# Langchain Processing of Meta 10K 2023

- Google Doc with [instructions](https://docs.google.com/forms/d/e/1FAIpQLSfRHORtHFiPUGCiYNt2NfapWtgUQWbv5V75kUPwUAkx20r9Eg/viewform)

## 1. Setup

In [1]:
import nest_asyncio

nest_asyncio.apply()

import logging
import sys
import os
from dotenv import find_dotenv, load_dotenv

load_dotenv(find_dotenv())

True

In [2]:
DEFAULT_QUESTION1 = "What was the total value of 'Cash and cash equivalents' as of December 31, 2023?"
DEFAULT_QUESTION2 = "Who are Meta's 'Directors' (i.e., members of the Board of Directors)?"

## 2.  Loading Document

In [3]:
from langchain_community.document_loaders import PyMuPDFLoader
loader = PyMuPDFLoader(
    "../data/meta-10k-2023.pdf",
)

# from langchain_community.document_loaders import UnstructuredPDFLoader
# loader = UnstructuredPDFLoader(
#     file_path="../data/meta-10k-2023.pdf",
#     mode="elements"
# )

documents = loader.load()
len(documents)

147

In [4]:
documents[0].metadata

{'source': '../data/meta-10k-2023.pdf',
 'file_path': '../data/meta-10k-2023.pdf',
 'page': 0,
 'total_pages': 147,
 'format': 'PDF 1.4',
 'title': '0001326801-24-000012',
 'author': 'EDGAR® Online LLC, a subsidiary of OTC Markets Group',
 'subject': 'Form 10-K filed on 2024-02-02 for the period ending 2023-12-31',
 'keywords': '0001326801-24-000012; ; 10-K',
 'creator': 'EDGAR Filing HTML Converter',
 'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0',
 'creationDate': "D:20240202060356-05'00'",
 'modDate': "D:20240202060413-05'00'",
 'trapped': '',
 'encryption': 'Standard V2 R3 128-bit RC4'}

## 3.  Transforming Data

In [5]:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_core.prompts import ChatPromptTemplate

text_splitter = RecursiveCharacterTextSplitter(
    chunk_size = 1024,
    chunk_overlap = 64
)

docs = text_splitter.split_documents(documents)
len(docs)

621

## 4.  Embedding & Vector Storage

In [6]:
from langchain_openai import ChatOpenAI
chat_model = ChatOpenAI(model="gpt-3.5-turbo", temperature=0.0)

# from llama_index.llms.ollama import Ollama
# chat_model = Ollama(model="llama3", request_timeout=30.0)

In [7]:
from langchain_openai import OpenAIEmbeddings
embeddings = OpenAIEmbeddings(
    model="text-embedding-3-small"
)

# from langchain_voyageai import VoyageAIEmbeddings
# EMBEDDING_MODEL = "voyage-2"  # Alternative: "voyage-lite-02-instruct"
# embeddings = VoyageAIEmbeddings(model=EMBEDDING_MODEL, batch_size=12)


In [8]:
from langchain.vectorstores import Qdrant

qdrant_vectorstore = Qdrant.from_documents(
    docs,
    embeddings,
    path="../data",
    # location=":memory:",
    collection_name="meta10k",
)

In [None]:
]

In [9]:
qdrant_retriever = qdrant_vectorstore.as_retriever()

In [10]:
from langchain.retrievers.multi_query import MultiQueryRetriever

mquery_retriever = MultiQueryRetriever.from_llm(
    retriever=qdrant_retriever, llm=chat_model
)

## 4. LCEL



In [11]:
RAG_PROMPT = """
CONTEXT:
{context}

QUERY:
{question}

You should only respond to user's query if the context is related to the query.  If not, please reply "I don't know".
"""

rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)

In [12]:
from operator import itemgetter
from langchain.schema.output_parser import StrOutputParser
from langchain.schema.runnable import RunnablePassthrough

retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | mquery_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

## Testing with basic RAG

In [13]:
response1 = retrieval_augmented_qa_chain.invoke({"question": DEFAULT_QUESTION1})
print(response1["response"].content)

The total value of 'Cash and cash equivalents' as of December 31, 2023, was $65.40 billion.


In [14]:
response2 = retrieval_augmented_qa_chain.invoke({"question": DEFAULT_QUESTION2})
print(response2["response"].content)

The Directors of Meta Platforms, Inc. mentioned in the document are:
- Robert M. Kimmitt
- Sheryl K. Sandberg
- Tracey T. Travis
- Tony Xu


In [15]:
response2 = retrieval_augmented_qa_chain.invoke(
    {"question":  "Who are the 'Directors' (i.e., members of the Board of Directors)?"})
print(response2["response"].content)

The Directors mentioned in the context are Robert M. Kimmitt, Sheryl K. Sandberg, Tracey T. Travis, Tony Xu, Mark Zuckerberg, Susan Li, Aaron Anderson, Peggy Alford, Marc L. Andreessen, Andrew W. Houston, Nancy Killefer.


## Semantic Chunking

In [18]:
from langchain_experimental.text_splitter import SemanticChunker

semantic_chunker = SemanticChunker(
    OpenAIEmbeddings(model="text-embedding-3-large"), 
    breakpoint_threshold_type="percentile"
)

In [21]:
len(documents)

147

In [22]:
documents[0]

Document(page_content='UNITED STATES\nSECURITIES AND EXCHANGE COMMISSION\nWashington, D.C.\xa020549\n__________________________\nFORM 10-K\n__________________________\n(Mark One)\n☒\xa0\xa0\xa0\xa0ANNUAL REPORT PURSUANT TO SECTION 13 OR 15(d)\xa0OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the fiscal year ended December\xa031, 2023\nor\n☐\xa0\xa0\xa0\xa0TRANSITION REPORT PURSUANT TO SECTION 13 OR 15(d)\xa0OF THE SECURITIES EXCHANGE ACT OF 1934\nFor the transition period from\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0to\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\xa0\nCommission File Number:\xa0001-35551\n__________________________\nMeta Platforms, Inc.\n(Exact name of registrant as specified in its charter)\n__________________________\nDelaware\n20-1665019\n(State or other jurisdiction of incorporation or organization)\n(I.R.S. Employer Identification Number)\n1 Meta Way, Menlo Park, California 94025\n(Address of principal executive offices and Zip Code)\n(650)\xa0543-4800\n(Regist

In [23]:
semantic_chunks = semantic_chunker.create_documents([d.page_content for d in documents])

In [24]:
len(semantic_chunks)

346

## Creating a RAG Pipeline using Semantic Chunks

In [26]:
semantic_chunk_vectorstore = Qdrant.from_documents(
    semantic_chunks,
    embeddings,
    path="../data/semantic-chunks",
    # location=":memory:",
    collection_name="meta10k-semantic",
)

In [None]:
semantic_chunk_vectorstore

In [27]:
semantic_chunk_retriever = semantic_chunk_vectorstore.as_retriever()

In [28]:
from langchain.retrievers.multi_query import MultiQueryRetriever

semantic_mquery_retriever = MultiQueryRetriever.from_llm(
    retriever=semantic_chunk_retriever, 
    llm=chat_model
)

In [32]:
semantic_retrieval_augmented_qa_chain = (
    # INVOKE CHAIN WITH: {"question" : "<<SOME USER QUESTION>>"}
    # "question" : populated by getting the value of the "question" key
    # "context"  : populated by getting the value of the "question" key and chaining it into the base_retriever
    {"context": itemgetter("question") | semantic_mquery_retriever, "question": itemgetter("question")}
    # "context"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)
    #              by getting the value of the "context" key from the previous step
    | RunnablePassthrough.assign(context=itemgetter("context"))
    # "response" : the "context" and "question" values are used to format our prompt object and then piped
    #              into the LLM and stored in a key called "response"
    # "context"  : populated by getting the value of the "context" key from the previous step
    | {"response": rag_prompt | chat_model, "context": itemgetter("context")}
)

In [36]:
type(semantic_retrieval_augmented_qa_chain)

langchain_core.runnables.base.RunnableSequence

## Testing with Semantic Chunk RAG

In [33]:
semantic_response1 = semantic_retrieval_augmented_qa_chain.invoke({"question": DEFAULT_QUESTION1})
print(semantic_response1["response"].content)

The total value of 'Cash and cash equivalents' as of December 31, 2023, was $41.862 billion.


In [34]:
semantic_response2 = semantic_retrieval_augmented_qa_chain.invoke({"question": DEFAULT_QUESTION2})
print(semantic_response2["response"].content)

The Directors of Meta, as mentioned in the provided documents, include:
- Andrew W. Houston
- Nancy Killefer
- Robert M. Kimmitt
- Sheryl K. Sandberg
- Tracey T. Travis
- Tony Xu


In [35]:
response2 = semantic_retrieval_augmented_qa_chain.invoke(
    {"question":  "Who are the 'Directors' (i.e., members of the Board of Directors)?"})
print(response2["response"].content)

The members of the Board of Directors mentioned in the provided documents are:
- Andrew W. Houston
- Nancy Killefer
- Robert M. Kimmitt
- Sheryl K. Sandberg
- Tracey T. Travis
- Tony Xu
- Mark Zuckerberg
- Susan Li
- Aaron Anderson
- Peggy Alford
- Marc L. Andreessen
