Spaces:

JulsdL
/

DeepPDF_AI

Sleeping

JulsdL commited on May 1, 2024

Commit

99cc165

1 Parent(s): 43f4eb8

Add initial deepPDF notebook and changelog for v0.1.0

- Implement environment setup, data loading, chunking, embedding, vector storing, RAG prompt, RAG chain, and response generation in deepPDF.ipynb
- Install necessary packages including langchain, qdrant-client, tiktoken, and pymupdf
- Configure OpenAI API key setup and import necessary modules for processing and querying data
- Define and utilize text splitter, embeddings, vector store, and retriever for document processing
- Create RAG prompt template and chain for retrieval-augmented question answering
- Generate responses to example queries demonstrating the functionality
- Initialize CHANGELOG.md with details of the notebook creation for version 0.1.0

Files changed (2) hide show

CHANGELOG.md +5 -0
deepPDF.ipynb +317 -0

CHANGELOG.md ADDED Viewed

	@@ -0,0 +1,5 @@

+## v0.1.0 (2024-05-01)
+### Added
+- Introduced a Jupyter notebook for PDF RAG QA application, including environment setup, data loading, chunking, embedding, vector storing, and response generation using langchain, qdrant-client, tiktoken, pymupdf, and OpenAI's GPT models.

deepPDF.ipynb ADDED Viewed

	@@ -0,0 +1,317 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Setting up environnement"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -qU langchain langchain-core langchain-community langchain-openai"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -qU qdrant-client"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "!pip install -qU tiktoken pymupdf"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import os\n",
+    "import getpass\n",
+    "\n",
+    "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 17,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_openai import ChatOpenAI\n",
+    "\n",
+    "openai_chat_model = ChatOpenAI(model=\"gpt-3.5-turbo\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Loading the data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.document_loaders import PyMuPDFLoader\n",
+    "\n",
+    "docs = PyMuPDFLoader(\"https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf\").load()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Chunking the data"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 9,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
+    "import tiktoken\n",
+    "\n",
+    "def tiktoken_len(text):\n",
+    "    tokens = tiktoken.encoding_for_model(\"gpt-3.5-turbo\").encode(\n",
+    "        text,\n",
+    "    )\n",
+    "    return len(tokens)\n"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 10,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "text_splitter = RecursiveCharacterTextSplitter(\n",
+    "    chunk_size = 200,\n",
+    "    chunk_overlap = 50,\n",
+    "    length_function = tiktoken_len,\n",
+    ")\n",
+    "\n",
+    "split_chunks = text_splitter.split_documents(docs)"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 11,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "765"
+      ]
+     },
+     "execution_count": 11,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "len(split_chunks)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Embedding and vectore storing"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 12,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_openai.embeddings import OpenAIEmbeddings\n",
+    "\n",
+    "embedding_model = OpenAIEmbeddings(model=\"text-embedding-3-small\")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 13,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_community.vectorstores import Qdrant\n",
+    "\n",
+    "qdrant_vectorstore = Qdrant.from_documents(\n",
+    "    split_chunks,\n",
+    "    embedding_model,\n",
+    "    location=\":memory:\",\n",
+    "    collection_name=\"Meta 10-k Fillings\",\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 14,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "qdrant_retriever = qdrant_vectorstore.as_retriever()"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## RAG Prompt"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 15,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from langchain_core.prompts import ChatPromptTemplate"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 16,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "RAG_PROMPT = \"\"\"\n",
+    "CONTEXT:\n",
+    "{context}\n",
+    "\n",
+    "QUERY:\n",
+    "{question}\n",
+    "\n",
+    "Answer the query if the context is related to it; otherwise, answer: 'Sorry, the context is unrelated to the query, I can't answer.'\n",
+    "\"\"\"\n",
+    "\n",
+    "rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## RAG Chain"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 18,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "from operator import itemgetter\n",
+    "from langchain.schema.output_parser import StrOutputParser\n",
+    "from langchain.schema.runnable import RunnablePassthrough\n",
+    "\n",
+    "retrieval_augmented_qa_chain = (\n",
+    "    # INVOKE CHAIN WITH: {\"question\" : \"<<SOME USER QUESTION>>\"}\n",
+    "    # \"question\" : populated by getting the value of the \"question\" key\n",
+    "    # \"context\"  : populated by getting the value of the \"question\" key and chaining it into the base_retriever\n",
+    "    {\"context\": itemgetter(\"question\") | qdrant_retriever, \"question\": itemgetter(\"question\")}\n",
+    "    # \"context\"  : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)\n",
+    "    #              by getting the value of the \"context\" key from the previous step\n",
+    "    | RunnablePassthrough.assign(context=itemgetter(\"context\"))\n",
+    "    # \"response\" : the \"context\" and \"question\" values are used to format our prompt object and then piped\n",
+    "    #              into the LLM and stored in a key called \"response\"\n",
+    "    # \"context\"  : populated by getting the value of the \"context\" key from the previous step\n",
+    "    | {\"response\": rag_prompt | openai_chat_model, \"context\": itemgetter(\"context\")}\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "## Response generation"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 19,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\"The total value of 'Cash and cash equivalents' as of December 31, 2023, was $41,862.\""
+      ]
+     },
+     "execution_count": 19,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "response_1 = retrieval_augmented_qa_chain.invoke({\"question\" : \"What was the total value of 'Cash and cash equivalents' as of December 31, 2023?\"})\n",
+    "response_1[\"response\"].content"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 20,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/plain": [
+       "\"Sorry, the context is unrelated to the query, I can't answer.\""
+      ]
+     },
+     "execution_count": 20,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "response_2 = retrieval_augmented_qa_chain.invoke({\"question\" : \"Who are Meta's 'Directors' (i.e., members of the Board of Directors)?\"})\n",
+    "response_2[\"response\"].content"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "AIMakerSpace",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.11.3"
+  }
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}