JulsdL commited on
Commit
99cc165
·
1 Parent(s): 43f4eb8

Add initial deepPDF notebook and changelog for v0.1.0

Browse files

- Implement environment setup, data loading, chunking, embedding, vector storing, RAG prompt, RAG chain, and response generation in deepPDF.ipynb
- Install necessary packages including langchain, qdrant-client, tiktoken, and pymupdf
- Configure OpenAI API key setup and import necessary modules for processing and querying data
- Define and utilize text splitter, embeddings, vector store, and retriever for document processing
- Create RAG prompt template and chain for retrieval-augmented question answering
- Generate responses to example queries demonstrating the functionality
- Initialize CHANGELOG.md with details of the notebook creation for version 0.1.0

Files changed (2) hide show
  1. CHANGELOG.md +5 -0
  2. deepPDF.ipynb +317 -0
CHANGELOG.md ADDED
@@ -0,0 +1,5 @@
 
 
 
 
 
 
1
+ ## v0.1.0 (2024-05-01)
2
+
3
+ ### Added
4
+
5
+ - Introduced a Jupyter notebook for PDF RAG QA application, including environment setup, data loading, chunking, embedding, vector storing, and response generation using langchain, qdrant-client, tiktoken, pymupdf, and OpenAI's GPT models.
deepPDF.ipynb ADDED
@@ -0,0 +1,317 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "## Setting up environnement"
8
+ ]
9
+ },
10
+ {
11
+ "cell_type": "code",
12
+ "execution_count": 1,
13
+ "metadata": {},
14
+ "outputs": [],
15
+ "source": [
16
+ "!pip install -qU langchain langchain-core langchain-community langchain-openai"
17
+ ]
18
+ },
19
+ {
20
+ "cell_type": "code",
21
+ "execution_count": 2,
22
+ "metadata": {},
23
+ "outputs": [],
24
+ "source": [
25
+ "!pip install -qU qdrant-client"
26
+ ]
27
+ },
28
+ {
29
+ "cell_type": "code",
30
+ "execution_count": 3,
31
+ "metadata": {},
32
+ "outputs": [],
33
+ "source": [
34
+ "!pip install -qU tiktoken pymupdf"
35
+ ]
36
+ },
37
+ {
38
+ "cell_type": "code",
39
+ "execution_count": 4,
40
+ "metadata": {},
41
+ "outputs": [],
42
+ "source": [
43
+ "import os\n",
44
+ "import getpass\n",
45
+ "\n",
46
+ "os.environ[\"OPENAI_API_KEY\"] = getpass.getpass(\"OpenAI API Key:\")"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "code",
51
+ "execution_count": 17,
52
+ "metadata": {},
53
+ "outputs": [],
54
+ "source": [
55
+ "from langchain_openai import ChatOpenAI\n",
56
+ "\n",
57
+ "openai_chat_model = ChatOpenAI(model=\"gpt-3.5-turbo\")"
58
+ ]
59
+ },
60
+ {
61
+ "cell_type": "markdown",
62
+ "metadata": {},
63
+ "source": [
64
+ "## Loading the data"
65
+ ]
66
+ },
67
+ {
68
+ "cell_type": "code",
69
+ "execution_count": 5,
70
+ "metadata": {},
71
+ "outputs": [],
72
+ "source": [
73
+ "from langchain.document_loaders import PyMuPDFLoader\n",
74
+ "\n",
75
+ "docs = PyMuPDFLoader(\"https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf\").load()"
76
+ ]
77
+ },
78
+ {
79
+ "cell_type": "markdown",
80
+ "metadata": {},
81
+ "source": [
82
+ "## Chunking the data"
83
+ ]
84
+ },
85
+ {
86
+ "cell_type": "code",
87
+ "execution_count": 9,
88
+ "metadata": {},
89
+ "outputs": [],
90
+ "source": [
91
+ "from langchain.text_splitter import RecursiveCharacterTextSplitter\n",
92
+ "import tiktoken\n",
93
+ "\n",
94
+ "def tiktoken_len(text):\n",
95
+ " tokens = tiktoken.encoding_for_model(\"gpt-3.5-turbo\").encode(\n",
96
+ " text,\n",
97
+ " )\n",
98
+ " return len(tokens)\n"
99
+ ]
100
+ },
101
+ {
102
+ "cell_type": "code",
103
+ "execution_count": 10,
104
+ "metadata": {},
105
+ "outputs": [],
106
+ "source": [
107
+ "text_splitter = RecursiveCharacterTextSplitter(\n",
108
+ " chunk_size = 200,\n",
109
+ " chunk_overlap = 50,\n",
110
+ " length_function = tiktoken_len,\n",
111
+ ")\n",
112
+ "\n",
113
+ "split_chunks = text_splitter.split_documents(docs)"
114
+ ]
115
+ },
116
+ {
117
+ "cell_type": "code",
118
+ "execution_count": 11,
119
+ "metadata": {},
120
+ "outputs": [
121
+ {
122
+ "data": {
123
+ "text/plain": [
124
+ "765"
125
+ ]
126
+ },
127
+ "execution_count": 11,
128
+ "metadata": {},
129
+ "output_type": "execute_result"
130
+ }
131
+ ],
132
+ "source": [
133
+ "len(split_chunks)"
134
+ ]
135
+ },
136
+ {
137
+ "cell_type": "markdown",
138
+ "metadata": {},
139
+ "source": [
140
+ "## Embedding and vectore storing"
141
+ ]
142
+ },
143
+ {
144
+ "cell_type": "code",
145
+ "execution_count": 12,
146
+ "metadata": {},
147
+ "outputs": [],
148
+ "source": [
149
+ "from langchain_openai.embeddings import OpenAIEmbeddings\n",
150
+ "\n",
151
+ "embedding_model = OpenAIEmbeddings(model=\"text-embedding-3-small\")"
152
+ ]
153
+ },
154
+ {
155
+ "cell_type": "code",
156
+ "execution_count": 13,
157
+ "metadata": {},
158
+ "outputs": [],
159
+ "source": [
160
+ "from langchain_community.vectorstores import Qdrant\n",
161
+ "\n",
162
+ "qdrant_vectorstore = Qdrant.from_documents(\n",
163
+ " split_chunks,\n",
164
+ " embedding_model,\n",
165
+ " location=\":memory:\",\n",
166
+ " collection_name=\"Meta 10-k Fillings\",\n",
167
+ ")"
168
+ ]
169
+ },
170
+ {
171
+ "cell_type": "code",
172
+ "execution_count": 14,
173
+ "metadata": {},
174
+ "outputs": [],
175
+ "source": [
176
+ "qdrant_retriever = qdrant_vectorstore.as_retriever()"
177
+ ]
178
+ },
179
+ {
180
+ "cell_type": "markdown",
181
+ "metadata": {},
182
+ "source": [
183
+ "## RAG Prompt"
184
+ ]
185
+ },
186
+ {
187
+ "cell_type": "code",
188
+ "execution_count": 15,
189
+ "metadata": {},
190
+ "outputs": [],
191
+ "source": [
192
+ "from langchain_core.prompts import ChatPromptTemplate"
193
+ ]
194
+ },
195
+ {
196
+ "cell_type": "code",
197
+ "execution_count": 16,
198
+ "metadata": {},
199
+ "outputs": [],
200
+ "source": [
201
+ "RAG_PROMPT = \"\"\"\n",
202
+ "CONTEXT:\n",
203
+ "{context}\n",
204
+ "\n",
205
+ "QUERY:\n",
206
+ "{question}\n",
207
+ "\n",
208
+ "Answer the query if the context is related to it; otherwise, answer: 'Sorry, the context is unrelated to the query, I can't answer.'\n",
209
+ "\"\"\"\n",
210
+ "\n",
211
+ "rag_prompt = ChatPromptTemplate.from_template(RAG_PROMPT)"
212
+ ]
213
+ },
214
+ {
215
+ "cell_type": "markdown",
216
+ "metadata": {},
217
+ "source": [
218
+ "## RAG Chain"
219
+ ]
220
+ },
221
+ {
222
+ "cell_type": "code",
223
+ "execution_count": 18,
224
+ "metadata": {},
225
+ "outputs": [],
226
+ "source": [
227
+ "from operator import itemgetter\n",
228
+ "from langchain.schema.output_parser import StrOutputParser\n",
229
+ "from langchain.schema.runnable import RunnablePassthrough\n",
230
+ "\n",
231
+ "retrieval_augmented_qa_chain = (\n",
232
+ " # INVOKE CHAIN WITH: {\"question\" : \"<<SOME USER QUESTION>>\"}\n",
233
+ " # \"question\" : populated by getting the value of the \"question\" key\n",
234
+ " # \"context\" : populated by getting the value of the \"question\" key and chaining it into the base_retriever\n",
235
+ " {\"context\": itemgetter(\"question\") | qdrant_retriever, \"question\": itemgetter(\"question\")}\n",
236
+ " # \"context\" : is assigned to a RunnablePassthrough object (will not be called or considered in the next step)\n",
237
+ " # by getting the value of the \"context\" key from the previous step\n",
238
+ " | RunnablePassthrough.assign(context=itemgetter(\"context\"))\n",
239
+ " # \"response\" : the \"context\" and \"question\" values are used to format our prompt object and then piped\n",
240
+ " # into the LLM and stored in a key called \"response\"\n",
241
+ " # \"context\" : populated by getting the value of the \"context\" key from the previous step\n",
242
+ " | {\"response\": rag_prompt | openai_chat_model, \"context\": itemgetter(\"context\")}\n",
243
+ ")"
244
+ ]
245
+ },
246
+ {
247
+ "cell_type": "markdown",
248
+ "metadata": {},
249
+ "source": [
250
+ "## Response generation"
251
+ ]
252
+ },
253
+ {
254
+ "cell_type": "code",
255
+ "execution_count": 19,
256
+ "metadata": {},
257
+ "outputs": [
258
+ {
259
+ "data": {
260
+ "text/plain": [
261
+ "\"The total value of 'Cash and cash equivalents' as of December 31, 2023, was $41,862.\""
262
+ ]
263
+ },
264
+ "execution_count": 19,
265
+ "metadata": {},
266
+ "output_type": "execute_result"
267
+ }
268
+ ],
269
+ "source": [
270
+ "response_1 = retrieval_augmented_qa_chain.invoke({\"question\" : \"What was the total value of 'Cash and cash equivalents' as of December 31, 2023?\"})\n",
271
+ "response_1[\"response\"].content"
272
+ ]
273
+ },
274
+ {
275
+ "cell_type": "code",
276
+ "execution_count": 20,
277
+ "metadata": {},
278
+ "outputs": [
279
+ {
280
+ "data": {
281
+ "text/plain": [
282
+ "\"Sorry, the context is unrelated to the query, I can't answer.\""
283
+ ]
284
+ },
285
+ "execution_count": 20,
286
+ "metadata": {},
287
+ "output_type": "execute_result"
288
+ }
289
+ ],
290
+ "source": [
291
+ "response_2 = retrieval_augmented_qa_chain.invoke({\"question\" : \"Who are Meta's 'Directors' (i.e., members of the Board of Directors)?\"})\n",
292
+ "response_2[\"response\"].content"
293
+ ]
294
+ }
295
+ ],
296
+ "metadata": {
297
+ "kernelspec": {
298
+ "display_name": "AIMakerSpace",
299
+ "language": "python",
300
+ "name": "python3"
301
+ },
302
+ "language_info": {
303
+ "codemirror_mode": {
304
+ "name": "ipython",
305
+ "version": 3
306
+ },
307
+ "file_extension": ".py",
308
+ "mimetype": "text/x-python",
309
+ "name": "python",
310
+ "nbconvert_exporter": "python",
311
+ "pygments_lexer": "ipython3",
312
+ "version": "3.11.3"
313
+ }
314
+ },
315
+ "nbformat": 4,
316
+ "nbformat_minor": 2
317
+ }