Spaces:
Sleeping
Sleeping
Enhanced deepPDF notebook with MultiQueryRetriever integration and retrieval strategy upgrade and set temperature parameter for ChatOpenAI model to 0 for deterministic outputs.
Browse files- deepPDF.ipynb +94 -2
deepPDF.ipynb
CHANGED
@@ -54,7 +54,7 @@
|
|
54 |
"source": [
|
55 |
"from langchain_openai import ChatOpenAI\n",
|
56 |
"\n",
|
57 |
-
"openai_chat_model = ChatOpenAI(model=\"gpt-3.5-turbo\")"
|
58 |
]
|
59 |
},
|
60 |
{
|
@@ -455,7 +455,7 @@
|
|
455 |
"metadata": {},
|
456 |
"source": [
|
457 |
"## Second pipeline results analysis\n",
|
458 |
-
"- The second pipeline
|
459 |
]
|
460 |
},
|
461 |
{
|
@@ -465,6 +465,98 @@
|
|
465 |
"## Upgrading the retrieval strategy\n",
|
466 |
"- While conserving the same chunking strategy as in the first pipeline, I will try to upgrade the retrieval strategy by using the MultiQueryRetriever."
|
467 |
]
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
468 |
}
|
469 |
],
|
470 |
"metadata": {
|
|
|
54 |
"source": [
|
55 |
"from langchain_openai import ChatOpenAI\n",
|
56 |
"\n",
|
57 |
+
"openai_chat_model = ChatOpenAI(model=\"gpt-3.5-turbo\", temperature=0)"
|
58 |
]
|
59 |
},
|
60 |
{
|
|
|
455 |
"metadata": {},
|
456 |
"source": [
|
457 |
"## Second pipeline results analysis\n",
|
458 |
+
"- The second pipeline shows the same results as the first one. The context retrieval is still not working properly, despite the larger chunking size."
|
459 |
]
|
460 |
},
|
461 |
{
|
|
|
465 |
"## Upgrading the retrieval strategy\n",
|
466 |
"- While conserving the same chunking strategy as in the first pipeline, I will try to upgrade the retrieval strategy by using the MultiQueryRetriever."
|
467 |
]
|
468 |
+
},
|
469 |
+
{
|
470 |
+
"cell_type": "code",
|
471 |
+
"execution_count": 36,
|
472 |
+
"metadata": {},
|
473 |
+
"outputs": [],
|
474 |
+
"source": [
|
475 |
+
"from langchain.retrievers import MultiQueryRetriever\n",
|
476 |
+
"\n",
|
477 |
+
"multiquery_retriever = MultiQueryRetriever.from_llm(retriever=qdrant_retriever_2, llm=openai_chat_model)"
|
478 |
+
]
|
479 |
+
},
|
480 |
+
{
|
481 |
+
"cell_type": "code",
|
482 |
+
"execution_count": 37,
|
483 |
+
"metadata": {},
|
484 |
+
"outputs": [],
|
485 |
+
"source": [
|
486 |
+
"retrieval_augmented_qa_chain_3 = (\n",
|
487 |
+
"\n",
|
488 |
+
" {\"context\": itemgetter(\"question\") | multiquery_retriever, \"question\": itemgetter(\"question\")}\n",
|
489 |
+
"\n",
|
490 |
+
" | RunnablePassthrough.assign(context=itemgetter(\"context\"))\n",
|
491 |
+
"\n",
|
492 |
+
" | {\"response\": rag_prompt | openai_chat_model, \"context\": itemgetter(\"context\")}\n",
|
493 |
+
")"
|
494 |
+
]
|
495 |
+
},
|
496 |
+
{
|
497 |
+
"cell_type": "code",
|
498 |
+
"execution_count": 38,
|
499 |
+
"metadata": {},
|
500 |
+
"outputs": [
|
501 |
+
{
|
502 |
+
"data": {
|
503 |
+
"text/plain": [
|
504 |
+
"\"The total value of 'Cash and cash equivalents' as of December 31, 2023, was $41,862.\""
|
505 |
+
]
|
506 |
+
},
|
507 |
+
"execution_count": 38,
|
508 |
+
"metadata": {},
|
509 |
+
"output_type": "execute_result"
|
510 |
+
}
|
511 |
+
],
|
512 |
+
"source": [
|
513 |
+
"response_1c = retrieval_augmented_qa_chain.invoke({\"question\" : \"What was the total value of 'Cash and cash equivalents' as of December 31, 2023?\"})\n",
|
514 |
+
"response_1c[\"response\"].content"
|
515 |
+
]
|
516 |
+
},
|
517 |
+
{
|
518 |
+
"cell_type": "code",
|
519 |
+
"execution_count": 39,
|
520 |
+
"metadata": {},
|
521 |
+
"outputs": [
|
522 |
+
{
|
523 |
+
"data": {
|
524 |
+
"text/plain": [
|
525 |
+
"\"Sorry, the context is unrelated to the query, I can't answer.\""
|
526 |
+
]
|
527 |
+
},
|
528 |
+
"execution_count": 39,
|
529 |
+
"metadata": {},
|
530 |
+
"output_type": "execute_result"
|
531 |
+
}
|
532 |
+
],
|
533 |
+
"source": [
|
534 |
+
"response_2c = retrieval_augmented_qa_chain.invoke({\"question\" : \"Who are Meta's 'Directors' (i.e., members of the Board of Directors)?\"})\n",
|
535 |
+
"response_2c[\"response\"].content"
|
536 |
+
]
|
537 |
+
},
|
538 |
+
{
|
539 |
+
"cell_type": "code",
|
540 |
+
"execution_count": 40,
|
541 |
+
"metadata": {},
|
542 |
+
"outputs": [
|
543 |
+
{
|
544 |
+
"data": {
|
545 |
+
"text/plain": [
|
546 |
+
"[Document(page_content='to having a skilled, inclusive and diverse workforce because we believe cognitive diversity fuels innovation. To aid in this effort, we have taken steps to reduce\\nbias from our hiring processes and performance management systems, as well as offering learning and development courses for our employees.\\nCorporate Information\\nWe were incorporated in Delaware in July 2004. We completed our initial public offering in May 2012 and our Class\\xa0A common stock is currently listed\\non the Nasdaq Global Select Market under the symbol \"META.\" Our principal executive offices are located at 1 Meta Way, Menlo Park, California 94025, and\\nour telephone number is (650) 543-4800.\\nMeta, the Meta logo, Meta Quest, Meta Horizon, Facebook, FB, Instagram, Oculus, WhatsApp, Reels, and our other registered or common law', metadata={'source': 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf', 'file_path': 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf', 'page': 13, 'total_pages': 147, 'format': 'PDF 1.4', 'title': '0001326801-24-000012', 'author': 'EDGAR® Online LLC, a subsidiary of OTC Markets Group', 'subject': 'Form 10-K filed on 2024-02-02 for the period ending 2023-12-31', 'keywords': '0001326801-24-000012; ; 10-K', 'creator': 'EDGAR Filing HTML Converter', 'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creationDate': \"D:20240202060356-05'00'\", 'modDate': \"D:20240202060413-05'00'\", 'trapped': '', 'encryption': 'Standard V2 R3 128-bit RC4', '_id': '18b4e9ef498c49b5867839caa18d2e9a', '_collection_name': 'Meta 10-k Fillings'}),\n",
|
547 |
+
" Document(page_content=\"(Exact name of registrant as specified in its charter)\\n__________________________\\nDelaware\\n20-1665019\\n(State or other jurisdiction of incorporation or organization)\\n(I.R.S. Employer Identification Number)\\n1 Meta Way, Menlo Park, California 94025\\n(Address of principal executive offices and Zip Code)\\n(650)\\xa0543-4800\\n(Registrant's telephone number, including area code)\\n__________________________\\nSecurities registered pursuant to Section 12(b) of the Act:\\nTitle of each class\\nTrading symbol(s)\\nName of each exchange on which registered\\nClass A Common Stock, $0.000006 par value\\nMETA\\nThe Nasdaq Stock Market LLC\\nSecurities registered pursuant to Section 12(g) of the Act: None\", metadata={'source': 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf', 'file_path': 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf', 'page': 0, 'total_pages': 147, 'format': 'PDF 1.4', 'title': '0001326801-24-000012', 'author': 'EDGAR® Online LLC, a subsidiary of OTC Markets Group', 'subject': 'Form 10-K filed on 2024-02-02 for the period ending 2023-12-31', 'keywords': '0001326801-24-000012; ; 10-K', 'creator': 'EDGAR Filing HTML Converter', 'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creationDate': \"D:20240202060356-05'00'\", 'modDate': \"D:20240202060413-05'00'\", 'trapped': '', 'encryption': 'Standard V2 R3 128-bit RC4', '_id': '91fb05632758461883fe050d26afeae1', '_collection_name': 'Meta 10-k Fillings'}),\n",
|
548 |
+
" Document(page_content='making to real-time optimizations to post-campaign analytics. We work directly with these advertisers, as well as through advertising agencies and resellers.\\nWe operate offices in approximately 90\\xa0cities around the globe, the majority of which have a sales presence. We also invest in and rely on self-service tools to\\nprovide direct customer support to our users and partners.\\nFor our RL products, our sales and operations efforts utilize third-party sales channels such as retailers, resellers, and our direct-to-consumer channel,\\nMeta.com. These efforts are focused on driving consumer and enterprise sales and adoption of our Meta Quest portfolio of products and Ray-Ban Meta smart\\nglasses.\\nMarketing\\nHistorically, our communities have generally grown organically with people inviting their friends to connect with them, supported by internal efforts to\\nstimulate awareness and interest. In addition, we have invested and will continue to invest in marketing our products and services to grow our brand and help', metadata={'source': 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf', 'file_path': 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf', 'page': 9, 'total_pages': 147, 'format': 'PDF 1.4', 'title': '0001326801-24-000012', 'author': 'EDGAR® Online LLC, a subsidiary of OTC Markets Group', 'subject': 'Form 10-K filed on 2024-02-02 for the period ending 2023-12-31', 'keywords': '0001326801-24-000012; ; 10-K', 'creator': 'EDGAR Filing HTML Converter', 'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creationDate': \"D:20240202060356-05'00'\", 'modDate': \"D:20240202060413-05'00'\", 'trapped': '', 'encryption': 'Standard V2 R3 128-bit RC4', '_id': 'a474117d06bd422387c0fa613b547ee5', '_collection_name': 'Meta 10-k Fillings'}),\n",
|
549 |
+
" Document(page_content='decision-making and prioritization of cybersecurity countermeasures and risk mitigation strategies. Our risk mitigation strategies include a broad variety of\\ntechnical and operational measures, as well as annual cybersecurity and privacy training for all of our employees.\\nIn addition, we maintain specific policies and practices governing our third-party security risks, including our third-party assessment (TPA) process.\\nUnder our TPA process, we gather information from certain third parties who contract with Meta and share or receive data, or have access to or integrate with\\nour systems, in order to help us assess potential risks associated with their security controls. We also generally require third parties to, among other things,\\nmaintain security controls to protect our confidential information and data, and notify us of material data breaches that may impact our data.\\nOur board of directors has oversight of our strategic and business risk management and has delegated cybersecurity risk management oversight to the', metadata={'source': 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf', 'file_path': 'https://d18rn0p25nwr6d.cloudfront.net/CIK-0001326801/c7318154-f6ae-4866-89fa-f0c589f2ee3d.pdf', 'page': 51, 'total_pages': 147, 'format': 'PDF 1.4', 'title': '0001326801-24-000012', 'author': 'EDGAR® Online LLC, a subsidiary of OTC Markets Group', 'subject': 'Form 10-K filed on 2024-02-02 for the period ending 2023-12-31', 'keywords': '0001326801-24-000012; ; 10-K', 'creator': 'EDGAR Filing HTML Converter', 'producer': 'EDGRpdf Service w/ EO.Pdf 22.0.40.0', 'creationDate': \"D:20240202060356-05'00'\", 'modDate': \"D:20240202060413-05'00'\", 'trapped': '', 'encryption': 'Standard V2 R3 128-bit RC4', '_id': '5565a5de93004129a382c31ab609c975', '_collection_name': 'Meta 10-k Fillings'})]"
|
550 |
+
]
|
551 |
+
},
|
552 |
+
"execution_count": 40,
|
553 |
+
"metadata": {},
|
554 |
+
"output_type": "execute_result"
|
555 |
+
}
|
556 |
+
],
|
557 |
+
"source": [
|
558 |
+
"response_2c[\"context\"]"
|
559 |
+
]
|
560 |
}
|
561 |
],
|
562 |
"metadata": {
|