yash bhaskar commited on
Commit
386f2a5
·
1 Parent(s): d2dbe42

Updated README.md

Browse files
Files changed (1) hide show
  1. README.md +129 -10
README.md CHANGED
@@ -1,13 +1,132 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
- title: MultiAgent QnA ChatBot
3
- emoji: 🏆
4
- colorFrom: gray
5
- colorTo: green
6
- sdk: gradio
7
- sdk_version: 5.6.0
8
- app_file: app.py
9
- pinned: false
10
- short_description: Multi-Agent Open-Domain QnA with Cross-Source Reranking
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
11
  ---
12
 
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Multi-Agent Open-Domain QnA with Cross-Source Reranking
2
+
3
+ ---
4
+
5
+ ## **Introduction**
6
+
7
+ The objective of this project is to develop a multi-agent open-domain question-answering (ODQA) system capable of retrieving and synthesizing information from diverse sources. These sources include web searches, large language models (LLMs) such as **Llama 3**, and vision models for multi-modal retrieval. Leveraging datasets like **KILT**, **Natural Questions**, **HotspotQA**, **TriviaQA**, and **ELI5**, the system incorporates a **cross-source reranking model** to improve the selection of the most accurate answers. This project emphasizes scalability and reliability by addressing both context-free and context-based scenarios, even when confronted with an increasing volume of irrelevant documents.
8
+
9
+ ---
10
+
11
+ ## **Project Overview**
12
+
13
+ - **Pipeline Development**: Created a multi-agent ODQA pipeline integrating specialized retrieval agents.
14
+ - **Source Diversity**: Utilized web searches, LLMs, and vision models for retrieving information.
15
+ - **Cross-Source Reranking**: Applied methods such as Reciprocal Rank Fusion (RRF) to enhance answer accuracy.
16
+ - **Scalability Evaluation**: Tested the system on datasets with varying ratios of relevant and irrelevant documents.
17
+
18
  ---
19
+
20
+ ## **Pipeline**
21
+
22
+ ### **Dataset Construction**
23
+
24
+ - **Mini-Wiki Collection**:
25
+ A condensed version of the Wikipedia dump was created by selecting a subset of documents relevant to the validation sets of **Natural Questions**, **HotspotQA**, **TriviaQA**, and **ELI5**.
26
+
27
+ - **Document Ratio Variants**:
28
+ To evaluate retrieval scalability, multiple datasets with different relevant-to-irrelevant document ratios were constructed:
29
+ - **1:0**: Contains only 1000 relevant documents for 1000 queries.
30
+ - **1:1**: Contains 1000 relevant documents and 1000 irrelevant documents.
31
+ - **1:2**: Contains 1000 relevant documents and 2000 irrelevant documents.
32
+
33
+ ---
34
+
35
+ ## **Retrieval Models**
36
+
37
+ To ensure robust and efficient retrieval, the project combined sparse and dense methods:
38
+
39
+ ### **Sparse Retrieval Models**
40
+
41
+ 1. **TF-IDF**:
42
+ Measures the importance of terms in a document relative to the entire dataset.
43
+ - Effective for small datasets.
44
+ - Serves as a lightweight and interpretable baseline.
45
+
46
+ 2. **BM25**:
47
+ - Extends TF-IDF with term frequency normalization and length penalization.
48
+ - Handles query-document term overlap better than TF-IDF.
49
+
50
+ 3. **Bag of Words (BOW)**:
51
+ - A simple vector-space model using term frequency vectors.
52
+ - Acts as a baseline for comparison with more advanced methods.
53
+
54
+ ---
55
+
56
+ ### **Dense Retrieval Models**
57
+
58
+ 1. **Text Embeddings (all-MiniLM-L6-v2)**:
59
+ - A pre-trained sentence-transformer for generating compact, high-quality embeddings.
60
+ - Captures semantic relationships between queries and documents.
61
+ - Lightweight and suitable for large-scale datasets.
62
+
63
+ 2. **Vision Embeddings (ViT)**:
64
+ - Generates embeddings for image-based data, enabling multi-modal information retrieval.
65
+ - Complements text-based retrieval for answering questions requiring visual context.
66
+
67
  ---
68
 
69
+ ## **Agents for Context Generation**
70
+
71
+ ### **Query Modification Agent**
72
+ Refines user queries to optimize them for retrieval, ensuring that they are better suited for identifying relevant documents.
73
+
74
+ ### **Keyword Extraction Agent**
75
+ Extracts key terms from the query and passes them to a **Wiki Agent**, which uses n-grams to retrieve relevant Wikipedia pages.
76
+
77
+ ### **Llama 3 Agent**
78
+ Synthesizes context directly related to the user query, enriching the system’s ability to answer complex questions.
79
+
80
+ ---
81
+
82
+ ## **Post-Retrieval Process**
83
+
84
+ 1. **Top-Ranked Document as Context**
85
+ The highest-ranked document was used directly as context for QnA tasks.
86
+
87
+ 2. **Iterative Use of Ranked Documents**
88
+ Explored answers using documents ranked in descending order of relevance.
89
+
90
+ 3. **Rank Fusion (RRF)**
91
+ Combined rankings from multiple retrieval methods (e.g., BM25, TF-IDF, MiniLM) to improve robustness and accuracy.
92
+
93
+ ---
94
+
95
+ ## **Results and Evaluation**
96
+
97
+ ### **Retrieval Model Scores**
98
+
99
+ | **Method** | **Query Type** | **Ranking Scores** |
100
+ |--------------------|----------------|---------------------|
101
+ | **BOW** | Modified | 13.82 - 33.39 |
102
+ | **BM25** | Modified | 736.74 - 785.09 |
103
+ | **TF-IDF** | Modified | 730.61 - 788.87 |
104
+ | **Vision** | Modified | 0.03 - 5.08 |
105
+ | **MiniLM (Open)** | Modified | 827.92 - 849.79 |
106
+
107
+ ### **Question Answering Model Scores**
108
+
109
+ - **ROUGE Score**: Demonstrated improvements with RRF across most datasets.
110
+ - **Cosine Similarity Score**: Highlighted semantic alignment in dense methods.
111
+ - **BERT F1 Score**: Dense embeddings outperformed sparse methods.
112
+
113
+ ---
114
+
115
+ ## **Analysis**
116
+
117
+ 1. **Sparse Models**:
118
+ - Sparse methods like **BM25** and **TF-IDF** performed well on context-free datasets but struggled with context-based tasks.
119
+ - **BOW** and **Vision** models were ineffective, worsening LLM performance compared to zero-shot baselines.
120
+
121
+ 2. **Dense Models**:
122
+ - Dense retrieval methods showed significant improvements in relevance and answer accuracy, especially when combining results with RRF.
123
+
124
+ 3. **Cross-Source Reranking**:
125
+ - RRF combining **BM25**, **TF-IDF**, and **MiniLM** yielded the best results.
126
+ - Using LLMs as rerankers was less reliable, with a bias toward zero-shot outputs.
127
+
128
+ ---
129
+
130
+ ## **Conclusion**
131
+
132
+ The multi-agent ODQA system successfully integrates sparse and dense retrieval methods, leveraging RRF for cross-source reranking. Dense methods and generative agents like **Llama 3** significantly enhance the system’s capability in open-domain settings. Future work can focus on improving multi-modal integration and reducing biases in LLM-based reranking.