swaroop-uddandarao commited on
Commit
923b896
·
1 Parent(s): 8f24b96

added reports

Browse files
README.md CHANGED
@@ -1,13 +1,118 @@
1
- ---
2
- title: RagBenchCapstone10
3
- emoji: 📉
4
- colorFrom: green
5
- colorTo: yellow
6
- sdk: gradio
7
- sdk_version: 5.16.0
8
- app_file: app.py
9
- pinned: false
10
- short_description: RagBench Dataset development by Saiteja
11
- ---
12
-
13
- Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # RAG Benchmark Evaluation System
2
+
3
+ ## Overview
4
+
5
+ This project implements a Retrieval-Augmented Generation (RAG) system for evaluating different language models and reranking strategies. It provides a user-friendly interface for querying documents and analyzing the performance of various models.
6
+
7
+ ## Features
8
+
9
+ - Multiple LLM support (LLaMA 3.3, Mistral 7B)
10
+ - Various reranking models:
11
+ - MS MARCO MiniLM
12
+ - MS MARCO TinyBERT
13
+ - MonoT5 Base
14
+ - MonoT5 Small
15
+ - MonoT5 3B
16
+ - Vector similarity search using Milvus
17
+ - Automatic document chunking and retrieval
18
+ - Performance metrics calculation
19
+ - Interactive Gradio interface
20
+
21
+ ## Prerequisites
22
+
23
+ - Python 3.8+
24
+ - CUDA-compatible GPU (optional, for faster processing)
25
+
26
+ ## Installation
27
+
28
+ 1. Clone the repository:
29
+ bash
30
+ git clone https://github.com/yourusername/rag-benchmark.git
31
+ cd rag-benchmark
32
+
33
+ 2. Install dependencies:
34
+
35
+ - pip install -r requirements.txt
36
+
37
+ 3. Configure the models:
38
+
39
+ - Create a `models` directory and add your language model files.
40
+ - Create a `rerankers` directory and add your reranking model files.
41
+
42
+ - Run the application:
43
+
44
+ - python app.py
45
+
46
+ ## Usage
47
+
48
+ 1. Start the application:
49
+
50
+ 2. Access the web interface at `http://localhost:7860`
51
+
52
+ 3. Enter your question and select:
53
+
54
+ - LLM Model (LLaMA 3.3 or Mistral 7B)
55
+ - Reranking Model (MS MARCO or MonoT5 variants)
56
+
57
+ 4. Click "Evaluate Model" to get results
58
+
59
+ ## Metrics
60
+
61
+ The system calculates several performance metrics:
62
+
63
+ - RMSE Context Relevance
64
+ - RMSE Context Utilization
65
+ - AUCROC Adherence
66
+ - Processing Time
67
+
68
+ ## Reranking Models Comparison
69
+
70
+ ### MS MARCO Models
71
+
72
+ - **MiniLM**: Fast and efficient, good general performance
73
+ - **TinyBERT**: Lightweight, slightly lower accuracy but faster
74
+
75
+ ### MonoT5 Models
76
+
77
+ - **Small**: Compact and fast, suitable for limited resources
78
+ - **Base**: Balanced performance and speed
79
+ - **3B**: Highest accuracy, requires more computational resources
80
+
81
+ ## Error Handling
82
+
83
+ - Automatic fallback to fewer documents if token limits are exceeded
84
+ - Graceful handling of API timeouts
85
+ - Comprehensive error logging
86
+
87
+ ## Contributing
88
+
89
+ 1. Fork the repository
90
+ 2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
91
+ 3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
92
+ 4. Push to the branch (`git push origin feature/AmazingFeature`)
93
+ 5. Open a Pull Request
94
+
95
+ ## Dependencies
96
+
97
+ - gradio
98
+ - torch
99
+ - transformers
100
+ - sentence-transformers
101
+ - pymilvus
102
+ - numpy
103
+ - pandas
104
+ - scikit-learn
105
+ - tiktoken
106
+ - groq
107
+ - huggingface_hub
108
+
109
+ ## License
110
+
111
+ [Your License Here]
112
+
113
+ ## Acknowledgments
114
+
115
+ - RAGBench dataset
116
+ - Hugging Face Transformers
117
+ - Milvus Vector Database
118
+ - Groq API
report/Scores for RAGBenchCapstone.xlsx ADDED
Binary file (39.5 kB). View file
 
report/analyze_scores.py ADDED
@@ -0,0 +1,147 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import pandas as pd
2
+ import matplotlib.pyplot as plt
3
+ import seaborn as sns
4
+ import numpy as np
5
+
6
+ def load_and_preprocess_data(file_path):
7
+ # Read Excel file, skipping the first 2 rows
8
+ df = pd.read_excel(file_path, skiprows=2)
9
+
10
+ # Extract data for each configuration using column letters
11
+ milvus_llama = df.iloc[:, 2:8].copy() # Columns C to H
12
+ milvus_llama.columns = ['RMSE_Context_Rel', 'RMSE_Context_Util', 'AUCROC',
13
+ 'Retrieval_Time', 'Context_Relevance', 'Context_Utilization']
14
+
15
+ weaviate_mistral = df.iloc[:, 9:16].copy() # Columns J to P
16
+ weaviate_mistral.columns = ['Retrieval_Time', 'Context_Rel', 'Util',
17
+ 'Adherence', 'RMSE_Context_Rel', 'RMSE_Context_Util', 'AUCROC']
18
+
19
+ milvus_mistral = df.iloc[:, 17:24].copy() # Columns R to X
20
+ milvus_mistral.columns = ['Retrieval_Time', 'Context_Rel', 'Util',
21
+ 'Adherence', 'RMSE_Context_Rel', 'RMSE_Context_Util', 'AUCROC']
22
+
23
+ # Replace 'na' with NaN and convert to float
24
+ milvus_llama = milvus_llama.replace('na', np.nan).astype(float)
25
+ weaviate_mistral = weaviate_mistral.replace('na', np.nan).astype(float)
26
+ milvus_mistral = milvus_mistral.replace('na', np.nan).astype(float)
27
+
28
+ return milvus_llama, weaviate_mistral, milvus_mistral
29
+
30
+ def create_performance_comparison(milvus_llama, weaviate_mistral, milvus_mistral):
31
+ plt.style.use('default') # Using default style instead of seaborn
32
+ fig, axes = plt.subplots(2, 2, figsize=(15, 12))
33
+
34
+ # Retrieval Time Comparison
35
+ data = {
36
+ 'Milvus + LLaMA': milvus_llama['Retrieval_Time'].dropna(),
37
+ 'Weaviate + Mistral': weaviate_mistral['Retrieval_Time'].dropna(),
38
+ 'Milvus + Mistral': milvus_mistral['Retrieval_Time'].dropna()
39
+ }
40
+ sns.boxplot(data=pd.DataFrame(data), ax=axes[0,0])
41
+ axes[0,0].set_title('Retrieval Time Comparison')
42
+ axes[0,0].set_ylabel('Time (seconds)')
43
+ axes[0,0].tick_params(axis='x', rotation=45)
44
+
45
+ # RMSE Context Relevance Comparison
46
+ data = {
47
+ 'Milvus + LLaMA': milvus_llama['RMSE_Context_Rel'].dropna(),
48
+ 'Weaviate + Mistral': weaviate_mistral['RMSE_Context_Rel'].dropna(),
49
+ 'Milvus + Mistral': milvus_mistral['RMSE_Context_Rel'].dropna()
50
+ }
51
+ sns.boxplot(data=pd.DataFrame(data), ax=axes[0,1])
52
+ axes[0,1].set_title('RMSE Context Relevance')
53
+ axes[0,1].tick_params(axis='x', rotation=45)
54
+
55
+ # RMSE Context Utilization Comparison
56
+ data = {
57
+ 'Milvus + LLaMA': milvus_llama['RMSE_Context_Util'].dropna(),
58
+ 'Weaviate + Mistral': weaviate_mistral['RMSE_Context_Util'].dropna(),
59
+ 'Milvus + Mistral': milvus_mistral['RMSE_Context_Util'].dropna()
60
+ }
61
+ sns.boxplot(data=pd.DataFrame(data), ax=axes[1,0])
62
+ axes[1,0].set_title('RMSE Context Utilization')
63
+ axes[1,0].tick_params(axis='x', rotation=45)
64
+
65
+ # AUROC Comparison
66
+ data = {
67
+ 'Milvus + LLaMA': milvus_llama['AUCROC'].dropna(),
68
+ 'Weaviate + Mistral': weaviate_mistral['AUCROC'].dropna(),
69
+ 'Milvus + Mistral': milvus_mistral['AUCROC'].dropna()
70
+ }
71
+ sns.boxplot(data=pd.DataFrame(data), ax=axes[1,1])
72
+ axes[1,1].set_title('AUROC Scores')
73
+ axes[1,1].tick_params(axis='x', rotation=45)
74
+
75
+ plt.tight_layout()
76
+ plt.savefig('report/visualizations/performance_comparison.png', dpi=300, bbox_inches='tight')
77
+ plt.close()
78
+
79
+ def create_correlation_heatmaps(milvus_llama, weaviate_mistral, milvus_mistral):
80
+ plt.figure(figsize=(20, 6))
81
+
82
+ # Create correlation heatmaps for each configuration
83
+ plt.subplot(1, 3, 1)
84
+ sns.heatmap(milvus_llama.corr(), annot=True, cmap='coolwarm', fmt='.2f', square=True)
85
+ plt.title('Milvus + LLaMA Correlations')
86
+
87
+ plt.subplot(1, 3, 2)
88
+ sns.heatmap(weaviate_mistral.corr(), annot=True, cmap='coolwarm', fmt='.2f', square=True)
89
+ plt.title('Weaviate + Mistral Correlations')
90
+
91
+ plt.subplot(1, 3, 3)
92
+ sns.heatmap(milvus_mistral.corr(), annot=True, cmap='coolwarm', fmt='.2f', square=True)
93
+ plt.title('Milvus + Mistral Correlations')
94
+
95
+ plt.tight_layout()
96
+ plt.savefig('report/visualizations/correlation_heatmaps.png', dpi=300, bbox_inches='tight')
97
+ plt.close()
98
+
99
+ def create_violin_plots(milvus_llama, weaviate_mistral, milvus_mistral):
100
+ metrics = ['RMSE_Context_Rel', 'RMSE_Context_Util', 'AUCROC']
101
+
102
+ plt.figure(figsize=(15, 5))
103
+ for i, metric in enumerate(metrics, 1):
104
+ plt.subplot(1, 3, i)
105
+ data = {
106
+ 'Milvus + LLaMA': milvus_llama[metric].dropna(),
107
+ 'Weaviate + Mistral': weaviate_mistral[metric].dropna(),
108
+ 'Milvus + Mistral': milvus_mistral[metric].dropna()
109
+ }
110
+ sns.violinplot(data=pd.DataFrame(data))
111
+ plt.title(f'{metric} Distribution')
112
+ plt.xticks(rotation=45)
113
+
114
+ plt.tight_layout()
115
+ plt.savefig('report/visualizations/metric_distributions.png', dpi=300, bbox_inches='tight')
116
+ plt.close()
117
+
118
+ def print_summary_statistics(milvus_llama, weaviate_mistral, milvus_mistral):
119
+ print("\nSummary Statistics:")
120
+
121
+ print("\nMilvus + LLaMA:")
122
+ print(milvus_llama.describe().round(4))
123
+
124
+ print("\nWeaviate + Mistral:")
125
+ print(weaviate_mistral.describe().round(4))
126
+
127
+ print("\nMilvus + Mistral:")
128
+ print(milvus_mistral.describe().round(4))
129
+
130
+ def main():
131
+ # Create visualizations directory
132
+ import os
133
+ os.makedirs("report/visualizations", exist_ok=True)
134
+
135
+ # Load data
136
+ milvus_llama, weaviate_mistral, milvus_mistral = load_and_preprocess_data("report/Scores for RAGBenchCapstone.xlsx")
137
+
138
+ # Create visualizations
139
+ create_performance_comparison(milvus_llama, weaviate_mistral, milvus_mistral)
140
+ create_correlation_heatmaps(milvus_llama, weaviate_mistral, milvus_mistral)
141
+ create_violin_plots(milvus_llama, weaviate_mistral, milvus_mistral)
142
+
143
+ # Print statistics
144
+ print_summary_statistics(milvus_llama, weaviate_mistral, milvus_mistral)
145
+
146
+ if __name__ == "__main__":
147
+ main()
report/finalreport.md ADDED
@@ -0,0 +1,63 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Performance Analysis Report
2
+ =========================
3
+
4
+ 1. **Retrieval Time**:
5
+ - Milvus + LLaMA: 0.132s
6
+ - Weaviate + Mistral: 0.157s
7
+ - Milvus + Mistral: NaN
8
+
9
+ 2. **Context Relevance** (higher is better):
10
+ - Milvus + LLaMA: 0.640
11
+ - Weaviate + Mistral: 0.591
12
+ - Milvus + Mistral: 0.518
13
+
14
+ 3. **Context Utilization** (higher is better):
15
+ - Milvus + LLaMA: 0.673
16
+ - Weaviate + Mistral: 0.619
17
+ - Milvus + Mistral: 0.614
18
+
19
+ 4. **AUCROC** (Area Under ROC Curve):
20
+ - Milvus + LLaMA: 0.912
21
+ - Weaviate + Mistral: 0.750
22
+ - Milvus + Mistral: 0.844
23
+
24
+ 5. **RMSE** (Root Mean Square Error):
25
+ - Milvus + LLaMA:
26
+ * Context Relevance RMSE: 0.179
27
+ * Context Utilization RMSE: 0.302
28
+ - Weaviate + Mistral:
29
+ * Context Relevance RMSE: 0.414
30
+ * Context Utilization RMSE: 0.482
31
+ - Milvus + Mistral:
32
+ * Context Relevance RMSE: 0.167
33
+ * Context Utilization RMSE: 0.258
34
+
35
+ Analysis
36
+ --------
37
+ 1. **Best Overall Performance: Milvus + LLaMA**
38
+ - Highest AUCROC score (0.912)
39
+ - Best context relevance (0.640) and utilization (0.673)
40
+ - Fast retrieval time (0.132s)
41
+ - Moderate RMSE scores
42
+
43
+ 2. **Runner-up: Milvus + Mistral**
44
+ - Second-best AUCROC (0.844)
45
+ - Lowest RMSE scores overall
46
+ - Lower context relevance and utilization
47
+ - Retrieval time data unavailable
48
+
49
+ 3. **Third Place: Weaviate + Mistral**
50
+ - Lowest AUCROC (0.750)
51
+ - Highest RMSE scores
52
+ - Slowest retrieval time (0.157s)
53
+ - Moderate context metrics
54
+
55
+ Recommendation
56
+ -------------
57
+ Based on the comprehensive analysis of all metrics, Milvus + LLaMA emerges as the optimal choice for overall performance. It demonstrates:
58
+ - Superior accuracy (highest AUCROC)
59
+ - Better context handling capabilities
60
+ - Efficient retrieval speed
61
+ - Reasonable error rates
62
+
63
+ However, if minimizing error (RMSE) is the primary objective, Milvus + Mistral could be a viable alternative due to its lower error rates in both context relevance and utilization metrics.
report/visualizations/correlation_heatmaps.png ADDED
report/visualizations/metric_distributions.png ADDED
report/visualizations/performance_comparison.png ADDED
report/visualizations/summary_statistics.txt ADDED
@@ -0,0 +1,34 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ Summary Statistics:
2
+
3
+ Milvus + LLaMA:
4
+ RMSE_Context_Rel RMSE_Context_Util AUCROC Retrieval_Time Context_Relevance Context_Utilization
5
+ count 19.0000 19.0000 17.0000 19.0000 19.0000 19.0000
6
+ mean 0.1786 0.3022 0.9118 0.1322 0.6402 0.6729
7
+ std 0.2014 0.3444 0.1965 0.0288 0.2923 0.2889
8
+ min 0.0000 0.0008 0.5000 0.1008 0.0000 0.0000
9
+ 25% 0.0211 0.0559 1.0000 0.1145 0.4083 0.4583
10
+ 50% 0.1160 0.1033 1.0000 0.1233 0.6667 0.6667
11
+ 75% 0.2826 0.5837 1.0000 0.1348 0.9500 1.0000
12
+ max 0.5625 0.9654 1.0000 0.1954 1.0000 1.0000
13
+
14
+ Weaviate + Mistral:
15
+ Retrieval_Time Context_Rel Util Adherence RMSE_Context_Rel RMSE_Context_Util AUCROC
16
+ count 19.0000 18.0000 18.0000 18.0000 16.0000 16.0000 14.0000
17
+ mean 0.1565 0.5913 0.6190 0.3889 0.4139 0.4824 0.7500
18
+ std 0.0286 0.4559 0.4324 0.5016 0.3231 0.3404 0.2594
19
+ min 0.1085 0.0000 0.0000 0.0000 0.0035 0.0035 0.5000
20
+ 25% 0.1254 0.0357 0.1905 0.0000 0.1426 0.1702 0.5000
21
+ 50% 0.1685 0.8333 0.8333 0.0000 0.3212 0.5028 0.7500
22
+ 75% 0.1720 1.0000 1.0000 1.0000 0.7199 0.8521 1.0000
23
+ max 0.2132 1.0000 1.0000 1.0000 0.9481 0.9481 1.0000
24
+
25
+ Milvus + Mistral:
26
+ Retrieval_Time Context_Rel Util Adherence RMSE_Context_Rel RMSE_Context_Util AUCROC
27
+ count 0.0 19.0000 19.0000 18.0000 18.0000 18.0000 16.0000
28
+ mean NaN 0.5176 0.6144 0.7847 0.1665 0.2575 0.8438
29
+ std NaN 0.3481 0.3511 0.3878 0.1819 0.2662 0.2394
30
+ min NaN 0.0000 0.0000 0.0000 0.0017 0.0031 0.5000
31
+ 25% NaN 0.2917 0.3542 0.7188 0.0408 0.0625 0.5000
32
+ 50% NaN 0.5000 0.6111 1.0000 0.0808 0.1627 1.0000
33
+ 75% NaN 0.8397 1.0000 1.0000 0.2500 0.3448 1.0000
34
+ max NaN 1.0000 1.0000 1.0000 0.6049 0.8711 1.0000
requirements.txt CHANGED
@@ -5,4 +5,10 @@ huggingface_hub
5
  pymilvus
6
  nltk
7
  sentence-transformers
8
- Groq
 
 
 
 
 
 
 
5
  pymilvus
6
  nltk
7
  sentence-transformers
8
+ Groq
9
+ pandas
10
+ openpyxl
11
+ matplotlib
12
+ seaborn
13
+ numpy
14
+