Spaces:

ssaiteja16
/

RagBenchCapstone10

Sleeping

App Files Files Community

swaroop-uddandarao commited on Feb 21

Commit

923b896

1 Parent(s): 8f24b96

added reports

Browse files

Files changed (9) hide show

README.md +118 -13
report/Scores for RAGBenchCapstone.xlsx +0 -0
report/analyze_scores.py +147 -0
report/finalreport.md +63 -0
report/visualizations/correlation_heatmaps.png +0 -0
report/visualizations/metric_distributions.png +0 -0
report/visualizations/performance_comparison.png +0 -0
report/visualizations/summary_statistics.txt +34 -0
requirements.txt +7 -1

README.md CHANGED Viewed

@@ -1,13 +1,118 @@
----
-title: RagBenchCapstone10
-emoji: 📉
-colorFrom: green
-colorTo: yellow
-sdk: gradio
-sdk_version: 5.16.0
-app_file: app.py
-pinned: false
-short_description: RagBench Dataset development by Saiteja
----
-Check out the configuration reference at https://huggingface.co/docs/hub/spaces-config-reference

+# RAG Benchmark Evaluation System
+## Overview
+This project implements a Retrieval-Augmented Generation (RAG) system for evaluating different language models and reranking strategies. It provides a user-friendly interface for querying documents and analyzing the performance of various models.
+## Features
+- Multiple LLM support (LLaMA 3.3, Mistral 7B)
+- Various reranking models:
+  - MS MARCO MiniLM
+  - MS MARCO TinyBERT
+  - MonoT5 Base
+  - MonoT5 Small
+  - MonoT5 3B
+- Vector similarity search using Milvus
+- Automatic document chunking and retrieval
+- Performance metrics calculation
+- Interactive Gradio interface
+## Prerequisites
+- Python 3.8+
+- CUDA-compatible GPU (optional, for faster processing)
+## Installation
+1. Clone the repository:
+   bash
+   git clone https://github.com/yourusername/rag-benchmark.git
+   cd rag-benchmark
+2. Install dependencies:
+- pip install -r requirements.txt
+3. Configure the models:
+- Create a `models` directory and add your language model files.
+- Create a `rerankers` directory and add your reranking model files.
+- Run the application:
+- python app.py
+## Usage
+1. Start the application:
+2. Access the web interface at `http://localhost:7860`
+3. Enter your question and select:
+   - LLM Model (LLaMA 3.3 or Mistral 7B)
+   - Reranking Model (MS MARCO or MonoT5 variants)
+4. Click "Evaluate Model" to get results
+## Metrics
+The system calculates several performance metrics:
+- RMSE Context Relevance
+- RMSE Context Utilization
+- AUCROC Adherence
+- Processing Time
+## Reranking Models Comparison
+### MS MARCO Models
+- **MiniLM**: Fast and efficient, good general performance
+- **TinyBERT**: Lightweight, slightly lower accuracy but faster
+### MonoT5 Models
+- **Small**: Compact and fast, suitable for limited resources
+- **Base**: Balanced performance and speed
+- **3B**: Highest accuracy, requires more computational resources
+## Error Handling
+- Automatic fallback to fewer documents if token limits are exceeded
+- Graceful handling of API timeouts
+- Comprehensive error logging
+## Contributing
+1. Fork the repository
+2. Create your feature branch (`git checkout -b feature/AmazingFeature`)
+3. Commit your changes (`git commit -m 'Add some AmazingFeature'`)
+4. Push to the branch (`git push origin feature/AmazingFeature`)
+5. Open a Pull Request
+## Dependencies
+- gradio
+- torch
+- transformers
+- sentence-transformers
+- pymilvus
+- numpy
+- pandas
+- scikit-learn
+- tiktoken
+- groq
+- huggingface_hub
+## License
+[Your License Here]
+## Acknowledgments
+- RAGBench dataset
+- Hugging Face Transformers
+- Milvus Vector Database
+- Groq API

report/Scores for RAGBenchCapstone.xlsx ADDED Viewed

Binary file (39.5 kB). View file

report/analyze_scores.py ADDED Viewed

	@@ -0,0 +1,147 @@

+import pandas as pd
+import matplotlib.pyplot as plt
+import seaborn as sns
+import numpy as np
+def load_and_preprocess_data(file_path):
+    # Read Excel file, skipping the first 2 rows
+    df = pd.read_excel(file_path, skiprows=2)
+    # Extract data for each configuration using column letters
+    milvus_llama = df.iloc[:, 2:8].copy()  # Columns C to H
+    milvus_llama.columns = ['RMSE_Context_Rel', 'RMSE_Context_Util', 'AUCROC',
+                           'Retrieval_Time', 'Context_Relevance', 'Context_Utilization']
+    weaviate_mistral = df.iloc[:, 9:16].copy()  # Columns J to P
+    weaviate_mistral.columns = ['Retrieval_Time', 'Context_Rel', 'Util',
+                               'Adherence', 'RMSE_Context_Rel', 'RMSE_Context_Util', 'AUCROC']
+    milvus_mistral = df.iloc[:, 17:24].copy()  # Columns R to X
+    milvus_mistral.columns = ['Retrieval_Time', 'Context_Rel', 'Util',
+                             'Adherence', 'RMSE_Context_Rel', 'RMSE_Context_Util', 'AUCROC']
+    # Replace 'na' with NaN and convert to float
+    milvus_llama = milvus_llama.replace('na', np.nan).astype(float)
+    weaviate_mistral = weaviate_mistral.replace('na', np.nan).astype(float)
+    milvus_mistral = milvus_mistral.replace('na', np.nan).astype(float)
+    return milvus_llama, weaviate_mistral, milvus_mistral
+def create_performance_comparison(milvus_llama, weaviate_mistral, milvus_mistral):
+    plt.style.use('default')  # Using default style instead of seaborn
+    fig, axes = plt.subplots(2, 2, figsize=(15, 12))
+    # Retrieval Time Comparison
+    data = {
+        'Milvus + LLaMA': milvus_llama['Retrieval_Time'].dropna(),
+        'Weaviate + Mistral': weaviate_mistral['Retrieval_Time'].dropna(),
+        'Milvus + Mistral': milvus_mistral['Retrieval_Time'].dropna()
+    }
+    sns.boxplot(data=pd.DataFrame(data), ax=axes[0,0])
+    axes[0,0].set_title('Retrieval Time Comparison')
+    axes[0,0].set_ylabel('Time (seconds)')
+    axes[0,0].tick_params(axis='x', rotation=45)
+    # RMSE Context Relevance Comparison
+    data = {
+        'Milvus + LLaMA': milvus_llama['RMSE_Context_Rel'].dropna(),
+        'Weaviate + Mistral': weaviate_mistral['RMSE_Context_Rel'].dropna(),
+        'Milvus + Mistral': milvus_mistral['RMSE_Context_Rel'].dropna()
+    }
+    sns.boxplot(data=pd.DataFrame(data), ax=axes[0,1])
+    axes[0,1].set_title('RMSE Context Relevance')
+    axes[0,1].tick_params(axis='x', rotation=45)
+    # RMSE Context Utilization Comparison
+    data = {
+        'Milvus + LLaMA': milvus_llama['RMSE_Context_Util'].dropna(),
+        'Weaviate + Mistral': weaviate_mistral['RMSE_Context_Util'].dropna(),
+        'Milvus + Mistral': milvus_mistral['RMSE_Context_Util'].dropna()
+    }
+    sns.boxplot(data=pd.DataFrame(data), ax=axes[1,0])
+    axes[1,0].set_title('RMSE Context Utilization')
+    axes[1,0].tick_params(axis='x', rotation=45)
+    # AUROC Comparison
+    data = {
+        'Milvus + LLaMA': milvus_llama['AUCROC'].dropna(),
+        'Weaviate + Mistral': weaviate_mistral['AUCROC'].dropna(),
+        'Milvus + Mistral': milvus_mistral['AUCROC'].dropna()
+    }
+    sns.boxplot(data=pd.DataFrame(data), ax=axes[1,1])
+    axes[1,1].set_title('AUROC Scores')
+    axes[1,1].tick_params(axis='x', rotation=45)
+    plt.tight_layout()
+    plt.savefig('report/visualizations/performance_comparison.png', dpi=300, bbox_inches='tight')
+    plt.close()
+def create_correlation_heatmaps(milvus_llama, weaviate_mistral, milvus_mistral):
+    plt.figure(figsize=(20, 6))
+    # Create correlation heatmaps for each configuration
+    plt.subplot(1, 3, 1)
+    sns.heatmap(milvus_llama.corr(), annot=True, cmap='coolwarm', fmt='.2f', square=True)
+    plt.title('Milvus + LLaMA Correlations')
+    plt.subplot(1, 3, 2)
+    sns.heatmap(weaviate_mistral.corr(), annot=True, cmap='coolwarm', fmt='.2f', square=True)
+    plt.title('Weaviate + Mistral Correlations')
+    plt.subplot(1, 3, 3)
+    sns.heatmap(milvus_mistral.corr(), annot=True, cmap='coolwarm', fmt='.2f', square=True)
+    plt.title('Milvus + Mistral Correlations')
+    plt.tight_layout()
+    plt.savefig('report/visualizations/correlation_heatmaps.png', dpi=300, bbox_inches='tight')
+    plt.close()
+def create_violin_plots(milvus_llama, weaviate_mistral, milvus_mistral):
+    metrics = ['RMSE_Context_Rel', 'RMSE_Context_Util', 'AUCROC']
+    plt.figure(figsize=(15, 5))
+    for i, metric in enumerate(metrics, 1):
+        plt.subplot(1, 3, i)
+        data = {
+            'Milvus + LLaMA': milvus_llama[metric].dropna(),
+            'Weaviate + Mistral': weaviate_mistral[metric].dropna(),
+            'Milvus + Mistral': milvus_mistral[metric].dropna()
+        }
+        sns.violinplot(data=pd.DataFrame(data))
+        plt.title(f'{metric} Distribution')
+        plt.xticks(rotation=45)
+    plt.tight_layout()
+    plt.savefig('report/visualizations/metric_distributions.png', dpi=300, bbox_inches='tight')
+    plt.close()
+def print_summary_statistics(milvus_llama, weaviate_mistral, milvus_mistral):
+    print("\nSummary Statistics:")
+    print("\nMilvus + LLaMA:")
+    print(milvus_llama.describe().round(4))
+    print("\nWeaviate + Mistral:")
+    print(weaviate_mistral.describe().round(4))
+    print("\nMilvus + Mistral:")
+    print(milvus_mistral.describe().round(4))
+def main():
+    # Create visualizations directory
+    import os
+    os.makedirs("report/visualizations", exist_ok=True)
+    # Load data
+    milvus_llama, weaviate_mistral, milvus_mistral = load_and_preprocess_data("report/Scores for RAGBenchCapstone.xlsx")
+    # Create visualizations
+    create_performance_comparison(milvus_llama, weaviate_mistral, milvus_mistral)
+    create_correlation_heatmaps(milvus_llama, weaviate_mistral, milvus_mistral)
+    create_violin_plots(milvus_llama, weaviate_mistral, milvus_mistral)
+    # Print statistics
+    print_summary_statistics(milvus_llama, weaviate_mistral, milvus_mistral)
+if __name__ == "__main__":
+    main()

report/finalreport.md ADDED Viewed

	@@ -0,0 +1,63 @@

+Performance Analysis Report
+=========================
+1. **Retrieval Time**:
+   - Milvus + LLaMA: 0.132s
+   - Weaviate + Mistral: 0.157s
+   - Milvus + Mistral: NaN
+2. **Context Relevance** (higher is better):
+   - Milvus + LLaMA: 0.640
+   - Weaviate + Mistral: 0.591
+   - Milvus + Mistral: 0.518
+3. **Context Utilization** (higher is better):
+   - Milvus + LLaMA: 0.673
+   - Weaviate + Mistral: 0.619
+   - Milvus + Mistral: 0.614
+4. **AUCROC** (Area Under ROC Curve):
+   - Milvus + LLaMA: 0.912
+   - Weaviate + Mistral: 0.750
+   - Milvus + Mistral: 0.844
+5. **RMSE** (Root Mean Square Error):
+   - Milvus + LLaMA:
+     * Context Relevance RMSE: 0.179
+     * Context Utilization RMSE: 0.302
+   - Weaviate + Mistral:
+     * Context Relevance RMSE: 0.414
+     * Context Utilization RMSE: 0.482
+   - Milvus + Mistral:
+     * Context Relevance RMSE: 0.167
+     * Context Utilization RMSE: 0.258
+Analysis
+--------
+1. **Best Overall Performance: Milvus + LLaMA**
+   - Highest AUCROC score (0.912)
+   - Best context relevance (0.640) and utilization (0.673)
+   - Fast retrieval time (0.132s)
+   - Moderate RMSE scores
+2. **Runner-up: Milvus + Mistral**
+   - Second-best AUCROC (0.844)
+   - Lowest RMSE scores overall
+   - Lower context relevance and utilization
+   - Retrieval time data unavailable
+3. **Third Place: Weaviate + Mistral**
+   - Lowest AUCROC (0.750)
+   - Highest RMSE scores
+   - Slowest retrieval time (0.157s)
+   - Moderate context metrics
+Recommendation
+-------------
+Based on the comprehensive analysis of all metrics, Milvus + LLaMA emerges as the optimal choice for overall performance. It demonstrates:
+- Superior accuracy (highest AUCROC)
+- Better context handling capabilities
+- Efficient retrieval speed
+- Reasonable error rates
+However, if minimizing error (RMSE) is the primary objective, Milvus + Mistral could be a viable alternative due to its lower error rates in both context relevance and utilization metrics.

report/visualizations/correlation_heatmaps.png ADDED Viewed

report/visualizations/metric_distributions.png ADDED Viewed

report/visualizations/performance_comparison.png ADDED Viewed

report/visualizations/summary_statistics.txt ADDED Viewed

	@@ -0,0 +1,34 @@

+Summary Statistics:
+Milvus + LLaMA:
+       RMSE_Context_Rel  RMSE_Context_Util   AUCROC  Retrieval_Time  Context_Relevance  Context_Utilization
+count           19.0000            19.0000  17.0000         19.0000            19.0000              19.0000
+mean             0.1786             0.3022   0.9118          0.1322             0.6402               0.6729
+std              0.2014             0.3444   0.1965          0.0288             0.2923               0.2889
+min              0.0000             0.0008   0.5000          0.1008             0.0000               0.0000
+25%              0.0211             0.0559   1.0000          0.1145             0.4083               0.4583
+50%              0.1160             0.1033   1.0000          0.1233             0.6667               0.6667
+75%              0.2826             0.5837   1.0000          0.1348             0.9500               1.0000
+max              0.5625             0.9654   1.0000          0.1954             1.0000               1.0000
+Weaviate + Mistral:
+       Retrieval_Time  Context_Rel     Util  Adherence  RMSE_Context_Rel  RMSE_Context_Util   AUCROC
+count         19.0000      18.0000  18.0000    18.0000           16.0000            16.0000  14.0000
+mean           0.1565       0.5913   0.6190     0.3889            0.4139             0.4824   0.7500
+std            0.0286       0.4559   0.4324     0.5016            0.3231             0.3404   0.2594
+min            0.1085       0.0000   0.0000     0.0000            0.0035             0.0035   0.5000
+25%            0.1254       0.0357   0.1905     0.0000            0.1426             0.1702   0.5000
+50%            0.1685       0.8333   0.8333     0.0000            0.3212             0.5028   0.7500
+75%            0.1720       1.0000   1.0000     1.0000            0.7199             0.8521   1.0000
+max            0.2132       1.0000   1.0000     1.0000            0.9481             0.9481   1.0000
+Milvus + Mistral:
+       Retrieval_Time  Context_Rel     Util  Adherence  RMSE_Context_Rel  RMSE_Context_Util   AUCROC
+count             0.0      19.0000  19.0000    18.0000           18.0000            18.0000  16.0000
+mean              NaN       0.5176   0.6144     0.7847            0.1665             0.2575   0.8438
+std               NaN       0.3481   0.3511     0.3878            0.1819             0.2662   0.2394
+min               NaN       0.0000   0.0000     0.0000            0.0017             0.0031   0.5000
+25%               NaN       0.2917   0.3542     0.7188            0.0408             0.0625   0.5000
+50%               NaN       0.5000   0.6111     1.0000            0.0808             0.1627   1.0000
+75%               NaN       0.8397   1.0000     1.0000            0.2500             0.3448   1.0000
+max               NaN       1.0000   1.0000     1.0000            0.6049             0.8711   1.0000

requirements.txt CHANGED Viewed

@@ -5,4 +5,10 @@ huggingface_hub
 pymilvus
 nltk
 sentence-transformers
-Groq

 pymilvus
 nltk
 sentence-transformers
+Groq
+pandas
+openpyxl
+matplotlib
+seaborn
+numpy