Spaces:

ssaiteja16
/

RagBenchCapstone10

Runtime error

App Files Files Community

swaroop-uddandarao commited on Feb 22

Commit

fed116a

1 Parent(s): 408ab70

modified reports

Browse files

Files changed (9) hide show

report/Scores for RAGBenchCapstone.xlsx +0 -0
report/analyze_scores.py +110 -15
report/architecture.md +154 -0
report/finalreport.md +20 -14
report/visualizations/correlation_heatmaps.png +0 -0
report/visualizations/milvus_llama_plots.png +0 -0
report/visualizations/milvus_mistral_plots.png +0 -0
report/visualizations/performance_comparison.png +0 -0
report/visualizations/weaviate_mistral_plots.png +0 -0

report/Scores for RAGBenchCapstone.xlsx CHANGED Viewed

Binary files a/report/Scores for RAGBenchCapstone.xlsx and b/report/Scores for RAGBenchCapstone.xlsx differ

report/analyze_scores.py CHANGED Viewed

@@ -38,7 +38,7 @@ def create_performance_comparison(milvus_llama, weaviate_mistral, milvus_mistral
         'Milvus + Mistral': milvus_mistral['Retrieval_Time'].dropna()
     }
     sns.boxplot(data=pd.DataFrame(data), ax=axes[0,0])
-    axes[0,0].set_title('Retrieval Time Comparison')
     axes[0,0].set_ylabel('Time (seconds)')
     axes[0,0].tick_params(axis='x', rotation=45)
@@ -76,26 +76,121 @@ def create_performance_comparison(milvus_llama, weaviate_mistral, milvus_mistral
     plt.savefig('report/visualizations/performance_comparison.png', dpi=300, bbox_inches='tight')
     plt.close()
-def create_correlation_heatmaps(milvus_llama, weaviate_mistral, milvus_mistral):
-    plt.figure(figsize=(20, 6))
-    # Create correlation heatmaps for each configuration
-    plt.subplot(1, 3, 1)
-    sns.heatmap(milvus_llama.corr(), annot=True, cmap='coolwarm', fmt='.2f', square=True)
-    plt.title('Milvus + LLaMA Correlations')
-    plt.subplot(1, 3, 2)
-    sns.heatmap(weaviate_mistral.corr(), annot=True, cmap='coolwarm', fmt='.2f', square=True)
-    plt.title('Weaviate + Mistral Correlations')
-    plt.subplot(1, 3, 3)
-    sns.heatmap(milvus_mistral.corr(), annot=True, cmap='coolwarm', fmt='.2f', square=True)
-    plt.title('Milvus + Mistral Correlations')
     plt.tight_layout()
-    plt.savefig('report/visualizations/correlation_heatmaps.png', dpi=300, bbox_inches='tight')
     plt.close()
 def create_violin_plots(milvus_llama, weaviate_mistral, milvus_mistral):
     metrics = ['RMSE_Context_Rel', 'RMSE_Context_Util', 'AUCROC']
@@ -137,7 +232,7 @@ def main():
     # Create visualizations
     create_performance_comparison(milvus_llama, weaviate_mistral, milvus_mistral)
-    create_correlation_heatmaps(milvus_llama, weaviate_mistral, milvus_mistral)
     create_violin_plots(milvus_llama, weaviate_mistral, milvus_mistral)
     # Print statistics

         'Milvus + Mistral': milvus_mistral['Retrieval_Time'].dropna()
     }
     sns.boxplot(data=pd.DataFrame(data), ax=axes[0,0])
+    axes[0,0].set_title('VectorDB Retrieval Time Comparison')
     axes[0,0].set_ylabel('Time (seconds)')
     axes[0,0].tick_params(axis='x', rotation=45)
     plt.savefig('report/visualizations/performance_comparison.png', dpi=300, bbox_inches='tight')
     plt.close()
+def create_correlation_plots(milvus_llama, weaviate_mistral, milvus_mistral):
+    # Create separate plots for each model
+    # 1. Milvus + LLaMA
+    plt.figure(figsize=(15, 10))
+    # Relevance comparison
+    plt.subplot(2, 1, 1)
+    plt.plot(range(len(milvus_llama)), milvus_llama['RMSE_Context_Rel'], 'o--',
+            color='red', label='RMSE Context Relevance', linewidth=2, alpha=0.7)
+    plt.plot(range(len(milvus_llama)), milvus_llama['Context_Relevance'], 'o-',
+            color='darkred', label='Context Relevance', linewidth=2, alpha=0.7)
+    plt.title('Milvus + LLaMA: Context Relevance vs RMSE')
+    plt.xlabel('Data Points')
+    plt.ylabel('Score')
+    plt.grid(True, linestyle='--', alpha=0.7)
+    plt.legend()
+    # Utilization comparison
+    plt.subplot(2, 1, 2)
+    plt.plot(range(len(milvus_llama)), milvus_llama['RMSE_Context_Util'], 'o--',
+            color='blue', label='RMSE Context Utilization', linewidth=2, alpha=0.7)
+    plt.plot(range(len(milvus_llama)), milvus_llama['Context_Utilization'], 'o-',
+            color='darkblue', label='Context Utilization', linewidth=2, alpha=0.7)
+    plt.title('Milvus + LLaMA: Context Utilization vs RMSE')
+    plt.xlabel('Data Points')
+    plt.ylabel('Score')
+    plt.grid(True, linestyle='--', alpha=0.7)
+    plt.legend()
+    plt.tight_layout()
+    plt.savefig('report/visualizations/milvus_llama_plots.png', bbox_inches='tight', dpi=300)
+    plt.close()
+    # 2. Weaviate + Mistral
+    plt.figure(figsize=(15, 10))
+    # Relevance comparison
+    plt.subplot(2, 1, 1)
+    plt.plot(range(len(weaviate_mistral)), weaviate_mistral['RMSE_Context_Rel'], 'o--',
+            color='red', label='RMSE Context Relevance', linewidth=2, alpha=0.7)
+    plt.plot(range(len(weaviate_mistral)), weaviate_mistral['Context_Rel'], 'o-',
+            color='darkred', label='Context Relevance', linewidth=2, alpha=0.7)
+    plt.title('Weaviate + Mistral: Context Relevance vs RMSE')
+    plt.xlabel('Data Points')
+    plt.ylabel('Score')
+    plt.grid(True, linestyle='--', alpha=0.7)
+    plt.legend()
+    # Utilization comparison
+    plt.subplot(2, 1, 2)
+    plt.plot(range(len(weaviate_mistral)), weaviate_mistral['RMSE_Context_Util'], 'o--',
+            color='blue', label='RMSE Context Utilization', linewidth=2, alpha=0.7)
+    plt.plot(range(len(weaviate_mistral)), weaviate_mistral['Util'], 'o-',
+            color='darkblue', label='Context Utilization', linewidth=2, alpha=0.7)
+    plt.title('Weaviate + Mistral: Context Utilization vs RMSE')
+    plt.xlabel('Data Points')
+    plt.ylabel('Score')
+    plt.grid(True, linestyle='--', alpha=0.7)
+    plt.legend()
+    plt.tight_layout()
+    plt.savefig('report/visualizations/weaviate_mistral_plots.png', bbox_inches='tight', dpi=300)
+    plt.close()
+    # 3. Milvus + Mistral
+    plt.figure(figsize=(15, 10))
+    # Relevance comparison
+    plt.subplot(2, 1, 1)
+    plt.plot(range(len(milvus_mistral)), milvus_mistral['RMSE_Context_Rel'], 'o--',
+            color='red', label='RMSE Context Relevance', linewidth=2, alpha=0.7)
+    plt.plot(range(len(milvus_mistral)), milvus_mistral['Context_Rel'], 'o-',
+            color='darkred', label='Context Relevance', linewidth=2, alpha=0.7)
+    plt.title('Milvus + Mistral: Context Relevance vs RMSE')
+    plt.xlabel('Data Points')
+    plt.ylabel('Score')
+    plt.grid(True, linestyle='--', alpha=0.7)
+    plt.legend()
+    # Utilization comparison
+    plt.subplot(2, 1, 2)
+    plt.plot(range(len(milvus_mistral)), milvus_mistral['RMSE_Context_Util'], 'o--',
+            color='blue', label='RMSE Context Utilization', linewidth=2, alpha=0.7)
+    plt.plot(range(len(milvus_mistral)), milvus_mistral['Util'], 'o-',
+            color='darkblue', label='Context Utilization', linewidth=2, alpha=0.7)
+    plt.title('Milvus + Mistral: Context Utilization vs RMSE')
+    plt.xlabel('Data Points')
+    plt.ylabel('Score')
+    plt.grid(True, linestyle='--', alpha=0.7)
+    plt.legend()
     plt.tight_layout()
+    plt.savefig('report/visualizations/milvus_mistral_plots.png', bbox_inches='tight', dpi=300)
     plt.close()
+    # Print statistical analysis for each model
+    print("\nStatistical Analysis:")
+    models = {
+        'Milvus + LLaMA': (milvus_llama['RMSE_Context_Rel'], milvus_llama['Context_Relevance'],
+                          milvus_llama['RMSE_Context_Util'], milvus_llama['Context_Utilization']),
+        'Weaviate + Mistral': (weaviate_mistral['RMSE_Context_Rel'], weaviate_mistral['Context_Rel'],
+                              weaviate_mistral['RMSE_Context_Util'], weaviate_mistral['Util']),
+        'Milvus + Mistral': (milvus_mistral['RMSE_Context_Rel'], milvus_mistral['Context_Rel'],
+                            milvus_mistral['RMSE_Context_Util'], milvus_mistral['Util'])
+    }
+    for model, (rmse_rel, rel, rmse_util, util) in models.items():
+        print(f"\n{model}:")
+        print(f"Context Relevance - Mean: {rel.mean():.3f}, Std: {rel.std():.3f}")
+        print(f"RMSE Context Rel - Mean: {rmse_rel.mean():.3f}, Std: {rmse_rel.std():.3f}")
+        print(f"Context Utilization - Mean: {util.mean():.3f}, Std: {util.std():.3f}")
+        print(f"RMSE Context Util - Mean: {rmse_util.mean():.3f}, Std: {rmse_util.std():.3f}")
 def create_violin_plots(milvus_llama, weaviate_mistral, milvus_mistral):
     metrics = ['RMSE_Context_Rel', 'RMSE_Context_Util', 'AUCROC']
     # Create visualizations
     create_performance_comparison(milvus_llama, weaviate_mistral, milvus_mistral)
+    create_correlation_plots(milvus_llama, weaviate_mistral, milvus_mistral)
     create_violin_plots(milvus_llama, weaviate_mistral, milvus_mistral)
     # Print statistics

report/architecture.md ADDED Viewed

	@@ -0,0 +1,154 @@

+# RAG Benchmark Evaluation System Architecture
+## High-Level Architecture Overview
+The system follows a modular architecture with the following key components:
+### 1. Data Layer
+- **Dataset Loading** (loaddataset.py)
+  - Handles RAGBench dataset loading from HuggingFace
+  - Processes multiple dataset configurations
+  - Extracts and normalizes data
+- **Vector Database** (Milvus)
+  - Stores document embeddings
+  - Enables efficient similarity search
+  - Manages metadata and scores
+### 2. Processing Layer
+- **Document Processing**
+  - Chunking (insertmilvushelper.py)
+  - Sliding window implementation
+  - Overlap management
+- **Embedding Generation**
+  - SentenceTransformer models
+  - Vector representation creation
+  - Dimension reduction
+### 3. Search & Retrieval Layer
+- **Vector Search** (searchmilvushelper.py)
+  - Cosine similarity computation
+  - Top-K retrieval
+  - Result ranking
+- **Reranking System** (finetuneresults.py)
+  - Multiple reranker options (MS MARCO, MonoT5)
+  - Context relevance scoring
+  - Result refinement
+### 4. Generation Layer
+- **LLM Integration** (generationhelper.py)
+  - Multiple model support (LLaMA, Mistral)
+  - Context-aware response generation
+  - Prompt engineering
+### 5. Evaluation Layer
+- **Metrics Calculation** (calculatescores.py)
+  - RMSE computation
+  - AUCROC calculation
+  - Context relevance/utilization scoring
+### 6. Presentation Layer
+- **Web Interface** (app.py)
+  - Gradio-based UI
+  - Interactive model selection
+  - Real-time result display
+## Data Flow
+1. User submits query through Gradio interface
+2. Query is embedded and searched in Milvus
+3. Retrieved documents are reranked
+4. LLM generates response using context
+5. Response is evaluated and scored
+6. Results are displayed to user
+## Architecture Diagram
+```mermaid
+graph TB
+    %% User Interface Layer
+    UI[Web Interface - Gradio]
+    %% Data Layer
+    subgraph Data Layer
+        DS[RAGBench Dataset]
+        VDB[(Milvus Vector DB)]
+    end
+    %% Processing Layer
+    subgraph Processing Layer
+        DP[Document Processing]
+        EG[Embedding Generation]
+        style DP fill:#f9f,stroke:#333
+        style EG fill:#f9f,stroke:#333
+    end
+    %% Search & Retrieval Layer
+    subgraph Search & Retrieval
+        VS[Vector Search]
+        RR[Reranking System]
+        style VS fill:#bbf,stroke:#333
+        style RR fill:#bbf,stroke:#333
+    end
+    %% Generation Layer
+    subgraph Generation Layer
+        LLM[LLM Models]
+        PR[Prompt Engineering]
+        style LLM fill:#bfb,stroke:#333
+        style PR fill:#bfb,stroke:#333
+    end
+    %% Evaluation Layer
+    subgraph Evaluation Layer
+        ME[Metrics Evaluation]
+        SC[Score Calculation]
+        style ME fill:#ffb,stroke:#333
+        style SC fill:#ffb,stroke:#333
+    end
+    %% Flow Connections
+    UI --> DP
+    DS --> DP
+    DP --> EG
+    EG --> VDB
+    UI --> VS
+    VS --> VDB
+    VS --> RR
+    RR --> LLM
+    LLM --> PR
+    PR --> ME
+    ME --> SC
+    SC --> UI
+    %% Model Components
+    subgraph Models
+        ST[SentenceTransformers]
+        RM[Reranking Models]
+        GM[Generation Models]
+        style ST fill:#dfd,stroke:#333
+        style RM fill:#dfd,stroke:#333
+        style GM fill:#dfd,stroke:#333
+    end
+    %% Model Connections
+    EG --> ST
+    RR --> RM
+    LLM --> GM
+    %% Styling
+    classDef default fill:#fff,stroke:#333,stroke-width:2px;
+    classDef interface fill:#f96,stroke:#333,stroke-width:2px;
+    class UI interface;
+```

report/finalreport.md CHANGED Viewed

@@ -1,46 +1,51 @@
-Performance Analysis Report
-=========================
 1. **Retrieval Time**:
    - Milvus + LLaMA: 0.132s
    - Weaviate + Mistral: 0.157s
    - Milvus + Mistral: NaN
 2. **Context Relevance** (higher is better):
    - Milvus + LLaMA: 0.640
    - Weaviate + Mistral: 0.591
    - Milvus + Mistral: 0.518
 3. **Context Utilization** (higher is better):
    - Milvus + LLaMA: 0.673
    - Weaviate + Mistral: 0.619
    - Milvus + Mistral: 0.614
 4. **AUCROC** (Area Under ROC Curve):
    - Milvus + LLaMA: 0.912
    - Weaviate + Mistral: 0.750
    - Milvus + Mistral: 0.844
 5. **RMSE** (Root Mean Square Error):
-   - Milvus + LLaMA:
-     * Context Relevance RMSE: 0.179
-     * Context Utilization RMSE: 0.302
    - Weaviate + Mistral:
-     * Context Relevance RMSE: 0.414
-     * Context Utilization RMSE: 0.482
    - Milvus + Mistral:
-     * Context Relevance RMSE: 0.167
-     * Context Utilization RMSE: 0.258
-Analysis
---------
 1. **Best Overall Performance: Milvus + LLaMA**
    - Highest AUCROC score (0.912)
    - Best context relevance (0.640) and utilization (0.673)
    - Fast retrieval time (0.132s)
    - Moderate RMSE scores
 2. **Runner-up: Milvus + Mistral**
    - Second-best AUCROC (0.844)
    - Lowest RMSE scores overall
    - Lower context relevance and utilization
@@ -52,12 +57,13 @@ Analysis
    - Slowest retrieval time (0.157s)
    - Moderate context metrics
-Recommendation
--------------
 Based on the comprehensive analysis of all metrics, Milvus + LLaMA emerges as the optimal choice for overall performance. It demonstrates:
 - Superior accuracy (highest AUCROC)
 - Better context handling capabilities
 - Efficient retrieval speed
 - Reasonable error rates
-However, if minimizing error (RMSE) is the primary objective, Milvus + Mistral could be a viable alternative due to its lower error rates in both context relevance and utilization metrics.

+# Performance Analysis Report
 1. **Retrieval Time**:
    - Milvus + LLaMA: 0.132s
    - Weaviate + Mistral: 0.157s
    - Milvus + Mistral: NaN
 2. **Context Relevance** (higher is better):
    - Milvus + LLaMA: 0.640
    - Weaviate + Mistral: 0.591
    - Milvus + Mistral: 0.518
 3. **Context Utilization** (higher is better):
    - Milvus + LLaMA: 0.673
    - Weaviate + Mistral: 0.619
    - Milvus + Mistral: 0.614
 4. **AUCROC** (Area Under ROC Curve):
    - Milvus + LLaMA: 0.912
    - Weaviate + Mistral: 0.750
    - Milvus + Mistral: 0.844
 5. **RMSE** (Root Mean Square Error):
+   - Milvus + LLaMA:
+     - Context Relevance RMSE: 0.179
+     - Context Utilization RMSE: 0.302
    - Weaviate + Mistral:
+     - Context Relevance RMSE: 0.414
+     - Context Utilization RMSE: 0.482
    - Milvus + Mistral:
+     - Context Relevance RMSE: 0.167
+     - Context Utilization RMSE: 0.258
+## Analysis
 1. **Best Overall Performance: Milvus + LLaMA**
    - Highest AUCROC score (0.912)
    - Best context relevance (0.640) and utilization (0.673)
    - Fast retrieval time (0.132s)
    - Moderate RMSE scores
 2. **Runner-up: Milvus + Mistral**
    - Second-best AUCROC (0.844)
    - Lowest RMSE scores overall
    - Lower context relevance and utilization
    - Slowest retrieval time (0.157s)
    - Moderate context metrics
+## Recommendation
 Based on the comprehensive analysis of all metrics, Milvus + LLaMA emerges as the optimal choice for overall performance. It demonstrates:
 - Superior accuracy (highest AUCROC)
 - Better context handling capabilities
 - Efficient retrieval speed
 - Reasonable error rates
+However, if minimizing error (RMSE) is the primary objective, Milvus + Mistral could be a viable alternative due to its lower error rates in both context relevance and utilization metrics.

report/visualizations/correlation_heatmaps.png DELETED Viewed

Binary file (561 kB)

report/visualizations/milvus_llama_plots.png ADDED Viewed

report/visualizations/milvus_mistral_plots.png ADDED Viewed

report/visualizations/performance_comparison.png CHANGED Viewed

report/visualizations/weaviate_mistral_plots.png ADDED Viewed