Spaces:

ybchen928
/

oncall-guide-ai

Sleeping

YanBoChen commited on 24 days ago

Commit

fbad237

1 Parent(s): f24fd2b

Re-check multilevel fallback validation test suite for OnCall.ai

- Implemented a comprehensive test suite to validate the 5-level fallback mechanism for medical queries.
- Included tests for predefined mappings, LLM extraction, semantic search, medical query validation, and generic medical search.
- Added detailed logging and reporting features to track test execution and results.
- Created a structured approach to initialize components, run tests, and generate reports with success rates and execution times.

Files changed (5) hide show

src/Todo_20250731_Multlevel_Fallback_Fix.md +83 -0
src/user_prompt.py +10 -3
tests/result_of_test_multlevel_allback_validation.md +570 -0
tests/{result_of_test_userinput_userprompt_medical_condition_llm.txt → result_of_test_userinput_userprompt_medical_condition_llm.md} +0 -0
tests/test_multilevel_fallback_validation.py +537 -0

src/Todo_20250731_Multlevel_Fallback_Fix.md ADDED Viewed

	@@ -0,0 +1,83 @@

+太棒了！讓我們一起分析這個詳細的測試結果，這些數據很有價值！
+## 📊 **測試結果深度分析**
+### **✅ 成功的部分 (53.8% 成功率)**
+**Level 1 - 預定義映射**: 💯 **完美表現**
+- 3/3 測試通過，平均 0.000s（瞬間響應）
+- `acute myocardial infarction`, `acute stroke`, `pulmonary embolism` 都直接命中
+**Level 4b→5 - 冷門醫療查詢**: 💯 **完美表現**
+- 3/3 測試通過，正確進入 generic search
+- 罕見血液疾病、ITP、壞死性筋膜炎都正確處理
+### **🔍 發現的關鍵問題**
+#### **問題1: Level 4 驗證機制失效** ❌
+**現象**: 非醫療查詢（烹飪、編程、天氣）都被當作醫療查詢處理
+```
+- "how to cook pasta properly?" → Level 5 (應該被拒絕)
+- "programming language" → Level 5 (應該被拒絕)
+- "weather forecast" → Level 5 (應該被拒絕)
+```
+**根本原因**: `validate_medical_query` 邏輯有問題
+- LLM 雖然說"這不是醫療查詢"，但函數仍然返回 `None`（表示通過驗證）
+- 應該檢查 LLM 回應中是否明確說明"非醫療"
+#### **問題2: Level 3 語義搜索邏輯問題** ⚠️
+**現象**: 期望 Level 3 的查詢都跳到了 Level 5
+```
+- "emergency management of cardiovascular crisis" → Level 5 (期望 Level 3)
+- "urgent neurological intervention protocols" → Level 5 (期望 Level 3)
+```
+**原因**: `_infer_condition_from_text` 方法可能過於嚴格，無法推斷出有效條件
+#### **問題3: Level 2 行為不一致** ⚠️
+**現象**:
+- `level2_001` 成功，但被 Level 1 攔截了（LLM 提取了已知條件）
+- `level2_002` 失敗，LLM 提取了條件但驗證失敗
+## 🛠️ **需要修正的優先順序**
+### **Priority 1: 修正 validate_medical_query**
+```python
+def validate_medical_query(self, user_query: str) -> Optional[Dict[str, Any]]:
+    # 檢查 LLM 回應是否明確說明非醫療
+    if llama_result.get('extracted_condition'):
+        response_text = llama_result.get('raw_response', '').lower()
+        # 檢查是否明確拒絕醫療查詢
+        rejection_phrases = [
+            "not a medical condition",
+            "outside my medical scope",
+            "unrelated to medical conditions",
+            "do not address"
+        ]
+        if any(phrase in response_text for phrase in rejection_phrases):
+            return self._generate_invalid_query_response()
+        return None  # 通過驗證
+```
+### **Priority 2: 改進語義搜索條件推斷**
+`_infer_condition_from_text` 的相似度閾值可能太高(0.7)，建議降低到 0.5
+### **Priority 3: 優化 Level 2 LLM 提取驗證**
+確保 `validate_condition` 能正確處理 LLM 的複雜回應
+## 🎯 **整體評估**
+### **速度表現**: ⭐⭐⭐⭐⭐
+- Level 1: 瞬間響應 (0.000s)
+- 平均: 14.4s（主要是 LLM 調用造成的）
+### **準確性**: ⭐⭐⭐
+- 預定義條件: 100% 準確
+- 冷門醫療: 100% 準確
+- 非醫療拒絕: 0% 準確 ← **需要立即修正**
+你希望我先修正 `validate_medical_query` 的邏輯嗎？這是最關鍵的問題，解決後整體成功率應該能提升到 80%+。

src/user_prompt.py CHANGED Viewed

@@ -61,6 +61,7 @@ class UserPromptProcessor:
         Returns:
             Dict with condition and keywords
         """
         # Level 1: Predefined Mapping (Fast Path)
         predefined_result = self._predefined_mapping(user_query)
         if predefined_result:
@@ -77,12 +78,19 @@ class UserPromptProcessor:
         if semantic_result:
             return semantic_result
-        # Level 4: Generic Medical Search
         generic_result = self._generic_medical_search(user_query)
         if generic_result:
             return generic_result
         # No match found
         return {
             'condition': '',
             'emergency_keywords': '',
@@ -230,8 +238,7 @@ class UserPromptProcessor:
                     'generic_confidence': 0.5
                 }
-            return None
         except Exception as e:
             logger.error(f"Generic medical search error: {e}")
             return None

         Returns:
             Dict with condition and keywords
         """
         # Level 1: Predefined Mapping (Fast Path)
         predefined_result = self._predefined_mapping(user_query)
         if predefined_result:
         if semantic_result:
             return semantic_result
+        # Level 4: Medical Query Validation
+        # Only validate if previous levels failed - speed optimization
+        validation_result = self.validate_medical_query(user_query)
+        if validation_result:  # If validation fails (returns non-None)
+            return validation_result
+        # Level 5: Generic Medical Search (after validation passes)
         generic_result = self._generic_medical_search(user_query)
         if generic_result:
             return generic_result
         # No match found
         return {
             'condition': '',
             'emergency_keywords': '',
                     'generic_confidence': 0.5
                 }
+            return None
         except Exception as e:
             logger.error(f"Generic medical search error: {e}")
             return None

tests/result_of_test_multlevel_allback_validation.md ADDED Viewed

	@@ -0,0 +1,570 @@

+🏥 OnCall.ai Multilevel Fallback Validation Test
+============================================================
+🔧 Initializing Components for Multilevel Fallback Test...
+------------------------------------------------------------
+1. Initializing Llama3-Med42-70B Client...
+2025-07-31 07:12:17,625 - llm_clients - INFO - Medical LLM client initialized with model: m42-health/Llama3-Med42-70B
+2025-07-31 07:12:17,626 - llm_clients - WARNING - Medical LLM Model: Research tool only. Not for professional medical diagnosis.
+   ✅ LLM client initialized
+2. Initializing Retrieval System...
+2025-07-31 07:12:17,626 - retrieval - INFO - Initializing retrieval system...
+2025-07-31 07:12:17,637 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: mps
+2025-07-31 07:12:17,637 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: NeuML/pubmedbert-base-embeddings
+2025-07-31 07:12:20,936 - retrieval - INFO - Embedding model loaded successfully
+2025-07-31 07:12:22,314 - retrieval - INFO - Chunks loaded successfully
+2025-07-31 07:12:22,418 - retrieval - INFO - Embeddings loaded successfully
+2025-07-31 07:12:22,419 - retrieval - INFO - Loaded existing emergency index
+2025-07-31 07:12:22,420 - retrieval - INFO - Loaded existing treatment index
+2025-07-31 07:12:22,420 - retrieval - INFO - Retrieval system initialized successfully
+   ✅ Retrieval system initialized
+3. Initializing User Prompt Processor...
+2025-07-31 07:12:22,420 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: mps
+2025-07-31 07:12:22,420 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: NeuML/pubmedbert-base-embeddings
+2025-07-31 07:12:24,622 - user_prompt - INFO - UserPromptProcessor initialized
+   ✅ User prompt processor initialized
+🎉 All components initialized successfully!
+🚀 Starting Multilevel Fallback Test Suite
+Total test cases: 13
+Test started at: 2025-07-31 07:12:17
+================================================================================
+🔍 level1_001: Level 1: Direct predefined condition match
+Query: 'acute myocardial infarction treatment'
+Expected Level: 1
+----------------------------------------------------------------------
+🎯 Executing multilevel fallback...
+2025-07-31 07:12:24,623 - user_prompt - INFO - Matched predefined condition: acute myocardial infarction
+   ✅ Detected Level: 1
+   Condition: acute myocardial infarction
+   Emergency Keywords: MI|chest pain|cardiac arrest
+   Treatment Keywords: aspirin|nitroglycerin|thrombolytic|PCI
+   Execution Time: 0.000s
+   🎉 Test PASSED - Expected behavior achieved
+🔍 level1_002: Level 1: Predefined stroke condition
+Query: 'how to manage acute stroke?'
+Expected Level: 1
+----------------------------------------------------------------------
+🎯 Executing multilevel fallback...
+2025-07-31 07:12:24,623 - user_prompt - INFO - Matched predefined condition: acute stroke
+   ✅ Detected Level: 1
+   Condition: acute stroke
+   Emergency Keywords: stroke|neurological deficit|sudden weakness
+   Treatment Keywords: tPA|thrombolysis|stroke unit care
+   Execution Time: 0.000s
+   🎉 Test PASSED - Expected behavior achieved
+🔍 level1_003: Level 1: Predefined PE condition
+Query: 'pulmonary embolism emergency protocol'
+Expected Level: 1
+----------------------------------------------------------------------
+🎯 Executing multilevel fallback...
+2025-07-31 07:12:24,623 - user_prompt - INFO - Matched predefined condition: pulmonary embolism
+   ✅ Detected Level: 1
+   Condition: pulmonary embolism
+   Emergency Keywords: chest pain|shortness of breath|sudden dyspnea
+   Treatment Keywords: anticoagulation|heparin|embolectomy
+   Execution Time: 0.000s
+   🎉 Test PASSED - Expected behavior achieved
+🔍 level2_001: Level 2: Symptom-based query requiring LLM analysis
+Query: 'patient with severe crushing chest pain radiating to left arm'
+Expected Level: 2
+----------------------------------------------------------------------
+🎯 Executing multilevel fallback...
+2025-07-31 07:12:24,623 - llm_clients - INFO - Calling Medical LLM with query: patient with severe crushing chest pain radiating to left arm
+2025-07-31 07:12:47,629 - llm_clients - INFO - Raw LLM Response: Acute Myocardial Infarction (STEMI) - considering "severe crushing chest pain" and radiation to the left arm, which are classic symptoms of a heart attack specifically involving ST-elevation (STEMI type), indicating complete blockage of a coronary artery. However, please note that as an AI assistant, I don't diagnose; this interpretation is based on common clinical presentation. A healthcare provider should perform an ECG and other tests for confirmation.
+2025-07-31 07:12:47,630 - llm_clients - INFO - Query Latency: 23.0064 seconds
+2025-07-31 07:12:47,630 - llm_clients - INFO - Extracted Condition: acute myocardial infarction
+   ✅ Detected Level: 1
+   Condition: acute myocardial infarction
+   Emergency Keywords: MI|chest pain|cardiac arrest
+   Treatment Keywords: aspirin|nitroglycerin|thrombolytic|PCI
+   Execution Time: 23.008s
+   🎉 Test PASSED - Expected behavior achieved
+🔍 level2_002: Level 2: Neurological symptoms requiring LLM
+Query: 'sudden onset weakness on right side with speech difficulty'
+Expected Level: 2
+----------------------------------------------------------------------
+🎯 Executing multilevel fallback...
+2025-07-31 07:12:47,631 - llm_clients - INFO - Calling Medical LLM with query: sudden onset weakness on right side with speech difficulty
+2025-07-31 07:12:56,760 - llm_clients - INFO - Raw LLM Response: Cerebrovascular Accident (CVA), or Acute Ischemic Stroke (specifically, with right hemiparesis and aphasia)
+- This diagnosis represents the most likely condition given the sudden onset of right-sided weakness (hemiparesis) and speech difficulty (aphasia). An ischemic stroke occurs when blood flow to a part of the brain is blocked, typically by a thrombus or embolus, causing damage to brain tissue and resulting in neurological deficits. Immediate medical
+2025-07-31 07:12:56,760 - llm_clients - INFO - Query Latency: 9.1288 seconds
+2025-07-31 07:12:56,760 - llm_clients - INFO - Extracted Condition: Cerebrovascular Accident (CVA), or Acute Ischemic Stroke (specifically, with right hemiparesis and aphasia)
+2025-07-31 07:12:56,760 - user_prompt - INFO - Starting semantic search fallback for query: 'sudden onset weakness on right side with speech difficulty'
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  1.66it/s]
+2025-07-31 07:12:58,013 - retrieval - INFO - Sliding window search: Found 5 results
+2025-07-31 07:12:58,023 - user_prompt - INFO - Semantic search returned 5 results
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.88it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.77it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.88it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 62.68it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.51it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 57.08it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.75it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 63.98it/s]
+2025-07-31 07:12:58,342 - user_prompt - INFO - Inferred condition: None
+2025-07-31 07:12:58,342 - user_prompt - WARNING - Condition validation failed for: None
+2025-07-31 07:12:58,342 - user_prompt - INFO - No suitable condition found in semantic search
+2025-07-31 07:12:58,342 - llm_clients - INFO - Calling Medical LLM with query: sudden onset weakness on right side with speech difficulty
+2025-07-31 07:13:09,255 - llm_clients - INFO - Raw LLM Response: Cerebrovascular Accident (CVA), or Acute Ischemic Stroke (specifically, with right hemiparesis and aphasia)
+- This diagnosis represents the most likely condition given the sudden onset of right-sided weakness (hemiparesis) and speech difficulty (aphasia), which are classic symptoms of an ischemic stroke affecting the dominant hemisphere (assuming the patient is right-handed).
+Please note that only a qualified physician can confirm a diagnosis after a thorough evaluation, including imaging studies
+2025-07-31 07:13:09,255 - llm_clients - INFO - Query Latency: 10.9129 seconds
+2025-07-31 07:13:09,255 - llm_clients - INFO - Extracted Condition: Cerebrovascular Accident (CVA), or Acute Ischemic Stroke (specifically, with right hemiparesis and aphasia)
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  8.55it/s]
+2025-07-31 07:13:09,844 - retrieval - INFO - Sliding window search: Found 5 results
+   ✅ Detected Level: 5
+   Condition: generic medical query
+   Emergency Keywords: medical|emergency
+   Treatment Keywords: treatment|management
+   Execution Time: 22.223s
+   ⚠️  Test PARTIAL - ⚠️ Level 5 != expected 2. ⚠️ Condition 'generic medical query' != expected ['acute stroke', 'cerebrovascular accident'].
+🔍 level3_001: Level 3: Generic medical terms requiring semantic search
+Query: 'emergency management of cardiovascular crisis'
+Expected Level: 3
+----------------------------------------------------------------------
+🎯 Executing multilevel fallback...
+2025-07-31 07:13:09,854 - llm_clients - INFO - Calling Medical LLM with query: emergency management of cardiovascular crisis
+2025-07-31 07:13:20,094 - llm_clients - INFO - Raw LLM Response: Cardiac Arrest (or, in context of crisis not yet arrest: Acute Cardiogenic Emergency, e.g., STEMI)
+- Note: As a text-based AI assistant, not a clinician, I don't provide medical advice. The term given here represents the most critical cardiovascular crisis requiring immediate emergency intervention. Cardiac arrest implies the heart has stopped pumping, while acute cardiogenic emergency (e.g., ST-elevation myocardial infarction, or STEMI) signifies severe heart
+2025-07-31 07:13:20,095 - llm_clients - INFO - Query Latency: 10.2402 seconds
+2025-07-31 07:13:20,095 - llm_clients - INFO - Extracted Condition: Cardiac Arrest (or, in context of crisis not yet arrest: Acute Cardiogenic Emergency, e.g., STEMI)
+2025-07-31 07:13:20,095 - user_prompt - INFO - Starting semantic search fallback for query: 'emergency management of cardiovascular crisis'
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.11it/s]
+2025-07-31 07:13:20,681 - retrieval - INFO - Sliding window search: Found 5 results
+2025-07-31 07:13:20,713 - user_prompt - INFO - Semantic search returned 5 results
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.75it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 59.28it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 56.29it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 62.79it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 65.12it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 62.44it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.88it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 62.20it/s]
+2025-07-31 07:13:20,905 - user_prompt - INFO - Inferred condition: None
+2025-07-31 07:13:20,905 - user_prompt - WARNING - Condition validation failed for: None
+2025-07-31 07:13:20,905 - user_prompt - INFO - No suitable condition found in semantic search
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.96it/s]
+2025-07-31 07:13:21,492 - retrieval - INFO - Sliding window search: Found 5 results
+   ✅ Detected Level: 5
+   Condition: generic medical query
+   Emergency Keywords: medical|emergency
+   Treatment Keywords: treatment|management
+   Execution Time: 11.647s
+   ⚠️  Test PARTIAL - ⚠️ Level 5 != expected 3. ⚠️ Condition 'generic medical query' != expected [].
+🔍 level3_002: Level 3: Medical terminology requiring semantic fallback
+Query: 'urgent neurological intervention protocols'
+Expected Level: 3
+----------------------------------------------------------------------
+🎯 Executing multilevel fallback...
+2025-07-31 07:13:21,501 - llm_clients - INFO - Calling Medical LLM with query: urgent neurological intervention protocols
+2025-07-31 07:13:30,536 - llm_clients - INFO - Raw LLM Response: The most representative condition: Acute Ischemic Stroke (requiring urgent neurointervention, such as thrombectomy)
+Explanation: The phrase "urgent neurological intervention protocols" typically refers to time-critical situations in neurology, and among these, acute ischemic stroke is a prime example. Acute ischemic stroke necessitates rapid evaluation and intervention, including thrombectomy, to restore blood flow and minimize brain damage. This condition demands urgent action due to its narrow therapeutic window, typically within
+2025-07-31 07:13:30,537 - llm_clients - INFO - Query Latency: 9.0352 seconds
+2025-07-31 07:13:30,537 - llm_clients - INFO - Extracted Condition: The most representative condition: Acute Ischemic Stroke (requiring urgent neurointervention, such as thrombectomy)
+2025-07-31 07:13:30,537 - user_prompt - INFO - Starting semantic search fallback for query: 'urgent neurological intervention protocols'
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  7.94it/s]
+2025-07-31 07:13:31,115 - retrieval - INFO - Sliding window search: Found 5 results
+2025-07-31 07:13:31,123 - user_prompt - INFO - Semantic search returned 5 results
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.96it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 46.55it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 47.09it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 58.23it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 48.16it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 65.05it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.42it/s]
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 63.08it/s]
+2025-07-31 07:13:31,334 - user_prompt - INFO - Inferred condition: None
+2025-07-31 07:13:31,334 - user_prompt - WARNING - Condition validation failed for: None
+2025-07-31 07:13:31,334 - user_prompt - INFO - No suitable condition found in semantic search
+Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 16.31it/s]
+2025-07-31 07:13:31,889 - retrieval - INFO - Sliding window search: Found 5 results
+   ✅ Detected Level: 5
+   Condition: generic medical query
+   Emergency Keywords: medical|emergency
+   Treatment Keywords: treatment|management
+   Execution Time: 10.398s
+   ⚠️  Test PARTIAL - ⚠️ Level 5 != expected 3. ⚠️ Condition 'generic medical query' != expected [].
+🔍 level4a_001: Level 4a: Non-medical query should be rejected
+Query: 'how to cook pasta properly?'
+Expected Level: 4
+----------------------------------------------------------------------
+🎯 Executing multilevel fallback...
+2025-07-31 07:13:31,899 - llm_clients - INFO - Calling Medical LLM with query: how to cook pasta properly?
+2025-07-31 07:13:41,038 - llm_clients - INFO - Raw LLM Response: As a medical assistant, I do not address cooking techniques, only medical conditions. However, for context (not advice): This query doesn't represent a medical condition; it's about culinary practice. In this case, "properly" cooking pasta typically means achieving al dente texture (not overly soft) by boiling in adequately salted water for the recommended time on the package, then draining well. This is unrelated to any health condition unless discussing, hypothetically, gastrointestinal tolerance in specific patients (e
+2025-07-31 07:13:41,038 - llm_clients - INFO - Query Latency: 9.1386 seconds
+2025-07-31 07:13:41,038 - llm_clients - INFO - Extracted Condition: As a medical assistant, I do not address cooking techniques, only medical conditions. However, for context (not advice): This query doesn't represent a medical condition; it's about culinary practice. In this case, "properly" cooking pasta typically means achieving al dente texture (not overly soft) by boiling in adequately salted water for the recommended time on the package, then draining well. This is unrelated to any health condition unless discussing, hypothetically, gastrointestinal tolerance in specific patients (e
+2025-07-31 07:13:41,038 - user_prompt - INFO - Starting semantic search fallback for query: 'how to cook pasta properly?'
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.02it/s]
+2025-07-31 07:13:42,156 - retrieval - INFO - Sliding window search: Found 5 results
+2025-07-31 07:13:42,165 - user_prompt - INFO - Semantic search returned 5 results
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  9.34it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 52.88it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 55.97it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 55.95it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 54.63it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 59.07it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 57.84it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.43it/s]
+2025-07-31 07:13:42,407 - user_prompt - INFO - Inferred condition: None
+2025-07-31 07:13:42,407 - user_prompt - WARNING - Condition validation failed for: None
+2025-07-31 07:13:42,407 - user_prompt - INFO - No suitable condition found in semantic search
+2025-07-31 07:13:42,407 - llm_clients - INFO - Calling Medical LLM with query: how to cook pasta properly?
+2025-07-31 07:13:51,634 - llm_clients - INFO - Raw LLM Response: As a medical assistant, I don't address cooking techniques, but for context (not medical advice): In terms of relevance to health, the key aspect here isn't "proper" cooking per se, but rather avoiding overcooking that can reduce nutrient content. For whole-grain pasta, aim for al dente texture (firm, not mushy) to preserve fiber and other nutrients. However, this query is not about a medical condition but a culinary practice.
+ Representative (non-medical) term
+2025-07-31 07:13:51,634 - llm_clients - INFO - Query Latency: 9.2269 seconds
+2025-07-31 07:13:51,634 - llm_clients - INFO - Extracted Condition: As a medical assistant, I don't address cooking techniques, but for context (not medical advice): In terms of relevance to health, the key aspect here isn't "proper" cooking per se, but rather avoiding overcooking that can reduce nutrient content. For whole-grain pasta, aim for al dente texture (firm, not mushy) to preserve fiber and other nutrients. However, this query is not about a medical condition but a culinary practice.
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.02it/s]
+2025-07-31 07:13:52,790 - retrieval - INFO - Sliding window search: Found 5 results
+   ✅ Detected Level: 5
+   Condition: generic medical query
+   Emergency Keywords: medical|emergency
+   Treatment Keywords: treatment|management
+   Execution Time: 20.900s
+   ⚠️  Test PARTIAL - ⚠️ Level 5 != expected 4. ⚠️ Query should have been rejected.
+🔍 level4a_002: Level 4a: Technology query should be rejected
+Query: 'best programming language to learn in 2025'
+Expected Level: 4
+----------------------------------------------------------------------
+🎯 Executing multilevel fallback...
+2025-07-31 07:13:52,799 - llm_clients - INFO - Calling Medical LLM with query: best programming language to learn in 2025
+2025-07-31 07:14:02,339 - llm_clients - INFO - Raw LLM Response: As a medical assistant, I do not address technology or education preferences like "best programming language" (which is non-medical context); however, for clarity, this query is outside my biomedical scope. In 2025 or any current year, the choice of "best" programming language is subjective and depends on industry trends, personal goals, and specific use cases (e.g., web development, mobile apps, or data science). Popular choices include Python, JavaScript, and Java, but it's crucial
+2025-07-31 07:14:02,339 - llm_clients - INFO - Query Latency: 9.5400 seconds
+2025-07-31 07:14:02,339 - llm_clients - INFO - Extracted Condition: As a medical assistant, I do not address technology or education preferences like "best programming language" (which is non-medical context); however, for clarity, this query is outside my biomedical scope. In 2025 or any current year, the choice of "best" programming language is subjective and depends on industry trends, personal goals, and specific use cases (e.g., web development, mobile apps, or data science). Popular choices include Python, JavaScript, and Java, but it's crucial
+2025-07-31 07:14:02,339 - user_prompt - INFO - Starting semantic search fallback for query: 'best programming language to learn in 2025'
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  6.45it/s]
+2025-07-31 07:14:02,974 - retrieval - INFO - Sliding window search: Found 5 results
+2025-07-31 07:14:02,986 - user_prompt - INFO - Semantic search returned 5 results
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  3.16it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 44.42it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 41.06it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 57.97it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 40.41it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 57.85it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 54.99it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 55.63it/s]
+2025-07-31 07:14:03,457 - user_prompt - INFO - Inferred condition: None
+2025-07-31 07:14:03,457 - user_prompt - WARNING - Condition validation failed for: None
+2025-07-31 07:14:03,457 - user_prompt - INFO - No suitable condition found in semantic search
+2025-07-31 07:14:03,457 - llm_clients - INFO - Calling Medical LLM with query: best programming language to learn in 2025
+2025-07-31 07:14:13,766 - llm_clients - INFO - Raw LLM Response: As a medical assistant, I don't analyze technology trends or recommend programming languages; however, for clarity's sake (though out of my medical scope), in 2021 (not 2025's future prediction, as I'm bound by current data), popular choices for learning include Python, JavaScript, and Java due to their versatility, wide adoption, and job market demand. Keep in mind this information is not medical advice but rather a layman's interpretation of tech trends.
+Representative Condition (not
+2025-07-31 07:14:13,766 - llm_clients - INFO - Query Latency: 10.3088 seconds
+2025-07-31 07:14:13,767 - llm_clients - INFO - Extracted Condition: As a medical assistant, I don't analyze technology trends or recommend programming languages; however, for clarity's sake (though out of my medical scope), in 2021 (not 2025's future prediction, as I'm bound by current data), popular choices for learning include Python, JavaScript, and Java due to their versatility, wide adoption, and job market demand. Keep in mind this information is not medical advice but rather a layman's interpretation of tech trends.
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.00it/s]
+2025-07-31 07:14:14,884 - retrieval - INFO - Sliding window search: Found 5 results
+   ✅ Detected Level: 5
+   Condition: generic medical query
+   Emergency Keywords: medical|emergency
+   Treatment Keywords: treatment|management
+   Execution Time: 22.107s
+   ⚠️  Test PARTIAL - ⚠️ Level 5 != expected 4. ⚠️ Query should have been rejected.
+🔍 level4a_003: Level 4a: Weather query should be rejected
+Query: 'weather forecast for tomorrow'
+Expected Level: 4
+----------------------------------------------------------------------
+🎯 Executing multilevel fallback...
+2025-07-31 07:14:14,905 - llm_clients - INFO - Calling Medical LLM with query: weather forecast for tomorrow
+2025-07-31 07:14:24,069 - llm_clients - INFO - Raw LLM Response: As a medical assistant, I do not address weather forecasts; however, for context clarification, this query is unrelated to medical conditions. The requested information here is about meteorology (weather prediction) rather than health or disease. There's no representative medical condition to provide in this case.
+If, however, you were referring indirectly to weather-sensitive health conditions (e.g., heat exhaustion, cold-induced asthma exacerbation), the specific condition would depend on the actual weather forecast details (temperature, humidity, etc.)
+2025-07-31 07:14:24,069 - llm_clients - INFO - Query Latency: 9.1634 seconds
+2025-07-31 07:14:24,069 - llm_clients - INFO - Extracted Condition: As a medical assistant, I do not address weather forecasts; however, for context clarification, this query is unrelated to medical conditions. The requested information here is about meteorology (weather prediction) rather than health or disease. There's no representative medical condition to provide in this case.
+2025-07-31 07:14:24,070 - user_prompt - INFO - Starting semantic search fallback for query: 'weather forecast for tomorrow'
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.17it/s]
+2025-07-31 07:14:25,222 - retrieval - INFO - Sliding window search: Found 5 results
+2025-07-31 07:14:25,234 - user_prompt - INFO - Semantic search returned 5 results
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.71it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 50.65it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 55.87it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 50.21it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.32it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 54.77it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 53.42it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 56.34it/s]
+2025-07-31 07:14:25,491 - user_prompt - INFO - Inferred condition: None
+2025-07-31 07:14:25,491 - user_prompt - WARNING - Condition validation failed for: None
+2025-07-31 07:14:25,491 - user_prompt - INFO - No suitable condition found in semantic search
+2025-07-31 07:14:25,491 - llm_clients - INFO - Calling Medical LLM with query: weather forecast for tomorrow
+2025-07-31 07:14:35,356 - llm_clients - INFO - Raw LLM Response: As a medical assistant, I do not address weather forecasts; however, for this context (to maintain representativeness in terms unrelated to diagnosis), the phrase here isn't indicative of a medical condition. Instead, it's about environmental information—specifically, a request for meteorological data (tomorrow's weather). In medical terminology, we wouldn't classify this as a condition, but for representation's sake in a non-medical context, it can be labeled as "meteorological inquiry" or simply
+2025-07-31 07:14:35,356 - llm_clients - INFO - Query Latency: 9.8645 seconds
+2025-07-31 07:14:35,356 - llm_clients - INFO - Extracted Condition: As a medical assistant, I do not address weather forecasts; however, for this context (to maintain representativeness in terms unrelated to diagnosis), the phrase here isn't indicative of a medical condition. Instead, it's about environmental information—specifically, a request for meteorological data (tomorrow's weather). In medical terminology, we wouldn't classify this as a condition, but for representation's sake in a non-medical context, it can be labeled as "meteorological inquiry" or simply
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.19it/s]
+2025-07-31 07:14:36,024 - retrieval - INFO - Sliding window search: Found 5 results
+   ✅ Detected Level: 5
+   Condition: generic medical query
+   Emergency Keywords: medical|emergency
+   Treatment Keywords: treatment|management
+   Execution Time: 21.128s
+   ⚠️  Test PARTIAL - ⚠️ Level 5 != expected 4. ⚠️ Query should have been rejected.
+🔍 level4b_001: Level 4b→5: Obscure medical query passing validation to generic search
+Query: 'rare hematologic malignancy treatment approaches'
+Expected Level: 5
+----------------------------------------------------------------------
+🎯 Executing multilevel fallback...
+2025-07-31 07:14:36,033 - llm_clients - INFO - Calling Medical LLM with query: rare hematologic malignancy treatment approaches
+2025-07-31 07:14:45,301 - llm_clients - INFO - Raw LLM Response: The most representative condition: Myelofibrosis (or, in context of "rare" reference, could be an even less common variant like BCR-ABL1-negative atypical CML or unclassifiable myeloproliferative neoplasm)
+- For myelofibrosis, primary treatment approaches include JAK2 inhibitors (e.g., ruxolitinib), supportive care (transfusions, erythropoiesis-stimulating agents), and allog
+2025-07-31 07:14:45,302 - llm_clients - INFO - Query Latency: 9.2678 seconds
+2025-07-31 07:14:45,302 - llm_clients - INFO - Extracted Condition: The most representative condition: Myelofibrosis (or, in context of "rare" reference, could be an even less common variant like BCR-ABL1-negative atypical CML or unclassifiable myeloproliferative neoplasm)
+2025-07-31 07:14:45,302 - user_prompt - INFO - Starting semantic search fallback for query: 'rare hematologic malignancy treatment approaches'
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00,  2.00it/s]
+2025-07-31 07:14:46,428 - retrieval - INFO - Sliding window search: Found 5 results
+2025-07-31 07:14:46,436 - user_prompt - INFO - Semantic search returned 5 results
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.59it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 38.61it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 41.66it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.40it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 41.09it/s]
+Batches: 100%|█████████████████████████████���████████████████████████████████████| 1/1 [00:00<00:00, 60.42it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 58.98it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 66.70it/s]
+2025-07-31 07:14:46,672 - user_prompt - INFO - Inferred condition: None
+2025-07-31 07:14:46,672 - user_prompt - WARNING - Condition validation failed for: None
+2025-07-31 07:14:46,672 - user_prompt - INFO - No suitable condition found in semantic search
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 54.28it/s]
+2025-07-31 07:14:47,160 - retrieval - INFO - Sliding window search: Found 5 results
+   ✅ Detected Level: 5
+   Condition: generic medical query
+   Emergency Keywords: medical|emergency
+   Treatment Keywords: treatment|management
+   Execution Time: 11.137s
+   🎉 Test PASSED - Expected behavior achieved
+🔍 level4b_002: Level 4b→5: Rare condition requiring generic medical search
+Query: 'idiopathic thrombocytopenic purpura management guidelines'
+Expected Level: 5
+----------------------------------------------------------------------
+🎯 Executing multilevel fallback...
+2025-07-31 07:14:47,170 - llm_clients - INFO - Calling Medical LLM with query: idiopathic thrombocytopenic purpura management guidelines
+2025-07-31 07:14:56,483 - llm_clients - INFO - Raw LLM Response: The primary medical condition: Idiopathic Thrombocytopenic Purpura (ITP)
+(As a medical assistant, I do not provide advice, but here's the relevant condition with context for a knowledge reference.)
+In this case, the most representative condition is Idiopathic Thrombocytopenic Purpura (ITP), an autoimmune disorder characterized by low platelet count (thrombocytopenia) without identifiable underlying causes. Management guidelines typically involve
+2025-07-31 07:14:56,484 - llm_clients - INFO - Query Latency: 9.3136 seconds
+2025-07-31 07:14:56,484 - llm_clients - INFO - Extracted Condition: The primary medical condition: Idiopathic Thrombocytopenic Purpura (ITP)
+2025-07-31 07:14:56,484 - user_prompt - INFO - Starting semantic search fallback for query: 'idiopathic thrombocytopenic purpura management guidelines'
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.14it/s]
+2025-07-31 07:14:57,082 - retrieval - INFO - Sliding window search: Found 5 results
+2025-07-31 07:14:57,090 - user_prompt - INFO - Semantic search returned 5 results
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.83it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 51.94it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.06it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 65.59it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 57.81it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.78it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.76it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.14it/s]
+2025-07-31 07:14:57,296 - user_prompt - INFO - Inferred condition: None
+2025-07-31 07:14:57,296 - user_prompt - WARNING - Condition validation failed for: None
+2025-07-31 07:14:57,296 - user_prompt - INFO - No suitable condition found in semantic search
+2025-07-31 07:14:57,296 - llm_clients - INFO - Calling Medical LLM with query: idiopathic thrombocytopenic purpura management guidelines
+2025-07-31 07:15:06,621 - llm_clients - INFO - Raw LLM Response: The primary medical condition: Idiopathic Thrombocytopenic Purpura (ITP)
+(As a medical assistant, I don't provide advice, but describe the condition and point to standard guidelines. For ITP management, refer to professional sources like the American Society of Hematology [ASH] or National Institutes of Health [NIH].)
+Idiopathic Thrombocytopenic Purpura (ITP) is an autoimmune disorder characterized by low platelet count
+2025-07-31 07:15:06,621 - llm_clients - INFO - Query Latency: 9.3245 seconds
+2025-07-31 07:15:06,621 - llm_clients - INFO - Extracted Condition: The primary medical condition: Idiopathic Thrombocytopenic Purpura (ITP)
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.12it/s]
+2025-07-31 07:15:07,391 - retrieval - INFO - Sliding window search: Found 5 results
+   ✅ Detected Level: 5
+   Condition: generic medical query
+   Emergency Keywords: medical|emergency
+   Treatment Keywords: treatment|management
+   Execution Time: 20.228s
+   🎉 Test PASSED - Expected behavior achieved
+🔍 level4b_003: Level 4b→5: Rare emergency condition → generic search
+Query: 'necrotizing fasciitis surgical intervention protocols'
+Expected Level: 5
+----------------------------------------------------------------------
+🎯 Executing multilevel fallback...
+2025-07-31 07:15:07,398 - llm_clients - INFO - Calling Medical LLM with query: necrotizing fasciitis surgical intervention protocols
+2025-07-31 07:15:16,625 - llm_clients - INFO - Raw LLM Response: The primary medical condition: Necrotizing Fasciitis
+In this context, the key condition is Necrotizing Fasciitis, a severe bacterial infection characterized by rapid destruction of subcutaneous tissue and fascia. The term provided, "surgical intervention protocols," refers to the treatment approach rather than a distinct medical condition. However, for clarity in this answer, I'll address it as it pertains to managing Necrotizing Fasciitis.
+In Necrotizing Fasciitis, surgical
+2025-07-31 07:15:16,625 - llm_clients - INFO - Query Latency: 9.2271 seconds
+2025-07-31 07:15:16,625 - llm_clients - INFO - Extracted Condition: The primary medical condition: Necrotizing Fasciitis
+2025-07-31 07:15:16,625 - user_prompt - INFO - Starting semantic search fallback for query: 'necrotizing fasciitis surgical intervention protocols'
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.01it/s]
+2025-07-31 07:15:17,212 - retrieval - INFO - Sliding window search: Found 5 results
+2025-07-31 07:15:17,222 - user_prompt - INFO - Semantic search returned 5 results
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.01it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 45.04it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 44.57it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 57.92it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 48.15it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.28it/s]
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.83it/s]
+Batches: 100%|███���██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.38it/s]
+2025-07-31 07:15:17,449 - user_prompt - INFO - Inferred condition: None
+2025-07-31 07:15:17,449 - user_prompt - WARNING - Condition validation failed for: None
+2025-07-31 07:15:17,449 - user_prompt - INFO - No suitable condition found in semantic search
+2025-07-31 07:15:17,449 - llm_clients - INFO - Calling Medical LLM with query: necrotizing fasciitis surgical intervention protocols
+2025-07-31 07:15:24,511 - llm_clients - INFO - Raw LLM Response: The most representative condition: Necrotizing Fasciitis
+(As a medical assistant, I do not provide advice, only identify conditions. For necrotizing fasciitis, surgical intervention typically involves aggressive debridement—removing dead tissue—and may require repeated procedures until healthy margins are achieved. This is accompanied by supportive care and antibiotics.)
+2025-07-31 07:15:24,511 - llm_clients - INFO - Query Latency: 7.0619 seconds
+2025-07-31 07:15:24,511 - llm_clients - INFO - Extracted Condition: The most representative condition: Necrotizing Fasciitis
+Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.83it/s]
+2025-07-31 07:15:25,078 - retrieval - INFO - Sliding window search: Found 5 results
+   ✅ Detected Level: 5
+   Condition: generic medical query
+   Emergency Keywords: medical|emergency
+   Treatment Keywords: treatment|management
+   Execution Time: 17.692s
+   🎉 Test PASSED - Expected behavior achieved
+================================================================================
+📊 MULTILEVEL FALLBACK TEST REPORT
+================================================================================
+🕐 Execution Summary:
+   Total duration: 187.465s
+   Average per test: 14.420s
+📈 Test Results:
+   Total tests: 13
+   Passed: 7 ✅
+   Partial: 6 ⚠️
+   Failed: 6 ❌
+   Success rate: 53.8%
+🎯 Level Distribution Analysis:
+   Level 1 (Predefined Mapping): 4 tests, avg 5.752s
+   Level 5 (Generic Search): 9 tests, avg 17.495s
+📋 Category Analysis:
+   level1_predefined: 3/3 (100.0%)
+   level2_llm: 1/2 (50.0%)
+   level3_semantic: 0/2 (0.0%)
+   level4a_rejection: 0/3 (0.0%)
+   level4b_to_5: 3/3 (100.0%)
+📝 Detailed Test Results:
+   level1_001: ✅ PASS
+      Query: 'acute myocardial infarction treatment'
+      Expected Level: 1
+      Detected Level: 1
+      Condition: acute myocardial infarction
+      Time: 0.000s
+      Validation: ✅ Level 1 as expected. ✅ Condition 'acute myocardial infarction' matches expected.
+   level1_002: ✅ PASS
+      Query: 'how to manage acute stroke?'
+      Expected Level: 1
+      Detected Level: 1
+      Condition: acute stroke
+      Time: 0.000s
+      Validation: ✅ Level 1 as expected. ✅ Condition 'acute stroke' matches expected.
+   level1_003: ✅ PASS
+      Query: 'pulmonary embolism emergency protocol'
+      Expected Level: 1
+      Detected Level: 1
+      Condition: pulmonary embolism
+      Time: 0.000s
+      Validation: ✅ Level 1 as expected. ✅ Condition 'pulmonary embolism' matches expected.
+   level2_001: ✅ PASS
+      Query: 'patient with severe crushing chest pain radiating to left arm'
+      Expected Level: 2
+      Detected Level: 1
+      Condition: acute myocardial infarction
+      Time: 23.008s
+      Validation: ⚠️ Level 1 != expected 2. ✅ Condition 'acute myocardial infarction' matches expected.
+   level2_002: ⚠️ PARTIAL
+      Query: 'sudden onset weakness on right side with speech difficulty'
+      Expected Level: 2
+      Detected Level: 5
+      Condition: generic medical query
+      Time: 22.223s
+      Validation: ⚠️ Level 5 != expected 2. ⚠️ Condition 'generic medical query' != expected ['acute stroke', 'cerebrovascular accident'].
+   level3_001: ⚠️ PARTIAL
+      Query: 'emergency management of cardiovascular crisis'
+      Expected Level: 3
+      Detected Level: 5
+      Condition: generic medical query
+      Time: 11.647s
+      Validation: ⚠️ Level 5 != expected 3. ⚠️ Condition 'generic medical query' != expected [].
+   level3_002: ⚠️ PARTIAL
+      Query: 'urgent neurological intervention protocols'
+      Expected Level: 3
+      Detected Level: 5
+      Condition: generic medical query
+      Time: 10.398s
+      Validation: ⚠️ Level 5 != expected 3. ⚠️ Condition 'generic medical query' != expected [].
+   level4a_001: ⚠️ PARTIAL
+      Query: 'how to cook pasta properly?'
+      Expected Level: 4
+      Detected Level: 5
+      Condition: generic medical query
+      Time: 20.900s
+      Validation: ⚠️ Level 5 != expected 4. ⚠️ Query should have been rejected.
+   level4a_002: ⚠️ PARTIAL
+      Query: 'best programming language to learn in 2025'
+      Expected Level: 4
+      Detected Level: 5
+      Condition: generic medical query
+      Time: 22.107s
+      Validation: ⚠️ Level 5 != expected 4. ⚠️ Query should have been rejected.
+   level4a_003: ⚠️ PARTIAL
+      Query: 'weather forecast for tomorrow'
+      Expected Level: 4
+      Detected Level: 5
+      Condition: generic medical query
+      Time: 21.128s
+      Validation: ⚠️ Level 5 != expected 4. ⚠️ Query should have been rejected.
+   level4b_001: ✅ PASS
+      Query: 'rare hematologic malignancy treatment approaches'
+      Expected Level: 5
+      Detected Level: 5
+      Condition: generic medical query
+      Time: 11.137s
+      Validation: ✅ Level 5 as expected. ✅ Generic medical search triggered.
+   level4b_002: ✅ PASS
+      Query: 'idiopathic thrombocytopenic purpura management guidelines'
+      Expected Level: 5
+      Detected Level: 5
+      Condition: generic medical query
+      Time: 20.228s
+      Validation: ✅ Level 5 as expected. ✅ Generic medical search triggered.
+   level4b_003: ✅ PASS
+      Query: 'necrotizing fasciitis surgical intervention protocols'
+      Expected Level: 5
+      Detected Level: 5
+      Condition: generic medical query
+      Time: 17.692s
+      Validation: ✅ Level 5 as expected. ✅ Generic medical search triggered.

tests/{result_of_test_userinput_userprompt_medical_condition_llm.txt → result_of_test_userinput_userprompt_medical_condition_llm.md} RENAMED Viewed

File without changes

tests/test_multilevel_fallback_validation.py ADDED Viewed

	@@ -0,0 +1,537 @@

+#!/usr/bin/env python3
+"""
+Multi-Level Fallback Validation Test Suite for OnCall.ai
+This test specifically validates the 5-level fallback mechanism:
+Level 1: Predefined Mapping (Fast Path)
+Level 2: Llama3-Med42-70B Extraction
+Level 3: Semantic Search Fallback
+Level 4: Medical Query Validation
+Level 5: Generic Medical Search
+Author: OnCall.ai Team
+Date: 2025-07-30
+"""
+import sys
+import os
+from pathlib import Path
+import logging
+import json
+import traceback
+from datetime import datetime
+from typing import Dict, List, Any, Optional
+# Add src directory to Python path
+current_dir = Path(__file__).parent
+project_root = current_dir.parent
+src_dir = project_root / "src"
+sys.path.insert(0, str(src_dir))
+# Import our modules
+try:
+    from user_prompt import UserPromptProcessor
+    from retrieval import BasicRetrievalSystem
+    from llm_clients import llm_Med42_70BClient
+    from medical_conditions import CONDITION_KEYWORD_MAPPING
+except ImportError as e:
+    print(f"❌ Import Error: {e}")
+    print(f"Current working directory: {os.getcwd()}")
+    print(f"Python path: {sys.path}")
+    sys.exit(1)
+# Configure logging
+logging.basicConfig(
+    level=logging.INFO,
+    format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
+    handlers=[
+        logging.StreamHandler(),
+        logging.FileHandler(project_root / 'tests' / 'multilevel_fallback_test.log')
+    ]
+)
+logger = logging.getLogger(__name__)
+class MultilevelFallbackTest:
+    """Test suite specifically for the 5-level fallback mechanism"""
+    def __init__(self):
+        """Initialize test suite"""
+        self.start_time = datetime.now()
+        self.results = []
+        self.components_initialized = False
+        # Component references
+        self.llm_client = None
+        self.retrieval_system = None
+        self.user_prompt_processor = None
+    def initialize_components(self):
+        """Initialize all pipeline components"""
+        print("🔧 Initializing Components for Multilevel Fallback Test...")
+        print("-" * 60)
+        try:
+            # Initialize LLM client
+            print("1. Initializing Llama3-Med42-70B Client...")
+            self.llm_client = llm_Med42_70BClient()
+            print("   ✅ LLM client initialized")
+            # Initialize retrieval system
+            print("2. Initializing Retrieval System...")
+            self.retrieval_system = BasicRetrievalSystem()
+            print("   ✅ Retrieval system initialized")
+            # Initialize user prompt processor
+            print("3. Initializing User Prompt Processor...")
+            self.user_prompt_processor = UserPromptProcessor(
+                llm_client=self.llm_client,
+                retrieval_system=self.retrieval_system
+            )
+            print("   ✅ User prompt processor initialized")
+            self.components_initialized = True
+            print("\n🎉 All components initialized successfully!")
+        except Exception as e:
+            logger.error(f"Component initialization failed: {e}")
+            print(f"❌ Component initialization failed: {e}")
+            traceback.print_exc()
+            self.components_initialized = False
+    def get_multilevel_test_cases(self) -> List[Dict[str, Any]]:
+        """Define test cases specifically targeting each fallback level"""
+        return [
+            # Level 1: Predefined Mapping Tests
+            {
+                "id": "level1_001",
+                "query": "acute myocardial infarction treatment",
+                "description": "Level 1: Direct predefined condition match",
+                "expected_level": 1,
+                "expected_condition": "acute myocardial infarction",
+                "expected_source": "predefined_mapping",
+                "category": "level1_predefined"
+            },
+            {
+                "id": "level1_002",
+                "query": "how to manage acute stroke?",
+                "description": "Level 1: Predefined stroke condition",
+                "expected_level": 1,
+                "expected_condition": "acute stroke",
+                "expected_source": "predefined_mapping",
+                "category": "level1_predefined"
+            },
+            {
+                "id": "level1_003",
+                "query": "pulmonary embolism emergency protocol",
+                "description": "Level 1: Predefined PE condition",
+                "expected_level": 1,
+                "expected_condition": "pulmonary embolism",
+                "expected_source": "predefined_mapping",
+                "category": "level1_predefined"
+            },
+            # Level 2: LLM Extraction Tests
+            {
+                "id": "level2_001",
+                "query": "patient with severe crushing chest pain radiating to left arm",
+                "description": "Level 2: Symptom-based query requiring LLM analysis",
+                "expected_level": 2,
+                "expected_condition": ["acute myocardial infarction", "acute coronary syndrome"],
+                "expected_source": "llm_extraction",
+                "category": "level2_llm"
+            },
+            {
+                "id": "level2_002",
+                "query": "sudden onset weakness on right side with speech difficulty",
+                "description": "Level 2: Neurological symptoms requiring LLM",
+                "expected_level": 2,
+                "expected_condition": ["acute stroke", "cerebrovascular accident"],
+                "expected_source": "llm_extraction",
+                "category": "level2_llm"
+            },
+            # Level 3: Semantic Search Tests
+            {
+                "id": "level3_001",
+                "query": "emergency management of cardiovascular crisis",
+                "description": "Level 3: Generic medical terms requiring semantic search",
+                "expected_level": 3,
+                "expected_source": "semantic_search",
+                "category": "level3_semantic"
+            },
+            {
+                "id": "level3_002",
+                "query": "urgent neurological intervention protocols",
+                "description": "Level 3: Medical terminology requiring semantic fallback",
+                "expected_level": 3,
+                "expected_source": "semantic_search",
+                "category": "level3_semantic"
+            },
+            # Level 4a: Non-Medical Query Rejection
+            {
+                "id": "level4a_001",
+                "query": "how to cook pasta properly?",
+                "description": "Level 4a: Non-medical query should be rejected",
+                "expected_level": 4,
+                "expected_result": "invalid_query",
+                "expected_source": "validation_rejection",
+                "category": "level4a_rejection"
+            },
+            {
+                "id": "level4a_002",
+                "query": "best programming language to learn in 2025",
+                "description": "Level 4a: Technology query should be rejected",
+                "expected_level": 4,
+                "expected_result": "invalid_query",
+                "expected_source": "validation_rejection",
+                "category": "level4a_rejection"
+            },
+            {
+                "id": "level4a_003",
+                "query": "weather forecast for tomorrow",
+                "description": "Level 4a: Weather query should be rejected",
+                "expected_level": 4,
+                "expected_result": "invalid_query",
+                "expected_source": "validation_rejection",
+                "category": "level4a_rejection"
+            },
+            # Level 4b + 5: Obscure Medical Terms → Generic Search
+            {
+                "id": "level4b_001",
+                "query": "rare hematologic malignancy treatment approaches",
+                "description": "Level 4b→5: Obscure medical query passing validation to generic search",
+                "expected_level": 5,
+                "expected_condition": "generic medical query",
+                "expected_source": "generic_search",
+                "category": "level4b_to_5"
+            },
+            {
+                "id": "level4b_002",
+                "query": "idiopathic thrombocytopenic purpura management guidelines",
+                "description": "Level 4b→5: Rare condition requiring generic medical search",
+                "expected_level": 5,
+                "expected_condition": "generic medical query",
+                "expected_source": "generic_search",
+                "category": "level4b_to_5"
+            },
+            {
+                "id": "level4b_003",
+                "query": "necrotizing fasciitis surgical intervention protocols",
+                "description": "Level 4b→5: Rare emergency condition → generic search",
+                "expected_level": 5,
+                "expected_condition": "generic medical query",
+                "expected_source": "generic_search",
+                "category": "level4b_to_5"
+            }
+        ]
+    def run_single_fallback_test(self, test_case: Dict[str, Any]) -> Dict[str, Any]:
+        """Execute a single fallback test case with level detection"""
+        test_id = test_case["id"]
+        query = test_case["query"]
+        print(f"\n🔍 {test_id}: {test_case['description']}")
+        print(f"Query: '{query}'")
+        print(f"Expected Level: {test_case.get('expected_level', 'Unknown')}")
+        print("-" * 70)
+        result = {
+            "test_id": test_id,
+            "test_case": test_case,
+            "timestamp": datetime.now().isoformat(),
+            "success": False,
+            "error": None,
+            "execution_time": 0,
+            "detected_level": None,
+            "condition_result": {}
+        }
+        start_time = datetime.now()
+        try:
+            # Execute condition extraction with level detection
+            print("🎯 Executing multilevel fallback...")
+            condition_start = datetime.now()
+            condition_result = self.user_prompt_processor.extract_condition_keywords(query)
+            condition_time = (datetime.now() - condition_start).total_seconds()
+            # Detect which level was used
+            detected_level = self._detect_fallback_level(condition_result)
+            result["condition_result"] = condition_result
+            result["detected_level"] = detected_level
+            result["execution_time"] = condition_time
+            print(f"   ✅ Detected Level: {detected_level}")
+            print(f"   Condition: {condition_result.get('condition', 'None')}")
+            print(f"   Emergency Keywords: {condition_result.get('emergency_keywords', 'None')}")
+            print(f"   Treatment Keywords: {condition_result.get('treatment_keywords', 'None')}")
+            print(f"   Execution Time: {condition_time:.3f}s")
+            # Validate expected behavior
+            validation_result = self._validate_expected_behavior(test_case, detected_level, condition_result)
+            result.update(validation_result)
+            if result["success"]:
+                print("   🎉 Test PASSED - Expected behavior achieved")
+            else:
+                print(f"   ⚠️  Test PARTIAL - {result.get('validation_message', 'Unexpected behavior')}")
+        except Exception as e:
+            total_time = (datetime.now() - start_time).total_seconds()
+            result["execution_time"] = total_time
+            result["error"] = str(e)
+            result["traceback"] = traceback.format_exc()
+            logger.error(f"Test {test_id} failed: {e}")
+            print(f"   ❌ Test FAILED: {e}")
+        return result
+    def _detect_fallback_level(self, condition_result: Dict[str, Any]) -> int:
+        """Detect which fallback level was used based on the result"""
+        if not condition_result:
+            return 0  # No result
+        # Check for validation rejection (Level 4a)
+        if condition_result.get('type') == 'invalid_query':
+            return 4
+        # Check for generic search (Level 5)
+        if condition_result.get('condition') == 'generic medical query':
+            return 5
+        # Check for semantic search (Level 3)
+        if 'semantic_confidence' in condition_result:
+            return 3
+        # Check for predefined mapping (Level 1)
+        condition = condition_result.get('condition', '')
+        if condition and condition in CONDITION_KEYWORD_MAPPING:
+            return 1
+        # Otherwise assume LLM extraction (Level 2)
+        if condition:
+            return 2
+        return 0  # Unknown
+    def _validate_expected_behavior(self, test_case: Dict[str, Any], detected_level: int,
+                                  condition_result: Dict[str, Any]) -> Dict[str, Any]:
+        """Validate if the test behaved as expected"""
+        expected_level = test_case.get('expected_level')
+        validation_result = {
+            "level_match": detected_level == expected_level,
+            "condition_match": False,
+            "success": False,
+            "validation_message": ""
+        }
+        # Check level match
+        if validation_result["level_match"]:
+            validation_result["validation_message"] += f"✅ Level {detected_level} as expected. "
+        else:
+            validation_result["validation_message"] += f"⚠️ Level {detected_level} != expected {expected_level}. "
+        # Check condition/result match based on test type
+        if test_case["category"] == "level4a_rejection":
+            # Should be rejected
+            validation_result["condition_match"] = condition_result.get('type') == 'invalid_query'
+            if validation_result["condition_match"]:
+                validation_result["validation_message"] += "✅ Query correctly rejected. "
+            else:
+                validation_result["validation_message"] += "⚠️ Query should have been rejected. "
+        elif test_case["category"] == "level4b_to_5":
+            # Should result in generic medical query
+            validation_result["condition_match"] = condition_result.get('condition') == 'generic medical query'
+            if validation_result["condition_match"]:
+                validation_result["validation_message"] += "✅ Generic medical search triggered. "
+            else:
+                validation_result["validation_message"] += "⚠️ Should trigger generic medical search. "
+        else:
+            # Check expected condition
+            expected_conditions = test_case.get('expected_condition', [])
+            if isinstance(expected_conditions, str):
+                expected_conditions = [expected_conditions]
+            actual_condition = condition_result.get('condition', '')
+            validation_result["condition_match"] = any(
+                expected.lower() in actual_condition.lower()
+                for expected in expected_conditions
+            )
+            if validation_result["condition_match"]:
+                validation_result["validation_message"] += f"✅ Condition '{actual_condition}' matches expected. "
+            else:
+                validation_result["validation_message"] += f"⚠️ Condition '{actual_condition}' != expected {expected_conditions}. "
+        # Overall success
+        validation_result["success"] = validation_result["level_match"] or validation_result["condition_match"]
+        return validation_result
+    def run_all_fallback_tests(self):
+        """Execute all fallback tests and generate report"""
+        if not self.components_initialized:
+            print("❌ Cannot run tests: components not initialized")
+            return
+        test_cases = self.get_multilevel_test_cases()
+        print(f"\n🚀 Starting Multilevel Fallback Test Suite")
+        print(f"Total test cases: {len(test_cases)}")
+        print(f"Test started at: {self.start_time.strftime('%Y-%m-%d %H:%M:%S')}")
+        print("=" * 80)
+        # Execute all tests
+        for test_case in test_cases:
+            result = self.run_single_fallback_test(test_case)
+            self.results.append(result)
+        # Generate report
+        self.generate_fallback_report()
+        self.save_fallback_results()
+    def generate_fallback_report(self):
+        """Generate detailed fallback analysis report"""
+        end_time = datetime.now()
+        total_duration = (end_time - self.start_time).total_seconds()
+        successful_tests = [r for r in self.results if r['success']]
+        failed_tests = [r for r in self.results if not r['success']]
+        partial_tests = [r for r in self.results if not r['success'] and not r.get('error')]
+        print("\n" + "=" * 80)
+        print("📊 MULTILEVEL FALLBACK TEST REPORT")
+        print("=" * 80)
+        # Overall Statistics
+        print(f"🕐 Execution Summary:")
+        print(f"   Total duration: {total_duration:.3f}s")
+        print(f"   Average per test: {total_duration/len(self.results):.3f}s")
+        print(f"\n📈 Test Results:")
+        print(f"   Total tests: {len(self.results)}")
+        print(f"   Passed: {len(successful_tests)} ✅")
+        print(f"   Partial: {len(partial_tests)} ⚠️")
+        print(f"   Failed: {len(failed_tests)} ❌")
+        print(f"   Success rate: {len(successful_tests)/len(self.results)*100:.1f}%")
+        # Level Distribution Analysis
+        level_distribution = {}
+        level_performance = {}
+        for result in self.results:
+            if not result.get('error'):
+                level = result.get('detected_level', 0)
+                level_distribution[level] = level_distribution.get(level, 0) + 1
+                if level not in level_performance:
+                    level_performance[level] = []
+                level_performance[level].append(result['execution_time'])
+        print(f"\n🎯 Level Distribution Analysis:")
+        for level in sorted(level_distribution.keys()):
+            count = level_distribution[level]
+            avg_time = sum(level_performance[level]) / len(level_performance[level])
+            level_name = {
+                1: "Predefined Mapping",
+                2: "LLM Extraction",
+                3: "Semantic Search",
+                4: "Validation Rejection",
+                5: "Generic Search"
+            }.get(level, f"Unknown ({level})")
+            print(f"   Level {level} ({level_name}): {count} tests, avg {avg_time:.3f}s")
+        # Category Analysis
+        categories = {}
+        for result in self.results:
+            category = result['test_case']['category']
+            if category not in categories:
+                categories[category] = {'total': 0, 'passed': 0}
+            categories[category]['total'] += 1
+            if result['success']:
+                categories[category]['passed'] += 1
+        print(f"\n📋 Category Analysis:")
+        for category, stats in categories.items():
+            success_rate = stats['passed'] / stats['total'] * 100
+            print(f"   {category}: {stats['passed']}/{stats['total']} ({success_rate:.1f}%)")
+        # Detailed Results
+        print(f"\n📝 Detailed Test Results:")
+        for result in self.results:
+            test_case = result['test_case']
+            status = "✅ PASS" if result['success'] else ("❌ FAIL" if result.get('error') else "⚠️ PARTIAL")
+            print(f"\n   {result['test_id']}: {status}")
+            print(f"      Query: '{test_case['query']}'")
+            print(f"      Expected Level: {test_case.get('expected_level', 'N/A')}")
+            print(f"      Detected Level: {result.get('detected_level', 'N/A')}")
+            print(f"      Condition: {result.get('condition_result', {}).get('condition', 'None')}")
+            print(f"      Time: {result['execution_time']:.3f}s")
+            if result.get('validation_message'):
+                print(f"      Validation: {result['validation_message']}")
+            if result.get('error'):
+                print(f"      Error: {result['error']}")
+        print("\n" + "=" * 80)
+    def save_fallback_results(self):
+        """Save detailed test results to JSON file"""
+        timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
+        filename = project_root / 'tests' / f'multilevel_fallback_results_{timestamp}.json'
+        try:
+            comprehensive_results = {
+                "test_metadata": {
+                    "timestamp": datetime.now().isoformat(),
+                    "test_type": "multilevel_fallback_validation",
+                    "total_duration_seconds": (datetime.now() - self.start_time).total_seconds(),
+                    "total_tests": len(self.results),
+                    "passed_tests": len([r for r in self.results if r['success']]),
+                    "failed_tests": len([r for r in self.results if not r['success']])
+                },
+                "fallback_results": self.results
+            }
+            with open(filename, 'w', encoding='utf-8') as f:
+                json.dump(comprehensive_results, f, indent=2, ensure_ascii=False)
+            print(f"📁 Multilevel fallback results saved to: {filename}")
+        except Exception as e:
+            logger.error(f"Failed to save test results: {e}")
+            print(f"⚠️ Failed to save test results: {e}")
+def main():
+    """Main execution function"""
+    print("🏥 OnCall.ai Multilevel Fallback Validation Test")
+    print("=" * 60)
+    # Initialize test suite
+    test_suite = MultilevelFallbackTest()
+    # Initialize components
+    test_suite.initialize_components()
+    if not test_suite.components_initialized:
+        print("❌ Test suite initialization failed. Exiting.")
+        return 1
+    # Run all fallback tests
+    test_suite.run_all_fallback_tests()
+    return 0
+if __name__ == "__main__":
+    exit_code = main()
+    sys.exit(exit_code)