YanBoChen commited on
Commit
fbad237
·
1 Parent(s): f24fd2b

Re-check multilevel fallback validation test suite for OnCall.ai

Browse files

- Implemented a comprehensive test suite to validate the 5-level fallback mechanism for medical queries.
- Included tests for predefined mappings, LLM extraction, semantic search, medical query validation, and generic medical search.
- Added detailed logging and reporting features to track test execution and results.
- Created a structured approach to initialize components, run tests, and generate reports with success rates and execution times.

src/Todo_20250731_Multlevel_Fallback_Fix.md ADDED
@@ -0,0 +1,83 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 太棒了!讓我們一起分析這個詳細的測試結果,這些數據很有價值!
2
+
3
+ ## 📊 **測試結果深度分析**
4
+
5
+ ### **✅ 成功的部分 (53.8% 成功率)**
6
+
7
+ **Level 1 - 預定義映射**: 💯 **完美表現**
8
+ - 3/3 測試通過,平均 0.000s(瞬間響應)
9
+ - `acute myocardial infarction`, `acute stroke`, `pulmonary embolism` 都直接命中
10
+
11
+ **Level 4b→5 - 冷門醫療查詢**: 💯 **完美表現**
12
+ - 3/3 測試通過,正確進入 generic search
13
+ - 罕見血液疾病、ITP、壞死性筋膜炎都正確處理
14
+
15
+ ### **🔍 發現的關鍵問題**
16
+
17
+ #### **問題1: Level 4 驗證機制失效** ❌
18
+ **現象**: 非醫療查詢(烹飪、編程、天氣)都被當作醫療查詢處理
19
+ ```
20
+ - "how to cook pasta properly?" → Level 5 (應該被拒絕)
21
+ - "programming language" → Level 5 (應該被拒絕)
22
+ - "weather forecast" → Level 5 (應該被拒絕)
23
+ ```
24
+
25
+ **根本原因**: `validate_medical_query` 邏輯有問題
26
+ - LLM 雖然說"這不是醫療查詢",但函數仍然返回 `None`(表示通過驗證)
27
+ - 應該檢查 LLM 回應中是否明確說明"非醫療"
28
+
29
+ #### **問題2: Level 3 語義搜索邏輯問題** ⚠️
30
+ **現象**: 期望 Level 3 的查詢都跳到了 Level 5
31
+ ```
32
+ - "emergency management of cardiovascular crisis" → Level 5 (期望 Level 3)
33
+ - "urgent neurological intervention protocols" → Level 5 (期望 Level 3)
34
+ ```
35
+
36
+ **原因**: `_infer_condition_from_text` 方法可能過於嚴格,無法推斷出有效條件
37
+
38
+ #### **問題3: Level 2 行為不一致** ⚠️
39
+ **現象**:
40
+ - `level2_001` 成功,但被 Level 1 攔截了(LLM 提取了已知條件)
41
+ - `level2_002` 失敗,LLM 提取了條件但驗證失敗
42
+
43
+ ## 🛠️ **需要修正的優先順序**
44
+
45
+ ### **Priority 1: 修正 validate_medical_query**
46
+ ```python
47
+ def validate_medical_query(self, user_query: str) -> Optional[Dict[str, Any]]:
48
+ # 檢查 LLM 回應是否明確說明非醫療
49
+ if llama_result.get('extracted_condition'):
50
+ response_text = llama_result.get('raw_response', '').lower()
51
+
52
+ # 檢查是否明確拒絕醫療查詢
53
+ rejection_phrases = [
54
+ "not a medical condition",
55
+ "outside my medical scope",
56
+ "unrelated to medical conditions",
57
+ "do not address"
58
+ ]
59
+
60
+ if any(phrase in response_text for phrase in rejection_phrases):
61
+ return self._generate_invalid_query_response()
62
+
63
+ return None # 通過驗證
64
+ ```
65
+
66
+ ### **Priority 2: 改進語義搜索條件推斷**
67
+ `_infer_condition_from_text` 的相似度閾值可能太高(0.7),建議降低到 0.5
68
+
69
+ ### **Priority 3: 優化 Level 2 LLM 提取驗證**
70
+ 確保 `validate_condition` 能正確處理 LLM 的複雜回應
71
+
72
+ ## 🎯 **整體評估**
73
+
74
+ ### **速度表現**: ⭐⭐⭐⭐⭐
75
+ - Level 1: 瞬間響應 (0.000s)
76
+ - 平均: 14.4s(主要是 LLM 調用造成的)
77
+
78
+ ### **準確性**: ⭐⭐⭐
79
+ - 預定義條件: 100% 準確
80
+ - 冷門醫療: 100% 準確
81
+ - 非醫療拒絕: 0% 準確 ← **需要立即修正**
82
+
83
+ 你希望我先修正 `validate_medical_query` 的邏輯嗎?這是最關鍵的問題,解決後整體成功率應該能提升到 80%+。
src/user_prompt.py CHANGED
@@ -61,6 +61,7 @@ class UserPromptProcessor:
61
  Returns:
62
  Dict with condition and keywords
63
  """
 
64
  # Level 1: Predefined Mapping (Fast Path)
65
  predefined_result = self._predefined_mapping(user_query)
66
  if predefined_result:
@@ -77,12 +78,19 @@ class UserPromptProcessor:
77
  if semantic_result:
78
  return semantic_result
79
 
80
- # Level 4: Generic Medical Search
 
 
 
 
 
 
81
  generic_result = self._generic_medical_search(user_query)
82
  if generic_result:
83
  return generic_result
84
 
85
  # No match found
 
86
  return {
87
  'condition': '',
88
  'emergency_keywords': '',
@@ -230,8 +238,7 @@ class UserPromptProcessor:
230
  'generic_confidence': 0.5
231
  }
232
 
233
- return None
234
-
235
  except Exception as e:
236
  logger.error(f"Generic medical search error: {e}")
237
  return None
 
61
  Returns:
62
  Dict with condition and keywords
63
  """
64
+
65
  # Level 1: Predefined Mapping (Fast Path)
66
  predefined_result = self._predefined_mapping(user_query)
67
  if predefined_result:
 
78
  if semantic_result:
79
  return semantic_result
80
 
81
+ # Level 4: Medical Query Validation
82
+ # Only validate if previous levels failed - speed optimization
83
+ validation_result = self.validate_medical_query(user_query)
84
+ if validation_result: # If validation fails (returns non-None)
85
+ return validation_result
86
+
87
+ # Level 5: Generic Medical Search (after validation passes)
88
  generic_result = self._generic_medical_search(user_query)
89
  if generic_result:
90
  return generic_result
91
 
92
  # No match found
93
+
94
  return {
95
  'condition': '',
96
  'emergency_keywords': '',
 
238
  'generic_confidence': 0.5
239
  }
240
 
241
+ return None
 
242
  except Exception as e:
243
  logger.error(f"Generic medical search error: {e}")
244
  return None
tests/result_of_test_multlevel_allback_validation.md ADDED
@@ -0,0 +1,570 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ 🏥 OnCall.ai Multilevel Fallback Validation Test
2
+ ============================================================
3
+ 🔧 Initializing Components for Multilevel Fallback Test...
4
+ ------------------------------------------------------------
5
+ 1. Initializing Llama3-Med42-70B Client...
6
+ 2025-07-31 07:12:17,625 - llm_clients - INFO - Medical LLM client initialized with model: m42-health/Llama3-Med42-70B
7
+ 2025-07-31 07:12:17,626 - llm_clients - WARNING - Medical LLM Model: Research tool only. Not for professional medical diagnosis.
8
+ ✅ LLM client initialized
9
+ 2. Initializing Retrieval System...
10
+ 2025-07-31 07:12:17,626 - retrieval - INFO - Initializing retrieval system...
11
+ 2025-07-31 07:12:17,637 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: mps
12
+ 2025-07-31 07:12:17,637 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: NeuML/pubmedbert-base-embeddings
13
+ 2025-07-31 07:12:20,936 - retrieval - INFO - Embedding model loaded successfully
14
+ 2025-07-31 07:12:22,314 - retrieval - INFO - Chunks loaded successfully
15
+ 2025-07-31 07:12:22,418 - retrieval - INFO - Embeddings loaded successfully
16
+ 2025-07-31 07:12:22,419 - retrieval - INFO - Loaded existing emergency index
17
+ 2025-07-31 07:12:22,420 - retrieval - INFO - Loaded existing treatment index
18
+ 2025-07-31 07:12:22,420 - retrieval - INFO - Retrieval system initialized successfully
19
+ ✅ Retrieval system initialized
20
+ 3. Initializing User Prompt Processor...
21
+ 2025-07-31 07:12:22,420 - sentence_transformers.SentenceTransformer - INFO - Use pytorch device_name: mps
22
+ 2025-07-31 07:12:22,420 - sentence_transformers.SentenceTransformer - INFO - Load pretrained SentenceTransformer: NeuML/pubmedbert-base-embeddings
23
+ 2025-07-31 07:12:24,622 - user_prompt - INFO - UserPromptProcessor initialized
24
+ ✅ User prompt processor initialized
25
+
26
+ 🎉 All components initialized successfully!
27
+
28
+ 🚀 Starting Multilevel Fallback Test Suite
29
+ Total test cases: 13
30
+ Test started at: 2025-07-31 07:12:17
31
+ ================================================================================
32
+
33
+ 🔍 level1_001: Level 1: Direct predefined condition match
34
+ Query: 'acute myocardial infarction treatment'
35
+ Expected Level: 1
36
+ ----------------------------------------------------------------------
37
+ 🎯 Executing multilevel fallback...
38
+ 2025-07-31 07:12:24,623 - user_prompt - INFO - Matched predefined condition: acute myocardial infarction
39
+ ✅ Detected Level: 1
40
+ Condition: acute myocardial infarction
41
+ Emergency Keywords: MI|chest pain|cardiac arrest
42
+ Treatment Keywords: aspirin|nitroglycerin|thrombolytic|PCI
43
+ Execution Time: 0.000s
44
+ 🎉 Test PASSED - Expected behavior achieved
45
+
46
+ 🔍 level1_002: Level 1: Predefined stroke condition
47
+ Query: 'how to manage acute stroke?'
48
+ Expected Level: 1
49
+ ----------------------------------------------------------------------
50
+ 🎯 Executing multilevel fallback...
51
+ 2025-07-31 07:12:24,623 - user_prompt - INFO - Matched predefined condition: acute stroke
52
+ ✅ Detected Level: 1
53
+ Condition: acute stroke
54
+ Emergency Keywords: stroke|neurological deficit|sudden weakness
55
+ Treatment Keywords: tPA|thrombolysis|stroke unit care
56
+ Execution Time: 0.000s
57
+ 🎉 Test PASSED - Expected behavior achieved
58
+
59
+ 🔍 level1_003: Level 1: Predefined PE condition
60
+ Query: 'pulmonary embolism emergency protocol'
61
+ Expected Level: 1
62
+ ----------------------------------------------------------------------
63
+ 🎯 Executing multilevel fallback...
64
+ 2025-07-31 07:12:24,623 - user_prompt - INFO - Matched predefined condition: pulmonary embolism
65
+ ✅ Detected Level: 1
66
+ Condition: pulmonary embolism
67
+ Emergency Keywords: chest pain|shortness of breath|sudden dyspnea
68
+ Treatment Keywords: anticoagulation|heparin|embolectomy
69
+ Execution Time: 0.000s
70
+ 🎉 Test PASSED - Expected behavior achieved
71
+
72
+ 🔍 level2_001: Level 2: Symptom-based query requiring LLM analysis
73
+ Query: 'patient with severe crushing chest pain radiating to left arm'
74
+ Expected Level: 2
75
+ ----------------------------------------------------------------------
76
+ 🎯 Executing multilevel fallback...
77
+ 2025-07-31 07:12:24,623 - llm_clients - INFO - Calling Medical LLM with query: patient with severe crushing chest pain radiating to left arm
78
+ 2025-07-31 07:12:47,629 - llm_clients - INFO - Raw LLM Response: Acute Myocardial Infarction (STEMI) - considering "severe crushing chest pain" and radiation to the left arm, which are classic symptoms of a heart attack specifically involving ST-elevation (STEMI type), indicating complete blockage of a coronary artery. However, please note that as an AI assistant, I don't diagnose; this interpretation is based on common clinical presentation. A healthcare provider should perform an ECG and other tests for confirmation.
79
+ 2025-07-31 07:12:47,630 - llm_clients - INFO - Query Latency: 23.0064 seconds
80
+ 2025-07-31 07:12:47,630 - llm_clients - INFO - Extracted Condition: acute myocardial infarction
81
+ ✅ Detected Level: 1
82
+ Condition: acute myocardial infarction
83
+ Emergency Keywords: MI|chest pain|cardiac arrest
84
+ Treatment Keywords: aspirin|nitroglycerin|thrombolytic|PCI
85
+ Execution Time: 23.008s
86
+ 🎉 Test PASSED - Expected behavior achieved
87
+
88
+ 🔍 level2_002: Level 2: Neurological symptoms requiring LLM
89
+ Query: 'sudden onset weakness on right side with speech difficulty'
90
+ Expected Level: 2
91
+ ----------------------------------------------------------------------
92
+ 🎯 Executing multilevel fallback...
93
+ 2025-07-31 07:12:47,631 - llm_clients - INFO - Calling Medical LLM with query: sudden onset weakness on right side with speech difficulty
94
+ 2025-07-31 07:12:56,760 - llm_clients - INFO - Raw LLM Response: Cerebrovascular Accident (CVA), or Acute Ischemic Stroke (specifically, with right hemiparesis and aphasia)
95
+
96
+ - This diagnosis represents the most likely condition given the sudden onset of right-sided weakness (hemiparesis) and speech difficulty (aphasia). An ischemic stroke occurs when blood flow to a part of the brain is blocked, typically by a thrombus or embolus, causing damage to brain tissue and resulting in neurological deficits. Immediate medical
97
+ 2025-07-31 07:12:56,760 - llm_clients - INFO - Query Latency: 9.1288 seconds
98
+ 2025-07-31 07:12:56,760 - llm_clients - INFO - Extracted Condition: Cerebrovascular Accident (CVA), or Acute Ischemic Stroke (specifically, with right hemiparesis and aphasia)
99
+ 2025-07-31 07:12:56,760 - user_prompt - INFO - Starting semantic search fallback for query: 'sudden onset weakness on right side with speech difficulty'
100
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 1.66it/s]
101
+ 2025-07-31 07:12:58,013 - retrieval - INFO - Sliding window search: Found 5 results
102
+ 2025-07-31 07:12:58,023 - user_prompt - INFO - Semantic search returned 5 results
103
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.88it/s]
104
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.77it/s]
105
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.88it/s]
106
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 62.68it/s]
107
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 17.51it/s]
108
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 57.08it/s]
109
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.75it/s]
110
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 63.98it/s]
111
+ 2025-07-31 07:12:58,342 - user_prompt - INFO - Inferred condition: None
112
+ 2025-07-31 07:12:58,342 - user_prompt - WARNING - Condition validation failed for: None
113
+ 2025-07-31 07:12:58,342 - user_prompt - INFO - No suitable condition found in semantic search
114
+ 2025-07-31 07:12:58,342 - llm_clients - INFO - Calling Medical LLM with query: sudden onset weakness on right side with speech difficulty
115
+ 2025-07-31 07:13:09,255 - llm_clients - INFO - Raw LLM Response: Cerebrovascular Accident (CVA), or Acute Ischemic Stroke (specifically, with right hemiparesis and aphasia)
116
+
117
+ - This diagnosis represents the most likely condition given the sudden onset of right-sided weakness (hemiparesis) and speech difficulty (aphasia), which are classic symptoms of an ischemic stroke affecting the dominant hemisphere (assuming the patient is right-handed).
118
+
119
+ Please note that only a qualified physician can confirm a diagnosis after a thorough evaluation, including imaging studies
120
+ 2025-07-31 07:13:09,255 - llm_clients - INFO - Query Latency: 10.9129 seconds
121
+ 2025-07-31 07:13:09,255 - llm_clients - INFO - Extracted Condition: Cerebrovascular Accident (CVA), or Acute Ischemic Stroke (specifically, with right hemiparesis and aphasia)
122
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 8.55it/s]
123
+ 2025-07-31 07:13:09,844 - retrieval - INFO - Sliding window search: Found 5 results
124
+ ✅ Detected Level: 5
125
+ Condition: generic medical query
126
+ Emergency Keywords: medical|emergency
127
+ Treatment Keywords: treatment|management
128
+ Execution Time: 22.223s
129
+ ⚠️ Test PARTIAL - ⚠️ Level 5 != expected 2. ⚠️ Condition 'generic medical query' != expected ['acute stroke', 'cerebrovascular accident'].
130
+
131
+ 🔍 level3_001: Level 3: Generic medical terms requiring semantic search
132
+ Query: 'emergency management of cardiovascular crisis'
133
+ Expected Level: 3
134
+ ----------------------------------------------------------------------
135
+ 🎯 Executing multilevel fallback...
136
+ 2025-07-31 07:13:09,854 - llm_clients - INFO - Calling Medical LLM with query: emergency management of cardiovascular crisis
137
+ 2025-07-31 07:13:20,094 - llm_clients - INFO - Raw LLM Response: Cardiac Arrest (or, in context of crisis not yet arrest: Acute Cardiogenic Emergency, e.g., STEMI)
138
+
139
+ - Note: As a text-based AI assistant, not a clinician, I don't provide medical advice. The term given here represents the most critical cardiovascular crisis requiring immediate emergency intervention. Cardiac arrest implies the heart has stopped pumping, while acute cardiogenic emergency (e.g., ST-elevation myocardial infarction, or STEMI) signifies severe heart
140
+ 2025-07-31 07:13:20,095 - llm_clients - INFO - Query Latency: 10.2402 seconds
141
+ 2025-07-31 07:13:20,095 - llm_clients - INFO - Extracted Condition: Cardiac Arrest (or, in context of crisis not yet arrest: Acute Cardiogenic Emergency, e.g., STEMI)
142
+ 2025-07-31 07:13:20,095 - user_prompt - INFO - Starting semantic search fallback for query: 'emergency management of cardiovascular crisis'
143
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.11it/s]
144
+ 2025-07-31 07:13:20,681 - retrieval - INFO - Sliding window search: Found 5 results
145
+ 2025-07-31 07:13:20,713 - user_prompt - INFO - Semantic search returned 5 results
146
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 14.75it/s]
147
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 59.28it/s]
148
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 56.29it/s]
149
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 62.79it/s]
150
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 65.12it/s]
151
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 62.44it/s]
152
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.88it/s]
153
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 62.20it/s]
154
+ 2025-07-31 07:13:20,905 - user_prompt - INFO - Inferred condition: None
155
+ 2025-07-31 07:13:20,905 - user_prompt - WARNING - Condition validation failed for: None
156
+ 2025-07-31 07:13:20,905 - user_prompt - INFO - No suitable condition found in semantic search
157
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 15.96it/s]
158
+ 2025-07-31 07:13:21,492 - retrieval - INFO - Sliding window search: Found 5 results
159
+ ✅ Detected Level: 5
160
+ Condition: generic medical query
161
+ Emergency Keywords: medical|emergency
162
+ Treatment Keywords: treatment|management
163
+ Execution Time: 11.647s
164
+ ⚠️ Test PARTIAL - ⚠️ Level 5 != expected 3. ⚠️ Condition 'generic medical query' != expected [].
165
+
166
+ 🔍 level3_002: Level 3: Medical terminology requiring semantic fallback
167
+ Query: 'urgent neurological intervention protocols'
168
+ Expected Level: 3
169
+ ----------------------------------------------------------------------
170
+ 🎯 Executing multilevel fallback...
171
+ 2025-07-31 07:13:21,501 - llm_clients - INFO - Calling Medical LLM with query: urgent neurological intervention protocols
172
+ 2025-07-31 07:13:30,536 - llm_clients - INFO - Raw LLM Response: The most representative condition: Acute Ischemic Stroke (requiring urgent neurointervention, such as thrombectomy)
173
+
174
+ Explanation: The phrase "urgent neurological intervention protocols" typically refers to time-critical situations in neurology, and among these, acute ischemic stroke is a prime example. Acute ischemic stroke necessitates rapid evaluation and intervention, including thrombectomy, to restore blood flow and minimize brain damage. This condition demands urgent action due to its narrow therapeutic window, typically within
175
+ 2025-07-31 07:13:30,537 - llm_clients - INFO - Query Latency: 9.0352 seconds
176
+ 2025-07-31 07:13:30,537 - llm_clients - INFO - Extracted Condition: The most representative condition: Acute Ischemic Stroke (requiring urgent neurointervention, such as thrombectomy)
177
+ 2025-07-31 07:13:30,537 - user_prompt - INFO - Starting semantic search fallback for query: 'urgent neurological intervention protocols'
178
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 7.94it/s]
179
+ 2025-07-31 07:13:31,115 - retrieval - INFO - Sliding window search: Found 5 results
180
+ 2025-07-31 07:13:31,123 - user_prompt - INFO - Semantic search returned 5 results
181
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.96it/s]
182
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 46.55it/s]
183
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 47.09it/s]
184
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 58.23it/s]
185
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 48.16it/s]
186
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 65.05it/s]
187
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.42it/s]
188
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 63.08it/s]
189
+ 2025-07-31 07:13:31,334 - user_prompt - INFO - Inferred condition: None
190
+ 2025-07-31 07:13:31,334 - user_prompt - WARNING - Condition validation failed for: None
191
+ 2025-07-31 07:13:31,334 - user_prompt - INFO - No suitable condition found in semantic search
192
+ Batches: 100%|██████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 16.31it/s]
193
+ 2025-07-31 07:13:31,889 - retrieval - INFO - Sliding window search: Found 5 results
194
+ ✅ Detected Level: 5
195
+ Condition: generic medical query
196
+ Emergency Keywords: medical|emergency
197
+ Treatment Keywords: treatment|management
198
+ Execution Time: 10.398s
199
+ ⚠️ Test PARTIAL - ⚠️ Level 5 != expected 3. ⚠️ Condition 'generic medical query' != expected [].
200
+
201
+ 🔍 level4a_001: Level 4a: Non-medical query should be rejected
202
+ Query: 'how to cook pasta properly?'
203
+ Expected Level: 4
204
+ ----------------------------------------------------------------------
205
+ 🎯 Executing multilevel fallback...
206
+ 2025-07-31 07:13:31,899 - llm_clients - INFO - Calling Medical LLM with query: how to cook pasta properly?
207
+ 2025-07-31 07:13:41,038 - llm_clients - INFO - Raw LLM Response: As a medical assistant, I do not address cooking techniques, only medical conditions. However, for context (not advice): This query doesn't represent a medical condition; it's about culinary practice. In this case, "properly" cooking pasta typically means achieving al dente texture (not overly soft) by boiling in adequately salted water for the recommended time on the package, then draining well. This is unrelated to any health condition unless discussing, hypothetically, gastrointestinal tolerance in specific patients (e
208
+ 2025-07-31 07:13:41,038 - llm_clients - INFO - Query Latency: 9.1386 seconds
209
+ 2025-07-31 07:13:41,038 - llm_clients - INFO - Extracted Condition: As a medical assistant, I do not address cooking techniques, only medical conditions. However, for context (not advice): This query doesn't represent a medical condition; it's about culinary practice. In this case, "properly" cooking pasta typically means achieving al dente texture (not overly soft) by boiling in adequately salted water for the recommended time on the package, then draining well. This is unrelated to any health condition unless discussing, hypothetically, gastrointestinal tolerance in specific patients (e
210
+ 2025-07-31 07:13:41,038 - user_prompt - INFO - Starting semantic search fallback for query: 'how to cook pasta properly?'
211
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.02it/s]
212
+ 2025-07-31 07:13:42,156 - retrieval - INFO - Sliding window search: Found 5 results
213
+ 2025-07-31 07:13:42,165 - user_prompt - INFO - Semantic search returned 5 results
214
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 9.34it/s]
215
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 52.88it/s]
216
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 55.97it/s]
217
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 55.95it/s]
218
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 54.63it/s]
219
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 59.07it/s]
220
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 57.84it/s]
221
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.43it/s]
222
+ 2025-07-31 07:13:42,407 - user_prompt - INFO - Inferred condition: None
223
+ 2025-07-31 07:13:42,407 - user_prompt - WARNING - Condition validation failed for: None
224
+ 2025-07-31 07:13:42,407 - user_prompt - INFO - No suitable condition found in semantic search
225
+ 2025-07-31 07:13:42,407 - llm_clients - INFO - Calling Medical LLM with query: how to cook pasta properly?
226
+ 2025-07-31 07:13:51,634 - llm_clients - INFO - Raw LLM Response: As a medical assistant, I don't address cooking techniques, but for context (not medical advice): In terms of relevance to health, the key aspect here isn't "proper" cooking per se, but rather avoiding overcooking that can reduce nutrient content. For whole-grain pasta, aim for al dente texture (firm, not mushy) to preserve fiber and other nutrients. However, this query is not about a medical condition but a culinary practice.
227
+ Representative (non-medical) term
228
+ 2025-07-31 07:13:51,634 - llm_clients - INFO - Query Latency: 9.2269 seconds
229
+ 2025-07-31 07:13:51,634 - llm_clients - INFO - Extracted Condition: As a medical assistant, I don't address cooking techniques, but for context (not medical advice): In terms of relevance to health, the key aspect here isn't "proper" cooking per se, but rather avoiding overcooking that can reduce nutrient content. For whole-grain pasta, aim for al dente texture (firm, not mushy) to preserve fiber and other nutrients. However, this query is not about a medical condition but a culinary practice.
230
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.02it/s]
231
+ 2025-07-31 07:13:52,790 - retrieval - INFO - Sliding window search: Found 5 results
232
+ ✅ Detected Level: 5
233
+ Condition: generic medical query
234
+ Emergency Keywords: medical|emergency
235
+ Treatment Keywords: treatment|management
236
+ Execution Time: 20.900s
237
+ ⚠️ Test PARTIAL - ⚠️ Level 5 != expected 4. ⚠️ Query should have been rejected.
238
+
239
+ 🔍 level4a_002: Level 4a: Technology query should be rejected
240
+ Query: 'best programming language to learn in 2025'
241
+ Expected Level: 4
242
+ ----------------------------------------------------------------------
243
+ 🎯 Executing multilevel fallback...
244
+ 2025-07-31 07:13:52,799 - llm_clients - INFO - Calling Medical LLM with query: best programming language to learn in 2025
245
+ 2025-07-31 07:14:02,339 - llm_clients - INFO - Raw LLM Response: As a medical assistant, I do not address technology or education preferences like "best programming language" (which is non-medical context); however, for clarity, this query is outside my biomedical scope. In 2025 or any current year, the choice of "best" programming language is subjective and depends on industry trends, personal goals, and specific use cases (e.g., web development, mobile apps, or data science). Popular choices include Python, JavaScript, and Java, but it's crucial
246
+ 2025-07-31 07:14:02,339 - llm_clients - INFO - Query Latency: 9.5400 seconds
247
+ 2025-07-31 07:14:02,339 - llm_clients - INFO - Extracted Condition: As a medical assistant, I do not address technology or education preferences like "best programming language" (which is non-medical context); however, for clarity, this query is outside my biomedical scope. In 2025 or any current year, the choice of "best" programming language is subjective and depends on industry trends, personal goals, and specific use cases (e.g., web development, mobile apps, or data science). Popular choices include Python, JavaScript, and Java, but it's crucial
248
+ 2025-07-31 07:14:02,339 - user_prompt - INFO - Starting semantic search fallback for query: 'best programming language to learn in 2025'
249
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 6.45it/s]
250
+ 2025-07-31 07:14:02,974 - retrieval - INFO - Sliding window search: Found 5 results
251
+ 2025-07-31 07:14:02,986 - user_prompt - INFO - Semantic search returned 5 results
252
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 3.16it/s]
253
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 44.42it/s]
254
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 41.06it/s]
255
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 57.97it/s]
256
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 40.41it/s]
257
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 57.85it/s]
258
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 54.99it/s]
259
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 55.63it/s]
260
+ 2025-07-31 07:14:03,457 - user_prompt - INFO - Inferred condition: None
261
+ 2025-07-31 07:14:03,457 - user_prompt - WARNING - Condition validation failed for: None
262
+ 2025-07-31 07:14:03,457 - user_prompt - INFO - No suitable condition found in semantic search
263
+ 2025-07-31 07:14:03,457 - llm_clients - INFO - Calling Medical LLM with query: best programming language to learn in 2025
264
+ 2025-07-31 07:14:13,766 - llm_clients - INFO - Raw LLM Response: As a medical assistant, I don't analyze technology trends or recommend programming languages; however, for clarity's sake (though out of my medical scope), in 2021 (not 2025's future prediction, as I'm bound by current data), popular choices for learning include Python, JavaScript, and Java due to their versatility, wide adoption, and job market demand. Keep in mind this information is not medical advice but rather a layman's interpretation of tech trends.
265
+
266
+ Representative Condition (not
267
+ 2025-07-31 07:14:13,766 - llm_clients - INFO - Query Latency: 10.3088 seconds
268
+ 2025-07-31 07:14:13,767 - llm_clients - INFO - Extracted Condition: As a medical assistant, I don't analyze technology trends or recommend programming languages; however, for clarity's sake (though out of my medical scope), in 2021 (not 2025's future prediction, as I'm bound by current data), popular choices for learning include Python, JavaScript, and Java due to their versatility, wide adoption, and job market demand. Keep in mind this information is not medical advice but rather a layman's interpretation of tech trends.
269
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.00it/s]
270
+ 2025-07-31 07:14:14,884 - retrieval - INFO - Sliding window search: Found 5 results
271
+ ✅ Detected Level: 5
272
+ Condition: generic medical query
273
+ Emergency Keywords: medical|emergency
274
+ Treatment Keywords: treatment|management
275
+ Execution Time: 22.107s
276
+ ⚠️ Test PARTIAL - ⚠️ Level 5 != expected 4. ⚠️ Query should have been rejected.
277
+
278
+ 🔍 level4a_003: Level 4a: Weather query should be rejected
279
+ Query: 'weather forecast for tomorrow'
280
+ Expected Level: 4
281
+ ----------------------------------------------------------------------
282
+ 🎯 Executing multilevel fallback...
283
+ 2025-07-31 07:14:14,905 - llm_clients - INFO - Calling Medical LLM with query: weather forecast for tomorrow
284
+ 2025-07-31 07:14:24,069 - llm_clients - INFO - Raw LLM Response: As a medical assistant, I do not address weather forecasts; however, for context clarification, this query is unrelated to medical conditions. The requested information here is about meteorology (weather prediction) rather than health or disease. There's no representative medical condition to provide in this case.
285
+
286
+ If, however, you were referring indirectly to weather-sensitive health conditions (e.g., heat exhaustion, cold-induced asthma exacerbation), the specific condition would depend on the actual weather forecast details (temperature, humidity, etc.)
287
+ 2025-07-31 07:14:24,069 - llm_clients - INFO - Query Latency: 9.1634 seconds
288
+ 2025-07-31 07:14:24,069 - llm_clients - INFO - Extracted Condition: As a medical assistant, I do not address weather forecasts; however, for context clarification, this query is unrelated to medical conditions. The requested information here is about meteorology (weather prediction) rather than health or disease. There's no representative medical condition to provide in this case.
289
+ 2025-07-31 07:14:24,070 - user_prompt - INFO - Starting semantic search fallback for query: 'weather forecast for tomorrow'
290
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.17it/s]
291
+ 2025-07-31 07:14:25,222 - retrieval - INFO - Sliding window search: Found 5 results
292
+ 2025-07-31 07:14:25,234 - user_prompt - INFO - Semantic search returned 5 results
293
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.71it/s]
294
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 50.65it/s]
295
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 55.87it/s]
296
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 50.21it/s]
297
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 21.32it/s]
298
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 54.77it/s]
299
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 53.42it/s]
300
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 56.34it/s]
301
+ 2025-07-31 07:14:25,491 - user_prompt - INFO - Inferred condition: None
302
+ 2025-07-31 07:14:25,491 - user_prompt - WARNING - Condition validation failed for: None
303
+ 2025-07-31 07:14:25,491 - user_prompt - INFO - No suitable condition found in semantic search
304
+ 2025-07-31 07:14:25,491 - llm_clients - INFO - Calling Medical LLM with query: weather forecast for tomorrow
305
+ 2025-07-31 07:14:35,356 - llm_clients - INFO - Raw LLM Response: As a medical assistant, I do not address weather forecasts; however, for this context (to maintain representativeness in terms unrelated to diagnosis), the phrase here isn't indicative of a medical condition. Instead, it's about environmental information—specifically, a request for meteorological data (tomorrow's weather). In medical terminology, we wouldn't classify this as a condition, but for representation's sake in a non-medical context, it can be labeled as "meteorological inquiry" or simply
306
+ 2025-07-31 07:14:35,356 - llm_clients - INFO - Query Latency: 9.8645 seconds
307
+ 2025-07-31 07:14:35,356 - llm_clients - INFO - Extracted Condition: As a medical assistant, I do not address weather forecasts; however, for this context (to maintain representativeness in terms unrelated to diagnosis), the phrase here isn't indicative of a medical condition. Instead, it's about environmental information—specifically, a request for meteorological data (tomorrow's weather). In medical terminology, we wouldn't classify this as a condition, but for representation's sake in a non-medical context, it can be labeled as "meteorological inquiry" or simply
308
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.19it/s]
309
+ 2025-07-31 07:14:36,024 - retrieval - INFO - Sliding window search: Found 5 results
310
+ ✅ Detected Level: 5
311
+ Condition: generic medical query
312
+ Emergency Keywords: medical|emergency
313
+ Treatment Keywords: treatment|management
314
+ Execution Time: 21.128s
315
+ ⚠️ Test PARTIAL - ⚠️ Level 5 != expected 4. ⚠️ Query should have been rejected.
316
+
317
+ 🔍 level4b_001: Level 4b→5: Obscure medical query passing validation to generic search
318
+ Query: 'rare hematologic malignancy treatment approaches'
319
+ Expected Level: 5
320
+ ----------------------------------------------------------------------
321
+ 🎯 Executing multilevel fallback...
322
+ 2025-07-31 07:14:36,033 - llm_clients - INFO - Calling Medical LLM with query: rare hematologic malignancy treatment approaches
323
+ 2025-07-31 07:14:45,301 - llm_clients - INFO - Raw LLM Response: The most representative condition: Myelofibrosis (or, in context of "rare" reference, could be an even less common variant like BCR-ABL1-negative atypical CML or unclassifiable myeloproliferative neoplasm)
324
+
325
+ - For myelofibrosis, primary treatment approaches include JAK2 inhibitors (e.g., ruxolitinib), supportive care (transfusions, erythropoiesis-stimulating agents), and allog
326
+ 2025-07-31 07:14:45,302 - llm_clients - INFO - Query Latency: 9.2678 seconds
327
+ 2025-07-31 07:14:45,302 - llm_clients - INFO - Extracted Condition: The most representative condition: Myelofibrosis (or, in context of "rare" reference, could be an even less common variant like BCR-ABL1-negative atypical CML or unclassifiable myeloproliferative neoplasm)
328
+ 2025-07-31 07:14:45,302 - user_prompt - INFO - Starting semantic search fallback for query: 'rare hematologic malignancy treatment approaches'
329
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 2.00it/s]
330
+ 2025-07-31 07:14:46,428 - retrieval - INFO - Sliding window search: Found 5 results
331
+ 2025-07-31 07:14:46,436 - user_prompt - INFO - Semantic search returned 5 results
332
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.59it/s]
333
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 38.61it/s]
334
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 41.66it/s]
335
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.40it/s]
336
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 41.09it/s]
337
+ Batches: 100%|█████████████████████████████���████████████████████████████████████| 1/1 [00:00<00:00, 60.42it/s]
338
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 58.98it/s]
339
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 66.70it/s]
340
+ 2025-07-31 07:14:46,672 - user_prompt - INFO - Inferred condition: None
341
+ 2025-07-31 07:14:46,672 - user_prompt - WARNING - Condition validation failed for: None
342
+ 2025-07-31 07:14:46,672 - user_prompt - INFO - No suitable condition found in semantic search
343
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 54.28it/s]
344
+ 2025-07-31 07:14:47,160 - retrieval - INFO - Sliding window search: Found 5 results
345
+ ✅ Detected Level: 5
346
+ Condition: generic medical query
347
+ Emergency Keywords: medical|emergency
348
+ Treatment Keywords: treatment|management
349
+ Execution Time: 11.137s
350
+ 🎉 Test PASSED - Expected behavior achieved
351
+
352
+ 🔍 level4b_002: Level 4b→5: Rare condition requiring generic medical search
353
+ Query: 'idiopathic thrombocytopenic purpura management guidelines'
354
+ Expected Level: 5
355
+ ----------------------------------------------------------------------
356
+ 🎯 Executing multilevel fallback...
357
+ 2025-07-31 07:14:47,170 - llm_clients - INFO - Calling Medical LLM with query: idiopathic thrombocytopenic purpura management guidelines
358
+ 2025-07-31 07:14:56,483 - llm_clients - INFO - Raw LLM Response: The primary medical condition: Idiopathic Thrombocytopenic Purpura (ITP)
359
+
360
+ (As a medical assistant, I do not provide advice, but here's the relevant condition with context for a knowledge reference.)
361
+ In this case, the most representative condition is Idiopathic Thrombocytopenic Purpura (ITP), an autoimmune disorder characterized by low platelet count (thrombocytopenia) without identifiable underlying causes. Management guidelines typically involve
362
+ 2025-07-31 07:14:56,484 - llm_clients - INFO - Query Latency: 9.3136 seconds
363
+ 2025-07-31 07:14:56,484 - llm_clients - INFO - Extracted Condition: The primary medical condition: Idiopathic Thrombocytopenic Purpura (ITP)
364
+ 2025-07-31 07:14:56,484 - user_prompt - INFO - Starting semantic search fallback for query: 'idiopathic thrombocytopenic purpura management guidelines'
365
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.14it/s]
366
+ 2025-07-31 07:14:57,082 - retrieval - INFO - Sliding window search: Found 5 results
367
+ 2025-07-31 07:14:57,090 - user_prompt - INFO - Semantic search returned 5 results
368
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.83it/s]
369
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 51.94it/s]
370
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.06it/s]
371
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 65.59it/s]
372
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 57.81it/s]
373
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.78it/s]
374
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.76it/s]
375
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.14it/s]
376
+ 2025-07-31 07:14:57,296 - user_prompt - INFO - Inferred condition: None
377
+ 2025-07-31 07:14:57,296 - user_prompt - WARNING - Condition validation failed for: None
378
+ 2025-07-31 07:14:57,296 - user_prompt - INFO - No suitable condition found in semantic search
379
+ 2025-07-31 07:14:57,296 - llm_clients - INFO - Calling Medical LLM with query: idiopathic thrombocytopenic purpura management guidelines
380
+ 2025-07-31 07:15:06,621 - llm_clients - INFO - Raw LLM Response: The primary medical condition: Idiopathic Thrombocytopenic Purpura (ITP)
381
+
382
+ (As a medical assistant, I don't provide advice, but describe the condition and point to standard guidelines. For ITP management, refer to professional sources like the American Society of Hematology [ASH] or National Institutes of Health [NIH].)
383
+
384
+ Idiopathic Thrombocytopenic Purpura (ITP) is an autoimmune disorder characterized by low platelet count
385
+ 2025-07-31 07:15:06,621 - llm_clients - INFO - Query Latency: 9.3245 seconds
386
+ 2025-07-31 07:15:06,621 - llm_clients - INFO - Extracted Condition: The primary medical condition: Idiopathic Thrombocytopenic Purpura (ITP)
387
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 11.12it/s]
388
+ 2025-07-31 07:15:07,391 - retrieval - INFO - Sliding window search: Found 5 results
389
+ ✅ Detected Level: 5
390
+ Condition: generic medical query
391
+ Emergency Keywords: medical|emergency
392
+ Treatment Keywords: treatment|management
393
+ Execution Time: 20.228s
394
+ 🎉 Test PASSED - Expected behavior achieved
395
+
396
+ 🔍 level4b_003: Level 4b→5: Rare emergency condition → generic search
397
+ Query: 'necrotizing fasciitis surgical intervention protocols'
398
+ Expected Level: 5
399
+ ----------------------------------------------------------------------
400
+ 🎯 Executing multilevel fallback...
401
+ 2025-07-31 07:15:07,398 - llm_clients - INFO - Calling Medical LLM with query: necrotizing fasciitis surgical intervention protocols
402
+ 2025-07-31 07:15:16,625 - llm_clients - INFO - Raw LLM Response: The primary medical condition: Necrotizing Fasciitis
403
+
404
+ In this context, the key condition is Necrotizing Fasciitis, a severe bacterial infection characterized by rapid destruction of subcutaneous tissue and fascia. The term provided, "surgical intervention protocols," refers to the treatment approach rather than a distinct medical condition. However, for clarity in this answer, I'll address it as it pertains to managing Necrotizing Fasciitis.
405
+
406
+ In Necrotizing Fasciitis, surgical
407
+ 2025-07-31 07:15:16,625 - llm_clients - INFO - Query Latency: 9.2271 seconds
408
+ 2025-07-31 07:15:16,625 - llm_clients - INFO - Extracted Condition: The primary medical condition: Necrotizing Fasciitis
409
+ 2025-07-31 07:15:16,625 - user_prompt - INFO - Starting semantic search fallback for query: 'necrotizing fasciitis surgical intervention protocols'
410
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 10.01it/s]
411
+ 2025-07-31 07:15:17,212 - retrieval - INFO - Sliding window search: Found 5 results
412
+ 2025-07-31 07:15:17,222 - user_prompt - INFO - Semantic search returned 5 results
413
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 12.01it/s]
414
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 45.04it/s]
415
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 44.57it/s]
416
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 57.92it/s]
417
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 48.15it/s]
418
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.28it/s]
419
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 60.83it/s]
420
+ Batches: 100%|███���██████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 61.38it/s]
421
+ 2025-07-31 07:15:17,449 - user_prompt - INFO - Inferred condition: None
422
+ 2025-07-31 07:15:17,449 - user_prompt - WARNING - Condition validation failed for: None
423
+ 2025-07-31 07:15:17,449 - user_prompt - INFO - No suitable condition found in semantic search
424
+ 2025-07-31 07:15:17,449 - llm_clients - INFO - Calling Medical LLM with query: necrotizing fasciitis surgical intervention protocols
425
+ 2025-07-31 07:15:24,511 - llm_clients - INFO - Raw LLM Response: The most representative condition: Necrotizing Fasciitis
426
+
427
+ (As a medical assistant, I do not provide advice, only identify conditions. For necrotizing fasciitis, surgical intervention typically involves aggressive debridement—removing dead tissue—and may require repeated procedures until healthy margins are achieved. This is accompanied by supportive care and antibiotics.)
428
+
429
+
430
+ 2025-07-31 07:15:24,511 - llm_clients - INFO - Query Latency: 7.0619 seconds
431
+ 2025-07-31 07:15:24,511 - llm_clients - INFO - Extracted Condition: The most representative condition: Necrotizing Fasciitis
432
+ Batches: 100%|██████████████████████████████████████████████████████████████████| 1/1 [00:00<00:00, 13.83it/s]
433
+ 2025-07-31 07:15:25,078 - retrieval - INFO - Sliding window search: Found 5 results
434
+ ✅ Detected Level: 5
435
+ Condition: generic medical query
436
+ Emergency Keywords: medical|emergency
437
+ Treatment Keywords: treatment|management
438
+ Execution Time: 17.692s
439
+ 🎉 Test PASSED - Expected behavior achieved
440
+
441
+ ================================================================================
442
+ 📊 MULTILEVEL FALLBACK TEST REPORT
443
+ ================================================================================
444
+ 🕐 Execution Summary:
445
+ Total duration: 187.465s
446
+ Average per test: 14.420s
447
+
448
+ 📈 Test Results:
449
+ Total tests: 13
450
+ Passed: 7 ✅
451
+ Partial: 6 ⚠️
452
+ Failed: 6 ❌
453
+ Success rate: 53.8%
454
+
455
+ 🎯 Level Distribution Analysis:
456
+ Level 1 (Predefined Mapping): 4 tests, avg 5.752s
457
+ Level 5 (Generic Search): 9 tests, avg 17.495s
458
+
459
+ 📋 Category Analysis:
460
+ level1_predefined: 3/3 (100.0%)
461
+ level2_llm: 1/2 (50.0%)
462
+ level3_semantic: 0/2 (0.0%)
463
+ level4a_rejection: 0/3 (0.0%)
464
+ level4b_to_5: 3/3 (100.0%)
465
+
466
+ 📝 Detailed Test Results:
467
+
468
+ level1_001: ✅ PASS
469
+ Query: 'acute myocardial infarction treatment'
470
+ Expected Level: 1
471
+ Detected Level: 1
472
+ Condition: acute myocardial infarction
473
+ Time: 0.000s
474
+ Validation: ✅ Level 1 as expected. ✅ Condition 'acute myocardial infarction' matches expected.
475
+
476
+ level1_002: ✅ PASS
477
+ Query: 'how to manage acute stroke?'
478
+ Expected Level: 1
479
+ Detected Level: 1
480
+ Condition: acute stroke
481
+ Time: 0.000s
482
+ Validation: ✅ Level 1 as expected. ✅ Condition 'acute stroke' matches expected.
483
+
484
+ level1_003: ✅ PASS
485
+ Query: 'pulmonary embolism emergency protocol'
486
+ Expected Level: 1
487
+ Detected Level: 1
488
+ Condition: pulmonary embolism
489
+ Time: 0.000s
490
+ Validation: ✅ Level 1 as expected. ✅ Condition 'pulmonary embolism' matches expected.
491
+
492
+ level2_001: ✅ PASS
493
+ Query: 'patient with severe crushing chest pain radiating to left arm'
494
+ Expected Level: 2
495
+ Detected Level: 1
496
+ Condition: acute myocardial infarction
497
+ Time: 23.008s
498
+ Validation: ⚠️ Level 1 != expected 2. ✅ Condition 'acute myocardial infarction' matches expected.
499
+
500
+ level2_002: ⚠️ PARTIAL
501
+ Query: 'sudden onset weakness on right side with speech difficulty'
502
+ Expected Level: 2
503
+ Detected Level: 5
504
+ Condition: generic medical query
505
+ Time: 22.223s
506
+ Validation: ⚠️ Level 5 != expected 2. ⚠️ Condition 'generic medical query' != expected ['acute stroke', 'cerebrovascular accident'].
507
+
508
+ level3_001: ⚠️ PARTIAL
509
+ Query: 'emergency management of cardiovascular crisis'
510
+ Expected Level: 3
511
+ Detected Level: 5
512
+ Condition: generic medical query
513
+ Time: 11.647s
514
+ Validation: ⚠️ Level 5 != expected 3. ⚠️ Condition 'generic medical query' != expected [].
515
+
516
+ level3_002: ⚠️ PARTIAL
517
+ Query: 'urgent neurological intervention protocols'
518
+ Expected Level: 3
519
+ Detected Level: 5
520
+ Condition: generic medical query
521
+ Time: 10.398s
522
+ Validation: ⚠️ Level 5 != expected 3. ⚠️ Condition 'generic medical query' != expected [].
523
+
524
+ level4a_001: ⚠️ PARTIAL
525
+ Query: 'how to cook pasta properly?'
526
+ Expected Level: 4
527
+ Detected Level: 5
528
+ Condition: generic medical query
529
+ Time: 20.900s
530
+ Validation: ⚠️ Level 5 != expected 4. ⚠️ Query should have been rejected.
531
+
532
+ level4a_002: ⚠️ PARTIAL
533
+ Query: 'best programming language to learn in 2025'
534
+ Expected Level: 4
535
+ Detected Level: 5
536
+ Condition: generic medical query
537
+ Time: 22.107s
538
+ Validation: ⚠️ Level 5 != expected 4. ⚠️ Query should have been rejected.
539
+
540
+ level4a_003: ⚠️ PARTIAL
541
+ Query: 'weather forecast for tomorrow'
542
+ Expected Level: 4
543
+ Detected Level: 5
544
+ Condition: generic medical query
545
+ Time: 21.128s
546
+ Validation: ⚠️ Level 5 != expected 4. ⚠️ Query should have been rejected.
547
+
548
+ level4b_001: ✅ PASS
549
+ Query: 'rare hematologic malignancy treatment approaches'
550
+ Expected Level: 5
551
+ Detected Level: 5
552
+ Condition: generic medical query
553
+ Time: 11.137s
554
+ Validation: ✅ Level 5 as expected. ✅ Generic medical search triggered.
555
+
556
+ level4b_002: ✅ PASS
557
+ Query: 'idiopathic thrombocytopenic purpura management guidelines'
558
+ Expected Level: 5
559
+ Detected Level: 5
560
+ Condition: generic medical query
561
+ Time: 20.228s
562
+ Validation: ✅ Level 5 as expected. ✅ Generic medical search triggered.
563
+
564
+ level4b_003: ✅ PASS
565
+ Query: 'necrotizing fasciitis surgical intervention protocols'
566
+ Expected Level: 5
567
+ Detected Level: 5
568
+ Condition: generic medical query
569
+ Time: 17.692s
570
+ Validation: ✅ Level 5 as expected. ✅ Generic medical search triggered.
tests/{result_of_test_userinput_userprompt_medical_condition_llm.txt → result_of_test_userinput_userprompt_medical_condition_llm.md} RENAMED
File without changes
tests/test_multilevel_fallback_validation.py ADDED
@@ -0,0 +1,537 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ #!/usr/bin/env python3
2
+ """
3
+ Multi-Level Fallback Validation Test Suite for OnCall.ai
4
+
5
+ This test specifically validates the 5-level fallback mechanism:
6
+ Level 1: Predefined Mapping (Fast Path)
7
+ Level 2: Llama3-Med42-70B Extraction
8
+ Level 3: Semantic Search Fallback
9
+ Level 4: Medical Query Validation
10
+ Level 5: Generic Medical Search
11
+
12
+ Author: OnCall.ai Team
13
+ Date: 2025-07-30
14
+ """
15
+
16
+ import sys
17
+ import os
18
+ from pathlib import Path
19
+ import logging
20
+ import json
21
+ import traceback
22
+ from datetime import datetime
23
+ from typing import Dict, List, Any, Optional
24
+
25
+ # Add src directory to Python path
26
+ current_dir = Path(__file__).parent
27
+ project_root = current_dir.parent
28
+ src_dir = project_root / "src"
29
+ sys.path.insert(0, str(src_dir))
30
+
31
+ # Import our modules
32
+ try:
33
+ from user_prompt import UserPromptProcessor
34
+ from retrieval import BasicRetrievalSystem
35
+ from llm_clients import llm_Med42_70BClient
36
+ from medical_conditions import CONDITION_KEYWORD_MAPPING
37
+ except ImportError as e:
38
+ print(f"❌ Import Error: {e}")
39
+ print(f"Current working directory: {os.getcwd()}")
40
+ print(f"Python path: {sys.path}")
41
+ sys.exit(1)
42
+
43
+ # Configure logging
44
+ logging.basicConfig(
45
+ level=logging.INFO,
46
+ format='%(asctime)s - %(name)s - %(levelname)s - %(message)s',
47
+ handlers=[
48
+ logging.StreamHandler(),
49
+ logging.FileHandler(project_root / 'tests' / 'multilevel_fallback_test.log')
50
+ ]
51
+ )
52
+ logger = logging.getLogger(__name__)
53
+
54
+ class MultilevelFallbackTest:
55
+ """Test suite specifically for the 5-level fallback mechanism"""
56
+
57
+ def __init__(self):
58
+ """Initialize test suite"""
59
+ self.start_time = datetime.now()
60
+ self.results = []
61
+ self.components_initialized = False
62
+
63
+ # Component references
64
+ self.llm_client = None
65
+ self.retrieval_system = None
66
+ self.user_prompt_processor = None
67
+
68
+ def initialize_components(self):
69
+ """Initialize all pipeline components"""
70
+ print("🔧 Initializing Components for Multilevel Fallback Test...")
71
+ print("-" * 60)
72
+
73
+ try:
74
+ # Initialize LLM client
75
+ print("1. Initializing Llama3-Med42-70B Client...")
76
+ self.llm_client = llm_Med42_70BClient()
77
+ print(" ✅ LLM client initialized")
78
+
79
+ # Initialize retrieval system
80
+ print("2. Initializing Retrieval System...")
81
+ self.retrieval_system = BasicRetrievalSystem()
82
+ print(" ✅ Retrieval system initialized")
83
+
84
+ # Initialize user prompt processor
85
+ print("3. Initializing User Prompt Processor...")
86
+ self.user_prompt_processor = UserPromptProcessor(
87
+ llm_client=self.llm_client,
88
+ retrieval_system=self.retrieval_system
89
+ )
90
+ print(" ✅ User prompt processor initialized")
91
+
92
+ self.components_initialized = True
93
+ print("\n🎉 All components initialized successfully!")
94
+
95
+ except Exception as e:
96
+ logger.error(f"Component initialization failed: {e}")
97
+ print(f"❌ Component initialization failed: {e}")
98
+ traceback.print_exc()
99
+ self.components_initialized = False
100
+
101
+ def get_multilevel_test_cases(self) -> List[Dict[str, Any]]:
102
+ """Define test cases specifically targeting each fallback level"""
103
+ return [
104
+ # Level 1: Predefined Mapping Tests
105
+ {
106
+ "id": "level1_001",
107
+ "query": "acute myocardial infarction treatment",
108
+ "description": "Level 1: Direct predefined condition match",
109
+ "expected_level": 1,
110
+ "expected_condition": "acute myocardial infarction",
111
+ "expected_source": "predefined_mapping",
112
+ "category": "level1_predefined"
113
+ },
114
+ {
115
+ "id": "level1_002",
116
+ "query": "how to manage acute stroke?",
117
+ "description": "Level 1: Predefined stroke condition",
118
+ "expected_level": 1,
119
+ "expected_condition": "acute stroke",
120
+ "expected_source": "predefined_mapping",
121
+ "category": "level1_predefined"
122
+ },
123
+ {
124
+ "id": "level1_003",
125
+ "query": "pulmonary embolism emergency protocol",
126
+ "description": "Level 1: Predefined PE condition",
127
+ "expected_level": 1,
128
+ "expected_condition": "pulmonary embolism",
129
+ "expected_source": "predefined_mapping",
130
+ "category": "level1_predefined"
131
+ },
132
+
133
+ # Level 2: LLM Extraction Tests
134
+ {
135
+ "id": "level2_001",
136
+ "query": "patient with severe crushing chest pain radiating to left arm",
137
+ "description": "Level 2: Symptom-based query requiring LLM analysis",
138
+ "expected_level": 2,
139
+ "expected_condition": ["acute myocardial infarction", "acute coronary syndrome"],
140
+ "expected_source": "llm_extraction",
141
+ "category": "level2_llm"
142
+ },
143
+ {
144
+ "id": "level2_002",
145
+ "query": "sudden onset weakness on right side with speech difficulty",
146
+ "description": "Level 2: Neurological symptoms requiring LLM",
147
+ "expected_level": 2,
148
+ "expected_condition": ["acute stroke", "cerebrovascular accident"],
149
+ "expected_source": "llm_extraction",
150
+ "category": "level2_llm"
151
+ },
152
+
153
+ # Level 3: Semantic Search Tests
154
+ {
155
+ "id": "level3_001",
156
+ "query": "emergency management of cardiovascular crisis",
157
+ "description": "Level 3: Generic medical terms requiring semantic search",
158
+ "expected_level": 3,
159
+ "expected_source": "semantic_search",
160
+ "category": "level3_semantic"
161
+ },
162
+ {
163
+ "id": "level3_002",
164
+ "query": "urgent neurological intervention protocols",
165
+ "description": "Level 3: Medical terminology requiring semantic fallback",
166
+ "expected_level": 3,
167
+ "expected_source": "semantic_search",
168
+ "category": "level3_semantic"
169
+ },
170
+
171
+ # Level 4a: Non-Medical Query Rejection
172
+ {
173
+ "id": "level4a_001",
174
+ "query": "how to cook pasta properly?",
175
+ "description": "Level 4a: Non-medical query should be rejected",
176
+ "expected_level": 4,
177
+ "expected_result": "invalid_query",
178
+ "expected_source": "validation_rejection",
179
+ "category": "level4a_rejection"
180
+ },
181
+ {
182
+ "id": "level4a_002",
183
+ "query": "best programming language to learn in 2025",
184
+ "description": "Level 4a: Technology query should be rejected",
185
+ "expected_level": 4,
186
+ "expected_result": "invalid_query",
187
+ "expected_source": "validation_rejection",
188
+ "category": "level4a_rejection"
189
+ },
190
+ {
191
+ "id": "level4a_003",
192
+ "query": "weather forecast for tomorrow",
193
+ "description": "Level 4a: Weather query should be rejected",
194
+ "expected_level": 4,
195
+ "expected_result": "invalid_query",
196
+ "expected_source": "validation_rejection",
197
+ "category": "level4a_rejection"
198
+ },
199
+
200
+ # Level 4b + 5: Obscure Medical Terms → Generic Search
201
+ {
202
+ "id": "level4b_001",
203
+ "query": "rare hematologic malignancy treatment approaches",
204
+ "description": "Level 4b→5: Obscure medical query passing validation to generic search",
205
+ "expected_level": 5,
206
+ "expected_condition": "generic medical query",
207
+ "expected_source": "generic_search",
208
+ "category": "level4b_to_5"
209
+ },
210
+ {
211
+ "id": "level4b_002",
212
+ "query": "idiopathic thrombocytopenic purpura management guidelines",
213
+ "description": "Level 4b→5: Rare condition requiring generic medical search",
214
+ "expected_level": 5,
215
+ "expected_condition": "generic medical query",
216
+ "expected_source": "generic_search",
217
+ "category": "level4b_to_5"
218
+ },
219
+ {
220
+ "id": "level4b_003",
221
+ "query": "necrotizing fasciitis surgical intervention protocols",
222
+ "description": "Level 4b→5: Rare emergency condition → generic search",
223
+ "expected_level": 5,
224
+ "expected_condition": "generic medical query",
225
+ "expected_source": "generic_search",
226
+ "category": "level4b_to_5"
227
+ }
228
+ ]
229
+
230
+ def run_single_fallback_test(self, test_case: Dict[str, Any]) -> Dict[str, Any]:
231
+ """Execute a single fallback test case with level detection"""
232
+ test_id = test_case["id"]
233
+ query = test_case["query"]
234
+
235
+ print(f"\n🔍 {test_id}: {test_case['description']}")
236
+ print(f"Query: '{query}'")
237
+ print(f"Expected Level: {test_case.get('expected_level', 'Unknown')}")
238
+ print("-" * 70)
239
+
240
+ result = {
241
+ "test_id": test_id,
242
+ "test_case": test_case,
243
+ "timestamp": datetime.now().isoformat(),
244
+ "success": False,
245
+ "error": None,
246
+ "execution_time": 0,
247
+ "detected_level": None,
248
+ "condition_result": {}
249
+ }
250
+
251
+ start_time = datetime.now()
252
+
253
+ try:
254
+ # Execute condition extraction with level detection
255
+ print("🎯 Executing multilevel fallback...")
256
+ condition_start = datetime.now()
257
+
258
+ condition_result = self.user_prompt_processor.extract_condition_keywords(query)
259
+ condition_time = (datetime.now() - condition_start).total_seconds()
260
+
261
+ # Detect which level was used
262
+ detected_level = self._detect_fallback_level(condition_result)
263
+
264
+ result["condition_result"] = condition_result
265
+ result["detected_level"] = detected_level
266
+ result["execution_time"] = condition_time
267
+
268
+ print(f" ✅ Detected Level: {detected_level}")
269
+ print(f" Condition: {condition_result.get('condition', 'None')}")
270
+ print(f" Emergency Keywords: {condition_result.get('emergency_keywords', 'None')}")
271
+ print(f" Treatment Keywords: {condition_result.get('treatment_keywords', 'None')}")
272
+ print(f" Execution Time: {condition_time:.3f}s")
273
+
274
+ # Validate expected behavior
275
+ validation_result = self._validate_expected_behavior(test_case, detected_level, condition_result)
276
+ result.update(validation_result)
277
+
278
+ if result["success"]:
279
+ print(" 🎉 Test PASSED - Expected behavior achieved")
280
+ else:
281
+ print(f" ⚠️ Test PARTIAL - {result.get('validation_message', 'Unexpected behavior')}")
282
+
283
+ except Exception as e:
284
+ total_time = (datetime.now() - start_time).total_seconds()
285
+ result["execution_time"] = total_time
286
+ result["error"] = str(e)
287
+ result["traceback"] = traceback.format_exc()
288
+
289
+ logger.error(f"Test {test_id} failed: {e}")
290
+ print(f" ❌ Test FAILED: {e}")
291
+
292
+ return result
293
+
294
+ def _detect_fallback_level(self, condition_result: Dict[str, Any]) -> int:
295
+ """Detect which fallback level was used based on the result"""
296
+ if not condition_result:
297
+ return 0 # No result
298
+
299
+ # Check for validation rejection (Level 4a)
300
+ if condition_result.get('type') == 'invalid_query':
301
+ return 4
302
+
303
+ # Check for generic search (Level 5)
304
+ if condition_result.get('condition') == 'generic medical query':
305
+ return 5
306
+
307
+ # Check for semantic search (Level 3)
308
+ if 'semantic_confidence' in condition_result:
309
+ return 3
310
+
311
+ # Check for predefined mapping (Level 1)
312
+ condition = condition_result.get('condition', '')
313
+ if condition and condition in CONDITION_KEYWORD_MAPPING:
314
+ return 1
315
+
316
+ # Otherwise assume LLM extraction (Level 2)
317
+ if condition:
318
+ return 2
319
+
320
+ return 0 # Unknown
321
+
322
+ def _validate_expected_behavior(self, test_case: Dict[str, Any], detected_level: int,
323
+ condition_result: Dict[str, Any]) -> Dict[str, Any]:
324
+ """Validate if the test behaved as expected"""
325
+ expected_level = test_case.get('expected_level')
326
+ validation_result = {
327
+ "level_match": detected_level == expected_level,
328
+ "condition_match": False,
329
+ "success": False,
330
+ "validation_message": ""
331
+ }
332
+
333
+ # Check level match
334
+ if validation_result["level_match"]:
335
+ validation_result["validation_message"] += f"✅ Level {detected_level} as expected. "
336
+ else:
337
+ validation_result["validation_message"] += f"⚠️ Level {detected_level} != expected {expected_level}. "
338
+
339
+ # Check condition/result match based on test type
340
+ if test_case["category"] == "level4a_rejection":
341
+ # Should be rejected
342
+ validation_result["condition_match"] = condition_result.get('type') == 'invalid_query'
343
+ if validation_result["condition_match"]:
344
+ validation_result["validation_message"] += "✅ Query correctly rejected. "
345
+ else:
346
+ validation_result["validation_message"] += "⚠️ Query should have been rejected. "
347
+
348
+ elif test_case["category"] == "level4b_to_5":
349
+ # Should result in generic medical query
350
+ validation_result["condition_match"] = condition_result.get('condition') == 'generic medical query'
351
+ if validation_result["condition_match"]:
352
+ validation_result["validation_message"] += "✅ Generic medical search triggered. "
353
+ else:
354
+ validation_result["validation_message"] += "⚠️ Should trigger generic medical search. "
355
+
356
+ else:
357
+ # Check expected condition
358
+ expected_conditions = test_case.get('expected_condition', [])
359
+ if isinstance(expected_conditions, str):
360
+ expected_conditions = [expected_conditions]
361
+
362
+ actual_condition = condition_result.get('condition', '')
363
+ validation_result["condition_match"] = any(
364
+ expected.lower() in actual_condition.lower()
365
+ for expected in expected_conditions
366
+ )
367
+
368
+ if validation_result["condition_match"]:
369
+ validation_result["validation_message"] += f"✅ Condition '{actual_condition}' matches expected. "
370
+ else:
371
+ validation_result["validation_message"] += f"⚠️ Condition '{actual_condition}' != expected {expected_conditions}. "
372
+
373
+ # Overall success
374
+ validation_result["success"] = validation_result["level_match"] or validation_result["condition_match"]
375
+
376
+ return validation_result
377
+
378
+ def run_all_fallback_tests(self):
379
+ """Execute all fallback tests and generate report"""
380
+ if not self.components_initialized:
381
+ print("❌ Cannot run tests: components not initialized")
382
+ return
383
+
384
+ test_cases = self.get_multilevel_test_cases()
385
+
386
+ print(f"\n🚀 Starting Multilevel Fallback Test Suite")
387
+ print(f"Total test cases: {len(test_cases)}")
388
+ print(f"Test started at: {self.start_time.strftime('%Y-%m-%d %H:%M:%S')}")
389
+ print("=" * 80)
390
+
391
+ # Execute all tests
392
+ for test_case in test_cases:
393
+ result = self.run_single_fallback_test(test_case)
394
+ self.results.append(result)
395
+
396
+ # Generate report
397
+ self.generate_fallback_report()
398
+ self.save_fallback_results()
399
+
400
+ def generate_fallback_report(self):
401
+ """Generate detailed fallback analysis report"""
402
+ end_time = datetime.now()
403
+ total_duration = (end_time - self.start_time).total_seconds()
404
+
405
+ successful_tests = [r for r in self.results if r['success']]
406
+ failed_tests = [r for r in self.results if not r['success']]
407
+ partial_tests = [r for r in self.results if not r['success'] and not r.get('error')]
408
+
409
+ print("\n" + "=" * 80)
410
+ print("📊 MULTILEVEL FALLBACK TEST REPORT")
411
+ print("=" * 80)
412
+
413
+ # Overall Statistics
414
+ print(f"🕐 Execution Summary:")
415
+ print(f" Total duration: {total_duration:.3f}s")
416
+ print(f" Average per test: {total_duration/len(self.results):.3f}s")
417
+
418
+ print(f"\n📈 Test Results:")
419
+ print(f" Total tests: {len(self.results)}")
420
+ print(f" Passed: {len(successful_tests)} ✅")
421
+ print(f" Partial: {len(partial_tests)} ⚠️")
422
+ print(f" Failed: {len(failed_tests)} ❌")
423
+ print(f" Success rate: {len(successful_tests)/len(self.results)*100:.1f}%")
424
+
425
+ # Level Distribution Analysis
426
+ level_distribution = {}
427
+ level_performance = {}
428
+
429
+ for result in self.results:
430
+ if not result.get('error'):
431
+ level = result.get('detected_level', 0)
432
+ level_distribution[level] = level_distribution.get(level, 0) + 1
433
+
434
+ if level not in level_performance:
435
+ level_performance[level] = []
436
+ level_performance[level].append(result['execution_time'])
437
+
438
+ print(f"\n🎯 Level Distribution Analysis:")
439
+ for level in sorted(level_distribution.keys()):
440
+ count = level_distribution[level]
441
+ avg_time = sum(level_performance[level]) / len(level_performance[level])
442
+ level_name = {
443
+ 1: "Predefined Mapping",
444
+ 2: "LLM Extraction",
445
+ 3: "Semantic Search",
446
+ 4: "Validation Rejection",
447
+ 5: "Generic Search"
448
+ }.get(level, f"Unknown ({level})")
449
+
450
+ print(f" Level {level} ({level_name}): {count} tests, avg {avg_time:.3f}s")
451
+
452
+ # Category Analysis
453
+ categories = {}
454
+ for result in self.results:
455
+ category = result['test_case']['category']
456
+ if category not in categories:
457
+ categories[category] = {'total': 0, 'passed': 0}
458
+ categories[category]['total'] += 1
459
+ if result['success']:
460
+ categories[category]['passed'] += 1
461
+
462
+ print(f"\n📋 Category Analysis:")
463
+ for category, stats in categories.items():
464
+ success_rate = stats['passed'] / stats['total'] * 100
465
+ print(f" {category}: {stats['passed']}/{stats['total']} ({success_rate:.1f}%)")
466
+
467
+ # Detailed Results
468
+ print(f"\n📝 Detailed Test Results:")
469
+ for result in self.results:
470
+ test_case = result['test_case']
471
+ status = "✅ PASS" if result['success'] else ("❌ FAIL" if result.get('error') else "⚠️ PARTIAL")
472
+
473
+ print(f"\n {result['test_id']}: {status}")
474
+ print(f" Query: '{test_case['query']}'")
475
+ print(f" Expected Level: {test_case.get('expected_level', 'N/A')}")
476
+ print(f" Detected Level: {result.get('detected_level', 'N/A')}")
477
+ print(f" Condition: {result.get('condition_result', {}).get('condition', 'None')}")
478
+ print(f" Time: {result['execution_time']:.3f}s")
479
+
480
+ if result.get('validation_message'):
481
+ print(f" Validation: {result['validation_message']}")
482
+
483
+ if result.get('error'):
484
+ print(f" Error: {result['error']}")
485
+
486
+ print("\n" + "=" * 80)
487
+
488
+ def save_fallback_results(self):
489
+ """Save detailed test results to JSON file"""
490
+ timestamp = datetime.now().strftime('%Y%m%d_%H%M%S')
491
+ filename = project_root / 'tests' / f'multilevel_fallback_results_{timestamp}.json'
492
+
493
+ try:
494
+ comprehensive_results = {
495
+ "test_metadata": {
496
+ "timestamp": datetime.now().isoformat(),
497
+ "test_type": "multilevel_fallback_validation",
498
+ "total_duration_seconds": (datetime.now() - self.start_time).total_seconds(),
499
+ "total_tests": len(self.results),
500
+ "passed_tests": len([r for r in self.results if r['success']]),
501
+ "failed_tests": len([r for r in self.results if not r['success']])
502
+ },
503
+ "fallback_results": self.results
504
+ }
505
+
506
+ with open(filename, 'w', encoding='utf-8') as f:
507
+ json.dump(comprehensive_results, f, indent=2, ensure_ascii=False)
508
+
509
+ print(f"📁 Multilevel fallback results saved to: {filename}")
510
+
511
+ except Exception as e:
512
+ logger.error(f"Failed to save test results: {e}")
513
+ print(f"⚠️ Failed to save test results: {e}")
514
+
515
+ def main():
516
+ """Main execution function"""
517
+ print("🏥 OnCall.ai Multilevel Fallback Validation Test")
518
+ print("=" * 60)
519
+
520
+ # Initialize test suite
521
+ test_suite = MultilevelFallbackTest()
522
+
523
+ # Initialize components
524
+ test_suite.initialize_components()
525
+
526
+ if not test_suite.components_initialized:
527
+ print("❌ Test suite initialization failed. Exiting.")
528
+ return 1
529
+
530
+ # Run all fallback tests
531
+ test_suite.run_all_fallback_tests()
532
+
533
+ return 0
534
+
535
+ if __name__ == "__main__":
536
+ exit_code = main()
537
+ sys.exit(exit_code)