Spaces:
Sleeping
Sleeping
YanBoChen
commited on
Commit
·
9e4c1bc
1
Parent(s):
5f9dffa
feat(evaluation): add seventh evaluation metric for multi-level fallback efficiency and early interception rate
Browse files
evaluation/evaluation_instruction.md
CHANGED
@@ -374,3 +374,328 @@ combine_evaluation_results()
|
|
374 |
此為評估策略討論,無涉及代碼修改。
|
375 |
|
376 |
**您的理解完全正確!RAG 特有的指標只能在 RAG 系統內部測試,而通用指標可以跨所有模型比較。這樣的分層評估策略非常合理!**
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
374 |
此為評估策略討論,無涉及代碼修改。
|
375 |
|
376 |
**您的理解完全正確!RAG 特有的指標只能在 RAG 系統內部測試,而通用指標可以跨所有模型比較。這樣的分層評估策略非常合理!**
|
377 |
+
|
378 |
+
---
|
379 |
+
|
380 |
+
## 📊 第七個評估指標(YanBo系統特有)
|
381 |
+
|
382 |
+
### 7. 多層級 Fallback 效率(早期攔截率)
|
383 |
+
|
384 |
+
**定義:** 系統通過多層級 Fallback 機制在早期層級成功處理查詢的效率
|
385 |
+
|
386 |
+
**測量位置:** `src/user_prompt.py` 的 `extract_condition_keywords` 多層級處理邏輯
|
387 |
+
|
388 |
+
**計算公式:**
|
389 |
+
```
|
390 |
+
Early_Interception_Rate = (Level1_Success + Level2_Success) / Total_Queries
|
391 |
+
|
392 |
+
其中:
|
393 |
+
- Level1_Success = 在預定義映射中直接找到條件的查詢數
|
394 |
+
- Level2_Success = 通過LLM抽取成功的查詢數
|
395 |
+
- Total_Queries = 測試查詢總數
|
396 |
+
|
397 |
+
時間節省效果:
|
398 |
+
Time_Savings = (Late_Avg_Time - Early_Avg_Time) / Late_Avg_Time
|
399 |
+
|
400 |
+
早期攔截效率:
|
401 |
+
Efficiency_Score = Early_Interception_Rate × (1 + Time_Savings)
|
402 |
+
```
|
403 |
+
|
404 |
+
**ASCII 流程圖:**
|
405 |
+
```
|
406 |
+
多層級 Fallback 效率示意圖:
|
407 |
+
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
|
408 |
+
│ 用戶查詢 │───▶│ Level 1 │───▶│ 直接成功 │
|
409 |
+
│ "胸痛診斷" │ │ 預定義映射 │ │ 35% (快) │
|
410 |
+
└─────────────┘ └─────────────┘ └─────────────┘
|
411 |
+
│
|
412 |
+
▼ (失敗)
|
413 |
+
┌─────────────┐ ┌─────────────┐
|
414 |
+
│ Level 2 │───▶│ LLM抽取成功 │
|
415 |
+
│ LLM 條件抽取│ │ 40% (中等) │
|
416 |
+
└─────────────┘ └─────────────┘
|
417 |
+
│
|
418 |
+
▼ (失敗)
|
419 |
+
┌─────────────┐ ┌─────────────┐
|
420 |
+
│ Level 3-5 │───▶│ 後備成功 │
|
421 |
+
│ 後續層級 │ │ 20% (慢) │
|
422 |
+
└─────────────┘ └─────────────┘
|
423 |
+
│
|
424 |
+
▼ (失敗)
|
425 |
+
┌─────────────┐
|
426 |
+
│ 完全失敗 │
|
427 |
+
│ 5% (錯誤) │
|
428 |
+
└─────────────┘
|
429 |
+
|
430 |
+
早期攔截率 = (35% + 40%) = 75% ✅ 目標 > 70%
|
431 |
+
```
|
432 |
+
|
433 |
+
**實現框架:**
|
434 |
+
```python
|
435 |
+
# 基於 user_prompt.py 的多層級處理邏輯
|
436 |
+
def evaluate_early_interception_efficiency(test_queries: List[str]) -> Dict[str, float]:
|
437 |
+
"""評估早期攔截率 - YanBo系統核心優勢"""
|
438 |
+
|
439 |
+
level1_success = 0 # Level 1: 預定義映射成功
|
440 |
+
level2_success = 0 # Level 2: LLM 抽取成功
|
441 |
+
later_success = 0 # Level 3-5: 後續層級成功
|
442 |
+
total_failures = 0 # 完全失敗
|
443 |
+
|
444 |
+
early_times = [] # 早期成功的處理時間
|
445 |
+
late_times = [] # 後期成功的處理時間
|
446 |
+
|
447 |
+
for query in test_queries:
|
448 |
+
# 追蹤每個查詢的成功層級和時間
|
449 |
+
success_level, processing_time = track_query_success_level(query)
|
450 |
+
|
451 |
+
if success_level == 1:
|
452 |
+
level1_success += 1
|
453 |
+
early_times.append(processing_time)
|
454 |
+
elif success_level == 2:
|
455 |
+
level2_success += 1
|
456 |
+
early_times.append(processing_time)
|
457 |
+
elif success_level in [3, 4, 5]:
|
458 |
+
later_success += 1
|
459 |
+
late_times.append(processing_time)
|
460 |
+
else:
|
461 |
+
total_failures += 1
|
462 |
+
|
463 |
+
total_queries = len(test_queries)
|
464 |
+
early_success_count = level1_success + level2_success
|
465 |
+
|
466 |
+
# 計算時間節省效果
|
467 |
+
early_avg_time = sum(early_times) / len(early_times) if early_times else 0
|
468 |
+
late_avg_time = sum(late_times) / len(late_times) if late_times else 0
|
469 |
+
time_savings = (late_avg_time - early_avg_time) / late_avg_time if late_avg_time > 0 else 0
|
470 |
+
|
471 |
+
# 綜合效率分數
|
472 |
+
early_interception_rate = early_success_count / total_queries
|
473 |
+
efficiency_score = early_interception_rate * (1 + time_savings)
|
474 |
+
|
475 |
+
return {
|
476 |
+
# 核心指標
|
477 |
+
"early_interception_rate": early_interception_rate, # 早期攔截率
|
478 |
+
"level1_success_rate": level1_success / total_queries,
|
479 |
+
"level2_success_rate": level2_success / total_queries,
|
480 |
+
|
481 |
+
# 時間效率
|
482 |
+
"early_avg_time": early_avg_time,
|
483 |
+
"late_avg_time": late_avg_time,
|
484 |
+
"time_savings_rate": time_savings,
|
485 |
+
|
486 |
+
# 系統健康度
|
487 |
+
"total_success_rate": (total_queries - total_failures) / total_queries,
|
488 |
+
"miss_rate": total_failures / total_queries,
|
489 |
+
|
490 |
+
# 綜合效率
|
491 |
+
"overall_efficiency_score": efficiency_score,
|
492 |
+
|
493 |
+
# 詳細分布
|
494 |
+
"success_distribution": {
|
495 |
+
"level1": level1_success,
|
496 |
+
"level2": level2_success,
|
497 |
+
"later_levels": later_success,
|
498 |
+
"failures": total_failures
|
499 |
+
}
|
500 |
+
}
|
501 |
+
|
502 |
+
def track_query_success_level(query: str) -> Tuple[int, float]:
|
503 |
+
"""
|
504 |
+
追蹤查詢在哪個層級成功並記錄時間
|
505 |
+
|
506 |
+
Args:
|
507 |
+
query: 測試查詢
|
508 |
+
|
509 |
+
Returns:
|
510 |
+
Tuple of (success_level, processing_time)
|
511 |
+
"""
|
512 |
+
start_time = time.time()
|
513 |
+
|
514 |
+
# 模擬 user_prompt.py 的層級處理邏輯
|
515 |
+
try:
|
516 |
+
# Level 1: 檢查預定義映射
|
517 |
+
if check_predefined_mapping(query):
|
518 |
+
processing_time = time.time() - start_time
|
519 |
+
return (1, processing_time)
|
520 |
+
|
521 |
+
# Level 2: LLM 條件抽取
|
522 |
+
llm_result = llm_client.analyze_medical_query(query)
|
523 |
+
if llm_result.get('extracted_condition'):
|
524 |
+
processing_time = time.time() - start_time
|
525 |
+
return (2, processing_time)
|
526 |
+
|
527 |
+
# Level 3: 語義搜索
|
528 |
+
semantic_result = semantic_search_fallback(query)
|
529 |
+
if semantic_result:
|
530 |
+
processing_time = time.time() - start_time
|
531 |
+
return (3, processing_time)
|
532 |
+
|
533 |
+
# Level 4: 醫學驗證
|
534 |
+
validation_result = validate_medical_query(query)
|
535 |
+
if not validation_result: # 驗證通過
|
536 |
+
processing_time = time.time() - start_time
|
537 |
+
return (4, processing_time)
|
538 |
+
|
539 |
+
# Level 5: 通用搜索
|
540 |
+
generic_result = generic_medical_search(query)
|
541 |
+
if generic_result:
|
542 |
+
processing_time = time.time() - start_time
|
543 |
+
return (5, processing_time)
|
544 |
+
|
545 |
+
# 完全失敗
|
546 |
+
processing_time = time.time() - start_time
|
547 |
+
return (0, processing_time)
|
548 |
+
|
549 |
+
except Exception as e:
|
550 |
+
processing_time = time.time() - start_time
|
551 |
+
return (0, processing_time)
|
552 |
+
|
553 |
+
def check_predefined_mapping(query: str) -> bool:
|
554 |
+
"""檢查查詢是否在預定義映射中"""
|
555 |
+
# 基於 medical_conditions.py 的 CONDITION_KEYWORD_MAPPING
|
556 |
+
from medical_conditions import CONDITION_KEYWORD_MAPPING
|
557 |
+
|
558 |
+
query_lower = query.lower()
|
559 |
+
for condition, keywords in CONDITION_KEYWORD_MAPPING.items():
|
560 |
+
if any(keyword.lower() in query_lower for keyword in keywords):
|
561 |
+
return True
|
562 |
+
return False
|
563 |
+
```
|
564 |
+
|
565 |
+
**目標閾值:**
|
566 |
+
- 早期攔截率 ≥ 70%(前兩層解決)
|
567 |
+
- 時間節省率 ≥ 60%(早期比後期快)
|
568 |
+
- 總成功率 ≥ 95%(漏接率 < 5%)
|
569 |
+
|
570 |
+
---
|
571 |
+
|
572 |
+
## 🧪 更新的完整評估流程
|
573 |
+
|
574 |
+
### 測試用例設計
|
575 |
+
```python
|
576 |
+
# 基於 readme.md 中的範例查詢設計測試集
|
577 |
+
MEDICAL_TEST_CASES = [
|
578 |
+
# Level 1 預期成功(預定義映射)
|
579 |
+
"患者胸痛怎麼處理?",
|
580 |
+
"心肌梗死的診斷方法?",
|
581 |
+
|
582 |
+
# Level 2 預期成功(LLM抽取)
|
583 |
+
"60歲男性,有高血壓病史,突發胸痛。可能的原因和評估方法?",
|
584 |
+
"30歲患者突發嚴重頭痛和頸部僵硬。鑑別診斷?",
|
585 |
+
|
586 |
+
# Level 3+ 預期成功(複雜查詢)
|
587 |
+
"患者急性呼吸困難和腿部水腫。應該考慮什麼?",
|
588 |
+
"20歲女性,無病史,突發癲癇。可能原因和完整處理流程?",
|
589 |
+
|
590 |
+
# 邊界測試
|
591 |
+
"疑似急性出血性中風。下一步處理?"
|
592 |
+
]
|
593 |
+
```
|
594 |
+
|
595 |
+
### 更新的評估執行流程
|
596 |
+
```python
|
597 |
+
def run_complete_evaluation(model_name: str, test_cases: List[str]) -> Dict[str, Any]:
|
598 |
+
"""執行完整的七項指標評估"""
|
599 |
+
|
600 |
+
results = {
|
601 |
+
"model": model_name,
|
602 |
+
"metrics": {},
|
603 |
+
"detailed_results": []
|
604 |
+
}
|
605 |
+
|
606 |
+
total_latencies = []
|
607 |
+
extraction_successes = []
|
608 |
+
relevance_scores = []
|
609 |
+
coverage_scores = []
|
610 |
+
actionability_scores = []
|
611 |
+
evidence_scores = []
|
612 |
+
fallback_efficiency_scores = [] # 新增
|
613 |
+
|
614 |
+
for query in test_cases:
|
615 |
+
# 運行模型並測量所有指標
|
616 |
+
|
617 |
+
# 1. 總處理時長
|
618 |
+
latency_result = measure_total_latency(query)
|
619 |
+
total_latencies.append(latency_result['total_latency'])
|
620 |
+
|
621 |
+
# 2. 條件抽取成功率
|
622 |
+
extraction_result = evaluate_condition_extraction([query])
|
623 |
+
extraction_successes.append(extraction_result['success_rate'])
|
624 |
+
|
625 |
+
# 3 & 4. 檢索相關性和覆蓋率
|
626 |
+
retrieval_results = get_retrieval_results(query)
|
627 |
+
relevance_result = evaluate_retrieval_relevance(retrieval_results)
|
628 |
+
relevance_scores.append(relevance_result['average_relevance'])
|
629 |
+
|
630 |
+
generated_advice = get_generated_advice(query, retrieval_results)
|
631 |
+
coverage_result = evaluate_retrieval_coverage(generated_advice, retrieval_results)
|
632 |
+
coverage_scores.append(coverage_result['coverage'])
|
633 |
+
|
634 |
+
# 5 & 6. LLM 評估
|
635 |
+
response_data = {
|
636 |
+
'query': query,
|
637 |
+
'advice': generated_advice,
|
638 |
+
'retrieval_results': retrieval_results
|
639 |
+
}
|
640 |
+
|
641 |
+
actionability_result = evaluate_clinical_actionability([response_data])
|
642 |
+
actionability_scores.append(actionability_result[0]['overall_score'])
|
643 |
+
|
644 |
+
evidence_result = evaluate_clinical_evidence([response_data])
|
645 |
+
evidence_scores.append(evidence_result[0]['overall_score'])
|
646 |
+
|
647 |
+
# 7. 多層級 Fallback 效率(新增)
|
648 |
+
if model_name == "Med42-70B_general_RAG": # 只對YanBo系統測量
|
649 |
+
fallback_result = evaluate_early_interception_efficiency([query])
|
650 |
+
fallback_efficiency_scores.append(fallback_result['overall_efficiency_score'])
|
651 |
+
|
652 |
+
# 記錄詳細結果...
|
653 |
+
|
654 |
+
# 計算平均指標
|
655 |
+
results["metrics"] = {
|
656 |
+
"average_latency": sum(total_latencies) / len(total_latencies),
|
657 |
+
"extraction_success_rate": sum(extraction_successes) / len(extraction_successes),
|
658 |
+
"average_relevance": sum(relevance_scores) / len(relevance_scores),
|
659 |
+
"average_coverage": sum(coverage_scores) / len(coverage_scores),
|
660 |
+
"average_actionability": sum(actionability_scores) / len(actionability_scores),
|
661 |
+
"average_evidence_score": sum(evidence_scores) / len(evidence_scores),
|
662 |
+
# 新增指標(只對RAG系統有效)
|
663 |
+
"average_fallback_efficiency": sum(fallback_efficiency_scores) / len(fallback_efficiency_scores) if fallback_efficiency_scores else 0.0
|
664 |
+
}
|
665 |
+
|
666 |
+
return results
|
667 |
+
```
|
668 |
+
|
669 |
+
---
|
670 |
+
|
671 |
+
## 📊 更新的系統成功標準
|
672 |
+
|
673 |
+
### 系統性能目標(七個指標)
|
674 |
+
```
|
675 |
+
✅ 達標條件:
|
676 |
+
1. 總處理時長 ≤ 30秒
|
677 |
+
2. 條件抽取成功率 ≥ 80%
|
678 |
+
3. 檢索相關性 ≥ 0.25(基於實際醫學數據)
|
679 |
+
4. 檢索覆蓋率 ≥ 60%
|
680 |
+
5. 臨床可操作性 ≥ 7.0/10
|
681 |
+
6. 臨床證據評分 ≥ 7.5/10
|
682 |
+
7. 早期攔截率 ≥ 70%(多層級 Fallback 效率)
|
683 |
+
|
684 |
+
🎯 YanBo RAG 系統成功標準:
|
685 |
+
- RAG增強版在 5-7 項指標上優於基線 Med42-70B
|
686 |
+
- 早期攔截率體現多層級設計的優勢
|
687 |
+
- 整體提升幅度 ≥ 15%
|
688 |
+
```
|
689 |
+
|
690 |
+
### YanBo 系統特有優勢分析
|
691 |
+
```
|
692 |
+
多層級 Fallback 優勢:
|
693 |
+
├── 漏接防護:通過多層級降低失敗率至 < 5%
|
694 |
+
├── 時間優化:70%+ 查詢在前兩層快速解決
|
695 |
+
├── 系統穩定:即使某層級失敗,後續層級提供保障
|
696 |
+
└── 智能分流:不同複雜度查詢自動分配到合適層級
|
697 |
+
```
|
698 |
+
|
699 |
+
---
|
700 |
+
|
701 |
+
**第七個指標已添加完成,專注測量您的多層級 Fallback 系統的早期攔截效率和時間節省效果。**
|