Spaces:

ybchen928
/

oncall-guide-ai

Sleeping

App Files Files Community

YanBoChen commited on Jul 22

Commit

c778aa6

1 Parent(s): 988dac9

refactor: remove outdated emergency subset preprocessing documentation

Browse files

Files changed (1) hide show

dataset/scripts/20250722_datesetA_emergency_subset_preprocessing_commit_message.txt +0 -52

dataset/scripts/20250722_datesetA_emergency_subset_preprocessing_commit_message.txt DELETED Viewed

@@ -1,52 +0,0 @@
-feat(dataset): Implement emergency subset extraction with enhanced matching
-Implement initial data preprocessing pipeline for RAG system evaluation.
-Key Changes:
-- Enhance keyword matching with findall and non-capturing groups
-- Add matched column for tracking all keyword occurrences
-- Implement basic statistics calculation
-- Prepare for data exploration phase
-Technical Details:
-1. Keyword Matching Enhancement:
-   - Use non-capturing groups (?:...) to handle multiple matches
-   - Implement proper regex pattern with word boundaries
-   - Handle NaN values explicitly
-2. Data Flow:
-```
-Raw Data (guidelines_source_filtered.jsonl)
-     │
-     ▼
-Keyword Matching (emergency_keywords.txt)
-     │    ┌─ Pattern: \b(?:keyword1|keyword2)\b
-     │    └─ Flags: re.IGNORECASE
-     ▼
-Multiple Match Extraction
-     │    ┌─ Use str.findall
-     │    └─ Join multiple matches with |
-     ▼
-Subset Creation
-     │    ┌─ matched column: "keyword1|keyword2"
-     │    └─ has_emergency flag
-     ▼
-Output Files
-     ├─ emergency_subset.jsonl
-     └─ emergency_subset.csv
-```
-3. Next Steps:
-   - Run data_explorer.py for detailed analysis
-   - Evaluate subset quality against draft_offlineSubsetbuilding.md
-   - Consider implementing treatment subset with similar approach
-Performance Metrics:
-- Capture all keyword matches (not just first occurrence)
-- Calculate average keywords per document
-- Prepare for co-occurrence analysis
-This approach aligns with the RAG system requirements:
-1. Maintain semantic relationships (multiple keyword tracking)
-2. Enable detailed analysis (matched column)
-3. Support future enhancements (treatment subset)