YanBoChen commited on
Commit
c778aa6
Β·
1 Parent(s): 988dac9

refactor: remove outdated emergency subset preprocessing documentation

Browse files
dataset/scripts/20250722_datesetA_emergency_subset_preprocessing_commit_message.txt DELETED
@@ -1,52 +0,0 @@
1
- feat(dataset): Implement emergency subset extraction with enhanced matching
2
-
3
- Implement initial data preprocessing pipeline for RAG system evaluation.
4
-
5
- Key Changes:
6
- - Enhance keyword matching with findall and non-capturing groups
7
- - Add matched column for tracking all keyword occurrences
8
- - Implement basic statistics calculation
9
- - Prepare for data exploration phase
10
-
11
- Technical Details:
12
- 1. Keyword Matching Enhancement:
13
- - Use non-capturing groups (?:...) to handle multiple matches
14
- - Implement proper regex pattern with word boundaries
15
- - Handle NaN values explicitly
16
-
17
- 2. Data Flow:
18
- ```
19
- Raw Data (guidelines_source_filtered.jsonl)
20
- β”‚
21
- β–Ό
22
- Keyword Matching (emergency_keywords.txt)
23
- β”‚ β”Œβ”€ Pattern: \b(?:keyword1|keyword2)\b
24
- β”‚ └─ Flags: re.IGNORECASE
25
- β–Ό
26
- Multiple Match Extraction
27
- β”‚ β”Œβ”€ Use str.findall
28
- β”‚ └─ Join multiple matches with |
29
- β–Ό
30
- Subset Creation
31
- β”‚ β”Œβ”€ matched column: "keyword1|keyword2"
32
- β”‚ └─ has_emergency flag
33
- β–Ό
34
- Output Files
35
- β”œβ”€ emergency_subset.jsonl
36
- └─ emergency_subset.csv
37
- ```
38
-
39
- 3. Next Steps:
40
- - Run data_explorer.py for detailed analysis
41
- - Evaluate subset quality against draft_offlineSubsetbuilding.md
42
- - Consider implementing treatment subset with similar approach
43
-
44
- Performance Metrics:
45
- - Capture all keyword matches (not just first occurrence)
46
- - Calculate average keywords per document
47
- - Prepare for co-occurrence analysis
48
-
49
- This approach aligns with the RAG system requirements:
50
- 1. Maintain semantic relationships (multiple keyword tracking)
51
- 2. Enable detailed analysis (matched column)
52
- 3. Support future enhancements (treatment subset)