Spaces:
Sleeping
Sleeping
YanBoChen
commited on
Commit
Β·
c778aa6
1
Parent(s):
988dac9
refactor: remove outdated emergency subset preprocessing documentation
Browse files
dataset/scripts/20250722_datesetA_emergency_subset_preprocessing_commit_message.txt
DELETED
@@ -1,52 +0,0 @@
|
|
1 |
-
feat(dataset): Implement emergency subset extraction with enhanced matching
|
2 |
-
|
3 |
-
Implement initial data preprocessing pipeline for RAG system evaluation.
|
4 |
-
|
5 |
-
Key Changes:
|
6 |
-
- Enhance keyword matching with findall and non-capturing groups
|
7 |
-
- Add matched column for tracking all keyword occurrences
|
8 |
-
- Implement basic statistics calculation
|
9 |
-
- Prepare for data exploration phase
|
10 |
-
|
11 |
-
Technical Details:
|
12 |
-
1. Keyword Matching Enhancement:
|
13 |
-
- Use non-capturing groups (?:...) to handle multiple matches
|
14 |
-
- Implement proper regex pattern with word boundaries
|
15 |
-
- Handle NaN values explicitly
|
16 |
-
|
17 |
-
2. Data Flow:
|
18 |
-
```
|
19 |
-
Raw Data (guidelines_source_filtered.jsonl)
|
20 |
-
β
|
21 |
-
βΌ
|
22 |
-
Keyword Matching (emergency_keywords.txt)
|
23 |
-
β ββ Pattern: \b(?:keyword1|keyword2)\b
|
24 |
-
β ββ Flags: re.IGNORECASE
|
25 |
-
βΌ
|
26 |
-
Multiple Match Extraction
|
27 |
-
β ββ Use str.findall
|
28 |
-
β ββ Join multiple matches with |
|
29 |
-
βΌ
|
30 |
-
Subset Creation
|
31 |
-
β ββ matched column: "keyword1|keyword2"
|
32 |
-
β ββ has_emergency flag
|
33 |
-
βΌ
|
34 |
-
Output Files
|
35 |
-
ββ emergency_subset.jsonl
|
36 |
-
ββ emergency_subset.csv
|
37 |
-
```
|
38 |
-
|
39 |
-
3. Next Steps:
|
40 |
-
- Run data_explorer.py for detailed analysis
|
41 |
-
- Evaluate subset quality against draft_offlineSubsetbuilding.md
|
42 |
-
- Consider implementing treatment subset with similar approach
|
43 |
-
|
44 |
-
Performance Metrics:
|
45 |
-
- Capture all keyword matches (not just first occurrence)
|
46 |
-
- Calculate average keywords per document
|
47 |
-
- Prepare for co-occurrence analysis
|
48 |
-
|
49 |
-
This approach aligns with the RAG system requirements:
|
50 |
-
1. Maintain semantic relationships (multiple keyword tracking)
|
51 |
-
2. Enable detailed analysis (matched column)
|
52 |
-
3. Support future enhancements (treatment subset)
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|