YanBoChen commited on
Commit
68cfce0
·
1 Parent(s): cd2cfdd

feat(data-processing): implement data processing pipeline with embeddings

Browse files

BREAKING CHANGE: Add data processing implementation with robust path handling

Key Changes:
1. Create DataProcessor class for medical data processing:
- Handle paths with spaces and special characters
- Support dataset/dataset directory structure
- Add detailed logging for debugging

2. Implement core functionalities:
- Load filtered emergency and treatment data
- Create intelligent chunks based on matched keywords
- Generate embeddings using NeuML/pubmedbert-base-embeddings
- Build ANNOY indices for vector search
- Save embeddings and metadata separately

3. Add test coverage:
- Basic data loading tests
- Chunking functionality tests
- Model loading tests

Technical Details:
- Use pathlib.Path.resolve() for robust path handling
- Separate storage for embeddings and indices:
* /models/embeddings/ for vector representations
* /models/indices/annoy/ for search indices
- Keep keywords as metadata without embedding

Testing:
✅ Data loading: 11,914 emergency + 11,023 treatment records
✅ Chunking: Successful with keyword-centered approach
✅ Model loading: NeuML/pubmedbert-base-embeddings (768 dims)

Next Steps:
- Integrate with Meditron for enhanced processing
- Implement prompt engineering
- Add hybrid search functionality

requirements.txt CHANGED
@@ -64,6 +64,7 @@ safehttpx==0.1.6
64
  safetensors==0.5.3
65
  seaborn==0.13.2
66
  semantic-version==2.10.0
 
67
  shellingham==1.5.4
68
  six==1.17.0
69
  sniffio==1.3.1
 
64
  safetensors==0.5.3
65
  seaborn==0.13.2
66
  semantic-version==2.10.0
67
+ sentence-transformers==3.0.1
68
  shellingham==1.5.4
69
  six==1.17.0
70
  sniffio==1.3.1
src/commit_message_20250726_data_processing.txt ADDED
@@ -0,0 +1,38 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ feat(data-processing): implement data processing pipeline with embeddings
2
+
3
+ BREAKING CHANGE: Add data processing implementation with robust path handling
4
+
5
+ Key Changes:
6
+ 1. Create DataProcessor class for medical data processing:
7
+ - Handle paths with spaces and special characters
8
+ - Support dataset/dataset directory structure
9
+ - Add detailed logging for debugging
10
+
11
+ 2. Implement core functionalities:
12
+ - Load filtered emergency and treatment data
13
+ - Create intelligent chunks based on matched keywords
14
+ - Generate embeddings using NeuML/pubmedbert-base-embeddings
15
+ - Build ANNOY indices for vector search
16
+ - Save embeddings and metadata separately
17
+
18
+ 3. Add test coverage:
19
+ - Basic data loading tests
20
+ - Chunking functionality tests
21
+ - Model loading tests
22
+
23
+ Technical Details:
24
+ - Use pathlib.Path.resolve() for robust path handling
25
+ - Separate storage for embeddings and indices:
26
+ * /models/embeddings/ for vector representations
27
+ * /models/indices/annoy/ for search indices
28
+ - Keep keywords as metadata without embedding
29
+
30
+ Testing:
31
+ ✅ Data loading: 11,914 emergency + 11,023 treatment records
32
+ ✅ Chunking: Successful with keyword-centered approach
33
+ ✅ Model loading: NeuML/pubmedbert-base-embeddings (768 dims)
34
+
35
+ Next Steps:
36
+ - Integrate with Meditron for enhanced processing
37
+ - Implement prompt engineering
38
+ - Add hybrid search functionality
src/data_processing.py ADDED
@@ -0,0 +1,531 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ OnCall.ai Data Processing Module
3
+
4
+ This module handles:
5
+ 1. Loading filtered medical guideline data
6
+ 2. Creating intelligent chunks based on matched keywords
7
+ 3. Generating embeddings using NeuML/pubmedbert-base-embeddings
8
+ 4. Building ANNOY indices for vector search
9
+ 5. Data quality validation
10
+
11
+ Author: OnCall.ai Team
12
+ Date: 2025-07-26
13
+ """
14
+
15
+ import os
16
+ import json
17
+ import pandas as pd
18
+ import numpy as np
19
+ from pathlib import Path
20
+ from typing import List, Dict, Tuple, Any
21
+ from sentence_transformers import SentenceTransformer
22
+ from annoy import AnnoyIndex
23
+ import logging
24
+
25
+ # Setup logging
26
+ logging.basicConfig(level=logging.INFO)
27
+ logger = logging.getLogger(__name__)
28
+
29
+ class DataProcessor:
30
+ """Main data processing class for OnCall.ai RAG system"""
31
+
32
+ def __init__(self, base_dir: str = None):
33
+ """
34
+ Initialize DataProcessor
35
+
36
+ Args:
37
+ base_dir: Base directory path for the project
38
+ """
39
+ self.base_dir = Path(base_dir).resolve() if base_dir else Path(__file__).parent.parent.resolve()
40
+ self.dataset_dir = (self.base_dir / "dataset" / "dataset").resolve() # 修正为实际的数据目录
41
+ self.models_dir = (self.base_dir / "models").resolve()
42
+
43
+ # Model configuration
44
+ self.embedding_model_name = "NeuML/pubmedbert-base-embeddings"
45
+ self.embedding_dim = 768 # PubMedBERT dimension
46
+ self.chunk_size = 512
47
+
48
+ # Initialize model (will be loaded when needed)
49
+ self.embedding_model = None
50
+
51
+ # Data containers
52
+ self.emergency_data = None
53
+ self.treatment_data = None
54
+ self.emergency_chunks = []
55
+ self.treatment_chunks = []
56
+
57
+ logger.info(f"Initialized DataProcessor with:")
58
+ logger.info(f" Base directory: {self.base_dir}")
59
+ logger.info(f" Dataset directory: {self.dataset_dir}")
60
+ logger.info(f" Models directory: {self.models_dir}")
61
+
62
+ def load_embedding_model(self):
63
+ """Load the embedding model"""
64
+ if self.embedding_model is None:
65
+ logger.info(f"Loading embedding model: {self.embedding_model_name}")
66
+ self.embedding_model = SentenceTransformer(self.embedding_model_name)
67
+ logger.info("Embedding model loaded successfully")
68
+ return self.embedding_model
69
+
70
+ def load_filtered_data(self) -> Tuple[pd.DataFrame, pd.DataFrame]:
71
+ """
72
+ Load pre-filtered emergency and treatment data
73
+
74
+ Returns:
75
+ Tuple of (emergency_data, treatment_data) DataFrames
76
+ """
77
+ logger.info("Loading filtered medical data...")
78
+
79
+ # File paths
80
+ emergency_path = (self.dataset_dir / "emergency" / "emergency_subset_opt.jsonl").resolve()
81
+ treatment_path = (self.dataset_dir / "emergency_treatment" / "emergency_treatment_subset_opt.jsonl").resolve()
82
+
83
+ logger.info(f"Looking for emergency data at: {emergency_path}")
84
+ logger.info(f"Looking for treatment data at: {treatment_path}")
85
+
86
+ # Validate file existence
87
+ if not emergency_path.exists():
88
+ raise FileNotFoundError(f"Emergency data not found: {emergency_path}")
89
+ if not treatment_path.exists():
90
+ raise FileNotFoundError(f"Treatment data not found: {treatment_path}")
91
+
92
+ # Load data
93
+ self.emergency_data = pd.read_json(str(emergency_path), lines=True) # 使用 str() 确保路径正确处理
94
+ self.treatment_data = pd.read_json(str(treatment_path), lines=True)
95
+
96
+ logger.info(f"Loaded {len(self.emergency_data)} emergency records")
97
+ logger.info(f"Loaded {len(self.treatment_data)} treatment records")
98
+
99
+ return self.emergency_data, self.treatment_data
100
+
101
+ def create_keyword_centered_chunks(self, text: str, matched_keywords: str,
102
+ chunk_size: int = 512, doc_id: str = None) -> List[Dict[str, Any]]:
103
+ """
104
+ Create chunks centered around matched keywords
105
+
106
+ Args:
107
+ text: Input text
108
+ matched_keywords: Pipe-separated keywords (e.g., "MI|chest pain|fever")
109
+ chunk_size: Size of each chunk
110
+ doc_id: Document ID for tracking
111
+
112
+ Returns:
113
+ List of chunk dictionaries
114
+ """
115
+ if not matched_keywords or pd.isna(matched_keywords):
116
+ return []
117
+
118
+ chunks = []
119
+ keywords = matched_keywords.split("|") if matched_keywords else []
120
+
121
+ for i, keyword in enumerate(keywords):
122
+ # Find keyword position in text (case insensitive)
123
+ keyword_pos = text.lower().find(keyword.lower())
124
+
125
+ if keyword_pos != -1:
126
+ # Calculate chunk boundaries centered on keyword
127
+ start = max(0, keyword_pos - chunk_size // 2)
128
+ end = min(len(text), keyword_pos + chunk_size // 2)
129
+
130
+ # Extract chunk text
131
+ chunk_text = text[start:end].strip()
132
+
133
+ if chunk_text: # Only add non-empty chunks
134
+ chunk_info = {
135
+ "text": chunk_text,
136
+ "primary_keyword": keyword,
137
+ "all_matched_keywords": matched_keywords,
138
+ "keyword_position": keyword_pos,
139
+ "chunk_start": start,
140
+ "chunk_end": end,
141
+ "chunk_id": f"{doc_id}_chunk_{i}" if doc_id else f"chunk_{i}",
142
+ "source_doc_id": doc_id
143
+ }
144
+ chunks.append(chunk_info)
145
+
146
+ return chunks
147
+
148
+ def create_dual_keyword_chunks(self, text: str, emergency_keywords: str,
149
+ treatment_keywords: str, chunk_size: int = 512,
150
+ doc_id: str = None) -> List[Dict[str, Any]]:
151
+ """
152
+ Create chunks for treatment data with both emergency and treatment keywords
153
+
154
+ Args:
155
+ text: Input text
156
+ emergency_keywords: Emergency keywords
157
+ treatment_keywords: Treatment keywords
158
+ chunk_size: Size of each chunk
159
+ doc_id: Document ID for tracking
160
+
161
+ Returns:
162
+ List of chunk dictionaries
163
+ """
164
+ if not treatment_keywords or pd.isna(treatment_keywords):
165
+ return []
166
+
167
+ chunks = []
168
+ em_keywords = emergency_keywords.split("|") if emergency_keywords else []
169
+ tr_keywords = treatment_keywords.split("|") if treatment_keywords else []
170
+
171
+ # Process treatment keywords as primary (since this is treatment-focused data)
172
+ for i, tr_keyword in enumerate(tr_keywords):
173
+ tr_pos = text.lower().find(tr_keyword.lower())
174
+
175
+ if tr_pos != -1:
176
+ # Find closest emergency keyword for context
177
+ closest_em_keyword = None
178
+ closest_distance = float('inf')
179
+
180
+ for em_keyword in em_keywords:
181
+ em_pos = text.lower().find(em_keyword.lower())
182
+ if em_pos != -1:
183
+ distance = abs(tr_pos - em_pos)
184
+ if distance < closest_distance and distance < chunk_size:
185
+ closest_distance = distance
186
+ closest_em_keyword = em_keyword
187
+
188
+ # Calculate chunk boundaries
189
+ if closest_em_keyword:
190
+ # Center between both keywords
191
+ em_pos = text.lower().find(closest_em_keyword.lower())
192
+ center = (tr_pos + em_pos) // 2
193
+ else:
194
+ # Center on treatment keyword
195
+ center = tr_pos
196
+
197
+ start = max(0, center - chunk_size // 2)
198
+ end = min(len(text), center + chunk_size // 2)
199
+
200
+ chunk_text = text[start:end].strip()
201
+
202
+ if chunk_text:
203
+ chunk_info = {
204
+ "text": chunk_text,
205
+ "primary_keyword": tr_keyword,
206
+ "emergency_keywords": emergency_keywords,
207
+ "treatment_keywords": treatment_keywords,
208
+ "closest_emergency_keyword": closest_em_keyword,
209
+ "keyword_distance": closest_distance if closest_em_keyword else None,
210
+ "chunk_start": start,
211
+ "chunk_end": end,
212
+ "chunk_id": f"{doc_id}_treatment_chunk_{i}" if doc_id else f"treatment_chunk_{i}",
213
+ "source_doc_id": doc_id
214
+ }
215
+ chunks.append(chunk_info)
216
+
217
+ return chunks
218
+
219
+ def process_emergency_chunks(self) -> List[Dict[str, Any]]:
220
+ """Process emergency data into chunks"""
221
+ logger.info("Processing emergency data into chunks...")
222
+
223
+ if self.emergency_data is None:
224
+ raise ValueError("Emergency data not loaded. Call load_filtered_data() first.")
225
+
226
+ all_chunks = []
227
+
228
+ for idx, row in self.emergency_data.iterrows():
229
+ if pd.notna(row.get('clean_text')) and pd.notna(row.get('matched')):
230
+ chunks = self.create_keyword_centered_chunks(
231
+ text=row['clean_text'],
232
+ matched_keywords=row['matched'],
233
+ chunk_size=self.chunk_size,
234
+ doc_id=str(row.get('id', idx))
235
+ )
236
+
237
+ # Add metadata to each chunk
238
+ for chunk in chunks:
239
+ chunk.update({
240
+ 'source_type': 'emergency',
241
+ 'source_title': row.get('title', ''),
242
+ 'source_url': row.get('url', ''),
243
+ 'has_emergency': row.get('has_emergency', True),
244
+ 'doc_type': row.get('type', 'emergency')
245
+ })
246
+
247
+ all_chunks.extend(chunks)
248
+
249
+ self.emergency_chunks = all_chunks
250
+ logger.info(f"Generated {len(all_chunks)} emergency chunks")
251
+ return all_chunks
252
+
253
+ def process_treatment_chunks(self) -> List[Dict[str, Any]]:
254
+ """Process treatment data into chunks"""
255
+ logger.info("Processing treatment data into chunks...")
256
+
257
+ if self.treatment_data is None:
258
+ raise ValueError("Treatment data not loaded. Call load_filtered_data() first.")
259
+
260
+ all_chunks = []
261
+
262
+ for idx, row in self.treatment_data.iterrows():
263
+ if (pd.notna(row.get('clean_text')) and
264
+ pd.notna(row.get('treatment_matched'))):
265
+
266
+ chunks = self.create_dual_keyword_chunks(
267
+ text=row['clean_text'],
268
+ emergency_keywords=row.get('matched', ''),
269
+ treatment_keywords=row['treatment_matched'],
270
+ chunk_size=self.chunk_size,
271
+ doc_id=str(row.get('id', idx))
272
+ )
273
+
274
+ # Add metadata to each chunk
275
+ for chunk in chunks:
276
+ chunk.update({
277
+ 'source_type': 'treatment',
278
+ 'source_title': row.get('title', ''),
279
+ 'source_url': row.get('url', ''),
280
+ 'has_emergency': row.get('has_emergency', True),
281
+ 'has_treatment': row.get('has_treatment', True),
282
+ 'doc_type': row.get('type', 'treatment')
283
+ })
284
+
285
+ all_chunks.extend(chunks)
286
+
287
+ self.treatment_chunks = all_chunks
288
+ logger.info(f"Generated {len(all_chunks)} treatment chunks")
289
+ return all_chunks
290
+
291
+ def generate_embeddings(self, chunks: List[Dict[str, Any]],
292
+ chunk_type: str = "emergency") -> np.ndarray:
293
+ """
294
+ Generate embeddings for chunks
295
+
296
+ Args:
297
+ chunks: List of chunk dictionaries
298
+ chunk_type: Type of chunks ("emergency" or "treatment")
299
+
300
+ Returns:
301
+ numpy array of embeddings
302
+ """
303
+ logger.info(f"Generating embeddings for {len(chunks)} {chunk_type} chunks...")
304
+
305
+ # Load model if not already loaded
306
+ model = self.load_embedding_model()
307
+
308
+ # Extract text from chunks
309
+ texts = [chunk['text'] for chunk in chunks]
310
+
311
+ # Generate embeddings in batches
312
+ batch_size = 32
313
+ embeddings = []
314
+
315
+ for i in range(0, len(texts), batch_size):
316
+ batch_texts = texts[i:i+batch_size]
317
+ batch_embeddings = model.encode(batch_texts, show_progress_bar=True)
318
+ embeddings.append(batch_embeddings)
319
+
320
+ # Concatenate all embeddings
321
+ all_embeddings = np.vstack(embeddings)
322
+
323
+ logger.info(f"Generated embeddings shape: {all_embeddings.shape}")
324
+ return all_embeddings
325
+
326
+ def build_annoy_index(self, embeddings: np.ndarray,
327
+ index_name: str, n_trees: int = 10) -> AnnoyIndex:
328
+ """
329
+ Build ANNOY index from embeddings
330
+
331
+ Args:
332
+ embeddings: Numpy array of embeddings
333
+ index_name: Name for the index file
334
+ n_trees: Number of trees for ANNOY index
335
+
336
+ Returns:
337
+ Built ANNOY index
338
+ """
339
+ logger.info(f"Building ANNOY index: {index_name}")
340
+
341
+ # Create ANNOY index
342
+ index = AnnoyIndex(self.embedding_dim, 'angular') # angular = cosine similarity
343
+
344
+ # Add vectors to index
345
+ for i, embedding in enumerate(embeddings):
346
+ index.add_item(i, embedding)
347
+
348
+ # Build index
349
+ index.build(n_trees)
350
+
351
+ # Save index
352
+ index_path = self.models_dir / "indices" / "annoy" / f"{index_name}.ann"
353
+ index_path.parent.mkdir(parents=True, exist_ok=True)
354
+ index.save(str(index_path))
355
+
356
+ logger.info(f"ANNOY index saved to: {index_path}")
357
+ return index
358
+
359
+ def save_chunks_and_embeddings(self, chunks: List[Dict[str, Any]],
360
+ embeddings: np.ndarray, chunk_type: str):
361
+ """
362
+ Save chunks metadata and embeddings
363
+
364
+ Args:
365
+ chunks: List of chunk dictionaries
366
+ embeddings: Numpy array of embeddings
367
+ chunk_type: Type of chunks ("emergency" or "treatment")
368
+ """
369
+ logger.info(f"Saving {chunk_type} chunks and embeddings...")
370
+
371
+ # Create output directories
372
+ embeddings_dir = self.models_dir / "embeddings"
373
+ embeddings_dir.mkdir(parents=True, exist_ok=True)
374
+
375
+ # Save chunks metadata
376
+ chunks_file = embeddings_dir / f"{chunk_type}_chunks.json"
377
+ with open(chunks_file, 'w', encoding='utf-8') as f:
378
+ json.dump(chunks, f, ensure_ascii=False, indent=2)
379
+
380
+ # Save embeddings
381
+ embeddings_file = embeddings_dir / f"{chunk_type}_embeddings.npy"
382
+ np.save(embeddings_file, embeddings)
383
+
384
+ logger.info(f"Saved {chunk_type} data:")
385
+ logger.info(f" - Chunks: {chunks_file}")
386
+ logger.info(f" - Embeddings: {embeddings_file}")
387
+
388
+ def validate_data_quality(self) -> Dict[str, Any]:
389
+ """
390
+ Validate data quality and return statistics
391
+
392
+ Returns:
393
+ Dictionary with validation statistics
394
+ """
395
+ logger.info("Validating data quality...")
396
+
397
+ validation_report = {
398
+ "emergency_data": {},
399
+ "treatment_data": {},
400
+ "chunks": {},
401
+ "embeddings": {}
402
+ }
403
+
404
+ # Emergency data validation
405
+ if self.emergency_data is not None:
406
+ validation_report["emergency_data"] = {
407
+ "total_records": len(self.emergency_data),
408
+ "records_with_text": self.emergency_data['clean_text'].notna().sum(),
409
+ "records_with_keywords": self.emergency_data['matched'].notna().sum(),
410
+ "avg_text_length": self.emergency_data['clean_text'].str.len().mean()
411
+ }
412
+
413
+ # Treatment data validation
414
+ if self.treatment_data is not None:
415
+ validation_report["treatment_data"] = {
416
+ "total_records": len(self.treatment_data),
417
+ "records_with_text": self.treatment_data['clean_text'].notna().sum(),
418
+ "records_with_emergency_keywords": self.treatment_data['matched'].notna().sum(),
419
+ "records_with_treatment_keywords": self.treatment_data['treatment_matched'].notna().sum(),
420
+ "avg_text_length": self.treatment_data['clean_text'].str.len().mean()
421
+ }
422
+
423
+ # Chunks validation
424
+ validation_report["chunks"] = {
425
+ "emergency_chunks": len(self.emergency_chunks),
426
+ "treatment_chunks": len(self.treatment_chunks),
427
+ "total_chunks": len(self.emergency_chunks) + len(self.treatment_chunks)
428
+ }
429
+
430
+ if self.emergency_chunks:
431
+ avg_chunk_length = np.mean([len(chunk['text']) for chunk in self.emergency_chunks])
432
+ validation_report["chunks"]["avg_emergency_chunk_length"] = avg_chunk_length
433
+
434
+ if self.treatment_chunks:
435
+ avg_chunk_length = np.mean([len(chunk['text']) for chunk in self.treatment_chunks])
436
+ validation_report["chunks"]["avg_treatment_chunk_length"] = avg_chunk_length
437
+
438
+ # Check if embeddings exist
439
+ embeddings_dir = self.models_dir / "embeddings"
440
+ if embeddings_dir.exists():
441
+ emergency_emb_file = embeddings_dir / "emergency_embeddings.npy"
442
+ treatment_emb_file = embeddings_dir / "treatment_embeddings.npy"
443
+
444
+ validation_report["embeddings"] = {
445
+ "emergency_embeddings_exist": emergency_emb_file.exists(),
446
+ "treatment_embeddings_exist": treatment_emb_file.exists()
447
+ }
448
+
449
+ if emergency_emb_file.exists():
450
+ emb = np.load(emergency_emb_file)
451
+ validation_report["embeddings"]["emergency_embeddings_shape"] = emb.shape
452
+
453
+ if treatment_emb_file.exists():
454
+ emb = np.load(treatment_emb_file)
455
+ validation_report["embeddings"]["treatment_embeddings_shape"] = emb.shape
456
+
457
+ # Save validation report
458
+ report_file = self.models_dir / "data_validation_report.json"
459
+ with open(report_file, 'w', encoding='utf-8') as f:
460
+ json.dump(validation_report, f, indent=2, default=str)
461
+
462
+ logger.info(f"Validation report saved to: {report_file}")
463
+ return validation_report
464
+
465
+ def process_all_data(self) -> Dict[str, Any]:
466
+ """
467
+ Complete data processing pipeline
468
+
469
+ Returns:
470
+ Processing summary
471
+ """
472
+ logger.info("Starting complete data processing pipeline...")
473
+
474
+ # Step 1: Load filtered data
475
+ self.load_filtered_data()
476
+
477
+ # Step 2: Process chunks
478
+ emergency_chunks = self.process_emergency_chunks()
479
+ treatment_chunks = self.process_treatment_chunks()
480
+
481
+ # Step 3: Generate embeddings
482
+ emergency_embeddings = self.generate_embeddings(emergency_chunks, "emergency")
483
+ treatment_embeddings = self.generate_embeddings(treatment_chunks, "treatment")
484
+
485
+ # Step 4: Build ANNOY indices
486
+ emergency_index = self.build_annoy_index(emergency_embeddings, "emergency_index")
487
+ treatment_index = self.build_annoy_index(treatment_embeddings, "treatment_index")
488
+
489
+ # Step 5: Save data
490
+ self.save_chunks_and_embeddings(emergency_chunks, emergency_embeddings, "emergency")
491
+ self.save_chunks_and_embeddings(treatment_chunks, treatment_embeddings, "treatment")
492
+
493
+ # Step 6: Validate data quality
494
+ validation_report = self.validate_data_quality()
495
+
496
+ # Summary
497
+ summary = {
498
+ "status": "completed",
499
+ "emergency_chunks": len(emergency_chunks),
500
+ "treatment_chunks": len(treatment_chunks),
501
+ "emergency_embeddings_shape": emergency_embeddings.shape,
502
+ "treatment_embeddings_shape": treatment_embeddings.shape,
503
+ "indices_created": ["emergency_index.ann", "treatment_index.ann"],
504
+ "validation_report": validation_report
505
+ }
506
+
507
+ logger.info("Data processing pipeline completed successfully!")
508
+ logger.info(f"Summary: {summary}")
509
+
510
+ return summary
511
+
512
+ def main():
513
+ """Main function for testing the data processor"""
514
+ # Initialize processor
515
+ processor = DataProcessor()
516
+
517
+ # Run complete pipeline
518
+ summary = processor.process_all_data()
519
+
520
+ print("\n" + "="*50)
521
+ print("DATA PROCESSING COMPLETED")
522
+ print("="*50)
523
+ print(f"Emergency chunks: {summary['emergency_chunks']}")
524
+ print(f"Treatment chunks: {summary['treatment_chunks']}")
525
+ print(f"Emergency embeddings: {summary['emergency_embeddings_shape']}")
526
+ print(f"Treatment embeddings: {summary['treatment_embeddings_shape']}")
527
+ print(f"Indices created: {summary['indices_created']}")
528
+ print("="*50)
529
+
530
+ if __name__ == "__main__":
531
+ main()
tests/test_data_processing.py ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ """
2
+ Test script for data_processing.py
3
+
4
+ This script tests the basic functionality without running the full pipeline
5
+ to ensure everything is working correctly before proceeding with embedding generation.
6
+ """
7
+
8
+ import sys
9
+ import pandas as pd
10
+ from pathlib import Path
11
+
12
+ # Add src to path
13
+ sys.path.append(str(Path(__file__).parent.parent.resolve() / "src"))
14
+
15
+ from data_processing import DataProcessor
16
+ import logging
17
+
18
+ # Setup logging
19
+ logging.basicConfig(level=logging.INFO)
20
+ logger = logging.getLogger(__name__)
21
+
22
+ def test_data_loading():
23
+ """Test data loading functionality"""
24
+ print("="*50)
25
+ print("TESTING DATA LOADING")
26
+ print("="*50)
27
+
28
+ try:
29
+ # Initialize processor with explicit base directory
30
+ base_dir = Path(__file__).parent.parent.resolve()
31
+ processor = DataProcessor(base_dir=str(base_dir))
32
+
33
+ # Test data loading
34
+ emergency_data, treatment_data = processor.load_filtered_data()
35
+
36
+ print(f"✅ Emergency data loaded: {len(emergency_data)} records")
37
+ print(f"✅ Treatment data loaded: {len(treatment_data)} records")
38
+
39
+ # Check data structure
40
+ print("\nEmergency data columns:", list(emergency_data.columns))
41
+ print("Treatment data columns:", list(treatment_data.columns))
42
+
43
+ # Show sample data
44
+ if len(emergency_data) > 0:
45
+ print(f"\nSample emergency matched keywords: {emergency_data['matched'].iloc[0]}")
46
+
47
+ if len(treatment_data) > 0:
48
+ print(f"Sample treatment matched keywords: {treatment_data['treatment_matched'].iloc[0]}")
49
+
50
+ return True
51
+
52
+ except Exception as e:
53
+ print(f"❌ Data loading failed: {e}")
54
+ return False
55
+
56
+ def test_chunking():
57
+ """Test chunking functionality"""
58
+ print("\n" + "="*50)
59
+ print("TESTING CHUNKING FUNCTIONALITY")
60
+ print("="*50)
61
+
62
+ try:
63
+ # Initialize processor
64
+ processor = DataProcessor()
65
+
66
+ # Load data
67
+ processor.load_filtered_data()
68
+
69
+ # Test emergency chunking (just first few records)
70
+ print("Testing emergency chunking...")
71
+ emergency_chunks = []
72
+ for idx, row in processor.emergency_data.head(3).iterrows():
73
+ if pd.notna(row.get('clean_text')) and pd.notna(row.get('matched')):
74
+ chunks = processor.create_keyword_centered_chunks(
75
+ text=row['clean_text'],
76
+ matched_keywords=row['matched'],
77
+ chunk_size=512,
78
+ doc_id=str(row.get('id', idx))
79
+ )
80
+ emergency_chunks.extend(chunks)
81
+
82
+ print(f"✅ Generated {len(emergency_chunks)} emergency chunks from 3 records")
83
+
84
+ # Test treatment chunking (just first few records)
85
+ print("Testing treatment chunking...")
86
+ treatment_chunks = []
87
+ for idx, row in processor.treatment_data.head(3).iterrows():
88
+ if (pd.notna(row.get('clean_text')) and
89
+ pd.notna(row.get('treatment_matched'))):
90
+ chunks = processor.create_dual_keyword_chunks(
91
+ text=row['clean_text'],
92
+ emergency_keywords=row.get('matched', ''),
93
+ treatment_keywords=row['treatment_matched'],
94
+ chunk_size=512,
95
+ doc_id=str(row.get('id', idx))
96
+ )
97
+ treatment_chunks.extend(chunks)
98
+
99
+ print(f"✅ Generated {len(treatment_chunks)} treatment chunks from 3 records")
100
+
101
+ # Show sample chunk
102
+ if emergency_chunks:
103
+ sample_chunk = emergency_chunks[0]
104
+ print(f"\nSample emergency chunk:")
105
+ print(f" Primary keyword: {sample_chunk['primary_keyword']}")
106
+ print(f" Text length: {len(sample_chunk['text'])}")
107
+ print(f" Text preview: {sample_chunk['text'][:100]}...")
108
+
109
+ if treatment_chunks:
110
+ sample_chunk = treatment_chunks[0]
111
+ print(f"\nSample treatment chunk:")
112
+ print(f" Primary keyword: {sample_chunk['primary_keyword']}")
113
+ print(f" Emergency keywords: {sample_chunk['emergency_keywords']}")
114
+ print(f" Text length: {len(sample_chunk['text'])}")
115
+ print(f" Text preview: {sample_chunk['text'][:100]}...")
116
+
117
+ return True
118
+
119
+ except Exception as e:
120
+ print(f"❌ Chunking test failed: {e}")
121
+ import traceback
122
+ traceback.print_exc()
123
+ return False
124
+
125
+ def test_model_loading():
126
+ """Test if we can load the embedding model"""
127
+ print("\n" + "="*50)
128
+ print("TESTING MODEL LOADING")
129
+ print("="*50)
130
+
131
+ try:
132
+ processor = DataProcessor()
133
+
134
+ print("Loading NeuML/pubmedbert-base-embeddings...")
135
+ model = processor.load_embedding_model()
136
+
137
+ print(f"✅ Model loaded successfully: {processor.embedding_model_name}")
138
+ print(f"✅ Model max sequence length: {model.max_seq_length}")
139
+
140
+ # Test a simple encoding
141
+ test_text = "Patient presents with chest pain and shortness of breath."
142
+ embedding = model.encode([test_text])
143
+
144
+ print(f"✅ Test embedding shape: {embedding.shape}")
145
+ print(f"✅ Expected dimension: {processor.embedding_dim}")
146
+
147
+ assert embedding.shape[1] == processor.embedding_dim, f"Dimension mismatch: {embedding.shape[1]} != {processor.embedding_dim}"
148
+
149
+ return True
150
+
151
+ except Exception as e:
152
+ print(f"❌ Model loading failed: {e}")
153
+ import traceback
154
+ traceback.print_exc()
155
+ return False
156
+
157
+ def main():
158
+ """Run all tests"""
159
+ print("Starting data processing tests...\n")
160
+
161
+ # Import pandas here since it's used in chunking test
162
+ import pandas as pd
163
+
164
+ tests = [
165
+ test_data_loading,
166
+ test_chunking,
167
+ test_model_loading
168
+ ]
169
+
170
+ results = []
171
+ for test in tests:
172
+ result = test()
173
+ results.append(result)
174
+
175
+ print("\n" + "="*50)
176
+ print("TEST SUMMARY")
177
+ print("="*50)
178
+
179
+ for i, (test, result) in enumerate(zip(tests, results), 1):
180
+ status = "✅ PASSED" if result else "❌ FAILED"
181
+ print(f"{i}. {test.__name__}: {status}")
182
+
183
+ all_passed = all(results)
184
+
185
+ if all_passed:
186
+ print("\n🎉 All tests passed! Ready to proceed with full pipeline.")
187
+ print("\nTo run the full data processing pipeline:")
188
+ print("cd FinalProject && python src/data_processing.py")
189
+ else:
190
+ print("\n⚠️ Some tests failed. Please check the issues above.")
191
+
192
+ return all_passed
193
+
194
+ if __name__ == "__main__":
195
+ main()