fadliaulawi commited on
Commit
73c34df
1 Parent(s): dd9817d

Add llm validation

Browse files
Files changed (2) hide show
  1. prompt.py +12 -6
  2. validate.py +25 -0
prompt.py CHANGED
@@ -271,8 +271,8 @@ If there is no specific extracted entities provided from the table, just leave t
271
 
272
  prompt_validation = """
273
  # CONTEXT #
274
- In my capacity as a genomics specialist, I have table data containing gene names with their corresponding SNPs and diseases. The data is provided in a list of JSON format, with each JSON object representing a single row in a tabular structure.
275
- The problem is because the data is extracted using OCR, some gene names and SNPs may have a typo.
276
 
277
  This is the data:
278
  {}
@@ -280,19 +280,25 @@ This is the data:
280
  # OBJECTIVE #
281
  Given the provided table data, the following tasks need to be completed:
282
 
283
- 1. Check whether the gene name is the correct gene name. If the gene name is suspected of a typo, fix it into the correct form. If the gene name seems like a mistake entirely or invalid, remove the data row. Common errors include:
284
  - Combined Names: Two gene names erroneously merged into one. Duplicate this data row so each gene name has its own data.
285
  - OCR Errors: Similar characters misread by the system. Correct these to the intended form.
286
  2. If SNP is not empty, check whether the gene name corresponds with the SNP. Fix it with the correct SNP if the original SNP is wrong.
287
- 3. If diseases are not empty, check whether the gene name corresponds with the diseases. Fix it with the correct diseases if the original disease is wrong.
 
 
 
288
 
289
  # RESPONSE #
290
  The output must be only a string containing a list of JSON objects, adhering to the identical structure present in the original input data. Each object representing a validated entry with the following structure:
291
  [
292
  {{
293
  "Genes": "A",
294
- "SNPs": "rs123",
295
- "Diseases": "A disease"
 
 
 
296
  }}
297
  ]
298
  """
 
271
 
272
  prompt_validation = """
273
  # CONTEXT #
274
+ In my capacity as a genomics specialist, I have table data containing gene names with their corresponding rsID and diseases. The data is provided in a list of JSON format, with each JSON object representing a single row in a tabular structure.
275
+ The problem is because the data is extracted using OCR, some gene names and rsIDs may have a typo.
276
 
277
  This is the data:
278
  {}
 
280
  # OBJECTIVE #
281
  Given the provided table data, the following tasks need to be completed:
282
 
283
+ 1. Check whether the gene name is the correct gene name. If the gene name is suspected of a typo, fix it into the correct form. If the gene name seems like a mistake entirely or invalid, leave it blank. Common errors include:
284
  - Combined Names: Two gene names erroneously merged into one. Duplicate this data row so each gene name has its own data.
285
  - OCR Errors: Similar characters misread by the system. Correct these to the intended form.
286
  2. If SNP is not empty, check whether the gene name corresponds with the SNP. Fix it with the correct SNP if the original SNP is wrong.
287
+ 3. If the diseases field is not empty, verify that each entry is a recognized medical condition and has documented correlations with its associated gene and SNP entries. Clear any disease entries that fail either validation check.
288
+
289
+ IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility. If you are not sure about some values, just leave the corresponding field blank with an empty string ('').
290
+ If there are any mistakes, don't remove the data row. Instead, either fix the mistake or replace the value with a blank string.
291
 
292
  # RESPONSE #
293
  The output must be only a string containing a list of JSON objects, adhering to the identical structure present in the original input data. Each object representing a validated entry with the following structure:
294
  [
295
  {{
296
  "Genes": "A",
297
+ "rsID": "rs123",
298
+ "OR Value": 1.25,
299
+ "Beta Value": 0.02,
300
+ "P Value": 0.51,
301
+ "Traits": "A disease"
302
  }}
303
  ]
304
  """
validate.py CHANGED
@@ -5,6 +5,7 @@ from prompt import *
5
  from utils import *
6
 
7
  import os
 
8
  import re
9
 
10
  load_dotenv()
@@ -167,6 +168,30 @@ class Validation():
167
  genes.append(gene)
168
  snps.append(snp)
169
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
170
  df.reset_index(drop=True, inplace=True)
171
 
172
  return df, df_clean
 
5
  from utils import *
6
 
7
  import os
8
+ import json
9
  import re
10
 
11
  load_dotenv()
 
168
  genes.append(gene)
169
  snps.append(snp)
170
 
171
+ df.drop_duplicates(['Genes', 'rsID'], ignore_index=True, inplace=True)
172
+
173
+ # Validate genes and diseases with LLM (for each 20 rows)
174
+ idx = 0
175
+ df_llm = pd.DataFrame()
176
+
177
+ while True:
178
+ json_table = df[idx:idx+20].to_json(orient='records')
179
+ str_json_table = json.dumps(json.loads(json_table), indent=2)
180
+
181
+ result = self.llm.invoke(input=prompt_validation.format(str_json_table)).content
182
+ result = result[result.find('['):result.rfind(']')+1]
183
+ try:
184
+ result = eval(result)
185
+ except SyntaxError:
186
+ result = []
187
+
188
+ df_llm = pd.concat([df_llm, pd.DataFrame(result)])
189
+
190
+ idx += 20
191
+ if idx not in df.index:
192
+ break
193
+
194
+ df = df_llm.copy()
195
  df.reset_index(drop=True, inplace=True)
196
 
197
  return df, df_clean