fadliaulawi
commited on
Commit
•
73c34df
1
Parent(s):
dd9817d
Add llm validation
Browse files- prompt.py +12 -6
- validate.py +25 -0
prompt.py
CHANGED
@@ -271,8 +271,8 @@ If there is no specific extracted entities provided from the table, just leave t
|
|
271 |
|
272 |
prompt_validation = """
|
273 |
# CONTEXT #
|
274 |
-
In my capacity as a genomics specialist, I have table data containing gene names with their corresponding
|
275 |
-
The problem is because the data is extracted using OCR, some gene names and
|
276 |
|
277 |
This is the data:
|
278 |
{}
|
@@ -280,19 +280,25 @@ This is the data:
|
|
280 |
# OBJECTIVE #
|
281 |
Given the provided table data, the following tasks need to be completed:
|
282 |
|
283 |
-
1. Check whether the gene name is the correct gene name. If the gene name is suspected of a typo, fix it into the correct form. If the gene name seems like a mistake entirely or invalid,
|
284 |
- Combined Names: Two gene names erroneously merged into one. Duplicate this data row so each gene name has its own data.
|
285 |
- OCR Errors: Similar characters misread by the system. Correct these to the intended form.
|
286 |
2. If SNP is not empty, check whether the gene name corresponds with the SNP. Fix it with the correct SNP if the original SNP is wrong.
|
287 |
-
3. If diseases
|
|
|
|
|
|
|
288 |
|
289 |
# RESPONSE #
|
290 |
The output must be only a string containing a list of JSON objects, adhering to the identical structure present in the original input data. Each object representing a validated entry with the following structure:
|
291 |
[
|
292 |
{{
|
293 |
"Genes": "A",
|
294 |
-
"
|
295 |
-
"
|
|
|
|
|
|
|
296 |
}}
|
297 |
]
|
298 |
"""
|
|
|
271 |
|
272 |
prompt_validation = """
|
273 |
# CONTEXT #
|
274 |
+
In my capacity as a genomics specialist, I have table data containing gene names with their corresponding rsID and diseases. The data is provided in a list of JSON format, with each JSON object representing a single row in a tabular structure.
|
275 |
+
The problem is because the data is extracted using OCR, some gene names and rsIDs may have a typo.
|
276 |
|
277 |
This is the data:
|
278 |
{}
|
|
|
280 |
# OBJECTIVE #
|
281 |
Given the provided table data, the following tasks need to be completed:
|
282 |
|
283 |
+
1. Check whether the gene name is the correct gene name. If the gene name is suspected of a typo, fix it into the correct form. If the gene name seems like a mistake entirely or invalid, leave it blank. Common errors include:
|
284 |
- Combined Names: Two gene names erroneously merged into one. Duplicate this data row so each gene name has its own data.
|
285 |
- OCR Errors: Similar characters misread by the system. Correct these to the intended form.
|
286 |
2. If SNP is not empty, check whether the gene name corresponds with the SNP. Fix it with the correct SNP if the original SNP is wrong.
|
287 |
+
3. If the diseases field is not empty, verify that each entry is a recognized medical condition and has documented correlations with its associated gene and SNP entries. Clear any disease entries that fail either validation check.
|
288 |
+
|
289 |
+
IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility. If you are not sure about some values, just leave the corresponding field blank with an empty string ('').
|
290 |
+
If there are any mistakes, don't remove the data row. Instead, either fix the mistake or replace the value with a blank string.
|
291 |
|
292 |
# RESPONSE #
|
293 |
The output must be only a string containing a list of JSON objects, adhering to the identical structure present in the original input data. Each object representing a validated entry with the following structure:
|
294 |
[
|
295 |
{{
|
296 |
"Genes": "A",
|
297 |
+
"rsID": "rs123",
|
298 |
+
"OR Value": 1.25,
|
299 |
+
"Beta Value": 0.02,
|
300 |
+
"P Value": 0.51,
|
301 |
+
"Traits": "A disease"
|
302 |
}}
|
303 |
]
|
304 |
"""
|
validate.py
CHANGED
@@ -5,6 +5,7 @@ from prompt import *
|
|
5 |
from utils import *
|
6 |
|
7 |
import os
|
|
|
8 |
import re
|
9 |
|
10 |
load_dotenv()
|
@@ -167,6 +168,30 @@ class Validation():
|
|
167 |
genes.append(gene)
|
168 |
snps.append(snp)
|
169 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
170 |
df.reset_index(drop=True, inplace=True)
|
171 |
|
172 |
return df, df_clean
|
|
|
5 |
from utils import *
|
6 |
|
7 |
import os
|
8 |
+
import json
|
9 |
import re
|
10 |
|
11 |
load_dotenv()
|
|
|
168 |
genes.append(gene)
|
169 |
snps.append(snp)
|
170 |
|
171 |
+
df.drop_duplicates(['Genes', 'rsID'], ignore_index=True, inplace=True)
|
172 |
+
|
173 |
+
# Validate genes and diseases with LLM (for each 20 rows)
|
174 |
+
idx = 0
|
175 |
+
df_llm = pd.DataFrame()
|
176 |
+
|
177 |
+
while True:
|
178 |
+
json_table = df[idx:idx+20].to_json(orient='records')
|
179 |
+
str_json_table = json.dumps(json.loads(json_table), indent=2)
|
180 |
+
|
181 |
+
result = self.llm.invoke(input=prompt_validation.format(str_json_table)).content
|
182 |
+
result = result[result.find('['):result.rfind(']')+1]
|
183 |
+
try:
|
184 |
+
result = eval(result)
|
185 |
+
except SyntaxError:
|
186 |
+
result = []
|
187 |
+
|
188 |
+
df_llm = pd.concat([df_llm, pd.DataFrame(result)])
|
189 |
+
|
190 |
+
idx += 20
|
191 |
+
if idx not in df.index:
|
192 |
+
break
|
193 |
+
|
194 |
+
df = df_llm.copy()
|
195 |
df.reset_index(drop=True, inplace=True)
|
196 |
|
197 |
return df, df_clean
|