Spaces:

KalbeDigitalLab
/

nutrigenme-paper-extractor

Running

App Files Files Community

fadliaulawi commited on about 12 hours ago

Commit

73c34df

•

1 Parent(s): dd9817d

Add llm validation

Browse files

Files changed (2) hide show

prompt.py +12 -6
validate.py +25 -0

prompt.py CHANGED Viewed

@@ -271,8 +271,8 @@ If there is no specific extracted entities provided from the table, just leave t
 prompt_validation = """
 # CONTEXT #
-In my capacity as a genomics specialist, I have table data containing gene names with their corresponding SNPs and diseases. The data is provided in a list of JSON format, with each JSON object representing a single row in a tabular structure.
-The problem is because the data is extracted using OCR, some gene names and SNPs may have a typo.
 This is the data:
 {}
@@ -280,19 +280,25 @@ This is the data:
 # OBJECTIVE #
 Given the provided table data, the following tasks need to be completed:
-1. Check whether the gene name is the correct gene name. If the gene name is suspected of a typo, fix it into the correct form. If the gene name seems like a mistake entirely or invalid, remove the data row. Common errors include:
     - Combined Names: Two gene names erroneously merged into one. Duplicate this data row so each gene name has its own data.
     - OCR Errors: Similar characters misread by the system. Correct these to the intended form.
 2. If SNP is not empty, check whether the gene name corresponds with the SNP. Fix it with the correct SNP if the original SNP is wrong.
-3. If diseases are not empty, check whether the gene name corresponds with the diseases. Fix it with the correct diseases if the original disease is wrong.
 # RESPONSE #
 The output must be only a string containing a list of JSON objects, adhering to the identical structure present in the original input data. Each object representing a validated entry with the following structure:
 [
     {{
         "Genes": "A",
-        "SNPs": "rs123",
-        "Diseases": "A disease"
     }}
 ]
 """

 prompt_validation = """
 # CONTEXT #
+In my capacity as a genomics specialist, I have table data containing gene names with their corresponding rsID and diseases. The data is provided in a list of JSON format, with each JSON object representing a single row in a tabular structure.
+The problem is because the data is extracted using OCR, some gene names and rsIDs may have a typo.
 This is the data:
 {}
 # OBJECTIVE #
 Given the provided table data, the following tasks need to be completed:
+1. Check whether the gene name is the correct gene name. If the gene name is suspected of a typo, fix it into the correct form. If the gene name seems like a mistake entirely or invalid, leave it blank. Common errors include:
     - Combined Names: Two gene names erroneously merged into one. Duplicate this data row so each gene name has its own data.
     - OCR Errors: Similar characters misread by the system. Correct these to the intended form.
 2. If SNP is not empty, check whether the gene name corresponds with the SNP. Fix it with the correct SNP if the original SNP is wrong.
+3. If the diseases field is not empty, verify that each entry is a recognized medical condition and has documented correlations with its associated gene and SNP entries. Clear any disease entries that fail either validation check.
+IMPORTANT: It is crucial to maintain the utmost accuracy in this process, as any false or fabricated information (hallucination) can have severe consequences for academic integrity and research credibility. If you are not sure about some values, just leave the corresponding field blank with an empty string ('').
+If there are any mistakes, don't remove the data row. Instead, either fix the mistake or replace the value with a blank string.
 # RESPONSE #
 The output must be only a string containing a list of JSON objects, adhering to the identical structure present in the original input data. Each object representing a validated entry with the following structure:
 [
     {{
         "Genes": "A",
+        "rsID": "rs123",
+        "OR Value": 1.25,
+        "Beta Value": 0.02,
+        "P Value": 0.51,
+        "Traits": "A disease"
     }}
 ]
 """

validate.py CHANGED Viewed

@@ -5,6 +5,7 @@ from prompt import *
 from utils import *
 import os
 import re
 load_dotenv()
@@ -167,6 +168,30 @@ class Validation():
                 genes.append(gene)
                 snps.append(snp)
         df.reset_index(drop=True, inplace=True)
         return df, df_clean

 from utils import *
 import os
+import json
 import re
 load_dotenv()
                 genes.append(gene)
                 snps.append(snp)
+        df.drop_duplicates(['Genes', 'rsID'], ignore_index=True, inplace=True)
+        # Validate genes and diseases with LLM (for each 20 rows)
+        idx = 0
+        df_llm = pd.DataFrame()
+        while True:
+            json_table = df[idx:idx+20].to_json(orient='records')
+            str_json_table = json.dumps(json.loads(json_table), indent=2)
+            result = self.llm.invoke(input=prompt_validation.format(str_json_table)).content
+            result = result[result.find('['):result.rfind(']')+1]
+            try:
+                result = eval(result)
+            except SyntaxError:
+                result = []
+            df_llm = pd.concat([df_llm, pd.DataFrame(result)])
+            idx += 20
+            if idx not in df.index:
+                break
+        df = df_llm.copy()
         df.reset_index(drop=True, inplace=True)
         return df, df_clean