Spaces:

jhatchett
/

Words2Wisdom

Sleeping

App Files Files Community

johaunh commited on Feb 17, 2024

Commit

4b9251f

1 Parent(s): 049a574

updated for ai4ed

Browse files

Files changed (20) hide show

README.md +58 -62
config/default_config.ini +8 -0
schema.yml → config/modules.yml +22 -1
config/validation.yml +61 -0
demo/config.ini +8 -0
demo/demo.ipynb +291 -0
demo/example.txt +1 -0
main.py +0 -263
requirements.txt +8 -8
src/words2wisdom/__init__.py +30 -0
src/words2wisdom/__main__.py +3 -0
src/words2wisdom/cli.py +97 -0
src/words2wisdom/config.py +96 -0
src/words2wisdom/gui.py +86 -0
chains.py → src/words2wisdom/output_parsers.py +26 -21
src/words2wisdom/pipeline.py +183 -0
src/words2wisdom/utils.py +62 -0
src/words2wisdom/validate.py +179 -0
utils.py +0 -31
writeup/words2wisdom_short.pdf +0 -0

README.md CHANGED Viewed

@@ -1,86 +1,82 @@
----
-title: Text2KG
-app_file: main.py
-sdk: gradio
-sdk_version: 3.39.0
-pinned: true
-license: mit
-emoji: 🧞📖
-colorFrom: indigo
-colorTo: gray
----
-# Text2KG
-We introduce Text2KG – an intuitive, domain-independent tool that leverages the creative generative ability of GPT-3.5 in the KG construction process. Text2KG automates and accelerates the construction of KGs from unstructured plain text, reducing the need for traditionally-used human labor and computer resources. Our approach incorporates a novel, clause-based text simplification step, reducing the processing of even the most extensive corpora down to the order of minutes. With Text2KG, we aim to streamline the creation of databases from natural language, offering a robust, cost-effective, and user-friendly solution for KG construction.
-## Usage
-### Gradio app
-#### Remotely
-Visit the [Text2KG HuggingFace Space](https://huggingface.co/spaces/jhatchett/Text2KG).
-#### Locally
-Clone this repository, and then use the command
-```
-python main.py
-```
-in the repository's directory.
-### Within a `python` IDE
-Import the primary pipeline method using
-```python
->>> from main import extract_knowledge_graph
 ```
-**`extract_knowledge_graph` parameters**
-```
-api_key (str)
-    OpenAI API key
-batch_size (int)
-    Number of sentences per forward pass
-modules (list)
-    Additional modules to add before main extraction process (triplet_extraction). Must be a valid name in schema.yml
-text (str)
-    Input text to extract knowledge graph from
-progress
-    Progress bar. The default is Gradio's progress bar;
-    set `progress = tqdm` for implementations outside of Gradio
-```
-### Using Gradio API
-Read more [here](https://www.gradio.app/docs/python-client).
 ## File structure
 ```
-chains.py
-    Converts the items in schema.yml to LangChain modules
-requirements.txt
-    Contains packages required to run Text2KG
-main.py
-    Main pipeline/app code
-README.md
-    This file
-schema.yml
-    Contains definitions of modules -- prompts + output parsers
-utils.py
-    Contains helper functions
 ```

+# Words2Wisdom
+This is the repository for Words2Wisdom. The work is still a work in progress.
+**Paper:** [here](./writeup/words2wisdom_short.pdf) (Accepted as poster to AAAI AI4ED '24 Workshop)
+**Hugging Face Space:** [Words2Wisdom](https://huggingface.co/spaces/jhatchett/Words2Wisdom)
+**Abstract:**
+Large language models (LLMs) have emerged as powerful tools with vast potential across various domains. While they have the potential to transform the educational landscape with personalized learning experiences, these models face challenges such as high training and usage costs, and susceptibility to inaccuracies. One promising solution to these challenges lies in leveraging knowledge graphs (KGs) for knowledge injection. By integrating factual content into pre-trained LLMs, KGs can reduce the costs associated with domain alignment, mitigate the risk of hallucination, and enhance the interpretability of the models' outputs. To meet the need for efficient knowledge graph creation, we introduce *Words2Wisdom* (W2W), a domain-independent LLM-based tool that automatically generates KGs from plain text. With W2W, we aim to provide a streamlined KG construction option that can drive advancements in grounded LLM-based educational technologies.
+## Demonstration
+The `demo/demo.ipynb` notebook walks through how to use the `words2wisdom` pipeline.
+## Usage
+Due to the large number of configurable parameters, `words2wisdom` uses a configuration INI file:
+```ini
+[pipeline]
+words_per_batch = 150 # any positive integer
+preprocess = clause_deconstruction # {None, clause_deconstruction}
+extraction = triplet_extraction # {triplet_extraction}
+[llm]
+model = gpt-3.5-turbo
+# other GPT params like temperature, etc. can be set here too
 ```
+A template configuration file can be generated with the command-line interface. **Note:** If `openai_api_key` is not explicitly set, the config will automatically try to read from the `OPENAI_API_KEY` environment variable.
+### From the CLI
+All commands are preceded by `python -m words2wisdom`
+| In order to... | Use the command... |
+| -- | -- |
+| Create a new config file | `init > path/to/write/config.ini` |
+| Generate KG from text | `run path/to/text.txt [--config CONFIG] [--output-dir OUTPUT_DIR]` |
+| Evaluate `words2wisdom` outputs | `eval path/to/output.zip` |
+| Use `words2wisdom` from Gradio interface (default config only) | `gui` |
+### As a `Python` package
+Import the primary pipeline method using
+```python
+from words2wisdom.pipeline import Pipeline
+# configure pipeline from .ini
+pipe = Pipeline.from_ini("path/to/config.ini")
+text_batches, knowledge_graph = pipe.run("The cat sat on the mat")
+```
 ## File structure
 ```
+├── config
+│   ├── default_config.ini
+│   ├── modules.yml
+│   └── validation.yml
+├── demo
+│   ├── config.ini
+│   ├── demo.ipynb
+│   └── example.txt
+├── src/words2wisdom
+│   ├── __init__.py
+│   ├── __main__.py
+│   ├── cli.py
+│   ├── config.py
+│   ├── gui.py
+│   ├── output_parsers.py
+│   ├── pipeline.py
+│   ├── utils.py
+│   └── validate.py
+├── writeup
+│   └── words2wisdom_short.pdf
+├── LICENSE.md
+├── README.md
+└── requirements.txt
 ```

config/default_config.ini ADDED Viewed

	@@ -0,0 +1,8 @@

+[pipeline]
+words_per_batch = 150
+preprocess = clause_deconstruction
+extraction = triplet_extraction
+[llm]
+openai_api_key = None
+model = gpt-3.5-turbo

schema.yml → config/modules.yml RENAMED Viewed

@@ -1,8 +1,28 @@
 clause_deconstruction:
   parser: ClauseParser
   prompts:
     system: |
-      You are a sentence parsing agent helping to construct a knowledge graph.
       Given the text, extract a list of the premises embedded within it.
       Focus on identifying declarative sentences that convey factual information.
@@ -19,6 +39,7 @@ clause_deconstruction:
       {text}
 triplet_extraction:
   parser: TripletParser
   prompts:
     system: |

 clause_deconstruction:
+  type: preprocess
+  parser: StrOutputParser
+  prompts:
+    system: |
+      You are a sentence parsing agent helping to simplify complex syntax.
+      The aim is to split the given text into a meaning-preserving sequence of simpler, shorter sentences.
+      Each of the short sentences should be self-contained, and should not co-reference other sentences.
+      Try to aim for one clause per sentence. Your response should be the split/rephrased text ONLY.
+      EXAMPLE: Dogs and cats chase squirrels, but fish do not.
+      ACCEPTABLE RESPONSE: Dogs chase squirrels. Cats chase squirrels. Fish do not chase squirrels.
+      UNACCEPTABLE RESPONSE: Dogs and cats chase squirrels. Fish do not chase them.
+    human: |
+      {text}
+premise_extraction:
+  type: preprocess
   parser: ClauseParser
   prompts:
     system: |
+      You are a sentence parsing agent helping to simplify complex syntax.
       Given the text, extract a list of the premises embedded within it.
       Focus on identifying declarative sentences that convey factual information.
       {text}
 triplet_extraction:
+  type: extraction
   parser: TripletParser
   prompts:
     system: |

config/validation.yml ADDED Viewed

	@@ -0,0 +1,61 @@

+instruction: >
+  We are in the process of evaluating an AI-extracted knowledge graph.
+  Your task involves assessing the accuracy and specificity of a component
+  of the graph known as a triplet, which represents a key idea or fact. A
+  triplet comprises three components: a subject entity s, a relation r, and
+  an object entity o (format: [s, r, o]). We emphasize that the order of these
+  components is significant; the subject s relates to the object o via the
+  relation r. That is, the relation points from the subject to the object. Our
+  AI agent extracts a collection of triplets from each passage provided. The
+  evaluation task has 5 subtasks:
+questions:
+  Q1:
+    title: Specificity (subject entity).
+    text: Does the subject entity represent a specific term/concept referenced in the passage?
+    additional:
+    options:
+      - 1 = Specific
+      - 0 = Not specific
+  Q2:
+    title: Specificity (object entity).
+    text: Does the object entity represent a specific term/concept referenced in the passage?
+    additional:
+    options:
+      - 1 = Specific
+      - 0 = Not specific
+  Q3:
+    title: Relation Validity.
+    text: Does the relation correctly connect the subject entity to the object entity?
+    additional:
+    options:
+      - 1 = Correct
+      - 0 = Incorrect
+  Q4:
+    title: Triplet Relevance.
+    text: Evaluate the importance of the triplet in relation to the passage's meaning.
+    additional: Rate from 0 to 2.
+    options:
+      - 0 = Not relevant to understanding
+      - 1 = Helpful, but not essential to understanding
+      - 2 = Critical fact of passage
+  Q5:
+    title: Triplet Comprehensiveness.
+    text: Assess whether the triplet can function independently from the passage, effectively conveying one of the passage's key ideas.
+    additional: Rate from 0 to 2.
+    options:
+      - 0 = Cannot function without context
+      - 1 = Somewhat dependent on passage
+      - 2 = Comprehensive
+example:
+  passage: >
+    Several cells of one kind that interconnect with each other and perform a shared function form
+    tissues. These tissues combine to form an organ (your stomach, heart, or brain), and several
+    organs comprise an organ system (such as the digestive system, circulatory system, or nervous
+    system).
+  triplet: "['organ', 'such as', 'heart']"
+  answers:
+    Q1: 1 - The entity 'organ' is specific and is mentioned in the passage.
+    Q2: 1 - The entity 'heart' is specific and is mentioned in the passage.
+    Q3: 1 - The relation 'such as' is unclear. A better relation would be 'superclass of'.
+    Q4: 1 - The triplet is relatively important as it is used as a parenthetical example in the passage.
+    Q5: 0 - The provided triplet is unclear as is. "Organ such as heart" doesn't make sense.

demo/config.ini ADDED Viewed

	@@ -0,0 +1,8 @@

+[pipeline]
+words_per_batch = 150
+preprocess = clause_deconstruction
+extraction = triplet_extraction
+[llm]
+model = gpt-3.5-turbo
+temperature = 0.3

demo/demo.ipynb ADDED Viewed

	@@ -0,0 +1,291 @@

+{
+ "cells": [
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "# `Words2Wisdom` Demo\n",
+    "\n",
+    "For purpose of the notebook, we add the `src` director to the `PYTHONPATH`:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 1,
+   "metadata": {},
+   "outputs": [],
+   "source": [
+    "import sys\n",
+    "\n",
+    "# add words2wisdom to PYTHONPATH\n",
+    "sys.path.append(\"../src/\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Next, we load in the example text file (from OpenStax Bio 2e chapter 4.2):"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 2,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Cells fall into one of two broad categories: prokaryotic and eukaryotic. We classify only the predominantly single-celled organisms Bacteria and Archaea as prokaryotes (pro- = before; -kary- = nucleus...\n"
+     ]
+    }
+   ],
+   "source": [
+    "# load example text\n",
+    "with open(\"example.txt\") as f:\n",
+    "    text = f.read()\n",
+    "\n",
+    "# print example\n",
+    "print(text[:200] + \"...\")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "The `words2wisdom` pipeline can be configured from a configuration INI file. We have one prepared already, but you will need to create one with your desired settings.\n",
+    "\n",
+    "After configuration, we call the `run` process. Then, we save all outputs to a ZIP file."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 3,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Initialized Text2KG pipeline:\n",
+      "[INPUT: text] -> ClauseDeconstruction() -> TripletExtraction() -> [OUTPUT: knowledge graph]\n",
+      "Running Text2KG pipeline:\n",
+      "Extracting knowledge graph... Cleaning knowledge graph components... Done!\n",
+      "Run ID: 2024-02-16-001\n",
+      "Saved data to ./output-2024-02-16-001.zip\n"
+     ]
+    }
+   ],
+   "source": [
+    "from words2wisdom.pipeline import Pipeline\n",
+    "from words2wisdom.utils import dump_all\n",
+    "\n",
+    "w2w = Pipeline.from_ini(\"config.ini\")\n",
+    "batches, graph = w2w.run(text)\n",
+    "\n",
+    "output_zip = dump_all(\n",
+    "    pipeline=w2w,\n",
+    "    text_batches=batches,\n",
+    "    knowledge_graph=graph,\n",
+    "    to_path=\".\"\n",
+    ")"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "Here we use GPT-4 to auto-evaluate the knowledge graph."
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 4,
+   "metadata": {},
+   "outputs": [
+    {
+     "name": "stdout",
+     "output_type": "stream",
+     "text": [
+      "Initializing knowledge graph validation. Run: 2024-02-16-001\n",
+      "\n",
+      "Starting excerpt  1 of 6. Validating  7 triplets... Done!\n",
+      "Starting excerpt  2 of 6. Validating 20 triplets... Done!\n",
+      "Starting excerpt  3 of 6. Validating 20 triplets... Done!\n",
+      "Starting excerpt  4 of 6. Validating 10 triplets... Done!\n",
+      "Starting excerpt  5 of 6. Validating 10 triplets... Done!\n",
+      "Starting excerpt  6 of 6. Validating 16 triplets... Done!\n",
+      "\n",
+      "Knowledge graph validation complete!\n",
+      "It took 109.471 seconds to validate 83 triplets.\n",
+      "Saved to: ./validation-2024-02-16-001.csv\n"
+     ]
+    }
+   ],
+   "source": [
+    "from langchain_openai import ChatOpenAI\n",
+    "from words2wisdom.validate import validate_knowledge_graph\n",
+    "\n",
+    "llm = ChatOpenAI(model=\"gpt-4-turbo-preview\")\n",
+    "\n",
+    "eval_file = validate_knowledge_graph(llm=llm, output_zip=output_zip)"
+   ]
+  },
+  {
+   "cell_type": "markdown",
+   "metadata": {},
+   "source": [
+    "There are 5 evaluation questions. The questions and score ranges can be found in `config/validation.yml`. Here are the results:"
+   ]
+  },
+  {
+   "cell_type": "code",
+   "execution_count": 5,
+   "metadata": {},
+   "outputs": [
+    {
+     "data": {
+      "text/html": [
+       "<div>\n",
+       "<style scoped>\n",
+       "    .dataframe tbody tr th:only-of-type {\n",
+       "        vertical-align: middle;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe tbody tr th {\n",
+       "        vertical-align: top;\n",
+       "    }\n",
+       "\n",
+       "    .dataframe thead th {\n",
+       "        text-align: right;\n",
+       "    }\n",
+       "</style>\n",
+       "<table border=\"1\" class=\"dataframe\">\n",
+       "  <thead>\n",
+       "    <tr style=\"text-align: right;\">\n",
+       "      <th></th>\n",
+       "      <th>Q1</th>\n",
+       "      <th>Q2</th>\n",
+       "      <th>Q3</th>\n",
+       "      <th>Q4</th>\n",
+       "      <th>Q5</th>\n",
+       "    </tr>\n",
+       "  </thead>\n",
+       "  <tbody>\n",
+       "    <tr>\n",
+       "      <th>count</th>\n",
+       "      <td>83.000000</td>\n",
+       "      <td>83.000000</td>\n",
+       "      <td>83.000000</td>\n",
+       "      <td>83.000000</td>\n",
+       "      <td>83.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>mean</th>\n",
+       "      <td>0.975904</td>\n",
+       "      <td>0.975904</td>\n",
+       "      <td>0.975904</td>\n",
+       "      <td>1.819277</td>\n",
+       "      <td>1.566265</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>std</th>\n",
+       "      <td>0.154281</td>\n",
+       "      <td>0.154281</td>\n",
+       "      <td>0.154281</td>\n",
+       "      <td>0.387128</td>\n",
+       "      <td>0.522489</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>min</th>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>0.000000</td>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>0.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>25%</th>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>2.000000</td>\n",
+       "      <td>1.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>50%</th>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>2.000000</td>\n",
+       "      <td>2.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>75%</th>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>2.000000</td>\n",
+       "      <td>2.000000</td>\n",
+       "    </tr>\n",
+       "    <tr>\n",
+       "      <th>max</th>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>1.000000</td>\n",
+       "      <td>2.000000</td>\n",
+       "      <td>2.000000</td>\n",
+       "    </tr>\n",
+       "  </tbody>\n",
+       "</table>\n",
+       "</div>"
+      ],
+      "text/plain": [
+       "              Q1         Q2         Q3         Q4         Q5\n",
+       "count  83.000000  83.000000  83.000000  83.000000  83.000000\n",
+       "mean    0.975904   0.975904   0.975904   1.819277   1.566265\n",
+       "std     0.154281   0.154281   0.154281   0.387128   0.522489\n",
+       "min     0.000000   0.000000   0.000000   1.000000   0.000000\n",
+       "25%     1.000000   1.000000   1.000000   2.000000   1.000000\n",
+       "50%     1.000000   1.000000   1.000000   2.000000   2.000000\n",
+       "75%     1.000000   1.000000   1.000000   2.000000   2.000000\n",
+       "max     1.000000   1.000000   1.000000   2.000000   2.000000"
+      ]
+     },
+     "execution_count": 5,
+     "metadata": {},
+     "output_type": "execute_result"
+    }
+   ],
+   "source": [
+    "import pandas as pd\n",
+    "\n",
+    "data = pd.read_csv(eval_file)\n",
+    "data.describe(include=[int])"
+   ]
+  }
+ ],
+ "metadata": {
+  "kernelspec": {
+   "display_name": "nlp",
+   "language": "python",
+   "name": "python3"
+  },
+  "language_info": {
+   "codemirror_mode": {
+    "name": "ipython",
+    "version": 3
+   },
+   "file_extension": ".py",
+   "mimetype": "text/x-python",
+   "name": "python",
+   "nbconvert_exporter": "python",
+   "pygments_lexer": "ipython3",
+   "version": "3.10.11"
+  },
+  "orig_nbformat": 4
+ },
+ "nbformat": 4,
+ "nbformat_minor": 2
+}

demo/example.txt ADDED Viewed

	@@ -0,0 +1 @@

+ Cells fall into one of two broad categories: prokaryotic and eukaryotic. We classify only the predominantly single-celled organisms Bacteria and Archaea as prokaryotes (pro- = before; -kary- = nucleus). Animal cells, plants, fungi, and protists are all eukaryotes (eu- = true). All cells share four common components: 1) a plasma membrane, an outer covering that separates the cell's interior from its surrounding environment; 2) cytoplasm, consisting of a jelly-like cytosol within the cell in which there are other cellular components; 3) DNA, the cell's genetic material; and 4) ribosomes, which synthesize proteins. However, prokaryotes differ from eukaryotic cells in several ways. A prokaryote is a simple, mostly single-celled (unicellular) organism that lacks a nucleus, or any other membrane-bound organelle. We will shortly come to see that this is significantly different in eukaryotes. Prokaryotic DNA is in the cell's central part: the nucleoid (Figure 4.5). This figure shows the generalized structure of a prokaryotic cell. All prokaryotes have chromosomal DNA localized in a nucleoid, ribosomes, a cell membrane, and a cell wall. The other structures shown are present in some, but not all, bacteria. Most prokaryotes have a peptidoglycan cell wall and many have a polysaccharide capsule (Figure 4.5). The cell wall acts as an extra layer of protection, helps the cell maintain its shape, and prevents dehydration. The capsule enables the cell to attach to surfaces in its environment. Some prokaryotes have flagella, pili, or fimbriae. Flagella are used for locomotion. Pili exchange genetic material during conjugation, the process by which one bacterium transfers genetic material to another through direct contact. Bacteria use fimbriae to attach to a host cell. The most effective action anyone can take to prevent the spread of contagious illnesses is to wash his or her hands. Why? Because microbes (organisms so tiny that they can only be seen with microscopes) are ubiquitous. They live on doorknobs, money, your hands, and many other surfaces. If someone sneezes into his hand and touches a doorknob, and afterwards you touch that same doorknob, the microbes from the sneezer's mucus are now on your hands. If you touch your hands to your mouth, nose, or eyes, those microbes can enter your body and could make you sick. However, not all microbes (also called microorganisms) cause disease; most are actually beneficial. You have microbes in your gut that make vitamin K. Other microorganisms are used to ferment beer and wine. Microbiologists are scientists who study microbes. Microbiologists can pursue a number of careers. Not only do they work in the food industry, they are also employed in the veterinary and medical fields. They can work in the pharmaceutical sector, serving key roles in research and development by identifying new antibiotic sources that can treat bacterial infections. Environmental microbiologists may look for new ways to use specially selected or genetically engineered microbes to remove pollutants from soil or groundwater, as well as hazardous elements from contaminated sites. We call using these microbes bioremediation technologies. Microbiologists can also work in the bioinformatics field, providing specialized knowledge and insight for designing, developing, and specificity of computer models of, for example, bacterial epidemics. At 0.1 to 5.0 μm in diameter, prokaryotic cells are significantly smaller than eukaryotic cells, which have diameters ranging from 10 to 100 μm (Figure 4.6). The prokaryotes' small size allows ions and organic molecules that enter them to quickly diffuse to other parts of the cell. Similarly, any wastes produced within a prokaryotic cell can quickly diffuse. This is not the case in eukaryotic cells, which have developed different structural adaptations to enhance intracellular transport. This figure shows relative sizes of microbes on a logarithmic scale (recall that each unit of increase in a logarithmic scale represents a 10-fold increase in the quantity measured). Small size, in general, is necessary for all cells, whether prokaryotic or eukaryotic. Let's examine why that is so. First, we'll consider the area and volume of a typical cell. Not all cells are spherical in shape, but most tend to approximate a sphere. You may remember from your high school geometry course that the formula for the surface area of a sphere is 4πr2, while the formula for its volume is 4πr3/3. Thus, as the radius of a cell increases, its surface area increases as the square of its radius, but its volume increases as the cube of its radius (much more rapidly). Therefore, as a cell increases in size, its surface area-to-volume ratio decreases. This same principle would apply if the cell had a cube shape (Figure 4.7). If the cell grows too large, the plasma membrane will not have sufficient surface area to support the rate of diffusion required for the increased volume. In other words, as a cell grows, it becomes less efficient. One way to become more efficient is to divide. Another way is to develop organelles that perform specific tasks. These adaptations lead to developing more sophisticated cells, which we call eukaryotic cells. Notice that as a cell increases in size, its surface area-to-volume ratio decreases. When there is insufficient surface area to support a cell's increasing volume, a cell will either divide or die. The cell on the left has a volume of 1 mm3 and a surface area of 6 mm2, with a surface area-to-volume ratio of 6 to 1; whereas, the cell on the right has a volume of 8 mm3 and a surface area of 24 mm2, with a surface area-to-volume ratio of 3 to 1. Prokaryotic cells are much smaller than eukaryotic cells. What advantages might small cell size confer on a cell? What advantages might large cell size have?

main.py DELETED Viewed

@@ -1,263 +0,0 @@
-import os
-import re
-import secrets
-import string
-import yaml
-from datetime import datetime
-from zipfile import ZipFile
-import gradio as gr
-import nltk
-import pandas as pd
-from langchain.embeddings import OpenAIEmbeddings
-from langchain.chains import SimpleSequentialChain
-from langchain.chat_models import ChatOpenAI
-from nltk.tokenize import sent_tokenize
-from pandas import DataFrame
-import utils
-from chains import llm_chains
-# download NLTK dependencies
-nltk.download("punkt")
-nltk.download("stopwords")
-# load stop words const.
-from nltk.corpus import stopwords
-STOP_WORDS = stopwords.words("english")
-# load global spacy model
-# try:
-#     SPACY_MODEL = spacy.load("en_core_web_sm")
-# except OSError:
-#     print("[spacy] Downloading model: en_core_web_sm")
-#     spacy.cli.download("en_core_web_sm")
-#     SPACY_MODEL = spacy.load("en_core_web_sm")
-class Text2KG:
-    """Text2KG class."""
-    def __init__(self, api_key: str, **kwargs):
-        self.llm = ChatOpenAI(openai_api_key=api_key, **kwargs)
-        self.embedding = OpenAIEmbeddings(openai_api_key=api_key)
-    def init(self, steps: list[str]):
-        """Initialize Text2KG pipeline from passed steps.
-        Args:
-            *steps (str): Steps to include in pipeline. Must be a top-level name present in
-                the schema.yml file
-        """
-        self.pipeline = SimpleSequentialChain(
-            chains=[llm_chains[step](llm=self.llm) for step in steps],
-            verbose=False
-        )
-    def run(self, text: str) -> list[dict]:
-        """Run Text2KG pipeline on passed text.
-        Arg:
-            text (str): The text input
-        Returns:
-            triplets (list): The list of extracted KG triplets
-        """
-        triplets = self.pipeline.run(text)
-        return triplets
-    def clean(self, kg: DataFrame) -> DataFrame:
-        """Text2KG post-processing."""
-        drop_list = []
-        for i, row in kg.iterrows():
-            # drop stopwords (e.g. pronouns)
-            if (row.subject in STOP_WORDS) or (row.object in STOP_WORDS):
-                drop_list.append(i)
-            # drop broken triplets
-            elif row.hasnans:
-                drop_list.append(i)
-            # lowercase nodes/edges, drop articles
-            else:
-                article_pattern = r'^(the|a|an) (.+)'
-                be_pattern = r'^(are|is) (a )?(.+)'
-                kg.at[i, "subject"] = re.sub(article_pattern, r'\2', row.subject.lower())
-                kg.at[i, "relation"] = re.sub(be_pattern, r'\3', row.relation.lower())
-                kg.at[i, "object"] = re.sub(article_pattern, r'\2', row.object.lower())
-        return kg.drop(drop_list)
-    def normalize(self, kg: DataFrame, threshold: float=0.3) -> DataFrame:
-        """Reduce dimensionality of Text2KG output by merging cosine-similar nodes/edges."""
-        ents = pd.concat([kg["subject"], kg["object"]]).unique()
-        rels = kg["relation"].unique()
-        ent_map = utils.condense_labels(ents, self.embedding.embed_documents, threshold=threshold)
-        rel_map = utils.condense_labels(rels, self.embedding.embed_documents, threshold=threshold)
-        kg_normal = pd.DataFrame()
-        kg_normal["subject"] = kg["subject"].map(ent_map)
-        kg_normal["relation"] = kg["relation"].map(rel_map)
-        kg_normal["object"] = kg["object"].map(ent_map)
-        return kg_normal
-def extract_knowledge_graph(api_key: str, batch_size: int, modules: list[str], text: str, progress=gr.Progress()):
-    """Extract knowledge graph from text.
-    Args:
-        api_key (str): OpenAI API key
-        batch_size (int): Number of sentences per forward pass
-        modules (list): Additional modules to add before main extraction step
-        text (str): Text from which Text2KG will extract knowledge graph from
-        progress: Progress bar. The default is gradio's progress bar; for a
-            command line progress bar, set `progress = tqdm`
-    Returns:
-        zip_path (str): Path to ZIP archive containing outputs
-        knowledge_graph (DataFrame): The extracted knowledge graph
-    """
-    # init
-    if api_key == "":
-        raise ValueError("API key is required")
-    pipeline = Text2KG(api_key=api_key, temperature=0.3) # low temp. -> low randomness
-    steps = []
-    for module in modules:
-        m = module.lower().replace(' ', '_')
-        steps.append(m)
-    if (len(steps) == 0) or (steps[-1] != "triplet_extraction"):
-        steps.append("triplet_extraction")
-    steps = []
-    for module in modules:
-        m = module.lower().replace(' ', '_')
-        steps.append(m)
-    if (len(steps) == 0) or (steps[-1] != "triplet_extraction"):
-        steps.append("triplet_extraction")
-    pipeline.init(steps)
-    # split text into batches
-    sentences = sent_tokenize(text)
-    batches = [" ".join(sentences[i:i+batch_size])
-               for i in range(0, len(sentences), batch_size)]
-    # create KG
-    knowledge_graph = []
-    for i, batch in progress.tqdm(list(enumerate(batches)),
-                                  desc="Processing...", unit="batches"):
-        output = pipeline.run(batch)
-        [triplet.update({"sentence_id": i}) for triplet in output]
-        knowledge_graph.extend(output)
-    # convert to df, post-process data
-    knowledge_graph = pd.DataFrame(knowledge_graph)
-    knowledge_graph = pipeline.clean(knowledge_graph)
-    # rearrange columns
-    knowledge_graph = knowledge_graph[["sentence_id", "subject", "relation", "object"]]
-    # metadata
-    now = datetime.now()
-    date = str(now.date())
-    metadata = {
-        "_timestamp": now,
-        "batch_size": batch_size,
-        "modules": steps
-    }
-    # unique identifier for local saving
-    uid = ''.join(secrets.choice(string.ascii_letters)
-                  for _ in range(6))
-    print(f"Run ID: {date}/{uid}")
-    save_dir = os.path.join(".", "output", date, uid)
-    os.makedirs(save_dir, exist_ok=True)
-    # save metadata & data
-    with open(os.path.join(save_dir, "metadata.yml"), 'w') as f:
-        yaml.dump(metadata, f)
-    batches_df = pd.DataFrame(enumerate(batches), columns=["sentence_id", "text"])
-    batches_df.to_csv(os.path.join(save_dir, "sentences.txt"),
-                     index=False)
-    knowledge_graph.to_csv(os.path.join(save_dir, "kg.txt"),
-                           index=False)
-    # create ZIP file
-    zip_path = os.path.join(save_dir, "output.zip")
-    with ZipFile(zip_path, 'w') as zipObj:
-        zipObj.write(os.path.join(save_dir, "metadata.yml"))
-        zipObj.write(os.path.join(save_dir, "sentences.txt"))
-        zipObj.write(os.path.join(save_dir, "kg.txt"))
-    return zip_path, knowledge_graph
-class App:
-    def __init__(self):
-        demo = gr.Interface(
-            fn=extract_knowledge_graph,
-            title="Text2KG",
-            inputs=[
-                gr.Textbox(placeholder="API key...", label="OpenAI API Key", type="password"),
-                gr.Slider(minimum=1, maximum=10, step=1, label="Sentence Batch Size"),
-                gr.CheckboxGroup(choices=["Clause Deconstruction"], label="Optional Modules"),
-                gr.Textbox(lines=2, placeholder="Text Here...", label="Input Text"),
-            ],
-            outputs=[
-                gr.File(label="Knowledge Graph"),
-                gr.DataFrame(label="Preview",
-                             headers=["sentence_id", "subject", "relation", "object"],
-                             max_rows=10,
-                             overflow_row_behaviour="paginate")
-            ],
-            examples=[[None, 1, [], ("All cells share four common components: "
-                                        "1) a plasma membrane, an outer covering that separates the "
-                                        "cell's interior from its surrounding environment; 2) cytoplasm, "
-                                        "consisting of a jelly-like cytosol within the cell in which "
-                                        "there are other cellular components; 3) DNA, the cell's genetic "
-                                        "material; and 4) ribosomes, which synthesize proteins. However, "
-                                        "prokaryotes differ from eukaryotic cells in several ways. A "
-                                        "prokaryote is a simple, mostly single-celled (unicellular) "
-                                        "organism that lacks a nucleus, or any other membrane-bound "
-                                        "organelle. We will shortly come to see that this is significantly "
-                                        "different in eukaryotes. Prokaryotic DNA is in the cell's central "
-                                        "part: the nucleoid.")]],
-            allow_flagging="never",
-            cache_examples=False
-        )
-        demo.queue().launch(share=False)
-if __name__ == "__main__":
-    App()

requirements.txt CHANGED Viewed

@@ -1,8 +1,8 @@
-gradio==4.2.0
-langchain==0.0.335
-nltk==3.7
-openai==0.27.4
-pandas==2.0.3
-PyYAML==6.0
-scikit-learn==1.2.2
-tqdm==4.65.0

+gradio
+langchain
+langchain_core
+langchain_openai
+nltk
+openai
+pandas
+PyYAML

src/words2wisdom/__init__.py ADDED Viewed

	@@ -0,0 +1,30 @@

+import os
+import sys
+import yaml
+import nltk
+# directories
+PACKAGE_DIR = os.path.dirname(__file__)
+ROOT = os.path.dirname(os.path.dirname(PACKAGE_DIR))
+DATA_DIR = os.path.join(ROOT, "data")
+CONFIG_DIR = os.path.join(ROOT, "config")
+OUTPUT_DIR = os.path.join(ROOT, "output")
+# add the package root directory to the python path
+sys.path.append(os.path.dirname(PACKAGE_DIR))
+# files
+with open(os.path.join(CONFIG_DIR, "modules.yml")) as f:
+    MODULES_CONFIG = yaml.safe_load(f)
+with open(os.path.join(CONFIG_DIR, "validation.yml")) as f:
+    VALIDATION_CONFIG = yaml.safe_load(f)
+# download NLTK dependencies
+nltk.download("punkt", quiet=True)
+nltk.download("stopwords", quiet=True)
+# load NLTK stop words
+from nltk.corpus import stopwords
+STOP_WORDS = stopwords.words("english")

src/words2wisdom/__main__.py ADDED Viewed

	@@ -0,0 +1,3 @@

+if __name__ == "__main__":
+    from . import cli
+    cli.main()

src/words2wisdom/cli.py ADDED Viewed

	@@ -0,0 +1,97 @@

+import argparse
+import os
+import subprocess
+from langchain_openai import ChatOpenAI
+from . import CONFIG_DIR, OUTPUT_DIR
+from .pipeline import Pipeline
+from .utils import dump_all
+from .validate import validate_knowledge_graph
+default_config_path = os.path.join(CONFIG_DIR, "default_config.ini")
+def main():
+    parser = argparse.ArgumentParser(
+        prog="words2wisdom",
+        #description="Generate a knowledge graph from a given text using OpenAI LLMs"
+    )
+    subparsers = parser.add_subparsers(dest="command",
+                                       help="Available commands")
+    # init
+    parser_init = subparsers.add_parser("init",
+                                        help="Return the default config.ini file",
+                                        description="Return the default config.ini file")
+    parser_init.set_defaults(func=get_default_config)
+    # gui
+    parser_gui = subparsers.add_parser("gui",
+                                       help="run Words2Wisdom using Gradio interface",
+                                       description="run Words2Wisdom using Gradio interface")
+    parser_gui.set_defaults(func=gui)
+    # run
+    parser_run = subparsers.add_parser("run",
+                                       help="Generate a knowledge graph from a given text using OpenAI LLMs",
+                                       description="Generate a knowledge graph from a given text using OpenAI LLMs")
+    parser_run.add_argument("text",
+                            help="Path to text file")
+    parser_run.add_argument("--config",
+                            help="Path to config.ini file",
+                            default=default_config_path)
+    parser_run.add_argument("--output-dir",
+                            help="Path to save outputs to",
+                            default=OUTPUT_DIR)
+    parser_run.set_defaults(func=run)
+    # eval
+    parser_eval = subparsers.add_parser("eval",
+                                        help="Auto-evaluate knowledge graph using GPT-4",
+                                        description="Auto-evaluate knowledge graph using GPT-4")
+    parser_eval.add_argument("output_zip",
+                             help="Path to output.zip file containing knowledge graph")
+    parser_eval.set_defaults(func=validate)
+    args = parser.parse_args()
+    args.func(args)
+def get_default_config(args):
+    """Print default config.ini"""
+    with open(default_config_path) as f:
+        default_config = f.read()
+    print(default_config)
+def gui(args):
+    """Run Gradio interface"""
+    subprocess.run(["python", "-m", "text2kg.gui"])
+def run(args):
+    """Text to KG pipeline"""
+    pipe = Pipeline.from_ini(args.config)
+    with open(args.text) as f:
+        batches, kg = pipe.run(f.read())
+    dump_all(pipe, batches, kg, to_path=args.output_dir)
+def validate(args):
+    """Validate knowledge graph"""
+    validate_knowledge_graph(
+        llm=ChatOpenAI(
+            model="gpt-4-turbo-preview",
+            #openai_api_key=...
+        ),
+        output_zip=args.output_zip
+    )
+if __name__ == "__main__":
+    main()

src/words2wisdom/config.py ADDED Viewed

	@@ -0,0 +1,96 @@

+import configparser
+import ast
+class Config:
+    def __init__(self, config_data):
+        self.config_data = config_data
+    def __getattr__(self, name):
+        return self.config_data.get(name, {})
+    def __setattr__(self, name, value):
+        if name == 'config_data':
+            super().__setattr__(name, value)
+        else:
+            self.config_data[name] = value
+    def __repr__(self):
+        return f"Config(\n{'pipeline':>12}: {self.pipeline}\n{'llm':>12}: {self.llm}\n)"
+    @classmethod
+    def read_ini(cls, file_path):
+        parser = configparser.ConfigParser()
+        parser.read(file_path)
+        return cls({"pipeline":  cls._parse_pipeline_section(parser["pipeline"]),
+                    "llm": cls._parse_llm_section(parser["llm"])})
+    @staticmethod
+    def _parse_llm_section(section):
+        parsed_data = {}
+        for key, value in section.items():
+            try:
+                parsed_data[key] = ast.literal_eval(value)
+            except ValueError:
+                parsed_data[key] = value
+        return parsed_data
+    @staticmethod
+    def _parse_pipeline_section(section):
+        eval_func = {
+            "words_per_batch": int,
+            "preprocess": lambda x: x.split(", ") if x.split(", ") != ["None"] else []
+        }
+        parsed_data = {}
+        for key, value in section.items():
+            parsed_data[key] = eval_func.get(key, str)(value)
+        return parsed_data
+    def serialize(self, save_path: str=None):
+        """Convert Config object to .ini file. If save_path is not specified, return string"""
+        serialized_config = ''
+        for section in self.config_data:
+            serialized_config += f"[{section}]\n"
+            for key, value in self.config_data[section].items():
+                # turn list back to str
+                if isinstance(value, list):
+                    value = ", ".join(value)
+                # don't serialize the api key
+                if key == "openai_api_key":
+                    value = None
+                serialized_config += f"{key} = {value}\n"
+            serialized_config += "\n"
+        if save_path:
+            with open(save_path, 'w') as f:
+                f.write(serialized_config)
+        else:
+            return serialized_config
+if __name__ == "__main__":
+    # example usage
+    config_file = "/Users/johaunh/Documents/PhD/Projects/Text2KG/config/config.ini"
+    config = Config.read_ini(config_file)
+    # access pipeline parameters
+    print("Pipeline Parameters:")
+    for k, v in config.pipeline.items():
+        print(f"{k}: {v}")
+    # access LLM parameters
+    print("\nLLM Parameters:")
+    for k, v in config.llm.items():
+        print(f"{k}: {v}")

src/words2wisdom/gui.py ADDED Viewed

	@@ -0,0 +1,86 @@

+import os
+import gradio as gr
+from . import CONFIG_DIR, ROOT
+from .config import Config
+from .pipeline import Pipeline
+from .utils import dump_all
+example_file = (os.path.join(ROOT, "demo", "prokaryotes.txt"))
+example_text = "The quick brown fox jumps over the lazy dog. The cat sits on the mat."
+def text2kg_from_string(openai_api_key: str, input_text: str):
+    config = Config.read_ini(os.path.join(CONFIG_DIR, "default_config.ini"))
+    config.llm["openai_api_key"] = openai_api_key
+    pipeline = Pipeline(config)
+    text_batches, knowledge_graph = pipeline.run(input_text)
+    zip_path = dump_all(config, text_batches, knowledge_graph)
+    return knowledge_graph, zip_path
+def text2kg_from_file(openai_api_key: str, input_file):
+    with open(input_file.name) as f:
+        input_text = f.read()
+    return text2kg_from_string(openai_api_key, input_text)
+with gr.Blocks(title="Text2KG") as demo:
+    gr.Markdown("# 🧞📖 Text2KG")
+    with gr.Column(variant="panel"):
+        openai_api_key = gr.Textbox(label="OpenAI API Key", placeholder="sk-...", type="password")
+    with gr.Row(equal_height=False):
+        with gr.Column(variant="panel"):
+            gr.Markdown("## Input (Text or Text File)")
+            #gr.Markdown("A knowledge graph will be generated for the provided text.")
+            with gr.Tab("Direct Input"):
+                text_string = gr.Textbox(lines=2, placeholder="Text Here...", label="Text")
+                submit_str = gr.Button()
+            with gr.Tab("File Upload"):
+                text_file = gr.File(file_types=["text"], label="Text File")
+                submit_file = gr.Button()
+        with gr.Column(variant="panel"):
+            gr.Markdown("## Output (ZIP Archive)")
+            #gr.Markdown("The ZIP contains the generated knowledge graph, the text batches (indexed), and a configuration file.")
+            output_zip = gr.File(label="ZIP")
+    with gr.Accordion(label="Preview of Knowledge Graph", open=False):
+        output_graph = gr.DataFrame(headers=["batch_id", "subject", "relation", "object"], label="Knowledge Graph")
+    with gr.Accordion(label="Examples", open=False):
+        gr.Markdown("### Text Example")
+        gr.Examples(
+            examples=[[None, example_text]],
+            inputs=[openai_api_key, text_string],
+            outputs=[output_graph, output_zip],
+            fn=text2kg_from_string,
+            preprocess=False,
+            postprocess=False
+        )
+        gr.Markdown("### File Example")
+        gr.Examples(
+            examples=[[None, example_file]],
+            inputs=[openai_api_key, text_file],
+            outputs=[output_graph, output_zip],
+            fn=text2kg_from_file,
+            preprocess=False,
+            postprocess=False
+        )
+    submit_str.click(fn=text2kg_from_string, inputs=[openai_api_key, text_string], outputs=[output_graph, output_zip])
+    submit_file.click(fn=text2kg_from_file, inputs=[openai_api_key, text_file], outputs=[output_graph, output_zip])
+demo.launch(inbrowser=True, width="75%")

chains.py → src/words2wisdom/output_parsers.py RENAMED Viewed

@@ -1,13 +1,7 @@
-from functools import partial
-import yaml
-from langchain.chains import LLMChain
-from langchain.output_parsers import NumberedListOutputParser
-from langchain.prompts import ChatPromptTemplate
-with open("./schema.yml") as f:
-    schema = yaml.safe_load(f)
 class ClauseParser(NumberedListOutputParser):
@@ -32,16 +26,27 @@ class TripletParser(NumberedListOutputParser):
     def get_format_instructions(self) -> str:
         return super().get_format_instructions()
-llm_chains = {}
-for scheme in schema:
-    parser = schema[scheme]["parser"]
-    prompts = schema[scheme]["prompts"]
-    llm_chains[scheme] = partial(
-        LLMChain,
-        output_parser=eval(f'{parser}()'),
-        prompt=ChatPromptTemplate.from_messages(list(prompts.items()))
-    )

+import re
+import pandas as pd
+from langchain_core.output_parsers import NumberedListOutputParser, StrOutputParser
 class ClauseParser(NumberedListOutputParser):
     def get_format_instructions(self) -> str:
         return super().get_format_instructions()
+class QuestionOutputParser(StrOutputParser):
+    def get_format_instructions(self) -> str:
+        return super().get_format_instructions()
+    def parse(self, text: str) -> pd.DataFrame:
+        """Parses the response to an array of answers/explanations."""
+        output = super().parse(text)
+        raw_list = re.findall(r'\d+\) (.*?)(?=\n\d+\)|\Z)', output, re.DOTALL)
+        raw_df = pd.DataFrame(raw_list).T
+        df = pd.DataFrame()
+        for idx in raw_df.columns:
+            # answer and explanation headers
+            ans_i = f"Q{idx+1}"
+            why_i = f"Q{idx+1}_explain"
+            # split response into answer/explanation columns
+            df[[ans_i, why_i]] = raw_df[idx].str.extract(r'(\d) \- (.*)')
+        return df

src/words2wisdom/pipeline.py ADDED Viewed

	@@ -0,0 +1,183 @@

+import re
+from typing import List
+import pandas as pd
+from langchain_openai import ChatOpenAI
+from langchain_core.output_parsers import StrOutputParser
+from langchain_core.prompts import ChatPromptTemplate
+from langchain_core.runnables import RunnablePassthrough
+from nltk.tokenize import sent_tokenize
+from . import MODULES_CONFIG, STOP_WORDS
+from .config import Config
+from .output_parsers import ClauseParser, TripletParser
+from .utils import partition_sentences
+# llm output parsers
+PARSERS = {
+    "StrOutputParser": StrOutputParser(),
+    "ClauseParser": ClauseParser(),
+    "TripletParser": TripletParser()
+}
+class Module:
+    """Text2KG module class."""
+    def __init__(self, name: str) -> None:
+        self.name = name
+        self.parser = self.get_parser()
+        self.prompts = self.get_prompts()
+        self.type = self.get_module_type()
+    def __repr__(self):
+        return self.name.replace("_", " ").title().replace(" ", "") + "()"
+    def get_prompts(self):
+        return ChatPromptTemplate.from_messages(MODULES_CONFIG[self.name]["prompts"].items())
+    def get_parser(self):
+        return PARSERS.get(MODULES_CONFIG[self.name]["parser"], StrOutputParser())
+    def get_module_type(self):
+        return MODULES_CONFIG[self.name]["type"]
+class Pipeline:
+    """Text2KG pipeline class."""
+    def __init__(self, config: Config):
+        self.config = config
+        self.initialize(config)
+    def __call__(self, text: str, clean: bool=True) -> pd.DataFrame:
+        return self.run(text, clean)
+    def __repr__(self) -> str:
+        return f"Text2KG(\n\tconfig.pipeline={self.config.pipeline}\n\tconfig.llm={self.config.llm}\n)"
+    def __str__(self) -> str:
+        return ("[INPUT: text] -> "
+                + " -> ".join([str(m) for m in self.modules])
+                + " -> [OUTPUT: knowledge graph]")
+    @classmethod
+    def from_ini(cls, config_path: str):
+        return cls(Config.read_ini(config_path))
+    def initialize(self, config: Config):
+        """Initialize Text2KG pipeline from config."""
+        # validate preprocess
+        preprocess_modules = [Module(name) for name in config.pipeline["preprocess"]]
+        for item in preprocess_modules:
+            if item.get_module_type() != "preprocess":
+                raise ValueError(f"Expected preprocess step `{item.name}` to"
+                                 f" have module type='preprocess'. Consider reviewing"
+                                 f" schema.yml")
+        # validate extraction process
+        extraction_module = Module(config.pipeline["extraction"])
+        if extraction_module.get_module_type() != "extraction":
+            raise ValueError(f"Expected `{extraction_module.name}` to have module"
+                             f" type='extraction'. Consider reviewing schema.yml")
+        # combine preprocess + extraction
+        self.modules = preprocess_modules + [extraction_module]
+        # init prompts & parsers
+        prompts = [m.get_prompts() for m in self.modules]
+        parsers = [m.get_parser() for m in self.modules]
+        # init llm
+        llm = ChatOpenAI(**self.config.llm)
+        # init chains
+        chains = [(prompt | llm | parser)
+                  for prompt, parser in zip(prompts, parsers)]
+        # stitch chains together
+        self.pipeline = {"text": RunnablePassthrough()} | chains[0]
+        for i in range(1, len(chains)):
+            self.pipeline = {"text": self.pipeline} | chains[i]
+        # print pipeline
+        print("Initialized Text2KG pipeline:")
+        print(str(self))
+    def run(self, text: str, clean=True) -> tuple[List[str], pd.DataFrame]:
+        """Run Text2KG pipeline on passed text.
+        Args:
+            *texts (str): The text inputs
+            clean (bool): Whether to clean the raw KG or not
+        Returns:
+            text_batches (list): Batched text
+            knowledge_graph (DataFrame): A dataframe containing the extracted KG triplets,
+                indexed by batch
+        """
+        print("Running Text2KG pipeline:")
+        # split text into batches
+        text_batches = list(partition_sentences(
+            sentences=sent_tokenize(text),
+            min_words=self.config.pipeline["words_per_batch"]
+        ))
+        # run pipeline in parallel; convert to dataframe
+        print("Extracting knowledge graph...", end=' ')
+        output = self.pipeline.batch(text_batches)
+        knowledge_graph = pd.DataFrame([{'batch_id': i, **triplet}
+                                        for i, batch in enumerate(output)
+                                        for triplet in batch])
+        if clean:
+            knowledge_graph = self._clean(knowledge_graph)
+        print("Done!", end='\n')
+        return text_batches, knowledge_graph
+    def _clean(self, kg: pd.DataFrame) -> pd.DataFrame:
+        """Text2KG post-processing."""
+        print("Cleaning knowledge graph components...", end=' ')
+        drop_list = []
+        for i, row in kg.iterrows():
+            # drop stopwords (e.g. pronouns)
+            if (row.subject in STOP_WORDS) or (row.object in STOP_WORDS):
+                drop_list.append(i)
+            # drop broken triplets
+            elif row.hasnans:
+                drop_list.append(i)
+            # lowercase nodes/edges, drop articles
+            else:
+                article_pattern = r'^(the|a|an) (.+)'
+                be_pattern = r'^(are|is) (a |an )?(.+)'
+                kg.at[i, "subject"] = re.sub(article_pattern, r'\2', row.subject.lower())
+                kg.at[i, "relation"] = re.sub(be_pattern, r'\3', row.relation.lower())
+                kg.at[i, "object"] = re.sub(article_pattern, r'\2', row.object.lower()).strip('.')
+        return kg.drop(drop_list)
+    def _normalize(self):
+        """Unused."""
+        return
+    def serialize(self):
+        return self.config.serialize()

src/words2wisdom/utils.py ADDED Viewed

	@@ -0,0 +1,62 @@

+import os
+from datetime import datetime
+from typing import List
+from zipfile import ZipFile
+import pandas as pd
+def partition_sentences(sentences: List[str], min_words: int):
+    current_batch = []
+    word_count = 0
+    for sentence in sentences:
+        # count the number of words in the sentence
+        word_count += len(sentence.split())
+        # add sentence to the current batch
+        current_batch.append(sentence)
+        # if the word count exceeds or equals the minimum threshold, yield the current batch
+        if word_count >= min_words:
+            yield " ".join(current_batch)
+            current_batch = []  # reset the batch
+            word_count = 0      # reset the word count
+    # yield the remaining batch if it's not empty
+    if current_batch:
+        yield " ".join(current_batch)
+def dump_all(pipeline, text_batches: List[str], knowledge_graph: pd.DataFrame, to_path: str=None):
+    """Save all items to ZIP."""
+    # metadata
+    date = str(datetime.now().date())
+    # convert batches to df
+    batches_df = pd.DataFrame(text_batches, columns=["text"])
+    # date + hex id for local saving
+    num = 0
+    while True:
+        hex_num = format(num, 'X').zfill(3)
+        filename = f"output-{date}-{hex_num}.zip"
+        zip_path = os.path.join(to_path, filename)
+        if os.path.exists(zip_path):
+            num += 1
+        else:
+            break
+    print(f"Run ID: {date}-{hex_num}")
+    os.makedirs(to_path, exist_ok=True)
+    # create ZIP file
+    with ZipFile(zip_path, 'w') as zipObj:
+        zipObj.writestr("config.ini", pipeline.serialize())
+        zipObj.writestr("text_batches.csv", batches_df.to_csv(index_label="batch_id"))
+        zipObj.writestr("kg.csv", knowledge_graph.to_csv(index=False))
+    print(f"Saved data to {zip_path}")
+    return zip_path

src/words2wisdom/validate.py ADDED Viewed

	@@ -0,0 +1,179 @@

+import os
+import re
+from time import time
+from argparse import ArgumentParser
+from typing import List
+from zipfile import ZipFile
+import pandas as pd
+from langchain_openai import ChatOpenAI
+from langchain_core.prompts import ChatPromptTemplate
+from . import VALIDATION_CONFIG, OUTPUT_DIR
+from .output_parsers import QuestionOutputParser
+def parse_args():
+    parser = ArgumentParser()
+    parser.add_argument(
+        "run_ids", nargs="+", help="Run IDs to evaluate. Format: YYYY-MM-DD-XXX"
+    )
+    parser.add_argument(
+        "--search_dir", help="Directory to search for output", default=OUTPUT_DIR
+    )
+    return parser.parse_args()
+def format_system_prompt():
+    """Format instructional prompt from instructions.yml"""
+    def format_question(question: dict):
+        # Question format: {text} {additional} {options}
+        # Ex. Can pigs fly? Explain. (Yes/No)
+        formatted = (
+            question["title"]
+            + " "
+            + question["text"]
+            + " "
+            + (question["additional"] + " " if question["additional"] else "")
+            + "("
+            + ";".join(question["options"])
+            + ")\n"
+        )
+        return formatted
+    def format_example(example: dict):
+        formatted = (
+            f"PASSAGE: {example['passage']}\n"
+            f"TRIPLET: {example['triplet']}\n\n"
+            + "".join([
+                f"{i}) {answer}\n"
+                for i, answer in enumerate(example["answers"].values(), start=1)
+            ])
+        )
+        return formatted
+    instruction = VALIDATION_CONFIG["instruction"]
+    questions = "".join([
+        f"{i}) {format_question(q)}"
+        for i, q in enumerate(VALIDATION_CONFIG["questions"].values(), start=1)
+    ])
+    example = format_example(VALIDATION_CONFIG["example"])
+    system_prompt = (
+        f"{instruction}\n" f"QUESTIONS:\n{questions}\n" f"[* EXAMPLE *]\n\n{example}"
+    )
+    return system_prompt
+def validate_triplets(
+    llm: ChatOpenAI, instruction: str, passage: str, triplets: List[List[str]]
+) -> List[pd.DataFrame]:
+    """Validate triplets with respect to passage."""
+    print(
+        f"Validating {len(triplets):>2} triplet{'s' if len(triplets) else ''}...",
+        end=" ",
+    )
+    prompt = ChatPromptTemplate.from_messages([
+        ("system", "{instruction}"),
+        ("user", "PASSAGE: {passage}\n\nTRIPLET: {triplet}"),
+    ])
+    chain = prompt | llm | QuestionOutputParser()
+    output = chain.batch([
+        {"instruction": instruction, "passage": passage, "triplet": triplet}
+        for triplet in triplets
+    ])
+    print("Done!", end="\n")
+    return output
+def validate_knowledge_graph(llm: ChatOpenAI, output_zip: str):
+    """Validate all triplets in a knowledge graph."""
+    run_id = re.findall(r"output-(.*)\.zip", output_zip)[0]
+    run_dir = os.path.dirname(output_zip)
+    # read output zip
+    with ZipFile(output_zip) as z:
+        # load knowledge graph
+        with z.open("kg.csv") as f:
+            graph = pd.read_csv(f)
+        # load text batches
+        with z.open("text_batches.csv") as f:
+            text_batches = pd.read_csv(f)
+    print("Initializing knowledge graph validation. Run:", run_id)
+    print()
+    # start stopwatch
+    start = time()
+    # container for evaluation responses
+    responses = []
+    # instructions
+    instruction = format_system_prompt()
+    # triplets are batched by passage
+    # so we iterate over passages
+    for idx, passage in enumerate(text_batches.text):
+        triplets = (
+            graph[graph["batch_id"] == idx].drop(columns=["batch_id"]).values.tolist()
+        )
+        print(f"Starting excerpt {idx + 1:>2} of {len(text_batches)}.", end=" ")
+        # if batch has no triplets to validate, skip batch
+        if len(triplets) == 0:
+            print("Excerpt has no triplets to validate.", end="\n")
+            continue
+        # validate triplets by batch
+        response = validate_triplets(
+            llm=llm,
+            instruction=instruction,
+            passage=passage,
+            triplets=triplets
+        )
+        responses.extend(response)
+    validation = pd.concat(responses, ignore_index=True)
+    # merge with knowlege graph data
+    validation_merged = (
+        text_batches.merge(graph)
+        .drop(columns=["batch_id"])
+        .merge(validation, left_index=True, right_index=True)
+    )
+    savepath = os.path.join(run_dir, f"validation-{run_id}.csv")
+    validation_merged.to_csv(savepath, index=False)
+    # stop stopwatch
+    end = time()
+    print("\nKnowledge graph validation complete!")
+    print(f"It took {end - start:0.3f} seconds to validate {len(validation)} triplets.")
+    print("Saved to:", savepath)
+    return savepath
+if __name__ == "__main__":
+    args = parse_args()
+    llm = ChatOpenAI(
+        model="gpt-4-turbo-preview",
+        openai_api_key=os.getenv("OPENAI_API_KEY")
+    )
+    for run_id in args.run_ids:
+        zipfile = os.path.join(args.search_dir, f"output-{run_id}.zip")
+        validate_knowledge_graph(llm, run_id)
+        print("* " * 25)

utils.py DELETED Viewed

@@ -1,31 +0,0 @@
-import numpy as np
-from typing import Callable
-from sklearn.cluster import AgglomerativeClustering
-def condense_labels(labels: np.ndarray, embedding_func: Callable, threshold: float=0.5):
-    """Combine cosine-similar labels under same name."""
-    embeddings = np.array(embedding_func(labels))
-    clustering = AgglomerativeClustering(
-        n_clusters=None,
-        distance_threshold=threshold
-    ).fit(embeddings)
-    clusters = [np.where(clustering.labels_ == l)[0]
-                for l in range(clustering.n_clusters_)]
-    clusters_reduced = []
-    for c in clusters:
-        embs = embeddings[c]
-        centroid = np.mean(embs)
-        idx = c[np.argmin(np.linalg.norm(embs - centroid, axis=1))]
-        clusters_reduced.append(idx)
-    old2new = {old_id: new_id for old_ids, new_id in zip(clusters, clusters_reduced) for old_id in old_ids}
-    return {labels[i]: labels[j] for i, j in old2new.items()}

writeup/words2wisdom_short.pdf ADDED Viewed

Binary file (168 kB). View file