# `Words2Wisdom` Demo

For purpose of the notebook, we add the `src` director to the `PYTHONPATH`:

In [1]:
import sys

# add words2wisdom to PYTHONPATH
sys.path.append("../src/")

Next, we load in the example text file (from OpenStax Bio 2e chapter 4.2):

In [2]:
# load example text
with open("example.txt") as f:
    text = f.read()

# print example
print(text[:200] + "...")

Cells fall into one of two broad categories: prokaryotic and eukaryotic. We classify only the predominantly single-celled organisms Bacteria and Archaea as prokaryotes (pro- = before; -kary- = nucleus...


The `words2wisdom` pipeline can be configured from a configuration INI file. We have one prepared already, but you will need to create one with your desired settings.

After configuration, we call the `run` process. Then, we save all outputs to a ZIP file.

In [3]:
from words2wisdom.pipeline import Pipeline
from words2wisdom.utils import dump_all

w2w = Pipeline.from_ini("config.ini")
batches, graph = w2w.run(text)

output_zip = dump_all(
    pipeline=w2w,
    text_batches=batches,
    knowledge_graph=graph,
    to_path="."
)

Initialized Text2KG pipeline:
[INPUT: text] -> ClauseDeconstruction() -> TripletExtraction() -> [OUTPUT: knowledge graph]
Running Text2KG pipeline:
Extracting knowledge graph... Cleaning knowledge graph components... Done!
Run ID: 2024-02-16-001
Saved data to ./output-2024-02-16-001.zip


Here we use GPT-4 to auto-evaluate the knowledge graph.

In [4]:
from langchain_openai import ChatOpenAI
from words2wisdom.validate import validate_knowledge_graph

llm = ChatOpenAI(model="gpt-4-turbo-preview")

eval_file = validate_knowledge_graph(llm=llm, output_zip=output_zip)

Initializing knowledge graph validation. Run: 2024-02-16-001

Starting excerpt  1 of 6. Validating  7 triplets... Done!
Starting excerpt  2 of 6. Validating 20 triplets... Done!
Starting excerpt  3 of 6. Validating 20 triplets... Done!
Starting excerpt  4 of 6. Validating 10 triplets... Done!
Starting excerpt  5 of 6. Validating 10 triplets... Done!
Starting excerpt  6 of 6. Validating 16 triplets... Done!

Knowledge graph validation complete!
It took 109.471 seconds to validate 83 triplets.
Saved to: ./validation-2024-02-16-001.csv


There are 5 evaluation questions. The questions and score ranges can be found in `config/validation.yml`. Here are the results:

In [5]:
import pandas as pd

data = pd.read_csv(eval_file)
data.describe(include=[int])

Unnamed: 0,Q1,Q2,Q3,Q4,Q5
count,83.0,83.0,83.0,83.0,83.0
mean,0.975904,0.975904,0.975904,1.819277,1.566265
std,0.154281,0.154281,0.154281,0.387128,0.522489
min,0.0,0.0,0.0,1.0,0.0
25%,1.0,1.0,1.0,2.0,1.0
50%,1.0,1.0,1.0,2.0,2.0
75%,1.0,1.0,1.0,2.0,2.0
max,1.0,1.0,1.0,2.0,2.0
