{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# `Words2Wisdom` Demo\n", "\n", "For purpose of the notebook, we add the `src` director to the `PYTHONPATH`:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "# add words2wisdom to PYTHONPATH\n", "sys.path.append(\"../src/\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we load in the example text file (from OpenStax Bio 2e chapter 4.2):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cells fall into one of two broad categories: prokaryotic and eukaryotic. We classify only the predominantly single-celled organisms Bacteria and Archaea as prokaryotes (pro- = before; -kary- = nucleus...\n" ] } ], "source": [ "# load example text\n", "with open(\"example.txt\") as f:\n", " text = f.read()\n", "\n", "# print example\n", "print(text[:200] + \"...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `words2wisdom` pipeline can be configured from a configuration INI file. We have one prepared already, but you will need to create one with your desired settings.\n", "\n", "After configuration, we call the `run` process. Then, we save all outputs to a ZIP file." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Initialized Text2KG pipeline:\n", "[INPUT: text] -> ClauseDeconstruction() -> TripletExtraction() -> [OUTPUT: knowledge graph]\n", "Running Text2KG pipeline:\n", "Extracting knowledge graph... Cleaning knowledge graph components... Done!\n", "Run ID: 2024-02-16-001\n", "Saved data to ./output-2024-02-16-001.zip\n" ] } ], "source": [ "from words2wisdom.pipeline import Pipeline\n", "from words2wisdom.utils import dump_all\n", "\n", "w2w = Pipeline.from_ini(\"config.ini\")\n", "batches, graph = w2w.run(text)\n", "\n", "output_zip = dump_all(\n", " pipeline=w2w,\n", " text_batches=batches,\n", " knowledge_graph=graph,\n", " to_path=\".\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we use GPT-4 to auto-evaluate the knowledge graph." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Initializing knowledge graph validation. Run: 2024-02-16-001\n", "\n", "Starting excerpt 1 of 6. Validating 7 triplets... Done!\n", "Starting excerpt 2 of 6. Validating 20 triplets... Done!\n", "Starting excerpt 3 of 6. Validating 20 triplets... Done!\n", "Starting excerpt 4 of 6. Validating 10 triplets... Done!\n", "Starting excerpt 5 of 6. Validating 10 triplets... Done!\n", "Starting excerpt 6 of 6. Validating 16 triplets... Done!\n", "\n", "Knowledge graph validation complete!\n", "It took 109.471 seconds to validate 83 triplets.\n", "Saved to: ./validation-2024-02-16-001.csv\n" ] } ], "source": [ "from langchain_openai import ChatOpenAI\n", "from words2wisdom.validate import validate_knowledge_graph\n", "\n", "llm = ChatOpenAI(model=\"gpt-4-turbo-preview\")\n", "\n", "eval_file = validate_knowledge_graph(llm=llm, output_zip=output_zip)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 5 evaluation questions. The questions and score ranges can be found in `config/validation.yml`. Here are the results:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", " | Q1 | \n", "Q2 | \n", "Q3 | \n", "Q4 | \n", "Q5 | \n", "
---|---|---|---|---|---|
count | \n", "83.000000 | \n", "83.000000 | \n", "83.000000 | \n", "83.000000 | \n", "83.000000 | \n", "
mean | \n", "0.975904 | \n", "0.975904 | \n", "0.975904 | \n", "1.819277 | \n", "1.566265 | \n", "
std | \n", "0.154281 | \n", "0.154281 | \n", "0.154281 | \n", "0.387128 | \n", "0.522489 | \n", "
min | \n", "0.000000 | \n", "0.000000 | \n", "0.000000 | \n", "1.000000 | \n", "0.000000 | \n", "
25% | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "2.000000 | \n", "1.000000 | \n", "
50% | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "2.000000 | \n", "2.000000 | \n", "
75% | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "2.000000 | \n", "2.000000 | \n", "
max | \n", "1.000000 | \n", "1.000000 | \n", "1.000000 | \n", "2.000000 | \n", "2.000000 | \n", "