{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# `Words2Wisdom` Demo\n", "\n", "For purpose of the notebook, we add the `src` director to the `PYTHONPATH`:" ] }, { "cell_type": "code", "execution_count": 1, "metadata": {}, "outputs": [], "source": [ "import sys\n", "\n", "# add words2wisdom to PYTHONPATH\n", "sys.path.append(\"../src/\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next, we load in the example text file (from OpenStax Bio 2e chapter 4.2):" ] }, { "cell_type": "code", "execution_count": 2, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Cells fall into one of two broad categories: prokaryotic and eukaryotic. We classify only the predominantly single-celled organisms Bacteria and Archaea as prokaryotes (pro- = before; -kary- = nucleus...\n" ] } ], "source": [ "# load example text\n", "with open(\"example.txt\") as f:\n", " text = f.read()\n", "\n", "# print example\n", "print(text[:200] + \"...\")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The `words2wisdom` pipeline can be configured from a configuration INI file. We have one prepared already, but you will need to create one with your desired settings.\n", "\n", "After configuration, we call the `run` process. Then, we save all outputs to a ZIP file." ] }, { "cell_type": "code", "execution_count": 3, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Initialized Text2KG pipeline:\n", "[INPUT: text] -> ClauseDeconstruction() -> TripletExtraction() -> [OUTPUT: knowledge graph]\n", "Running Text2KG pipeline:\n", "Extracting knowledge graph... Cleaning knowledge graph components... Done!\n", "Run ID: 2024-02-16-001\n", "Saved data to ./output-2024-02-16-001.zip\n" ] } ], "source": [ "from words2wisdom.pipeline import Pipeline\n", "from words2wisdom.utils import dump_all\n", "\n", "w2w = Pipeline.from_ini(\"config.ini\")\n", "batches, graph = w2w.run(text)\n", "\n", "output_zip = dump_all(\n", " pipeline=w2w,\n", " text_batches=batches,\n", " knowledge_graph=graph,\n", " to_path=\".\"\n", ")" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Here we use GPT-4 to auto-evaluate the knowledge graph." ] }, { "cell_type": "code", "execution_count": 4, "metadata": {}, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Initializing knowledge graph validation. Run: 2024-02-16-001\n", "\n", "Starting excerpt 1 of 6. Validating 7 triplets... Done!\n", "Starting excerpt 2 of 6. Validating 20 triplets... Done!\n", "Starting excerpt 3 of 6. Validating 20 triplets... Done!\n", "Starting excerpt 4 of 6. Validating 10 triplets... Done!\n", "Starting excerpt 5 of 6. Validating 10 triplets... Done!\n", "Starting excerpt 6 of 6. Validating 16 triplets... Done!\n", "\n", "Knowledge graph validation complete!\n", "It took 109.471 seconds to validate 83 triplets.\n", "Saved to: ./validation-2024-02-16-001.csv\n" ] } ], "source": [ "from langchain_openai import ChatOpenAI\n", "from words2wisdom.validate import validate_knowledge_graph\n", "\n", "llm = ChatOpenAI(model=\"gpt-4-turbo-preview\")\n", "\n", "eval_file = validate_knowledge_graph(llm=llm, output_zip=output_zip)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "There are 5 evaluation questions. The questions and score ranges can be found in `config/validation.yml`. Here are the results:" ] }, { "cell_type": "code", "execution_count": 5, "metadata": {}, "outputs": [ { "data": { "text/html": [ "
\n", "\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
Q1Q2Q3Q4Q5
count83.00000083.00000083.00000083.00000083.000000
mean0.9759040.9759040.9759041.8192771.566265
std0.1542810.1542810.1542810.3871280.522489
min0.0000000.0000000.0000001.0000000.000000
25%1.0000001.0000001.0000002.0000001.000000
50%1.0000001.0000001.0000002.0000002.000000
75%1.0000001.0000001.0000002.0000002.000000
max1.0000001.0000001.0000002.0000002.000000
\n", "
" ], "text/plain": [ " Q1 Q2 Q3 Q4 Q5\n", "count 83.000000 83.000000 83.000000 83.000000 83.000000\n", "mean 0.975904 0.975904 0.975904 1.819277 1.566265\n", "std 0.154281 0.154281 0.154281 0.387128 0.522489\n", "min 0.000000 0.000000 0.000000 1.000000 0.000000\n", "25% 1.000000 1.000000 1.000000 2.000000 1.000000\n", "50% 1.000000 1.000000 1.000000 2.000000 2.000000\n", "75% 1.000000 1.000000 1.000000 2.000000 2.000000\n", "max 1.000000 1.000000 1.000000 2.000000 2.000000" ] }, "execution_count": 5, "metadata": {}, "output_type": "execute_result" } ], "source": [ "import pandas as pd\n", "\n", "data = pd.read_csv(eval_file)\n", "data.describe(include=[int])" ] } ], "metadata": { "kernelspec": { "display_name": "nlp", "language": "python", "name": "python3" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 3 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython3", "version": "3.10.11" }, "orig_nbformat": 4 }, "nbformat": 4, "nbformat_minor": 2 }