johaunh commited on
Commit
4b9251f
Β·
1 Parent(s): 049a574

updated for ai4ed

Browse files
README.md CHANGED
@@ -1,86 +1,82 @@
1
- ---
2
- title: Text2KG
3
- app_file: main.py
4
- sdk: gradio
5
- sdk_version: 3.39.0
6
- pinned: true
7
- license: mit
8
- emoji: πŸ§žπŸ“–
9
- colorFrom: indigo
10
- colorTo: gray
11
- ---
12
- # Text2KG
13
-
14
- We introduce Text2KG – an intuitive, domain-independent tool that leverages the creative generative ability of GPT-3.5 in the KG construction process. Text2KG automates and accelerates the construction of KGs from unstructured plain text, reducing the need for traditionally-used human labor and computer resources. Our approach incorporates a novel, clause-based text simplification step, reducing the processing of even the most extensive corpora down to the order of minutes. With Text2KG, we aim to streamline the creation of databases from natural language, offering a robust, cost-effective, and user-friendly solution for KG construction.
15
 
16
- ## Usage
17
 
18
- ### Gradio app
19
 
20
- #### Remotely
21
 
22
- Visit the [Text2KG HuggingFace Space](https://huggingface.co/spaces/jhatchett/Text2KG).
 
23
 
24
- #### Locally
25
 
26
- Clone this repository, and then use the command
27
 
28
- ```
29
- python main.py
30
- ```
31
 
32
- in the repository's directory.
33
 
34
- ### Within a `python` IDE
 
 
 
 
35
 
36
- Import the primary pipeline method using
37
-
38
- ```python
39
- >>> from main import extract_knowledge_graph
40
  ```
41
 
42
- **`extract_knowledge_graph` parameters**
43
 
44
- ```
45
- api_key (str)
46
- OpenAI API key
47
 
48
- batch_size (int)
49
- Number of sentences per forward pass
50
 
51
- modules (list)
52
- Additional modules to add before main extraction process (triplet_extraction). Must be a valid name in schema.yml
 
 
 
 
53
 
54
- text (str)
55
- Input text to extract knowledge graph from
56
 
57
- progress
58
- Progress bar. The default is Gradio's progress bar;
59
- set `progress = tqdm` for implementations outside of Gradio
60
- ```
61
 
62
- ### Using Gradio API
 
63
 
64
- Read more [here](https://www.gradio.app/docs/python-client).
 
 
 
65
 
66
  ## File structure
67
 
68
  ```
69
- chains.py
70
- Converts the items in schema.yml to LangChain modules
71
-
72
- requirements.txt
73
- Contains packages required to run Text2KG
74
-
75
- main.py
76
- Main pipeline/app code
77
-
78
- README.md
79
- This file
80
-
81
- schema.yml
82
- Contains definitions of modules -- prompts + output parsers
83
-
84
- utils.py
85
- Contains helper functions
 
 
 
 
 
 
86
  ```
 
1
+ # Words2Wisdom
 
 
 
 
 
 
 
 
 
 
 
 
 
2
 
3
+ This is the repository for Words2Wisdom. The work is still a work in progress.
4
 
5
+ **Paper:** [here](./writeup/words2wisdom_short.pdf) (Accepted as poster to AAAI AI4ED '24 Workshop)
6
 
7
+ **Hugging Face Space:** [Words2Wisdom](https://huggingface.co/spaces/jhatchett/Words2Wisdom)
8
 
9
+ **Abstract:**
10
+ Large language models (LLMs) have emerged as powerful tools with vast potential across various domains. While they have the potential to transform the educational landscape with personalized learning experiences, these models face challenges such as high training and usage costs, and susceptibility to inaccuracies. One promising solution to these challenges lies in leveraging knowledge graphs (KGs) for knowledge injection. By integrating factual content into pre-trained LLMs, KGs can reduce the costs associated with domain alignment, mitigate the risk of hallucination, and enhance the interpretability of the models' outputs. To meet the need for efficient knowledge graph creation, we introduce *Words2Wisdom* (W2W), a domain-independent LLM-based tool that automatically generates KGs from plain text. With W2W, we aim to provide a streamlined KG construction option that can drive advancements in grounded LLM-based educational technologies.
11
 
12
+ ## Demonstration
13
 
14
+ The `demo/demo.ipynb` notebook walks through how to use the `words2wisdom` pipeline.
15
 
16
+ ## Usage
 
 
17
 
18
+ Due to the large number of configurable parameters, `words2wisdom` uses a configuration INI file:
19
 
20
+ ```ini
21
+ [pipeline]
22
+ words_per_batch = 150 # any positive integer
23
+ preprocess = clause_deconstruction # {None, clause_deconstruction}
24
+ extraction = triplet_extraction # {triplet_extraction}
25
 
26
+ [llm]
27
+ model = gpt-3.5-turbo
28
+ # other GPT params like temperature, etc. can be set here too
 
29
  ```
30
 
31
+ A template configuration file can be generated with the command-line interface. **Note:** If `openai_api_key` is not explicitly set, the config will automatically try to read from the `OPENAI_API_KEY` environment variable.
32
 
33
+ ### From the CLI
 
 
34
 
35
+ All commands are preceded by `python -m words2wisdom`
 
36
 
37
+ | In order to... | Use the command... |
38
+ | -- | -- |
39
+ | Create a new config file | `init > path/to/write/config.ini` |
40
+ | Generate KG from text | `run path/to/text.txt [--config CONFIG] [--output-dir OUTPUT_DIR]` |
41
+ | Evaluate `words2wisdom` outputs | `eval path/to/output.zip` |
42
+ | Use `words2wisdom` from Gradio interface (default config only) | `gui` |
43
 
44
+ ### As a `Python` package
 
45
 
46
+ Import the primary pipeline method using
 
 
 
47
 
48
+ ```python
49
+ from words2wisdom.pipeline import Pipeline
50
 
51
+ # configure pipeline from .ini
52
+ pipe = Pipeline.from_ini("path/to/config.ini")
53
+ text_batches, knowledge_graph = pipe.run("The cat sat on the mat")
54
+ ```
55
 
56
  ## File structure
57
 
58
  ```
59
+ β”œβ”€β”€ config
60
+ β”‚ β”œβ”€β”€ default_config.ini
61
+ β”‚ β”œβ”€β”€ modules.yml
62
+ β”‚ └── validation.yml
63
+ β”œβ”€β”€ demo
64
+ β”‚ β”œβ”€β”€ config.ini
65
+ β”‚ β”œβ”€β”€ demo.ipynb
66
+ β”‚ └── example.txt
67
+ β”œβ”€β”€ src/words2wisdom
68
+ β”‚ β”œβ”€β”€ __init__.py
69
+ β”‚ β”œβ”€β”€ __main__.py
70
+ β”‚ β”œβ”€β”€ cli.py
71
+ β”‚ β”œβ”€β”€ config.py
72
+ β”‚ β”œβ”€β”€ gui.py
73
+ β”‚ β”œβ”€β”€ output_parsers.py
74
+ β”‚ β”œβ”€β”€ pipeline.py
75
+ β”‚ β”œβ”€β”€ utils.py
76
+ β”‚ └── validate.py
77
+ β”œβ”€β”€ writeup
78
+ β”‚ └── words2wisdom_short.pdf
79
+ β”œβ”€β”€ LICENSE.md
80
+ β”œβ”€β”€ README.md
81
+ └── requirements.txt
82
  ```
config/default_config.ini ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ [pipeline]
2
+ words_per_batch = 150
3
+ preprocess = clause_deconstruction
4
+ extraction = triplet_extraction
5
+
6
+ [llm]
7
+ openai_api_key = None
8
+ model = gpt-3.5-turbo
schema.yml β†’ config/modules.yml RENAMED
@@ -1,8 +1,28 @@
1
  clause_deconstruction:
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2
  parser: ClauseParser
3
  prompts:
4
  system: |
5
- You are a sentence parsing agent helping to construct a knowledge graph.
6
 
7
  Given the text, extract a list of the premises embedded within it.
8
  Focus on identifying declarative sentences that convey factual information.
@@ -19,6 +39,7 @@ clause_deconstruction:
19
  {text}
20
 
21
  triplet_extraction:
 
22
  parser: TripletParser
23
  prompts:
24
  system: |
 
1
  clause_deconstruction:
2
+ type: preprocess
3
+ parser: StrOutputParser
4
+ prompts:
5
+ system: |
6
+ You are a sentence parsing agent helping to simplify complex syntax.
7
+
8
+ The aim is to split the given text into a meaning-preserving sequence of simpler, shorter sentences.
9
+ Each of the short sentences should be self-contained, and should not co-reference other sentences.
10
+ Try to aim for one clause per sentence. Your response should be the split/rephrased text ONLY.
11
+
12
+ EXAMPLE: Dogs and cats chase squirrels, but fish do not.
13
+
14
+ ACCEPTABLE RESPONSE: Dogs chase squirrels. Cats chase squirrels. Fish do not chase squirrels.
15
+ UNACCEPTABLE RESPONSE: Dogs and cats chase squirrels. Fish do not chase them.
16
+
17
+ human: |
18
+ {text}
19
+
20
+ premise_extraction:
21
+ type: preprocess
22
  parser: ClauseParser
23
  prompts:
24
  system: |
25
+ You are a sentence parsing agent helping to simplify complex syntax.
26
 
27
  Given the text, extract a list of the premises embedded within it.
28
  Focus on identifying declarative sentences that convey factual information.
 
39
  {text}
40
 
41
  triplet_extraction:
42
+ type: extraction
43
  parser: TripletParser
44
  prompts:
45
  system: |
config/validation.yml ADDED
@@ -0,0 +1,61 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ instruction: >
2
+ We are in the process of evaluating an AI-extracted knowledge graph.
3
+ Your task involves assessing the accuracy and specificity of a component
4
+ of the graph known as a triplet, which represents a key idea or fact. A
5
+ triplet comprises three components: a subject entity s, a relation r, and
6
+ an object entity o (format: [s, r, o]). We emphasize that the order of these
7
+ components is significant; the subject s relates to the object o via the
8
+ relation r. That is, the relation points from the subject to the object. Our
9
+ AI agent extracts a collection of triplets from each passage provided. The
10
+ evaluation task has 5 subtasks:
11
+ questions:
12
+ Q1:
13
+ title: Specificity (subject entity).
14
+ text: Does the subject entity represent a specific term/concept referenced in the passage?
15
+ additional:
16
+ options:
17
+ - 1 = Specific
18
+ - 0 = Not specific
19
+ Q2:
20
+ title: Specificity (object entity).
21
+ text: Does the object entity represent a specific term/concept referenced in the passage?
22
+ additional:
23
+ options:
24
+ - 1 = Specific
25
+ - 0 = Not specific
26
+ Q3:
27
+ title: Relation Validity.
28
+ text: Does the relation correctly connect the subject entity to the object entity?
29
+ additional:
30
+ options:
31
+ - 1 = Correct
32
+ - 0 = Incorrect
33
+ Q4:
34
+ title: Triplet Relevance.
35
+ text: Evaluate the importance of the triplet in relation to the passage's meaning.
36
+ additional: Rate from 0 to 2.
37
+ options:
38
+ - 0 = Not relevant to understanding
39
+ - 1 = Helpful, but not essential to understanding
40
+ - 2 = Critical fact of passage
41
+ Q5:
42
+ title: Triplet Comprehensiveness.
43
+ text: Assess whether the triplet can function independently from the passage, effectively conveying one of the passage's key ideas.
44
+ additional: Rate from 0 to 2.
45
+ options:
46
+ - 0 = Cannot function without context
47
+ - 1 = Somewhat dependent on passage
48
+ - 2 = Comprehensive
49
+ example:
50
+ passage: >
51
+ Several cells of one kind that interconnect with each other and perform a shared function form
52
+ tissues. These tissues combine to form an organ (your stomach, heart, or brain), and several
53
+ organs comprise an organ system (such as the digestive system, circulatory system, or nervous
54
+ system).
55
+ triplet: "['organ', 'such as', 'heart']"
56
+ answers:
57
+ Q1: 1 - The entity 'organ' is specific and is mentioned in the passage.
58
+ Q2: 1 - The entity 'heart' is specific and is mentioned in the passage.
59
+ Q3: 1 - The relation 'such as' is unclear. A better relation would be 'superclass of'.
60
+ Q4: 1 - The triplet is relatively important as it is used as a parenthetical example in the passage.
61
+ Q5: 0 - The provided triplet is unclear as is. "Organ such as heart" doesn't make sense.
demo/config.ini ADDED
@@ -0,0 +1,8 @@
 
 
 
 
 
 
 
 
 
1
+ [pipeline]
2
+ words_per_batch = 150
3
+ preprocess = clause_deconstruction
4
+ extraction = triplet_extraction
5
+
6
+ [llm]
7
+ model = gpt-3.5-turbo
8
+ temperature = 0.3
demo/demo.ipynb ADDED
@@ -0,0 +1,291 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "markdown",
5
+ "metadata": {},
6
+ "source": [
7
+ "# `Words2Wisdom` Demo\n",
8
+ "\n",
9
+ "For purpose of the notebook, we add the `src` director to the `PYTHONPATH`:"
10
+ ]
11
+ },
12
+ {
13
+ "cell_type": "code",
14
+ "execution_count": 1,
15
+ "metadata": {},
16
+ "outputs": [],
17
+ "source": [
18
+ "import sys\n",
19
+ "\n",
20
+ "# add words2wisdom to PYTHONPATH\n",
21
+ "sys.path.append(\"../src/\")"
22
+ ]
23
+ },
24
+ {
25
+ "cell_type": "markdown",
26
+ "metadata": {},
27
+ "source": [
28
+ "Next, we load in the example text file (from OpenStax Bio 2e chapter 4.2):"
29
+ ]
30
+ },
31
+ {
32
+ "cell_type": "code",
33
+ "execution_count": 2,
34
+ "metadata": {},
35
+ "outputs": [
36
+ {
37
+ "name": "stdout",
38
+ "output_type": "stream",
39
+ "text": [
40
+ "Cells fall into one of two broad categories: prokaryotic and eukaryotic. We classify only the predominantly single-celled organisms Bacteria and Archaea as prokaryotes (pro- = before; -kary- = nucleus...\n"
41
+ ]
42
+ }
43
+ ],
44
+ "source": [
45
+ "# load example text\n",
46
+ "with open(\"example.txt\") as f:\n",
47
+ " text = f.read()\n",
48
+ "\n",
49
+ "# print example\n",
50
+ "print(text[:200] + \"...\")"
51
+ ]
52
+ },
53
+ {
54
+ "cell_type": "markdown",
55
+ "metadata": {},
56
+ "source": [
57
+ "The `words2wisdom` pipeline can be configured from a configuration INI file. We have one prepared already, but you will need to create one with your desired settings.\n",
58
+ "\n",
59
+ "After configuration, we call the `run` process. Then, we save all outputs to a ZIP file."
60
+ ]
61
+ },
62
+ {
63
+ "cell_type": "code",
64
+ "execution_count": 3,
65
+ "metadata": {},
66
+ "outputs": [
67
+ {
68
+ "name": "stdout",
69
+ "output_type": "stream",
70
+ "text": [
71
+ "Initialized Text2KG pipeline:\n",
72
+ "[INPUT: text] -> ClauseDeconstruction() -> TripletExtraction() -> [OUTPUT: knowledge graph]\n",
73
+ "Running Text2KG pipeline:\n",
74
+ "Extracting knowledge graph... Cleaning knowledge graph components... Done!\n",
75
+ "Run ID: 2024-02-16-001\n",
76
+ "Saved data to ./output-2024-02-16-001.zip\n"
77
+ ]
78
+ }
79
+ ],
80
+ "source": [
81
+ "from words2wisdom.pipeline import Pipeline\n",
82
+ "from words2wisdom.utils import dump_all\n",
83
+ "\n",
84
+ "w2w = Pipeline.from_ini(\"config.ini\")\n",
85
+ "batches, graph = w2w.run(text)\n",
86
+ "\n",
87
+ "output_zip = dump_all(\n",
88
+ " pipeline=w2w,\n",
89
+ " text_batches=batches,\n",
90
+ " knowledge_graph=graph,\n",
91
+ " to_path=\".\"\n",
92
+ ")"
93
+ ]
94
+ },
95
+ {
96
+ "cell_type": "markdown",
97
+ "metadata": {},
98
+ "source": [
99
+ "Here we use GPT-4 to auto-evaluate the knowledge graph."
100
+ ]
101
+ },
102
+ {
103
+ "cell_type": "code",
104
+ "execution_count": 4,
105
+ "metadata": {},
106
+ "outputs": [
107
+ {
108
+ "name": "stdout",
109
+ "output_type": "stream",
110
+ "text": [
111
+ "Initializing knowledge graph validation. Run: 2024-02-16-001\n",
112
+ "\n",
113
+ "Starting excerpt 1 of 6. Validating 7 triplets... Done!\n",
114
+ "Starting excerpt 2 of 6. Validating 20 triplets... Done!\n",
115
+ "Starting excerpt 3 of 6. Validating 20 triplets... Done!\n",
116
+ "Starting excerpt 4 of 6. Validating 10 triplets... Done!\n",
117
+ "Starting excerpt 5 of 6. Validating 10 triplets... Done!\n",
118
+ "Starting excerpt 6 of 6. Validating 16 triplets... Done!\n",
119
+ "\n",
120
+ "Knowledge graph validation complete!\n",
121
+ "It took 109.471 seconds to validate 83 triplets.\n",
122
+ "Saved to: ./validation-2024-02-16-001.csv\n"
123
+ ]
124
+ }
125
+ ],
126
+ "source": [
127
+ "from langchain_openai import ChatOpenAI\n",
128
+ "from words2wisdom.validate import validate_knowledge_graph\n",
129
+ "\n",
130
+ "llm = ChatOpenAI(model=\"gpt-4-turbo-preview\")\n",
131
+ "\n",
132
+ "eval_file = validate_knowledge_graph(llm=llm, output_zip=output_zip)"
133
+ ]
134
+ },
135
+ {
136
+ "cell_type": "markdown",
137
+ "metadata": {},
138
+ "source": [
139
+ "There are 5 evaluation questions. The questions and score ranges can be found in `config/validation.yml`. Here are the results:"
140
+ ]
141
+ },
142
+ {
143
+ "cell_type": "code",
144
+ "execution_count": 5,
145
+ "metadata": {},
146
+ "outputs": [
147
+ {
148
+ "data": {
149
+ "text/html": [
150
+ "<div>\n",
151
+ "<style scoped>\n",
152
+ " .dataframe tbody tr th:only-of-type {\n",
153
+ " vertical-align: middle;\n",
154
+ " }\n",
155
+ "\n",
156
+ " .dataframe tbody tr th {\n",
157
+ " vertical-align: top;\n",
158
+ " }\n",
159
+ "\n",
160
+ " .dataframe thead th {\n",
161
+ " text-align: right;\n",
162
+ " }\n",
163
+ "</style>\n",
164
+ "<table border=\"1\" class=\"dataframe\">\n",
165
+ " <thead>\n",
166
+ " <tr style=\"text-align: right;\">\n",
167
+ " <th></th>\n",
168
+ " <th>Q1</th>\n",
169
+ " <th>Q2</th>\n",
170
+ " <th>Q3</th>\n",
171
+ " <th>Q4</th>\n",
172
+ " <th>Q5</th>\n",
173
+ " </tr>\n",
174
+ " </thead>\n",
175
+ " <tbody>\n",
176
+ " <tr>\n",
177
+ " <th>count</th>\n",
178
+ " <td>83.000000</td>\n",
179
+ " <td>83.000000</td>\n",
180
+ " <td>83.000000</td>\n",
181
+ " <td>83.000000</td>\n",
182
+ " <td>83.000000</td>\n",
183
+ " </tr>\n",
184
+ " <tr>\n",
185
+ " <th>mean</th>\n",
186
+ " <td>0.975904</td>\n",
187
+ " <td>0.975904</td>\n",
188
+ " <td>0.975904</td>\n",
189
+ " <td>1.819277</td>\n",
190
+ " <td>1.566265</td>\n",
191
+ " </tr>\n",
192
+ " <tr>\n",
193
+ " <th>std</th>\n",
194
+ " <td>0.154281</td>\n",
195
+ " <td>0.154281</td>\n",
196
+ " <td>0.154281</td>\n",
197
+ " <td>0.387128</td>\n",
198
+ " <td>0.522489</td>\n",
199
+ " </tr>\n",
200
+ " <tr>\n",
201
+ " <th>min</th>\n",
202
+ " <td>0.000000</td>\n",
203
+ " <td>0.000000</td>\n",
204
+ " <td>0.000000</td>\n",
205
+ " <td>1.000000</td>\n",
206
+ " <td>0.000000</td>\n",
207
+ " </tr>\n",
208
+ " <tr>\n",
209
+ " <th>25%</th>\n",
210
+ " <td>1.000000</td>\n",
211
+ " <td>1.000000</td>\n",
212
+ " <td>1.000000</td>\n",
213
+ " <td>2.000000</td>\n",
214
+ " <td>1.000000</td>\n",
215
+ " </tr>\n",
216
+ " <tr>\n",
217
+ " <th>50%</th>\n",
218
+ " <td>1.000000</td>\n",
219
+ " <td>1.000000</td>\n",
220
+ " <td>1.000000</td>\n",
221
+ " <td>2.000000</td>\n",
222
+ " <td>2.000000</td>\n",
223
+ " </tr>\n",
224
+ " <tr>\n",
225
+ " <th>75%</th>\n",
226
+ " <td>1.000000</td>\n",
227
+ " <td>1.000000</td>\n",
228
+ " <td>1.000000</td>\n",
229
+ " <td>2.000000</td>\n",
230
+ " <td>2.000000</td>\n",
231
+ " </tr>\n",
232
+ " <tr>\n",
233
+ " <th>max</th>\n",
234
+ " <td>1.000000</td>\n",
235
+ " <td>1.000000</td>\n",
236
+ " <td>1.000000</td>\n",
237
+ " <td>2.000000</td>\n",
238
+ " <td>2.000000</td>\n",
239
+ " </tr>\n",
240
+ " </tbody>\n",
241
+ "</table>\n",
242
+ "</div>"
243
+ ],
244
+ "text/plain": [
245
+ " Q1 Q2 Q3 Q4 Q5\n",
246
+ "count 83.000000 83.000000 83.000000 83.000000 83.000000\n",
247
+ "mean 0.975904 0.975904 0.975904 1.819277 1.566265\n",
248
+ "std 0.154281 0.154281 0.154281 0.387128 0.522489\n",
249
+ "min 0.000000 0.000000 0.000000 1.000000 0.000000\n",
250
+ "25% 1.000000 1.000000 1.000000 2.000000 1.000000\n",
251
+ "50% 1.000000 1.000000 1.000000 2.000000 2.000000\n",
252
+ "75% 1.000000 1.000000 1.000000 2.000000 2.000000\n",
253
+ "max 1.000000 1.000000 1.000000 2.000000 2.000000"
254
+ ]
255
+ },
256
+ "execution_count": 5,
257
+ "metadata": {},
258
+ "output_type": "execute_result"
259
+ }
260
+ ],
261
+ "source": [
262
+ "import pandas as pd\n",
263
+ "\n",
264
+ "data = pd.read_csv(eval_file)\n",
265
+ "data.describe(include=[int])"
266
+ ]
267
+ }
268
+ ],
269
+ "metadata": {
270
+ "kernelspec": {
271
+ "display_name": "nlp",
272
+ "language": "python",
273
+ "name": "python3"
274
+ },
275
+ "language_info": {
276
+ "codemirror_mode": {
277
+ "name": "ipython",
278
+ "version": 3
279
+ },
280
+ "file_extension": ".py",
281
+ "mimetype": "text/x-python",
282
+ "name": "python",
283
+ "nbconvert_exporter": "python",
284
+ "pygments_lexer": "ipython3",
285
+ "version": "3.10.11"
286
+ },
287
+ "orig_nbformat": 4
288
+ },
289
+ "nbformat": 4,
290
+ "nbformat_minor": 2
291
+ }
demo/example.txt ADDED
@@ -0,0 +1 @@
 
 
1
+ Cells fall into one of two broad categories: prokaryotic and eukaryotic. We classify only the predominantly single-celled organisms Bacteria and Archaea as prokaryotes (pro- = before; -kary- = nucleus). Animal cells, plants, fungi, and protists are all eukaryotes (eu- = true). All cells share four common components: 1) a plasma membrane, an outer covering that separates the cell's interior from its surrounding environment; 2) cytoplasm, consisting of a jelly-like cytosol within the cell in which there are other cellular components; 3) DNA, the cell's genetic material; and 4) ribosomes, which synthesize proteins. However, prokaryotes differ from eukaryotic cells in several ways. A prokaryote is a simple, mostly single-celled (unicellular) organism that lacks a nucleus, or any other membrane-bound organelle. We will shortly come to see that this is significantly different in eukaryotes. Prokaryotic DNA is in the cell's central part: the nucleoid (Figure 4.5). This figure shows the generalized structure of a prokaryotic cell. All prokaryotes have chromosomal DNA localized in a nucleoid, ribosomes, a cell membrane, and a cell wall. The other structures shown are present in some, but not all, bacteria. Most prokaryotes have a peptidoglycan cell wall and many have a polysaccharide capsule (Figure 4.5). The cell wall acts as an extra layer of protection, helps the cell maintain its shape, and prevents dehydration. The capsule enables the cell to attach to surfaces in its environment. Some prokaryotes have flagella, pili, or fimbriae. Flagella are used for locomotion. Pili exchange genetic material during conjugation, the process by which one bacterium transfers genetic material to another through direct contact. Bacteria use fimbriae to attach to a host cell. The most effective action anyone can take to prevent the spread of contagious illnesses is to wash his or her hands. Why? Because microbes (organisms so tiny that they can only be seen with microscopes) are ubiquitous. They live on doorknobs, money, your hands, and many other surfaces. If someone sneezes into his hand and touches a doorknob, and afterwards you touch that same doorknob, the microbes from the sneezer's mucus are now on your hands. If you touch your hands to your mouth, nose, or eyes, those microbes can enter your body and could make you sick. However, not all microbes (also called microorganisms) cause disease; most are actually beneficial. You have microbes in your gut that make vitamin K. Other microorganisms are used to ferment beer and wine. Microbiologists are scientists who study microbes. Microbiologists can pursue a number of careers. Not only do they work in the food industry, they are also employed in the veterinary and medical fields. They can work in the pharmaceutical sector, serving key roles in research and development by identifying new antibiotic sources that can treat bacterial infections. Environmental microbiologists may look for new ways to use specially selected or genetically engineered microbes to remove pollutants from soil or groundwater, as well as hazardous elements from contaminated sites. We call using these microbes bioremediation technologies. Microbiologists can also work in the bioinformatics field, providing specialized knowledge and insight for designing, developing, and specificity of computer models of, for example, bacterial epidemics. At 0.1 to 5.0 ΞΌm in diameter, prokaryotic cells are significantly smaller than eukaryotic cells, which have diameters ranging from 10 to 100 ΞΌm (Figure 4.6). The prokaryotes' small size allows ions and organic molecules that enter them to quickly diffuse to other parts of the cell. Similarly, any wastes produced within a prokaryotic cell can quickly diffuse. This is not the case in eukaryotic cells, which have developed different structural adaptations to enhance intracellular transport. This figure shows relative sizes of microbes on a logarithmic scale (recall that each unit of increase in a logarithmic scale represents a 10-fold increase in the quantity measured). Small size, in general, is necessary for all cells, whether prokaryotic or eukaryotic. Let's examine why that is so. First, we'll consider the area and volume of a typical cell. Not all cells are spherical in shape, but most tend to approximate a sphere. You may remember from your high school geometry course that the formula for the surface area of a sphere is 4Ο€r2, while the formula for its volume is 4Ο€r3/3. Thus, as the radius of a cell increases, its surface area increases as the square of its radius, but its volume increases as the cube of its radius (much more rapidly). Therefore, as a cell increases in size, its surface area-to-volume ratio decreases. This same principle would apply if the cell had a cube shape (Figure 4.7). If the cell grows too large, the plasma membrane will not have sufficient surface area to support the rate of diffusion required for the increased volume. In other words, as a cell grows, it becomes less efficient. One way to become more efficient is to divide. Another way is to develop organelles that perform specific tasks. These adaptations lead to developing more sophisticated cells, which we call eukaryotic cells. Notice that as a cell increases in size, its surface area-to-volume ratio decreases. When there is insufficient surface area to support a cell's increasing volume, a cell will either divide or die. The cell on the left has a volume of 1 mm3 and a surface area of 6 mm2, with a surface area-to-volume ratio of 6 to 1; whereas, the cell on the right has a volume of 8 mm3 and a surface area of 24 mm2, with a surface area-to-volume ratio of 3 to 1. Prokaryotic cells are much smaller than eukaryotic cells. What advantages might small cell size confer on a cell? What advantages might large cell size have?
main.py DELETED
@@ -1,263 +0,0 @@
1
- import os
2
- import re
3
- import secrets
4
- import string
5
- import yaml
6
- from datetime import datetime
7
- from zipfile import ZipFile
8
-
9
- import gradio as gr
10
- import nltk
11
- import pandas as pd
12
- from langchain.embeddings import OpenAIEmbeddings
13
- from langchain.chains import SimpleSequentialChain
14
- from langchain.chat_models import ChatOpenAI
15
- from nltk.tokenize import sent_tokenize
16
- from pandas import DataFrame
17
-
18
- import utils
19
- from chains import llm_chains
20
-
21
-
22
- # download NLTK dependencies
23
- nltk.download("punkt")
24
- nltk.download("stopwords")
25
-
26
- # load stop words const.
27
- from nltk.corpus import stopwords
28
- STOP_WORDS = stopwords.words("english")
29
-
30
- # load global spacy model
31
- # try:
32
- # SPACY_MODEL = spacy.load("en_core_web_sm")
33
- # except OSError:
34
- # print("[spacy] Downloading model: en_core_web_sm")
35
-
36
- # spacy.cli.download("en_core_web_sm")
37
- # SPACY_MODEL = spacy.load("en_core_web_sm")
38
-
39
-
40
- class Text2KG:
41
- """Text2KG class."""
42
-
43
- def __init__(self, api_key: str, **kwargs):
44
-
45
- self.llm = ChatOpenAI(openai_api_key=api_key, **kwargs)
46
- self.embedding = OpenAIEmbeddings(openai_api_key=api_key)
47
-
48
-
49
- def init(self, steps: list[str]):
50
- """Initialize Text2KG pipeline from passed steps.
51
-
52
- Args:
53
- *steps (str): Steps to include in pipeline. Must be a top-level name present in
54
- the schema.yml file
55
- """
56
- self.pipeline = SimpleSequentialChain(
57
- chains=[llm_chains[step](llm=self.llm) for step in steps],
58
- verbose=False
59
- )
60
-
61
-
62
- def run(self, text: str) -> list[dict]:
63
- """Run Text2KG pipeline on passed text.
64
-
65
- Arg:
66
- text (str): The text input
67
-
68
- Returns:
69
- triplets (list): The list of extracted KG triplets
70
- """
71
- triplets = self.pipeline.run(text)
72
-
73
- return triplets
74
-
75
-
76
- def clean(self, kg: DataFrame) -> DataFrame:
77
- """Text2KG post-processing."""
78
- drop_list = []
79
-
80
- for i, row in kg.iterrows():
81
- # drop stopwords (e.g. pronouns)
82
- if (row.subject in STOP_WORDS) or (row.object in STOP_WORDS):
83
- drop_list.append(i)
84
-
85
- # drop broken triplets
86
- elif row.hasnans:
87
- drop_list.append(i)
88
-
89
- # lowercase nodes/edges, drop articles
90
- else:
91
- article_pattern = r'^(the|a|an) (.+)'
92
- be_pattern = r'^(are|is) (a )?(.+)'
93
-
94
- kg.at[i, "subject"] = re.sub(article_pattern, r'\2', row.subject.lower())
95
- kg.at[i, "relation"] = re.sub(be_pattern, r'\3', row.relation.lower())
96
- kg.at[i, "object"] = re.sub(article_pattern, r'\2', row.object.lower())
97
-
98
- return kg.drop(drop_list)
99
-
100
-
101
- def normalize(self, kg: DataFrame, threshold: float=0.3) -> DataFrame:
102
- """Reduce dimensionality of Text2KG output by merging cosine-similar nodes/edges."""
103
-
104
- ents = pd.concat([kg["subject"], kg["object"]]).unique()
105
- rels = kg["relation"].unique()
106
-
107
- ent_map = utils.condense_labels(ents, self.embedding.embed_documents, threshold=threshold)
108
- rel_map = utils.condense_labels(rels, self.embedding.embed_documents, threshold=threshold)
109
-
110
- kg_normal = pd.DataFrame()
111
-
112
- kg_normal["subject"] = kg["subject"].map(ent_map)
113
- kg_normal["relation"] = kg["relation"].map(rel_map)
114
- kg_normal["object"] = kg["object"].map(ent_map)
115
-
116
- return kg_normal
117
-
118
-
119
- def extract_knowledge_graph(api_key: str, batch_size: int, modules: list[str], text: str, progress=gr.Progress()):
120
- """Extract knowledge graph from text.
121
-
122
- Args:
123
- api_key (str): OpenAI API key
124
- batch_size (int): Number of sentences per forward pass
125
- modules (list): Additional modules to add before main extraction step
126
- text (str): Text from which Text2KG will extract knowledge graph from
127
- progress: Progress bar. The default is gradio's progress bar; for a
128
- command line progress bar, set `progress = tqdm`
129
-
130
- Returns:
131
- zip_path (str): Path to ZIP archive containing outputs
132
- knowledge_graph (DataFrame): The extracted knowledge graph
133
- """
134
- # init
135
- if api_key == "":
136
- raise ValueError("API key is required")
137
-
138
- pipeline = Text2KG(api_key=api_key, temperature=0.3) # low temp. -> low randomness
139
-
140
- steps = []
141
-
142
- for module in modules:
143
- m = module.lower().replace(' ', '_')
144
- steps.append(m)
145
-
146
- if (len(steps) == 0) or (steps[-1] != "triplet_extraction"):
147
- steps.append("triplet_extraction")
148
- steps = []
149
-
150
- for module in modules:
151
- m = module.lower().replace(' ', '_')
152
- steps.append(m)
153
-
154
- if (len(steps) == 0) or (steps[-1] != "triplet_extraction"):
155
- steps.append("triplet_extraction")
156
-
157
- pipeline.init(steps)
158
-
159
- # split text into batches
160
- sentences = sent_tokenize(text)
161
- batches = [" ".join(sentences[i:i+batch_size])
162
- for i in range(0, len(sentences), batch_size)]
163
-
164
- # create KG
165
- knowledge_graph = []
166
-
167
- for i, batch in progress.tqdm(list(enumerate(batches)),
168
- desc="Processing...", unit="batches"):
169
- output = pipeline.run(batch)
170
- [triplet.update({"sentence_id": i}) for triplet in output]
171
-
172
- knowledge_graph.extend(output)
173
-
174
-
175
- # convert to df, post-process data
176
- knowledge_graph = pd.DataFrame(knowledge_graph)
177
- knowledge_graph = pipeline.clean(knowledge_graph)
178
-
179
- # rearrange columns
180
- knowledge_graph = knowledge_graph[["sentence_id", "subject", "relation", "object"]]
181
-
182
- # metadata
183
- now = datetime.now()
184
- date = str(now.date())
185
-
186
- metadata = {
187
- "_timestamp": now,
188
- "batch_size": batch_size,
189
- "modules": steps
190
- }
191
-
192
- # unique identifier for local saving
193
- uid = ''.join(secrets.choice(string.ascii_letters)
194
- for _ in range(6))
195
-
196
- print(f"Run ID: {date}/{uid}")
197
-
198
- save_dir = os.path.join(".", "output", date, uid)
199
- os.makedirs(save_dir, exist_ok=True)
200
-
201
-
202
- # save metadata & data
203
- with open(os.path.join(save_dir, "metadata.yml"), 'w') as f:
204
- yaml.dump(metadata, f)
205
-
206
- batches_df = pd.DataFrame(enumerate(batches), columns=["sentence_id", "text"])
207
- batches_df.to_csv(os.path.join(save_dir, "sentences.txt"),
208
- index=False)
209
-
210
- knowledge_graph.to_csv(os.path.join(save_dir, "kg.txt"),
211
- index=False)
212
-
213
-
214
- # create ZIP file
215
- zip_path = os.path.join(save_dir, "output.zip")
216
-
217
- with ZipFile(zip_path, 'w') as zipObj:
218
-
219
- zipObj.write(os.path.join(save_dir, "metadata.yml"))
220
- zipObj.write(os.path.join(save_dir, "sentences.txt"))
221
- zipObj.write(os.path.join(save_dir, "kg.txt"))
222
-
223
- return zip_path, knowledge_graph
224
-
225
-
226
- class App:
227
- def __init__(self):
228
- demo = gr.Interface(
229
- fn=extract_knowledge_graph,
230
- title="Text2KG",
231
- inputs=[
232
- gr.Textbox(placeholder="API key...", label="OpenAI API Key", type="password"),
233
- gr.Slider(minimum=1, maximum=10, step=1, label="Sentence Batch Size"),
234
- gr.CheckboxGroup(choices=["Clause Deconstruction"], label="Optional Modules"),
235
- gr.Textbox(lines=2, placeholder="Text Here...", label="Input Text"),
236
- ],
237
- outputs=[
238
- gr.File(label="Knowledge Graph"),
239
- gr.DataFrame(label="Preview",
240
- headers=["sentence_id", "subject", "relation", "object"],
241
- max_rows=10,
242
- overflow_row_behaviour="paginate")
243
- ],
244
- examples=[[None, 1, [], ("All cells share four common components: "
245
- "1) a plasma membrane, an outer covering that separates the "
246
- "cell's interior from its surrounding environment; 2) cytoplasm, "
247
- "consisting of a jelly-like cytosol within the cell in which "
248
- "there are other cellular components; 3) DNA, the cell's genetic "
249
- "material; and 4) ribosomes, which synthesize proteins. However, "
250
- "prokaryotes differ from eukaryotic cells in several ways. A "
251
- "prokaryote is a simple, mostly single-celled (unicellular) "
252
- "organism that lacks a nucleus, or any other membrane-bound "
253
- "organelle. We will shortly come to see that this is significantly "
254
- "different in eukaryotes. Prokaryotic DNA is in the cell's central "
255
- "part: the nucleoid.")]],
256
- allow_flagging="never",
257
- cache_examples=False
258
- )
259
- demo.queue().launch(share=False)
260
-
261
-
262
- if __name__ == "__main__":
263
- App()
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
requirements.txt CHANGED
@@ -1,8 +1,8 @@
1
- gradio==4.2.0
2
- langchain==0.0.335
3
- nltk==3.7
4
- openai==0.27.4
5
- pandas==2.0.3
6
- PyYAML==6.0
7
- scikit-learn==1.2.2
8
- tqdm==4.65.0
 
1
+ gradio
2
+ langchain
3
+ langchain_core
4
+ langchain_openai
5
+ nltk
6
+ openai
7
+ pandas
8
+ PyYAML
src/words2wisdom/__init__.py ADDED
@@ -0,0 +1,30 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import sys
3
+ import yaml
4
+
5
+ import nltk
6
+
7
+ # directories
8
+ PACKAGE_DIR = os.path.dirname(__file__)
9
+ ROOT = os.path.dirname(os.path.dirname(PACKAGE_DIR))
10
+ DATA_DIR = os.path.join(ROOT, "data")
11
+ CONFIG_DIR = os.path.join(ROOT, "config")
12
+ OUTPUT_DIR = os.path.join(ROOT, "output")
13
+
14
+ # add the package root directory to the python path
15
+ sys.path.append(os.path.dirname(PACKAGE_DIR))
16
+
17
+ # files
18
+ with open(os.path.join(CONFIG_DIR, "modules.yml")) as f:
19
+ MODULES_CONFIG = yaml.safe_load(f)
20
+
21
+ with open(os.path.join(CONFIG_DIR, "validation.yml")) as f:
22
+ VALIDATION_CONFIG = yaml.safe_load(f)
23
+
24
+ # download NLTK dependencies
25
+ nltk.download("punkt", quiet=True)
26
+ nltk.download("stopwords", quiet=True)
27
+
28
+ # load NLTK stop words
29
+ from nltk.corpus import stopwords
30
+ STOP_WORDS = stopwords.words("english")
src/words2wisdom/__main__.py ADDED
@@ -0,0 +1,3 @@
 
 
 
 
1
+ if __name__ == "__main__":
2
+ from . import cli
3
+ cli.main()
src/words2wisdom/cli.py ADDED
@@ -0,0 +1,97 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import argparse
2
+ import os
3
+ import subprocess
4
+
5
+ from langchain_openai import ChatOpenAI
6
+
7
+ from . import CONFIG_DIR, OUTPUT_DIR
8
+ from .pipeline import Pipeline
9
+ from .utils import dump_all
10
+ from .validate import validate_knowledge_graph
11
+
12
+
13
+ default_config_path = os.path.join(CONFIG_DIR, "default_config.ini")
14
+
15
+
16
+ def main():
17
+ parser = argparse.ArgumentParser(
18
+ prog="words2wisdom",
19
+ #description="Generate a knowledge graph from a given text using OpenAI LLMs"
20
+ )
21
+ subparsers = parser.add_subparsers(dest="command",
22
+ help="Available commands")
23
+
24
+ # init
25
+ parser_init = subparsers.add_parser("init",
26
+ help="Return the default config.ini file",
27
+ description="Return the default config.ini file")
28
+ parser_init.set_defaults(func=get_default_config)
29
+
30
+ # gui
31
+ parser_gui = subparsers.add_parser("gui",
32
+ help="run Words2Wisdom using Gradio interface",
33
+ description="run Words2Wisdom using Gradio interface")
34
+ parser_gui.set_defaults(func=gui)
35
+
36
+ # run
37
+ parser_run = subparsers.add_parser("run",
38
+ help="Generate a knowledge graph from a given text using OpenAI LLMs",
39
+ description="Generate a knowledge graph from a given text using OpenAI LLMs")
40
+ parser_run.add_argument("text",
41
+ help="Path to text file")
42
+ parser_run.add_argument("--config",
43
+ help="Path to config.ini file",
44
+ default=default_config_path)
45
+ parser_run.add_argument("--output-dir",
46
+ help="Path to save outputs to",
47
+ default=OUTPUT_DIR)
48
+ parser_run.set_defaults(func=run)
49
+
50
+ # eval
51
+ parser_eval = subparsers.add_parser("eval",
52
+ help="Auto-evaluate knowledge graph using GPT-4",
53
+ description="Auto-evaluate knowledge graph using GPT-4")
54
+ parser_eval.add_argument("output_zip",
55
+ help="Path to output.zip file containing knowledge graph")
56
+ parser_eval.set_defaults(func=validate)
57
+
58
+ args = parser.parse_args()
59
+ args.func(args)
60
+
61
+
62
+ def get_default_config(args):
63
+ """Print default config.ini"""
64
+ with open(default_config_path) as f:
65
+ default_config = f.read()
66
+
67
+ print(default_config)
68
+
69
+
70
+ def gui(args):
71
+ """Run Gradio interface"""
72
+ subprocess.run(["python", "-m", "text2kg.gui"])
73
+
74
+
75
+ def run(args):
76
+ """Text to KG pipeline"""
77
+ pipe = Pipeline.from_ini(args.config)
78
+
79
+ with open(args.text) as f:
80
+ batches, kg = pipe.run(f.read())
81
+
82
+ dump_all(pipe, batches, kg, to_path=args.output_dir)
83
+
84
+
85
+ def validate(args):
86
+ """Validate knowledge graph"""
87
+ validate_knowledge_graph(
88
+ llm=ChatOpenAI(
89
+ model="gpt-4-turbo-preview",
90
+ #openai_api_key=...
91
+ ),
92
+ output_zip=args.output_zip
93
+ )
94
+
95
+
96
+ if __name__ == "__main__":
97
+ main()
src/words2wisdom/config.py ADDED
@@ -0,0 +1,96 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import configparser
2
+ import ast
3
+
4
+
5
+ class Config:
6
+ def __init__(self, config_data):
7
+ self.config_data = config_data
8
+
9
+ def __getattr__(self, name):
10
+ return self.config_data.get(name, {})
11
+
12
+ def __setattr__(self, name, value):
13
+ if name == 'config_data':
14
+ super().__setattr__(name, value)
15
+ else:
16
+ self.config_data[name] = value
17
+
18
+ def __repr__(self):
19
+ return f"Config(\n{'pipeline':>12}: {self.pipeline}\n{'llm':>12}: {self.llm}\n)"
20
+
21
+ @classmethod
22
+ def read_ini(cls, file_path):
23
+
24
+ parser = configparser.ConfigParser()
25
+ parser.read(file_path)
26
+
27
+ return cls({"pipeline": cls._parse_pipeline_section(parser["pipeline"]),
28
+ "llm": cls._parse_llm_section(parser["llm"])})
29
+
30
+ @staticmethod
31
+ def _parse_llm_section(section):
32
+ parsed_data = {}
33
+ for key, value in section.items():
34
+ try:
35
+ parsed_data[key] = ast.literal_eval(value)
36
+ except ValueError:
37
+ parsed_data[key] = value
38
+
39
+ return parsed_data
40
+
41
+ @staticmethod
42
+ def _parse_pipeline_section(section):
43
+ eval_func = {
44
+ "words_per_batch": int,
45
+ "preprocess": lambda x: x.split(", ") if x.split(", ") != ["None"] else []
46
+ }
47
+ parsed_data = {}
48
+
49
+ for key, value in section.items():
50
+ parsed_data[key] = eval_func.get(key, str)(value)
51
+
52
+ return parsed_data
53
+
54
+
55
+ def serialize(self, save_path: str=None):
56
+ """Convert Config object to .ini file. If save_path is not specified, return string"""
57
+ serialized_config = ''
58
+
59
+ for section in self.config_data:
60
+ serialized_config += f"[{section}]\n"
61
+
62
+ for key, value in self.config_data[section].items():
63
+ # turn list back to str
64
+ if isinstance(value, list):
65
+ value = ", ".join(value)
66
+
67
+ # don't serialize the api key
68
+ if key == "openai_api_key":
69
+ value = None
70
+
71
+ serialized_config += f"{key} = {value}\n"
72
+
73
+ serialized_config += "\n"
74
+
75
+ if save_path:
76
+ with open(save_path, 'w') as f:
77
+ f.write(serialized_config)
78
+ else:
79
+ return serialized_config
80
+
81
+
82
+
83
+ if __name__ == "__main__":
84
+ # example usage
85
+ config_file = "/Users/johaunh/Documents/PhD/Projects/Text2KG/config/config.ini"
86
+ config = Config.read_ini(config_file)
87
+
88
+ # access pipeline parameters
89
+ print("Pipeline Parameters:")
90
+ for k, v in config.pipeline.items():
91
+ print(f"{k}: {v}")
92
+
93
+ # access LLM parameters
94
+ print("\nLLM Parameters:")
95
+ for k, v in config.llm.items():
96
+ print(f"{k}: {v}")
src/words2wisdom/gui.py ADDED
@@ -0,0 +1,86 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+
3
+ import gradio as gr
4
+
5
+ from . import CONFIG_DIR, ROOT
6
+ from .config import Config
7
+ from .pipeline import Pipeline
8
+ from .utils import dump_all
9
+
10
+
11
+ example_file = (os.path.join(ROOT, "demo", "prokaryotes.txt"))
12
+ example_text = "The quick brown fox jumps over the lazy dog. The cat sits on the mat."
13
+
14
+
15
+ def text2kg_from_string(openai_api_key: str, input_text: str):
16
+
17
+ config = Config.read_ini(os.path.join(CONFIG_DIR, "default_config.ini"))
18
+ config.llm["openai_api_key"] = openai_api_key
19
+
20
+ pipeline = Pipeline(config)
21
+ text_batches, knowledge_graph = pipeline.run(input_text)
22
+
23
+ zip_path = dump_all(config, text_batches, knowledge_graph)
24
+
25
+ return knowledge_graph, zip_path
26
+
27
+
28
+ def text2kg_from_file(openai_api_key: str, input_file):
29
+ with open(input_file.name) as f:
30
+ input_text = f.read()
31
+
32
+ return text2kg_from_string(openai_api_key, input_text)
33
+
34
+
35
+ with gr.Blocks(title="Text2KG") as demo:
36
+ gr.Markdown("# πŸ§žπŸ“– Text2KG")
37
+
38
+ with gr.Column(variant="panel"):
39
+ openai_api_key = gr.Textbox(label="OpenAI API Key", placeholder="sk-...", type="password")
40
+
41
+ with gr.Row(equal_height=False):
42
+ with gr.Column(variant="panel"):
43
+ gr.Markdown("## Input (Text or Text File)")
44
+ #gr.Markdown("A knowledge graph will be generated for the provided text.")
45
+ with gr.Tab("Direct Input"):
46
+ text_string = gr.Textbox(lines=2, placeholder="Text Here...", label="Text")
47
+ submit_str = gr.Button()
48
+
49
+ with gr.Tab("File Upload"):
50
+ text_file = gr.File(file_types=["text"], label="Text File")
51
+ submit_file = gr.Button()
52
+
53
+ with gr.Column(variant="panel"):
54
+ gr.Markdown("## Output (ZIP Archive)")
55
+ #gr.Markdown("The ZIP contains the generated knowledge graph, the text batches (indexed), and a configuration file.")
56
+ output_zip = gr.File(label="ZIP")
57
+
58
+ with gr.Accordion(label="Preview of Knowledge Graph", open=False):
59
+ output_graph = gr.DataFrame(headers=["batch_id", "subject", "relation", "object"], label="Knowledge Graph")
60
+
61
+ with gr.Accordion(label="Examples", open=False):
62
+ gr.Markdown("### Text Example")
63
+ gr.Examples(
64
+ examples=[[None, example_text]],
65
+ inputs=[openai_api_key, text_string],
66
+ outputs=[output_graph, output_zip],
67
+ fn=text2kg_from_string,
68
+ preprocess=False,
69
+ postprocess=False
70
+ )
71
+
72
+ gr.Markdown("### File Example")
73
+ gr.Examples(
74
+ examples=[[None, example_file]],
75
+ inputs=[openai_api_key, text_file],
76
+ outputs=[output_graph, output_zip],
77
+ fn=text2kg_from_file,
78
+ preprocess=False,
79
+ postprocess=False
80
+ )
81
+
82
+ submit_str.click(fn=text2kg_from_string, inputs=[openai_api_key, text_string], outputs=[output_graph, output_zip])
83
+ submit_file.click(fn=text2kg_from_file, inputs=[openai_api_key, text_file], outputs=[output_graph, output_zip])
84
+
85
+
86
+ demo.launch(inbrowser=True, width="75%")
chains.py β†’ src/words2wisdom/output_parsers.py RENAMED
@@ -1,13 +1,7 @@
1
- from functools import partial
2
 
3
- import yaml
4
- from langchain.chains import LLMChain
5
- from langchain.output_parsers import NumberedListOutputParser
6
- from langchain.prompts import ChatPromptTemplate
7
-
8
-
9
- with open("./schema.yml") as f:
10
- schema = yaml.safe_load(f)
11
 
12
 
13
  class ClauseParser(NumberedListOutputParser):
@@ -32,16 +26,27 @@ class TripletParser(NumberedListOutputParser):
32
 
33
  def get_format_instructions(self) -> str:
34
  return super().get_format_instructions()
 
35
 
36
-
37
- llm_chains = {}
38
-
39
- for scheme in schema:
40
- parser = schema[scheme]["parser"]
41
- prompts = schema[scheme]["prompts"]
42
-
43
- llm_chains[scheme] = partial(
44
- LLMChain,
45
- output_parser=eval(f'{parser}()'),
46
- prompt=ChatPromptTemplate.from_messages(list(prompts.items()))
47
- )
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
 
3
+ import pandas as pd
4
+ from langchain_core.output_parsers import NumberedListOutputParser, StrOutputParser
 
 
 
 
 
 
5
 
6
 
7
  class ClauseParser(NumberedListOutputParser):
 
26
 
27
  def get_format_instructions(self) -> str:
28
  return super().get_format_instructions()
29
+
30
 
31
+ class QuestionOutputParser(StrOutputParser):
32
+ def get_format_instructions(self) -> str:
33
+ return super().get_format_instructions()
34
+
35
+ def parse(self, text: str) -> pd.DataFrame:
36
+ """Parses the response to an array of answers/explanations."""
37
+ output = super().parse(text)
38
+ raw_list = re.findall(r'\d+\) (.*?)(?=\n\d+\)|\Z)', output, re.DOTALL)
39
+
40
+ raw_df = pd.DataFrame(raw_list).T
41
+
42
+ df = pd.DataFrame()
43
+
44
+ for idx in raw_df.columns:
45
+ # answer and explanation headers
46
+ ans_i = f"Q{idx+1}"
47
+ why_i = f"Q{idx+1}_explain"
48
+
49
+ # split response into answer/explanation columns
50
+ df[[ans_i, why_i]] = raw_df[idx].str.extract(r'(\d) \- (.*)')
51
+
52
+ return df
src/words2wisdom/pipeline.py ADDED
@@ -0,0 +1,183 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import re
2
+ from typing import List
3
+
4
+ import pandas as pd
5
+ from langchain_openai import ChatOpenAI
6
+ from langchain_core.output_parsers import StrOutputParser
7
+ from langchain_core.prompts import ChatPromptTemplate
8
+ from langchain_core.runnables import RunnablePassthrough
9
+ from nltk.tokenize import sent_tokenize
10
+
11
+ from . import MODULES_CONFIG, STOP_WORDS
12
+ from .config import Config
13
+ from .output_parsers import ClauseParser, TripletParser
14
+ from .utils import partition_sentences
15
+
16
+
17
+ # llm output parsers
18
+ PARSERS = {
19
+ "StrOutputParser": StrOutputParser(),
20
+ "ClauseParser": ClauseParser(),
21
+ "TripletParser": TripletParser()
22
+ }
23
+
24
+
25
+ class Module:
26
+ """Text2KG module class."""
27
+ def __init__(self, name: str) -> None:
28
+ self.name = name
29
+ self.parser = self.get_parser()
30
+ self.prompts = self.get_prompts()
31
+ self.type = self.get_module_type()
32
+
33
+ def __repr__(self):
34
+ return self.name.replace("_", " ").title().replace(" ", "") + "()"
35
+
36
+ def get_prompts(self):
37
+ return ChatPromptTemplate.from_messages(MODULES_CONFIG[self.name]["prompts"].items())
38
+
39
+ def get_parser(self):
40
+ return PARSERS.get(MODULES_CONFIG[self.name]["parser"], StrOutputParser())
41
+
42
+ def get_module_type(self):
43
+ return MODULES_CONFIG[self.name]["type"]
44
+
45
+
46
+ class Pipeline:
47
+ """Text2KG pipeline class."""
48
+
49
+ def __init__(self, config: Config):
50
+
51
+ self.config = config
52
+ self.initialize(config)
53
+
54
+
55
+ def __call__(self, text: str, clean: bool=True) -> pd.DataFrame:
56
+ return self.run(text, clean)
57
+
58
+
59
+ def __repr__(self) -> str:
60
+ return f"Text2KG(\n\tconfig.pipeline={self.config.pipeline}\n\tconfig.llm={self.config.llm}\n)"
61
+
62
+
63
+ def __str__(self) -> str:
64
+ return ("[INPUT: text] -> "
65
+ + " -> ".join([str(m) for m in self.modules])
66
+ + " -> [OUTPUT: knowledge graph]")
67
+
68
+
69
+ @classmethod
70
+ def from_ini(cls, config_path: str):
71
+ return cls(Config.read_ini(config_path))
72
+
73
+
74
+ def initialize(self, config: Config):
75
+ """Initialize Text2KG pipeline from config."""
76
+
77
+ # validate preprocess
78
+ preprocess_modules = [Module(name) for name in config.pipeline["preprocess"]]
79
+
80
+ for item in preprocess_modules:
81
+ if item.get_module_type() != "preprocess":
82
+ raise ValueError(f"Expected preprocess step `{item.name}` to"
83
+ f" have module type='preprocess'. Consider reviewing"
84
+ f" schema.yml")
85
+
86
+ # validate extraction process
87
+ extraction_module = Module(config.pipeline["extraction"])
88
+
89
+ if extraction_module.get_module_type() != "extraction":
90
+ raise ValueError(f"Expected `{extraction_module.name}` to have module"
91
+ f" type='extraction'. Consider reviewing schema.yml")
92
+
93
+ # combine preprocess + extraction
94
+ self.modules = preprocess_modules + [extraction_module]
95
+
96
+ # init prompts & parsers
97
+ prompts = [m.get_prompts() for m in self.modules]
98
+ parsers = [m.get_parser() for m in self.modules]
99
+
100
+ # init llm
101
+ llm = ChatOpenAI(**self.config.llm)
102
+
103
+ # init chains
104
+ chains = [(prompt | llm | parser)
105
+ for prompt, parser in zip(prompts, parsers)]
106
+
107
+ # stitch chains together
108
+ self.pipeline = {"text": RunnablePassthrough()} | chains[0]
109
+ for i in range(1, len(chains)):
110
+ self.pipeline = {"text": self.pipeline} | chains[i]
111
+
112
+ # print pipeline
113
+ print("Initialized Text2KG pipeline:")
114
+ print(str(self))
115
+
116
+
117
+ def run(self, text: str, clean=True) -> tuple[List[str], pd.DataFrame]:
118
+ """Run Text2KG pipeline on passed text.
119
+
120
+ Args:
121
+ *texts (str): The text inputs
122
+ clean (bool): Whether to clean the raw KG or not
123
+
124
+ Returns:
125
+ text_batches (list): Batched text
126
+ knowledge_graph (DataFrame): A dataframe containing the extracted KG triplets,
127
+ indexed by batch
128
+ """
129
+ print("Running Text2KG pipeline:")
130
+ # split text into batches
131
+ text_batches = list(partition_sentences(
132
+ sentences=sent_tokenize(text),
133
+ min_words=self.config.pipeline["words_per_batch"]
134
+ ))
135
+
136
+ # run pipeline in parallel; convert to dataframe
137
+ print("Extracting knowledge graph...", end=' ')
138
+ output = self.pipeline.batch(text_batches)
139
+
140
+ knowledge_graph = pd.DataFrame([{'batch_id': i, **triplet}
141
+ for i, batch in enumerate(output)
142
+ for triplet in batch])
143
+
144
+ if clean:
145
+ knowledge_graph = self._clean(knowledge_graph)
146
+
147
+ print("Done!", end='\n')
148
+
149
+ return text_batches, knowledge_graph
150
+
151
+
152
+ def _clean(self, kg: pd.DataFrame) -> pd.DataFrame:
153
+ """Text2KG post-processing."""
154
+ print("Cleaning knowledge graph components...", end=' ')
155
+ drop_list = []
156
+
157
+ for i, row in kg.iterrows():
158
+ # drop stopwords (e.g. pronouns)
159
+ if (row.subject in STOP_WORDS) or (row.object in STOP_WORDS):
160
+ drop_list.append(i)
161
+
162
+ # drop broken triplets
163
+ elif row.hasnans:
164
+ drop_list.append(i)
165
+
166
+ # lowercase nodes/edges, drop articles
167
+ else:
168
+ article_pattern = r'^(the|a|an) (.+)'
169
+ be_pattern = r'^(are|is) (a |an )?(.+)'
170
+
171
+ kg.at[i, "subject"] = re.sub(article_pattern, r'\2', row.subject.lower())
172
+ kg.at[i, "relation"] = re.sub(be_pattern, r'\3', row.relation.lower())
173
+ kg.at[i, "object"] = re.sub(article_pattern, r'\2', row.object.lower()).strip('.')
174
+
175
+ return kg.drop(drop_list)
176
+
177
+
178
+ def _normalize(self):
179
+ """Unused."""
180
+ return
181
+
182
+ def serialize(self):
183
+ return self.config.serialize()
src/words2wisdom/utils.py ADDED
@@ -0,0 +1,62 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ from datetime import datetime
3
+ from typing import List
4
+ from zipfile import ZipFile
5
+
6
+ import pandas as pd
7
+
8
+
9
+ def partition_sentences(sentences: List[str], min_words: int):
10
+ current_batch = []
11
+ word_count = 0
12
+
13
+ for sentence in sentences:
14
+ # count the number of words in the sentence
15
+ word_count += len(sentence.split())
16
+
17
+ # add sentence to the current batch
18
+ current_batch.append(sentence)
19
+
20
+ # if the word count exceeds or equals the minimum threshold, yield the current batch
21
+ if word_count >= min_words:
22
+ yield " ".join(current_batch)
23
+ current_batch = [] # reset the batch
24
+ word_count = 0 # reset the word count
25
+
26
+ # yield the remaining batch if it's not empty
27
+ if current_batch:
28
+ yield " ".join(current_batch)
29
+
30
+
31
+ def dump_all(pipeline, text_batches: List[str], knowledge_graph: pd.DataFrame, to_path: str=None):
32
+ """Save all items to ZIP."""
33
+ # metadata
34
+ date = str(datetime.now().date())
35
+
36
+ # convert batches to df
37
+ batches_df = pd.DataFrame(text_batches, columns=["text"])
38
+
39
+ # date + hex id for local saving
40
+ num = 0
41
+ while True:
42
+ hex_num = format(num, 'X').zfill(3)
43
+ filename = f"output-{date}-{hex_num}.zip"
44
+ zip_path = os.path.join(to_path, filename)
45
+
46
+ if os.path.exists(zip_path):
47
+ num += 1
48
+ else:
49
+ break
50
+
51
+ print(f"Run ID: {date}-{hex_num}")
52
+ os.makedirs(to_path, exist_ok=True)
53
+
54
+ # create ZIP file
55
+ with ZipFile(zip_path, 'w') as zipObj:
56
+ zipObj.writestr("config.ini", pipeline.serialize())
57
+ zipObj.writestr("text_batches.csv", batches_df.to_csv(index_label="batch_id"))
58
+ zipObj.writestr("kg.csv", knowledge_graph.to_csv(index=False))
59
+
60
+ print(f"Saved data to {zip_path}")
61
+
62
+ return zip_path
src/words2wisdom/validate.py ADDED
@@ -0,0 +1,179 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ import os
2
+ import re
3
+ from time import time
4
+ from argparse import ArgumentParser
5
+ from typing import List
6
+ from zipfile import ZipFile
7
+
8
+ import pandas as pd
9
+ from langchain_openai import ChatOpenAI
10
+ from langchain_core.prompts import ChatPromptTemplate
11
+
12
+ from . import VALIDATION_CONFIG, OUTPUT_DIR
13
+ from .output_parsers import QuestionOutputParser
14
+
15
+
16
+ def parse_args():
17
+ parser = ArgumentParser()
18
+ parser.add_argument(
19
+ "run_ids", nargs="+", help="Run IDs to evaluate. Format: YYYY-MM-DD-XXX"
20
+ )
21
+ parser.add_argument(
22
+ "--search_dir", help="Directory to search for output", default=OUTPUT_DIR
23
+ )
24
+ return parser.parse_args()
25
+
26
+
27
+ def format_system_prompt():
28
+ """Format instructional prompt from instructions.yml"""
29
+
30
+ def format_question(question: dict):
31
+ # Question format: {text} {additional} {options}
32
+ # Ex. Can pigs fly? Explain. (Yes/No)
33
+ formatted = (
34
+ question["title"]
35
+ + " "
36
+ + question["text"]
37
+ + " "
38
+ + (question["additional"] + " " if question["additional"] else "")
39
+ + "("
40
+ + ";".join(question["options"])
41
+ + ")\n"
42
+ )
43
+ return formatted
44
+
45
+ def format_example(example: dict):
46
+ formatted = (
47
+ f"PASSAGE: {example['passage']}\n"
48
+ f"TRIPLET: {example['triplet']}\n\n"
49
+ + "".join([
50
+ f"{i}) {answer}\n"
51
+ for i, answer in enumerate(example["answers"].values(), start=1)
52
+ ])
53
+ )
54
+ return formatted
55
+
56
+ instruction = VALIDATION_CONFIG["instruction"]
57
+ questions = "".join([
58
+ f"{i}) {format_question(q)}"
59
+ for i, q in enumerate(VALIDATION_CONFIG["questions"].values(), start=1)
60
+ ])
61
+ example = format_example(VALIDATION_CONFIG["example"])
62
+
63
+ system_prompt = (
64
+ f"{instruction}\n" f"QUESTIONS:\n{questions}\n" f"[* EXAMPLE *]\n\n{example}"
65
+ )
66
+
67
+ return system_prompt
68
+
69
+
70
+ def validate_triplets(
71
+ llm: ChatOpenAI, instruction: str, passage: str, triplets: List[List[str]]
72
+ ) -> List[pd.DataFrame]:
73
+ """Validate triplets with respect to passage."""
74
+
75
+ print(
76
+ f"Validating {len(triplets):>2} triplet{'s' if len(triplets) else ''}...",
77
+ end=" ",
78
+ )
79
+
80
+ prompt = ChatPromptTemplate.from_messages([
81
+ ("system", "{instruction}"),
82
+ ("user", "PASSAGE: {passage}\n\nTRIPLET: {triplet}"),
83
+ ])
84
+
85
+ chain = prompt | llm | QuestionOutputParser()
86
+
87
+ output = chain.batch([
88
+ {"instruction": instruction, "passage": passage, "triplet": triplet}
89
+ for triplet in triplets
90
+ ])
91
+
92
+ print("Done!", end="\n")
93
+ return output
94
+
95
+
96
+ def validate_knowledge_graph(llm: ChatOpenAI, output_zip: str):
97
+ """Validate all triplets in a knowledge graph."""
98
+
99
+ run_id = re.findall(r"output-(.*)\.zip", output_zip)[0]
100
+ run_dir = os.path.dirname(output_zip)
101
+
102
+ # read output zip
103
+ with ZipFile(output_zip) as z:
104
+ # load knowledge graph
105
+ with z.open("kg.csv") as f:
106
+ graph = pd.read_csv(f)
107
+
108
+ # load text batches
109
+ with z.open("text_batches.csv") as f:
110
+ text_batches = pd.read_csv(f)
111
+
112
+ print("Initializing knowledge graph validation. Run:", run_id)
113
+ print()
114
+
115
+ # start stopwatch
116
+ start = time()
117
+
118
+ # container for evaluation responses
119
+ responses = []
120
+
121
+ # instructions
122
+ instruction = format_system_prompt()
123
+
124
+ # triplets are batched by passage
125
+ # so we iterate over passages
126
+ for idx, passage in enumerate(text_batches.text):
127
+ triplets = (
128
+ graph[graph["batch_id"] == idx].drop(columns=["batch_id"]).values.tolist()
129
+ )
130
+
131
+ print(f"Starting excerpt {idx + 1:>2} of {len(text_batches)}.", end=" ")
132
+
133
+ # if batch has no triplets to validate, skip batch
134
+ if len(triplets) == 0:
135
+ print("Excerpt has no triplets to validate.", end="\n")
136
+ continue
137
+
138
+ # validate triplets by batch
139
+ response = validate_triplets(
140
+ llm=llm,
141
+ instruction=instruction,
142
+ passage=passage,
143
+ triplets=triplets
144
+ )
145
+ responses.extend(response)
146
+
147
+ validation = pd.concat(responses, ignore_index=True)
148
+
149
+ # merge with knowlege graph data
150
+ validation_merged = (
151
+ text_batches.merge(graph)
152
+ .drop(columns=["batch_id"])
153
+ .merge(validation, left_index=True, right_index=True)
154
+ )
155
+
156
+ savepath = os.path.join(run_dir, f"validation-{run_id}.csv")
157
+ validation_merged.to_csv(savepath, index=False)
158
+ # stop stopwatch
159
+ end = time()
160
+
161
+ print("\nKnowledge graph validation complete!")
162
+ print(f"It took {end - start:0.3f} seconds to validate {len(validation)} triplets.")
163
+ print("Saved to:", savepath)
164
+
165
+ return savepath
166
+
167
+
168
+ if __name__ == "__main__":
169
+ args = parse_args()
170
+
171
+ llm = ChatOpenAI(
172
+ model="gpt-4-turbo-preview",
173
+ openai_api_key=os.getenv("OPENAI_API_KEY")
174
+ )
175
+
176
+ for run_id in args.run_ids:
177
+ zipfile = os.path.join(args.search_dir, f"output-{run_id}.zip")
178
+ validate_knowledge_graph(llm, run_id)
179
+ print("* " * 25)
utils.py DELETED
@@ -1,31 +0,0 @@
1
- import numpy as np
2
- from typing import Callable
3
-
4
- from sklearn.cluster import AgglomerativeClustering
5
-
6
-
7
- def condense_labels(labels: np.ndarray, embedding_func: Callable, threshold: float=0.5):
8
- """Combine cosine-similar labels under same name."""
9
-
10
- embeddings = np.array(embedding_func(labels))
11
-
12
- clustering = AgglomerativeClustering(
13
- n_clusters=None,
14
- distance_threshold=threshold
15
- ).fit(embeddings)
16
-
17
- clusters = [np.where(clustering.labels_ == l)[0]
18
- for l in range(clustering.n_clusters_)]
19
-
20
- clusters_reduced = []
21
-
22
- for c in clusters:
23
- embs = embeddings[c]
24
- centroid = np.mean(embs)
25
-
26
- idx = c[np.argmin(np.linalg.norm(embs - centroid, axis=1))]
27
- clusters_reduced.append(idx)
28
-
29
- old2new = {old_id: new_id for old_ids, new_id in zip(clusters, clusters_reduced) for old_id in old_ids}
30
-
31
- return {labels[i]: labels[j] for i, j in old2new.items()}
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
writeup/words2wisdom_short.pdf ADDED
Binary file (168 kB). View file