Spaces:
Sleeping
Sleeping
ManjinderUNCC
commited on
Commit
β’
15d639e
1
Parent(s):
73b33ce
Upload 11 files
Browse files- README.md +101 -12
- gradio_interface.py +43 -0
- project.yml +278 -0
- python_Code/evaluate_model.py +52 -0
- python_Code/finalStep-formatLabel.py +53 -0
- python_Code/firstStep-format.py +21 -0
- python_Code/five_examples_annotated.ipynb +100 -0
- python_Code/secondStep-score.py +52 -0
- python_Code/thirdStep-label.py +23 -0
- requirements-dev.txt +1 -0
- requirements.txt +92 -0
README.md
CHANGED
@@ -1,12 +1,101 @@
|
|
1 |
-
|
2 |
-
|
3 |
-
|
4 |
-
|
5 |
-
|
6 |
-
|
7 |
-
|
8 |
-
|
9 |
-
|
10 |
-
|
11 |
-
|
12 |
-
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# prodigy-ecfr-textcat
|
2 |
+
|
3 |
+
## About the Project
|
4 |
+
|
5 |
+
Our goal is to organize these financial institution rules and regulations so financial institutions can go through newly created rules and regulations to know which departments to send the information to and to allow easy retrieval of these regulations when necessary. Text mining and information retrieval will allow a large step of the process to be automated. Automating these steps will allow less time and effort to be contributed for financial institutions employees. This allows more time and work to be used to accomplish other projects.
|
6 |
+
|
7 |
+
## Table of Contents
|
8 |
+
|
9 |
+
- [About the Project](#about-the-project)
|
10 |
+
- [Getting Started](#getting-started)
|
11 |
+
- [Prerequisites](#prerequisites)
|
12 |
+
- [Installation](#installation)
|
13 |
+
- [Usage](#usage)
|
14 |
+
- [File Structure](#file-structure)
|
15 |
+
- [License](#license)
|
16 |
+
- [Acknowledgements](#acknowledgements)
|
17 |
+
|
18 |
+
## Getting Started
|
19 |
+
|
20 |
+
Instructions on setting up the project on a local machine.
|
21 |
+
|
22 |
+
### Prerequisites
|
23 |
+
|
24 |
+
Before running the project, ensure you have the following software dependencies installed:
|
25 |
+
- [Python 3.x](https://www.python.org/downloads/)
|
26 |
+
- [spaCy](https://spacy.io/usage)
|
27 |
+
- [Prodigy](https://prodi.gy/docs/) (optional)
|
28 |
+
|
29 |
+
### Installation
|
30 |
+
|
31 |
+
Follow these step-by-step instructions to install and configure the project:
|
32 |
+
|
33 |
+
1. **Clone this repository to your local machine.**
|
34 |
+
```bash
|
35 |
+
git clone <https://github.com/ManjinderSinghSandhu/prodigy-ecfr-textcat.git>
|
36 |
+
```
|
37 |
+
2. Install the required dependencies by running:
|
38 |
+
|
39 |
+
```bash
|
40 |
+
pip install -r requirements.txt
|
41 |
+
```
|
42 |
+
|
43 |
+
3. Next, you need to have a Prodigy license key to use Prodigy. (But it's not required) Install Prodigy first:
|
44 |
+
|
45 |
+
```bash
|
46 |
+
python -m pip install prodigy==1.15.2 --extra-index-url https://[email protected]
|
47 |
+
```
|
48 |
+
|
49 |
+
This assumes you previously set up your `PRODIGY_KEY` as an environmental variable like:
|
50 |
+
|
51 |
+
```bash
|
52 |
+
export PRODIGY_KEY=1111-1111-1111-1111
|
53 |
+
```
|
54 |
+
|
55 |
+
## Usage
|
56 |
+
|
57 |
+
To use the project, follow these steps:
|
58 |
+
|
59 |
+
1. **Prepare your data:**
|
60 |
+
- Place your dataset files in the `/data` directory.
|
61 |
+
- Optionally, annotate your data using Prodigy and save the annotations in the `/data` directory.
|
62 |
+
|
63 |
+
2. **Train the text classification model:**
|
64 |
+
- Run the training script located in the `/python_Code` directory.
|
65 |
+
|
66 |
+
3. **Evaluate the model:**
|
67 |
+
- Use the evaluation script to assess the model's performance on labeled data.
|
68 |
+
|
69 |
+
4. **Make predictions:**
|
70 |
+
- Apply the trained model to new, unlabeled data to classify it into relevant categories.
|
71 |
+
|
72 |
+
|
73 |
+
## File Structure
|
74 |
+
|
75 |
+
Describe the organization of files and directories within the project.
|
76 |
+
|
77 |
+
- `/data`
|
78 |
+
- `five_examples_annotated5.jsonl`
|
79 |
+
- `goldenEval.jsonl`
|
80 |
+
- `train.jsonl`
|
81 |
+
- `train200.jsonl`
|
82 |
+
- `/python_Code`
|
83 |
+
- `finalStep-formatLabel.py`
|
84 |
+
- `firstStep-format.py`
|
85 |
+
- `five_examples_annotated.ipynb`
|
86 |
+
- `secondStep-score.py`
|
87 |
+
- `thirdStep-label.py`
|
88 |
+
- `requirements.txt`
|
89 |
+
- `requirements-dev.txt`
|
90 |
+
- `Project.yml`
|
91 |
+
- `README.md`
|
92 |
+
|
93 |
+
## License
|
94 |
+
|
95 |
+
- Package A: MIT License
|
96 |
+
- Package B: Apache License 2.0
|
97 |
+
|
98 |
+
## Acknowledgements
|
99 |
+
|
100 |
+
Manjinder Sandhu, Dagim Bantikassegn, Alex Brooks, Tyler Dabbs
|
101 |
+
|
gradio_interface.py
ADDED
@@ -0,0 +1,43 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Import necessary libraries
|
2 |
+
import gradio as gr
|
3 |
+
import spacy
|
4 |
+
|
5 |
+
# Load the trained spaCy model
|
6 |
+
model_path = "./my_trained_model"
|
7 |
+
nlp = spacy.load(model_path)
|
8 |
+
|
9 |
+
# Function to classify text
|
10 |
+
def classify_text(text):
|
11 |
+
doc = nlp(text)
|
12 |
+
predicted_labels = doc.cats
|
13 |
+
return predicted_labels
|
14 |
+
|
15 |
+
# Function to save results to a file
|
16 |
+
def save_to_file(text, predicted_labels):
|
17 |
+
with open("classification_results.txt", "w") as f:
|
18 |
+
f.write("Text: {}\n\n".format(text))
|
19 |
+
for label, score in predicted_labels.items():
|
20 |
+
f.write("{}: {}\n".format(label, score))
|
21 |
+
|
22 |
+
# Gradio Interface
|
23 |
+
inputs = [
|
24 |
+
gr.inputs.Textbox(lines=7, label="Enter your text"),
|
25 |
+
gr.inputs.File(label="Upload a file")
|
26 |
+
]
|
27 |
+
|
28 |
+
output = gr.outputs.Textbox(label="Classification Results")
|
29 |
+
|
30 |
+
def classify_and_save(input_text, input_file):
|
31 |
+
if input_text:
|
32 |
+
text = input_text
|
33 |
+
elif input_file:
|
34 |
+
# Process the file and extract text
|
35 |
+
with open(input_file.name, "r") as f:
|
36 |
+
text = f.read()
|
37 |
+
|
38 |
+
predicted_labels = classify_text(text)
|
39 |
+
save_to_file(text, predicted_labels)
|
40 |
+
return predicted_labels
|
41 |
+
|
42 |
+
iface = gr.Interface(fn=classify_and_save, inputs=inputs, outputs=output, title="Text Classifier")
|
43 |
+
iface.launch(share=True)
|
project.yml
ADDED
@@ -0,0 +1,278 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# Title and description of the project
|
2 |
+
title: "Citations of ECFR Banking Regulation in a spaCy pipeline."
|
3 |
+
description: "Custom text classification project for spaCy v3 adapted from the spaCy v3"
|
4 |
+
|
5 |
+
vars:
|
6 |
+
lang: "en"
|
7 |
+
train: corpus/train.spacy
|
8 |
+
dev: corpus/dev.spacy
|
9 |
+
version: "0.1.0"
|
10 |
+
gpu_id: -1
|
11 |
+
vectors_model: "en_core_web_lg"
|
12 |
+
name: ecfr_ner
|
13 |
+
prodigy:
|
14 |
+
ner_labels: ecfr_initial_ner
|
15 |
+
ner_manual_labels: ecfr_manual_ner
|
16 |
+
senter_labels: ecfr_labeled_sents
|
17 |
+
ner_labeled_dataset: ecfr_labeled_ner
|
18 |
+
assets:
|
19 |
+
ner_labels: assets/ecfr_ner_labels.jsonl
|
20 |
+
senter_labels: assets/ecfr_senter_labels.jsonl
|
21 |
+
ner_patterns: assets/patterns.jsonl
|
22 |
+
corpus_labels: corpus/labels
|
23 |
+
data_files: data
|
24 |
+
trained_model: my_trained_model
|
25 |
+
trained_model_textcat: my_trained_model/textcat_multilabel
|
26 |
+
output_models: output
|
27 |
+
python_code: python_Code
|
28 |
+
|
29 |
+
directories: [ "data", "python_Code"]
|
30 |
+
|
31 |
+
assets:
|
32 |
+
- dest: "data/firstStep_file.jsonl"
|
33 |
+
description: "JSONL file containing formatted data from the first step"
|
34 |
+
- dest: "data/five_examples_annotated5.jsonl"
|
35 |
+
description: "JSONL file containing five annotated examples"
|
36 |
+
- dest: "data/goldenEval.jsonl"
|
37 |
+
description: "JSONL file containing golden evaluation data"
|
38 |
+
- dest: "data/thirdStep_file.jsonl"
|
39 |
+
description: "JSONL file containing classified data from the third step"
|
40 |
+
- dest: "data/train.jsonl"
|
41 |
+
description: "JSONL file containing training data"
|
42 |
+
- dest: "data/train200.jsonl"
|
43 |
+
description: "JSONL file containing initial training data"
|
44 |
+
- dest: "data/train4465.jsonl"
|
45 |
+
description: "JSONL file containing formatted and labeled training data"
|
46 |
+
- dest: "python_Code/finalStep-formatLabel.py"
|
47 |
+
description: "Python script for formatting labeled data in the final step"
|
48 |
+
- dest: "python_Code/firstStep-format.py"
|
49 |
+
description: "Python script for formatting data in the first step"
|
50 |
+
- dest: "python_Code/five_examples_annotated.ipynb"
|
51 |
+
description: "Jupyter notebook containing five annotated examples"
|
52 |
+
- dest: "python_Code/secondStep-score.py"
|
53 |
+
description: "Python script for scoring data in the second step"
|
54 |
+
- dest: "python_Code/thirdStep-label.py"
|
55 |
+
description: "Python script for labeling data in the third step"
|
56 |
+
- dest: "python_Code/train_eval_split.ipynb"
|
57 |
+
description: "Jupyter notebook for training and evaluation data splitting"
|
58 |
+
- dest: "TerminalCode.txt"
|
59 |
+
description: "Text file containing terminal code"
|
60 |
+
- dest: "README.md"
|
61 |
+
description: "Markdown file containing project documentation"
|
62 |
+
- dest: "prodigy.json"
|
63 |
+
description: "JSON file containing Prodigy configuration"
|
64 |
+
|
65 |
+
workflows:
|
66 |
+
train:
|
67 |
+
- preprocess
|
68 |
+
- train-text-classification-model
|
69 |
+
- classify-unlabeled-data
|
70 |
+
- format-labeled-data
|
71 |
+
# - review-evaluation-data
|
72 |
+
# - export-reviewed-evaluation-data
|
73 |
+
# - import-training-data
|
74 |
+
# - import-golden-evaluation-data
|
75 |
+
# - train-model-experiment1
|
76 |
+
# - convert-data-to-spacy-format
|
77 |
+
|
78 |
+
commands:
|
79 |
+
- name: "preprocess"
|
80 |
+
help: |
|
81 |
+
Execute the Python script `firstStep-format.py`, which performs the initial formatting of a dataset file for the first step of the project. This script extracts text and labels from a dataset file in JSONL format and writes them to a new JSONL file in a specific format.
|
82 |
+
|
83 |
+
Usage:
|
84 |
+
```
|
85 |
+
spacy project run preprocess
|
86 |
+
```
|
87 |
+
|
88 |
+
Explanation:
|
89 |
+
- The script `firstStep-format.py` reads data from the file specified in the `dataset_file` variable (`data/train200.jsonl` by default).
|
90 |
+
- It extracts text and labels from each JSON object in the dataset file.
|
91 |
+
- If both text and at least one label are available, it writes a new JSON object to the output file specified in the `output_file` variable (`data/firstStep_file.jsonl` by default) with the extracted text and label.
|
92 |
+
- If either text or label is missing in a JSON object, a warning message is printed.
|
93 |
+
- Upon completion, the script prints a message confirming the processing and the path to the output file.
|
94 |
+
script:
|
95 |
+
- "python3 python_Code/firstStep-format.py"
|
96 |
+
|
97 |
+
- name: "train-text-classification-model"
|
98 |
+
help: |
|
99 |
+
Train the text classification model for the second step of the project using the `secondStep-score.py` script. This script loads a blank English spaCy model and adds a text classification pipeline to it. It then trains the model using the processed data from the first step.
|
100 |
+
|
101 |
+
Usage:
|
102 |
+
```
|
103 |
+
spacy project run train-text-classification-model
|
104 |
+
```
|
105 |
+
|
106 |
+
Explanation:
|
107 |
+
- The script `secondStep-score.py` loads a blank English spaCy model and adds a text classification pipeline to it.
|
108 |
+
- It reads processed data from the file specified in the `processed_data_file` variable (`data/firstStep_file.jsonl` by default).
|
109 |
+
- The processed data is converted to spaCy format for training the model.
|
110 |
+
- The model is trained using the converted data for a specified number of iterations (`n_iter`).
|
111 |
+
- Losses are printed for each iteration during training.
|
112 |
+
- Upon completion, the trained model is saved to the specified output directory (`./my_trained_model` by default).
|
113 |
+
script:
|
114 |
+
- "python3 python_Code/secondStep-score.py"
|
115 |
+
|
116 |
+
- name: "classify-unlabeled-data"
|
117 |
+
help: |
|
118 |
+
Classify the unlabeled data for the third step of the project using the `thirdStep-label.py` script. This script loads the trained spaCy model from the previous step and classifies each record in the unlabeled dataset.
|
119 |
+
|
120 |
+
Usage:
|
121 |
+
```
|
122 |
+
spacy project run classify-unlabeled-data
|
123 |
+
```
|
124 |
+
|
125 |
+
Explanation:
|
126 |
+
- The script `thirdStep-label.py` loads the trained spaCy model from the specified model directory (`./my_trained_model` by default).
|
127 |
+
- It reads the unlabeled data from the file specified in the `unlabeled_data_file` variable (`data/train.jsonl` by default).
|
128 |
+
- Each record in the unlabeled data is classified using the loaded model.
|
129 |
+
- The predicted labels for each record are extracted and stored along with the text.
|
130 |
+
- The classified data is optionally saved to a file specified in the `output_file` variable (`data/thirdStep_file.jsonl` by default).
|
131 |
+
script:
|
132 |
+
- "python3 python_Code/thirdStep-label.py"
|
133 |
+
|
134 |
+
- name: "format-labeled-data"
|
135 |
+
help: |
|
136 |
+
Format the labeled data for the final step of the project using the `finalStep-formatLabel.py` script. This script processes the classified data from the third step and transforms it into a specific format, considering a threshold for label acceptance.
|
137 |
+
|
138 |
+
Usage:
|
139 |
+
```
|
140 |
+
spacy project run format-labeled-data
|
141 |
+
```
|
142 |
+
|
143 |
+
Explanation:
|
144 |
+
- The script `finalStep-formatLabel.py` reads classified data from the file specified in the `input_file` variable (`data/thirdStep_file.jsonl` by default).
|
145 |
+
- For each record, it determines accepted categories based on a specified threshold.
|
146 |
+
- It constructs an output record containing the text, predicted labels, accepted categories, answer (accept/reject), and options with meta information.
|
147 |
+
- The transformed data is written to the file specified in the `output_file` variable (`data/train4465.jsonl` by default).
|
148 |
+
script:
|
149 |
+
- "python3 python_Code/finalStep-formatLabel.py"
|
150 |
+
|
151 |
+
# - name: "review-evaluation-data"
|
152 |
+
# help: |
|
153 |
+
# Review the evaluation data in Prodigy and automatically accept annotations.
|
154 |
+
|
155 |
+
# Usage:
|
156 |
+
# ```
|
157 |
+
# spacy project run review-evaluation-data
|
158 |
+
# ```
|
159 |
+
|
160 |
+
# Explanation:
|
161 |
+
# - The command reviews the evaluation data in Prodigy.
|
162 |
+
# - It automatically accepts annotations made during the review process.
|
163 |
+
# - Only sessions allowed by the environment variable PRODIGY_ALLOWED_SESSIONS are permitted to review data. In this case, the session 'reviwer' is allowed.
|
164 |
+
# script:
|
165 |
+
# - "PRODIGY_ALLOWED_SESSIONS=reviwer python3 -m prodigy review project3eval-review project3eval --auto-accept"
|
166 |
+
|
167 |
+
# - name: "export-reviewed-evaluation-data"
|
168 |
+
# help: |
|
169 |
+
# Export the reviewed evaluation data from Prodigy to a JSONL file named 'goldenEval.jsonl'.
|
170 |
+
|
171 |
+
# Usage:
|
172 |
+
# ```
|
173 |
+
# spacy project run export-reviewed-evaluation-data
|
174 |
+
# ```
|
175 |
+
|
176 |
+
# Explanation:
|
177 |
+
# - The command exports the reviewed evaluation data from Prodigy to a JSONL file.
|
178 |
+
# - The data is exported from the Prodigy database associated with the project named 'project3eval-review'.
|
179 |
+
# - The exported data is saved to the file 'goldenEval.jsonl'.
|
180 |
+
# - This command helps in preserving the reviewed annotations for further analysis or processing.
|
181 |
+
# script:
|
182 |
+
# - "prodigy db-out project3eval-review > goldenEval.jsonl"
|
183 |
+
|
184 |
+
# - name: "import-training-data"
|
185 |
+
# help: |
|
186 |
+
# Import the training data into Prodigy from a JSONL file named 'train200.jsonl'.
|
187 |
+
|
188 |
+
# Usage:
|
189 |
+
# ```
|
190 |
+
# spacy project run import-training-data
|
191 |
+
# ```
|
192 |
+
|
193 |
+
# Explanation:
|
194 |
+
# - The command imports the training data into Prodigy from the specified JSONL file.
|
195 |
+
# - The data is imported into the Prodigy database associated with the project named 'prodigy3train'.
|
196 |
+
# - This command prepares the training data for annotation and model training in Prodigy.
|
197 |
+
# script:
|
198 |
+
# - "prodigy db-in prodigy3train train200.jsonl"
|
199 |
+
|
200 |
+
# - name: "import-golden-evaluation-data"
|
201 |
+
# help: |
|
202 |
+
# Import the golden evaluation data into Prodigy from a JSONL file named 'goldeneval.jsonl'.
|
203 |
+
|
204 |
+
# Usage:
|
205 |
+
# ```
|
206 |
+
# spacy project run import-golden-evaluation-data
|
207 |
+
# ```
|
208 |
+
|
209 |
+
# Explanation:
|
210 |
+
# - The command imports the golden evaluation data into Prodigy from the specified JSONL file.
|
211 |
+
# - The data is imported into the Prodigy database associated with the project named 'golden3'.
|
212 |
+
# - This command prepares the golden evaluation data for further analysis and model evaluation in Prodigy.
|
213 |
+
# script:
|
214 |
+
# - "prodigy db-in golden3 goldeneval.jsonl"
|
215 |
+
|
216 |
+
# - name: "train-model-experiment1"
|
217 |
+
# help: |
|
218 |
+
# Train a text classification model using Prodigy with the 'prodigy3train' dataset and evaluating on 'golden3'.
|
219 |
+
|
220 |
+
# Usage:
|
221 |
+
# ```
|
222 |
+
# spacy project run train-model-experiment1
|
223 |
+
# ```
|
224 |
+
|
225 |
+
# Explanation:
|
226 |
+
# - The command trains a text classification model using Prodigy.
|
227 |
+
# - It uses the 'prodigy3train' dataset for training and evaluates the model on the 'golden3' dataset.
|
228 |
+
# - The trained model is saved to the './output/experiment1' directory.
|
229 |
+
# script:
|
230 |
+
# - "python3 -m prodigy train --textcat-multilabel prodigy3train,eval:golden3 ./output/experiment1"
|
231 |
+
|
232 |
+
# - name: "download-model"
|
233 |
+
# help: |
|
234 |
+
# Download the English language model 'en_core_web_lg' from spaCy.
|
235 |
+
|
236 |
+
# Usage:
|
237 |
+
# ```
|
238 |
+
# spacy project run download-model
|
239 |
+
# ```
|
240 |
+
|
241 |
+
# Explanation:
|
242 |
+
# - The command downloads the English language model 'en_core_web_lg' from spaCy.
|
243 |
+
# - This model is used as the base model for further data processing and training in the project.
|
244 |
+
# script:
|
245 |
+
# - "python3 -m spacy download en_core_web_lg"
|
246 |
+
|
247 |
+
# - name: "convert-data-to-spacy-format"
|
248 |
+
# help: |
|
249 |
+
# Convert the annotated data from Prodigy to spaCy format using the 'prodigy3train' and 'golden3' datasets.
|
250 |
+
|
251 |
+
# Usage:
|
252 |
+
# ```
|
253 |
+
# spacy project run convert-data-to-spacy-format
|
254 |
+
# ```
|
255 |
+
|
256 |
+
# Explanation:
|
257 |
+
# - The command converts the annotated data from Prodigy to spaCy format.
|
258 |
+
# - It uses the 'prodigy3train' and 'golden3' datasets for conversion.
|
259 |
+
# - The converted data is saved to the './corpus' directory with the base model 'en_core_web_lg'.
|
260 |
+
# script:
|
261 |
+
# - "python3 -m prodigy data-to-spacy --textcat-multilabel prodigy3train,eval:golden3 ./corpus --base-model en_core_web_lg"
|
262 |
+
|
263 |
+
# - name: "train-custom-model"
|
264 |
+
# help: |
|
265 |
+
# Train a custom text classification model using spaCy with the converted data in spaCy format.
|
266 |
+
|
267 |
+
# Usage:
|
268 |
+
# ```
|
269 |
+
# spacy project run train-custom-model
|
270 |
+
# ```
|
271 |
+
|
272 |
+
# Explanation:
|
273 |
+
# - The command trains a custom text classification model using spaCy.
|
274 |
+
# - It uses the converted data in spaCy format located in the './corpus' directory.
|
275 |
+
# - The model is trained using the configuration defined in 'corpus/config.cfg'.
|
276 |
+
# script:
|
277 |
+
# - "python -m spacy train corpus/config.cfg --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy"
|
278 |
+
|
python_Code/evaluate_model.py
ADDED
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import jsonlines
|
2 |
+
import spacy
|
3 |
+
from sklearn.metrics import classification_report, accuracy_score, f1_score, precision_score, recall_score
|
4 |
+
|
5 |
+
# Load the trained spaCy model
|
6 |
+
model_path = "./my_trained_model"
|
7 |
+
nlp = spacy.load(model_path)
|
8 |
+
|
9 |
+
# Load the golden evaluation data
|
10 |
+
golden_eval_data = []
|
11 |
+
with jsonlines.open("data/goldenEval.jsonl") as reader:
|
12 |
+
for record in reader:
|
13 |
+
golden_eval_data.append(record)
|
14 |
+
|
15 |
+
# New threshold for considering a label
|
16 |
+
new_threshold = 0.21 # Change this to your desired threshold value
|
17 |
+
|
18 |
+
# Predict labels for each record using your model with the new threshold
|
19 |
+
predicted_labels = []
|
20 |
+
for record in golden_eval_data:
|
21 |
+
text = record["text"]
|
22 |
+
doc = nlp(text)
|
23 |
+
# Apply the new threshold to the predicted labels
|
24 |
+
filtered_labels = {label: score for label, score in doc.cats.items() if score > new_threshold}
|
25 |
+
predicted_labels.append(filtered_labels)
|
26 |
+
|
27 |
+
# Extract ground truth labels from the golden evaluation data
|
28 |
+
true_labels = [record["accept"] for record in golden_eval_data]
|
29 |
+
|
30 |
+
# Convert label format to match sklearn's classification report format
|
31 |
+
true_labels_flat = [label[0] if label else "reject" for label in true_labels]
|
32 |
+
predicted_labels_flat = [max(pred, key=pred.get) if pred else "reject" for pred in predicted_labels]
|
33 |
+
|
34 |
+
# Calculate evaluation metrics
|
35 |
+
accuracy = accuracy_score(true_labels_flat, predicted_labels_flat)
|
36 |
+
precision = precision_score(true_labels_flat, predicted_labels_flat, average='weighted')
|
37 |
+
recall = recall_score(true_labels_flat, predicted_labels_flat, average='weighted')
|
38 |
+
f1 = f1_score(true_labels_flat, predicted_labels_flat, average='weighted')
|
39 |
+
|
40 |
+
# Additional classification report
|
41 |
+
report = classification_report(true_labels_flat, predicted_labels_flat)
|
42 |
+
|
43 |
+
# Print or save the evaluation metrics
|
44 |
+
print("Evaluation Metrics:")
|
45 |
+
print(f"Accuracy: {accuracy}")
|
46 |
+
print(f"Precision: {precision}")
|
47 |
+
print(f"Recall: {recall}")
|
48 |
+
print(f"F1-Score: {f1}")
|
49 |
+
|
50 |
+
# Print or save the detailed classification report
|
51 |
+
print("Detailed Classification Report:")
|
52 |
+
print(report)
|
python_Code/finalStep-formatLabel.py
ADDED
@@ -0,0 +1,53 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import jsonlines
|
2 |
+
|
3 |
+
# Input file containing classified data
|
4 |
+
input_file = "data/thirdStep_file.jsonl"
|
5 |
+
|
6 |
+
# Output file to store transformed data
|
7 |
+
output_file = "data/Full-Labeled-Data-Final-4465.jsonl"
|
8 |
+
|
9 |
+
# Threshold for considering a label
|
10 |
+
threshold = 0.5
|
11 |
+
|
12 |
+
# Options for different categories
|
13 |
+
options = [
|
14 |
+
{"id": "CapitalRequirements", "text": "Capital Requirements", "meta": "0.00"},
|
15 |
+
{"id": "ConsumerProtection", "text": "Consumer Protection", "meta": "0.00"},
|
16 |
+
{"id": "RiskManagement", "text": "Risk Management", "meta": "0.00"},
|
17 |
+
{"id": "ReportingAndCompliance", "text": "Reporting And Compliance", "meta": "0.00"},
|
18 |
+
{"id": "CorporateGovernance", "text": "Corporate Governance", "meta": "0.00"}
|
19 |
+
]
|
20 |
+
|
21 |
+
# Function to process each record
|
22 |
+
def process_record(record):
|
23 |
+
# Extract text and predicted labels
|
24 |
+
text = record["text"]
|
25 |
+
predicted_labels = record["predicted_labels"]
|
26 |
+
|
27 |
+
# Determine accepted categories based on threshold
|
28 |
+
accepted_categories = [label for label, score in predicted_labels.items() if score > threshold]
|
29 |
+
|
30 |
+
# Determine answer based on accepted categories
|
31 |
+
answer = "accept" if accepted_categories else "reject"
|
32 |
+
|
33 |
+
# Prepare options with meta
|
34 |
+
options_with_meta = [
|
35 |
+
{"id": option["id"], "text": option["text"], "meta": option["meta"]} for option in options
|
36 |
+
]
|
37 |
+
|
38 |
+
# Construct the output record
|
39 |
+
output_record = {
|
40 |
+
"text": text,
|
41 |
+
"cats": predicted_labels,
|
42 |
+
"accept": accepted_categories,
|
43 |
+
"answer": answer,
|
44 |
+
"options": options_with_meta
|
45 |
+
}
|
46 |
+
|
47 |
+
return output_record
|
48 |
+
|
49 |
+
# Process input file and write transformed data to output file
|
50 |
+
with jsonlines.open(input_file, "r") as infile, jsonlines.open(output_file, "w") as outfile:
|
51 |
+
for record in infile:
|
52 |
+
output_record = process_record(record)
|
53 |
+
outfile.write(output_record)
|
python_Code/firstStep-format.py
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import jsonlines
|
2 |
+
|
3 |
+
# Path to your dataset file
|
4 |
+
dataset_file = "data/train200.jsonl"
|
5 |
+
|
6 |
+
# Path to the output file
|
7 |
+
output_file = "data/firstStep_file.jsonl"
|
8 |
+
|
9 |
+
# Open the JSONL file and extract text and labels
|
10 |
+
try:
|
11 |
+
with jsonlines.open(dataset_file) as reader, jsonlines.open(output_file, mode='w') as writer:
|
12 |
+
for obj in reader:
|
13 |
+
text = obj.get("text")
|
14 |
+
label = obj.get("accept", [])[0] # Get the first accepted label if available
|
15 |
+
if text and label:
|
16 |
+
writer.write({"text": text, "label": label})
|
17 |
+
else:
|
18 |
+
print("Warning: Text or label missing in the JSON object.")
|
19 |
+
print("Processing completed. Output written to:", output_file)
|
20 |
+
except Exception as e:
|
21 |
+
print("Error:", e)
|
python_Code/five_examples_annotated.ipynb
ADDED
@@ -0,0 +1,100 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
{
|
2 |
+
"cells": [
|
3 |
+
{
|
4 |
+
"cell_type": "code",
|
5 |
+
"execution_count": 1,
|
6 |
+
"metadata": {},
|
7 |
+
"outputs": [
|
8 |
+
{
|
9 |
+
"name": "stdout",
|
10 |
+
"output_type": "stream",
|
11 |
+
"text": [
|
12 |
+
"Text: Banks that are at risk of failing selling bonds? Absolutely not! No way! The idea of where this money needs to come from should've been a thought that was had before these institutions took on crazy amounts of leverage and debt they couldn't pay. It's an obvious attempt at shifting the massive risk they hold onto unsuspecting investors instead of owning the bag themselves, and admitting they had no real risk management. Free money is becoming a thing of the past, it's time for these institutions to grow up and learn. Failure is always an option. Funds raised by selling off these bonds has a high chance of being similarly mismanaged by these at risk of failing institutions due to the aforementioned lack of real risk management. Actions speak louder than words, and we still live in the shadow of a great financial crisis (hmm, I wonder who could've caused that and why?) And constantly throwing the average Joe under the bus does a pretty bad job of helping maintain public confidence in the finance system.\n",
|
13 |
+
"\n",
|
14 |
+
"ReportingAndCompliance: 0.3665\n",
|
15 |
+
"RiskManagement: 0.0330\n",
|
16 |
+
"ConsumerProtection: 0.0310\n",
|
17 |
+
"CorporateGovernance: 0.0423\n",
|
18 |
+
"CapitalRequirements: 0.0245\n",
|
19 |
+
"\n",
|
20 |
+
"Text: The Wisconsin Bankers Association (aka the WBA) is the largest financial trade association in Wisconsin, representing over 200 state and nationally chartered banks, savings banks,and savings and loan associations located in communities throughout the State. WBA appreciates the opportunity to comment on the interim final rule. Over the past year, the Board of Governors of the Federal Reserve System (FRB) issued several interim final rules to except certain loans that are guaranteed under the Small Business Administration's (SBA's) Paycheck Protection Program (PPP) from the requirements of the Federal Reserve Act and the corresponding provisions of Regulation O.To reflect the latest program extension by Congress, FRB issued this interim final rule to extend the Regulation O exception to PPP loans through March 31, 2022. WBA filed comment letters in support of FRB's previous interim final rules as the removal of Regulation O obstacles through the exception has helped allow Wisconsin's banks to more efficiently address the needs of their insider-owned small businesses. FRB'spast interim final rules have helped ensuree ligible businesses have timely access to liquidity to help overcome economic hurdles resulting from the effects of COVID-19 and the mitigating efforts in effect throughout Wisconsin. WBA appreciates FRB's actions to provide continued clarity that loans made by a bank to insider-owned businesses that are guaranteed under SBA's PPP remain excepted from the Federal Reserve Act and the corresponding provisions of Regulation O. Without an extension of the exception, WBA fears some auditors and examiners would treat such loans differently than PPP loans made on or before June 30 ,2020. As have been requirements of the program since inception, any PPP loan made during the extended program period must still meet certain eligibility and documentation criteria, and have the same interest rate, payment, and loan term. Additionally, all eligibility and documentation criteria and all loan terms and program requirements remain exclusively set by SBA and cannot be altered by the lender. Therefore, FRB should once again extend its exception for PPP loans; this time for PPP loans made through March 31 ,2022. WBA also appreciates FRB's efforts to have promulgated the interim final rules in such a straight-forward manner and for using plain language in its interim final rules. WBA encourages FRB to continue such efforts in future rule makings and for any other regulatory review efforts.\n",
|
21 |
+
"\n",
|
22 |
+
"ReportingAndCompliance: 0.6879\n",
|
23 |
+
"RiskManagement: 0.0000\n",
|
24 |
+
"ConsumerProtection: 0.0048\n",
|
25 |
+
"CorporateGovernance: 0.0000\n",
|
26 |
+
"CapitalRequirements: 0.0000\n",
|
27 |
+
"\n",
|
28 |
+
"Text: How about you crooks focus on the billions being laundered by banks in plain fucking sight instead of intruding in our lives more. Disgusting. Aweful.\n",
|
29 |
+
"\n",
|
30 |
+
"ReportingAndCompliance: 0.4072\n",
|
31 |
+
"RiskManagement: 0.2440\n",
|
32 |
+
"ConsumerProtection: 0.3574\n",
|
33 |
+
"CorporateGovernance: 0.3809\n",
|
34 |
+
"CapitalRequirements: 0.2414\n",
|
35 |
+
"\n",
|
36 |
+
"Text: If adopted, this proposal [R-1726], would prove to be an invasion of privacy. In terms of digital assets, crypto exchanges are not held accountable in the same way that other financial institutions are, and have a track record of bad operational security when it comes to securely storing client information.\n",
|
37 |
+
"\n",
|
38 |
+
"ReportingAndCompliance: 0.4365\n",
|
39 |
+
"RiskManagement: 0.5856\n",
|
40 |
+
"ConsumerProtection: 0.3847\n",
|
41 |
+
"CorporateGovernance: 0.1818\n",
|
42 |
+
"CapitalRequirements: 0.1904\n",
|
43 |
+
"\n",
|
44 |
+
"Text: Amendments to 20402(d)(2) and 204.2(e)(2) and (4) make a savings account without transfer or withdrawal limits transaction accounts. Can a depository institution avoid having a savings account be a transaction account by imposing a transfer/withdrawal restriction? Must such a restriction be absolute, or can it be suggested though the imposition of transaction fees for excess transfers/withdrawals in a stated period? The prefatory text, including the FAQ found there consistently uses the verb 'suspend.' Is 'suspend' used in the dictionary sense of 'temporarily prevent from continuing or being in force or effect'? If so, is that deliberate so as to suggest that it's expected that depository institutions will re-impose transfer/withdrawal limits at some future date (e.g., once the local economy recovers from the present pandemic)? Does the Board anticipate reinstating savings account transfer limits in the future, or believe that they will be reimposed by depository institutions as an account or contract provision? Relationship to Regulation CC A related question regarding the impact of the Reg D changes on the definition of 'account' in Regulation CC (12 CFR Part 229), which appears in the definition to exclude, except for the purposes of subpart D, any savings account described in 12 CFR 204.2(d)(2) 'even though such accounts permit third party transfers.' I note that the Official Interpretations applicable to the 229.2(a)(1) definition of 'account' in Regulation CC suggests that savings deposits are excluded because they :'may have limited third party payment powers,' and the Board believed the 'EFA Act is intended to apply only to accounts that permit UNLIMITED (emphasis added) third party transfers.' Will, then, a bank that 'suspends' its limits on savings deposit transfers and withdrawals be perforce (and perhaps unwittingly) making those savings accounts subject to Regulation CC, or does the Regulation CC 'account' definition continue to exclude savings accounts as described in 204.2(d)(2)? Thank you for your consideration of these comments and questions.\n",
|
45 |
+
"\n",
|
46 |
+
"ReportingAndCompliance: 0.2221\n",
|
47 |
+
"RiskManagement: 0.0007\n",
|
48 |
+
"ConsumerProtection: 0.0513\n",
|
49 |
+
"CorporateGovernance: 0.0031\n",
|
50 |
+
"CapitalRequirements: 0.0000\n",
|
51 |
+
"\n"
|
52 |
+
]
|
53 |
+
}
|
54 |
+
],
|
55 |
+
"source": [
|
56 |
+
"import spacy\n",
|
57 |
+
"\n",
|
58 |
+
"# Load the trained model\n",
|
59 |
+
"nlp = spacy.load('output/experiment1/model-best')\n",
|
60 |
+
"\n",
|
61 |
+
"# List of new text examples you want to classify\n",
|
62 |
+
"texts = [\n",
|
63 |
+
" \"Banks that are at risk of failing selling bonds? Absolutely not! No way! The idea of where this money needs to come from should've been a thought that was had before these institutions took on crazy amounts of leverage and debt they couldn't pay. It's an obvious attempt at shifting the massive risk they hold onto unsuspecting investors instead of owning the bag themselves, and admitting they had no real risk management. Free money is becoming a thing of the past, it's time for these institutions to grow up and learn. Failure is always an option. Funds raised by selling off these bonds has a high chance of being similarly mismanaged by these at risk of failing institutions due to the aforementioned lack of real risk management. Actions speak louder than words, and we still live in the shadow of a great financial crisis (hmm, I wonder who could've caused that and why?) And constantly throwing the average Joe under the bus does a pretty bad job of helping maintain public confidence in the finance system.\",\n",
|
64 |
+
" \"The Wisconsin Bankers Association (aka the WBA) is the largest financial trade association in Wisconsin, representing over 200 state and nationally chartered banks, savings banks,and savings and loan associations located in communities throughout the State. WBA appreciates the opportunity to comment on the interim final rule. Over the past year, the Board of Governors of the Federal Reserve System (FRB) issued several interim final rules to except certain loans that are guaranteed under the Small Business Administration's (SBA's) Paycheck Protection Program (PPP) from the requirements of the Federal Reserve Act and the corresponding provisions of Regulation O.To reflect the latest program extension by Congress, FRB issued this interim final rule to extend the Regulation O exception to PPP loans through March 31, 2022. WBA filed comment letters in support of FRB's previous interim final rules as the removal of Regulation O obstacles through the exception has helped allow Wisconsin's banks to more efficiently address the needs of their insider-owned small businesses. FRB'spast interim final rules have helped ensuree ligible businesses have timely access to liquidity to help overcome economic hurdles resulting from the effects of COVID-19 and the mitigating efforts in effect throughout Wisconsin. WBA appreciates FRB's actions to provide continued clarity that loans made by a bank to insider-owned businesses that are guaranteed under SBA's PPP remain excepted from the Federal Reserve Act and the corresponding provisions of Regulation O. Without an extension of the exception, WBA fears some auditors and examiners would treat such loans differently than PPP loans made on or before June 30 ,2020. As have been requirements of the program since inception, any PPP loan made during the extended program period must still meet certain eligibility and documentation criteria, and have the same interest rate, payment, and loan term. Additionally, all eligibility and documentation criteria and all loan terms and program requirements remain exclusively set by SBA and cannot be altered by the lender. Therefore, FRB should once again extend its exception for PPP loans; this time for PPP loans made through March 31 ,2022. WBA also appreciates FRB's efforts to have promulgated the interim final rules in such a straight-forward manner and for using plain language in its interim final rules. WBA encourages FRB to continue such efforts in future rule makings and for any other regulatory review efforts.\",\n",
|
65 |
+
" \"How about you crooks focus on the billions being laundered by banks in plain fucking sight instead of intruding in our lives more. Disgusting. Aweful.\",\n",
|
66 |
+
" \"If adopted, this proposal [R-1726], would prove to be an invasion of privacy. In terms of digital assets, crypto exchanges are not held accountable in the same way that other financial institutions are, and have a track record of bad operational security when it comes to securely storing client information.\",\n",
|
67 |
+
" \"Amendments to 20402(d)(2) and 204.2(e)(2) and (4) make a savings account without transfer or withdrawal limits transaction accounts. Can a depository institution avoid having a savings account be a transaction account by imposing a transfer/withdrawal restriction? Must such a restriction be absolute, or can it be suggested though the imposition of transaction fees for excess transfers/withdrawals in a stated period? The prefatory text, including the FAQ found there consistently uses the verb 'suspend.' Is 'suspend' used in the dictionary sense of 'temporarily prevent from continuing or being in force or effect'? If so, is that deliberate so as to suggest that it's expected that depository institutions will re-impose transfer/withdrawal limits at some future date (e.g., once the local economy recovers from the present pandemic)? Does the Board anticipate reinstating savings account transfer limits in the future, or believe that they will be reimposed by depository institutions as an account or contract provision? Relationship to Regulation CC A related question regarding the impact of the Reg D changes on the definition of 'account' in Regulation CC (12 CFR Part 229), which appears in the definition to exclude, except for the purposes of subpart D, any savings account described in 12 CFR 204.2(d)(2) 'even though such accounts permit third party transfers.' I note that the Official Interpretations applicable to the 229.2(a)(1) definition of 'account' in Regulation CC suggests that savings deposits are excluded because they :'may have limited third party payment powers,' and the Board believed the 'EFA Act is intended to apply only to accounts that permit UNLIMITED (emphasis added) third party transfers.' Will, then, a bank that 'suspends' its limits on savings deposit transfers and withdrawals be perforce (and perhaps unwittingly) making those savings accounts subject to Regulation CC, or does the Regulation CC 'account' definition continue to exclude savings accounts as described in 204.2(d)(2)? Thank you for your consideration of these comments and questions.\"\n",
|
68 |
+
"]\n",
|
69 |
+
"\n",
|
70 |
+
"for text in texts:\n",
|
71 |
+
" doc = nlp(text)\n",
|
72 |
+
" print(f\"Text: {text}\\n\")\n",
|
73 |
+
" for label, score in doc.cats.items():\n",
|
74 |
+
" print(f\"{label}: {score:.4f}\")\n",
|
75 |
+
" print()"
|
76 |
+
]
|
77 |
+
}
|
78 |
+
],
|
79 |
+
"metadata": {
|
80 |
+
"kernelspec": {
|
81 |
+
"display_name": "venv",
|
82 |
+
"language": "python",
|
83 |
+
"name": "python3"
|
84 |
+
},
|
85 |
+
"language_info": {
|
86 |
+
"codemirror_mode": {
|
87 |
+
"name": "ipython",
|
88 |
+
"version": 3
|
89 |
+
},
|
90 |
+
"file_extension": ".py",
|
91 |
+
"mimetype": "text/x-python",
|
92 |
+
"name": "python",
|
93 |
+
"nbconvert_exporter": "python",
|
94 |
+
"pygments_lexer": "ipython3",
|
95 |
+
"version": "3.11.4"
|
96 |
+
}
|
97 |
+
},
|
98 |
+
"nbformat": 4,
|
99 |
+
"nbformat_minor": 2
|
100 |
+
}
|
python_Code/secondStep-score.py
ADDED
@@ -0,0 +1,52 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import spacy
|
2 |
+
from spacy.training import Example
|
3 |
+
import jsonlines
|
4 |
+
import random
|
5 |
+
|
6 |
+
# Load a blank English model
|
7 |
+
nlp = spacy.blank("en")
|
8 |
+
|
9 |
+
# Add text classification pipeline to the model
|
10 |
+
textcat = nlp.add_pipe('textcat_multilabel', last=True)
|
11 |
+
textcat.add_label("CapitalRequirements")
|
12 |
+
textcat.add_label("ConsumerProtection")
|
13 |
+
textcat.add_label("RiskManagement")
|
14 |
+
textcat.add_label("ReportingAndCompliance")
|
15 |
+
textcat.add_label("CorporateGovernance")
|
16 |
+
|
17 |
+
# Path to the processed data file
|
18 |
+
processed_data_file = "data/firstStep_file.jsonl"
|
19 |
+
|
20 |
+
# Open the JSONL file and extract text and labels
|
21 |
+
with jsonlines.open(processed_data_file) as reader:
|
22 |
+
processed_data = list(reader)
|
23 |
+
|
24 |
+
# Convert processed data to spaCy format
|
25 |
+
spacy_train_data = []
|
26 |
+
for obj in processed_data:
|
27 |
+
text = obj["text"]
|
28 |
+
label = {
|
29 |
+
"CapitalRequirements": obj["label"] == "CapitalRequirements",
|
30 |
+
"ConsumerProtection": obj["label"] == "ConsumerProtection",
|
31 |
+
"RiskManagement": obj["label"] == "RiskManagement",
|
32 |
+
"ReportingAndCompliance": obj["label"] == "ReportingAndCompliance",
|
33 |
+
"CorporateGovernance": obj["label"] == "CorporateGovernance"
|
34 |
+
}
|
35 |
+
spacy_train_data.append(Example.from_dict(nlp.make_doc(text), {"cats": label}))
|
36 |
+
|
37 |
+
# Initialize the model and get the optimizer
|
38 |
+
optimizer = nlp.initialize()
|
39 |
+
|
40 |
+
# Train the text classification model
|
41 |
+
n_iter = 10
|
42 |
+
for i in range(n_iter):
|
43 |
+
spacy.util.fix_random_seed(1)
|
44 |
+
random.shuffle(spacy_train_data)
|
45 |
+
losses = {}
|
46 |
+
for batch in spacy.util.minibatch(spacy_train_data, size=8):
|
47 |
+
nlp.update(batch, losses=losses, sgd=optimizer)
|
48 |
+
print("Iteration:", i, "Losses:", losses)
|
49 |
+
|
50 |
+
# Save the trained model
|
51 |
+
output_dir = "./my_trained_model"
|
52 |
+
nlp.to_disk(output_dir)
|
python_Code/thirdStep-label.py
ADDED
@@ -0,0 +1,23 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
import spacy
|
2 |
+
import jsonlines
|
3 |
+
|
4 |
+
# Load the trained model
|
5 |
+
model_path = "./my_trained_model"
|
6 |
+
nlp = spacy.load(model_path)
|
7 |
+
|
8 |
+
# Load the unlabeled data
|
9 |
+
unlabeled_data_file = "data/train.jsonl"
|
10 |
+
|
11 |
+
# Open the JSONL file and classify each record
|
12 |
+
classified_data = []
|
13 |
+
with jsonlines.open(unlabeled_data_file) as reader:
|
14 |
+
for record in reader:
|
15 |
+
text = record["text"]
|
16 |
+
doc = nlp(text)
|
17 |
+
predicted_labels = doc.cats
|
18 |
+
classified_data.append({"text": text, "predicted_labels": predicted_labels})
|
19 |
+
|
20 |
+
# Optionally, you can save the classified data to a file or process it further
|
21 |
+
output_file = "data/thirdStep_file.jsonl"
|
22 |
+
with jsonlines.open(output_file, mode="w") as writer:
|
23 |
+
writer.write_all(classified_data)
|
requirements-dev.txt
ADDED
@@ -0,0 +1 @@
|
|
|
|
|
1 |
+
prodigy==1.15.2
|
requirements.txt
ADDED
@@ -0,0 +1,92 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
aiofiles==23.2.1
|
2 |
+
altair==5.3.0
|
3 |
+
annotated-types==0.6.0
|
4 |
+
anyio==4.3.0
|
5 |
+
attrs==23.2.0
|
6 |
+
blis==0.7.11
|
7 |
+
cachetools==5.3.3
|
8 |
+
catalogue==2.0.10
|
9 |
+
certifi==2024.2.2
|
10 |
+
charset-normalizer==3.3.2
|
11 |
+
click==8.1.7
|
12 |
+
cloudpathlib==0.16.0
|
13 |
+
confection==0.1.4
|
14 |
+
contourpy==1.2.1
|
15 |
+
cycler==0.12.1
|
16 |
+
cymem==2.0.8
|
17 |
+
en-core-web-lg @ https://github.com/explosion/spacy-models/releases/download/en_core_web_lg-3.7.1/en_core_web_lg-3.7.1-py3-none-any.whl#sha256=ab70aeb6172cde82508f7739f35ebc9918a3d07debeed637403c8f794ba3d3dc
|
18 |
+
exceptiongroup==1.2.0
|
19 |
+
fastapi==0.102.0
|
20 |
+
ffmpy==0.3.2
|
21 |
+
filelock==3.14.0
|
22 |
+
fonttools==4.51.0
|
23 |
+
fsspec==2024.3.1
|
24 |
+
gradio==4.29.0
|
25 |
+
gradio-client==0.16.1
|
26 |
+
h11==0.14.0
|
27 |
+
httpcore==1.0.5
|
28 |
+
httpx==0.27.0
|
29 |
+
huggingface-hub==0.23.0
|
30 |
+
idna==3.6
|
31 |
+
importlib-metadata==7.1.0
|
32 |
+
importlib-resources==6.4.0
|
33 |
+
install==1.3.5
|
34 |
+
Jinja2==3.1.3
|
35 |
+
joblib==1.4.2
|
36 |
+
jsonlines==4.0.0
|
37 |
+
jsonschema==4.22.0
|
38 |
+
jsonschema-specifications==2023.12.1
|
39 |
+
kiwisolver==1.4.5
|
40 |
+
langcodes==3.3.0
|
41 |
+
MarkupSafe==2.1.5
|
42 |
+
matplotlib==3.8.4
|
43 |
+
murmurhash==1.0.10
|
44 |
+
numpy==1.26.4
|
45 |
+
orjson==3.10.3
|
46 |
+
packaging==24.0
|
47 |
+
pandas==2.2.2
|
48 |
+
peewee==3.16.3
|
49 |
+
pillow==10.3.0
|
50 |
+
preshed==3.0.9
|
51 |
+
pydantic==2.6.4
|
52 |
+
pydantic-core==2.16.3
|
53 |
+
pydub==0.25.1
|
54 |
+
PyJWT==2.8.0
|
55 |
+
pyparsing==3.1.2
|
56 |
+
python-dateutil==2.9.0.post0
|
57 |
+
python-dotenv==1.0.1
|
58 |
+
python-multipart==0.0.9
|
59 |
+
pytz==2024.1
|
60 |
+
PyYAML==6.0.1
|
61 |
+
radicli==0.0.25
|
62 |
+
referencing==0.35.1
|
63 |
+
requests==2.31.0
|
64 |
+
rpds-py==0.18.0
|
65 |
+
ruff==0.4.2
|
66 |
+
scikit-learn==1.4.2
|
67 |
+
scipy==1.13.0
|
68 |
+
semantic-version==2.10.0
|
69 |
+
six==1.16.0
|
70 |
+
smart-open==6.4.0
|
71 |
+
sniffio==1.3.1
|
72 |
+
spacy==3.7.4
|
73 |
+
spacy-legacy==3.0.12
|
74 |
+
spacy-llm==0.7.1
|
75 |
+
spacy-loggers==1.0.5
|
76 |
+
srsly==2.4.8
|
77 |
+
starlette==0.27.0
|
78 |
+
thinc==8.2.3
|
79 |
+
threadpoolctl==3.5.0
|
80 |
+
tomlkit==0.12.0
|
81 |
+
toolz==0.12.1
|
82 |
+
tqdm==4.66.2
|
83 |
+
typeguard==3.0.2
|
84 |
+
typer==0.9.4
|
85 |
+
typing-extensions==4.10.0
|
86 |
+
tzdata==2024.1
|
87 |
+
urllib3==2.2.1
|
88 |
+
uvicorn==0.26.0
|
89 |
+
wasabi==1.1.2
|
90 |
+
weasel==0.3.4
|
91 |
+
websockets==11.0.3
|
92 |
+
zipp==3.18.1
|