ManjinderUNCC commited on
Commit
425382d
1 Parent(s): 152518e

Delete project.yml

Browse files
Files changed (1) hide show
  1. project.yml +0 -294
project.yml DELETED
@@ -1,294 +0,0 @@
1
- # Title and description of the project
2
- title: "Citations of ECFR Banking Regulation in a spaCy pipeline."
3
- description: "Custom text classification project for spaCy v3 adapted from the spaCy v3"
4
-
5
- vars:
6
- lang: "en"
7
- train: corpus/train.spacy
8
- dev: corpus/dev.spacy
9
- version: "0.1.0"
10
- gpu_id: -1
11
- vectors_model: "en_core_web_lg"
12
- name: ecfr_ner
13
- prodigy:
14
- ner_labels: ecfr_initial_ner
15
- ner_manual_labels: ecfr_manual_ner
16
- senter_labels: ecfr_labeled_sents
17
- ner_labeled_dataset: ecfr_labeled_ner
18
- assets:
19
- ner_labels: assets/ecfr_ner_labels.jsonl
20
- senter_labels: assets/ecfr_senter_labels.jsonl
21
- ner_patterns: assets/patterns.jsonl
22
- corpus_labels: corpus/labels
23
- data_files: data
24
- trained_model: my_trained_model
25
- trained_model_textcat: my_trained_model/textcat_multilabel
26
- output_models: output
27
- python_code: python_Code
28
-
29
- directories: [ "data", "python_Code"]
30
-
31
- assets:
32
- - dest: "data/firstStep_file.jsonl"
33
- description: "JSONL file containing formatted data from the first step"
34
- - dest: "data/five_examples_annotated5.jsonl"
35
- description: "JSONL file containing five annotated examples"
36
- - dest: "data/goldenEval.jsonl"
37
- description: "JSONL file containing golden evaluation data"
38
- - dest: "data/thirdStep_file.jsonl"
39
- description: "JSONL file containing classified data from the third step"
40
- - dest: "data/train.jsonl"
41
- description: "JSONL file containing training data"
42
- - dest: "data/train200.jsonl"
43
- description: "JSONL file containing initial training data"
44
- - dest: "data/train4465.jsonl"
45
- description: "JSONL file containing formatted and labeled training data"
46
- - dest: "python_Code/finalStep-formatLabel.py"
47
- description: "Python script for formatting labeled data in the final step"
48
- - dest: "python_Code/firstStep-format.py"
49
- description: "Python script for formatting data in the first step"
50
- - dest: "python_Code/five_examples_annotated.ipynb"
51
- description: "Jupyter notebook containing five annotated examples"
52
- - dest: "python_Code/secondStep-score.py"
53
- description: "Python script for scoring data in the second step"
54
- - dest: "python_Code/thirdStep-label.py"
55
- description: "Python script for labeling data in the third step"
56
- - dest: "python_Code/train_eval_split.ipynb"
57
- description: "Jupyter notebook for training and evaluation data splitting"
58
- - dest: "data/firstStep_file.jsonl"
59
- description: "Python script for evaluating the trained model"
60
- - dest: "README.md"
61
- description: "Markdown file containing project documentation"
62
-
63
- workflows:
64
- train:
65
- - preprocess
66
- - train-text-classification-model
67
- - classify-unlabeled-data
68
- - format-labeled-data
69
- # - review-evaluation-data
70
- # - export-reviewed-evaluation-data
71
- # - import-training-data
72
- # - import-golden-evaluation-data
73
- # - train-model-experiment1
74
- # - convert-data-to-spacy-format
75
- evaluate:
76
- - evaluate-model
77
-
78
- commands:
79
- - name: "preprocess"
80
- help: |
81
- Execute the Python script `firstStep-format.py`, which performs the initial formatting of a dataset file for the first step of the project. This script extracts text and labels from a dataset file in JSONL format and writes them to a new JSONL file in a specific format.
82
-
83
- Usage:
84
- ```
85
- spacy project run preprocess
86
- ```
87
-
88
- Explanation:
89
- - The script `firstStep-format.py` reads data from the file specified in the `dataset_file` variable (`data/train200.jsonl` by default).
90
- - It extracts text and labels from each JSON object in the dataset file.
91
- - If both text and at least one label are available, it writes a new JSON object to the output file specified in the `output_file` variable (`data/firstStep_file.jsonl` by default) with the extracted text and label.
92
- - If either text or label is missing in a JSON object, a warning message is printed.
93
- - Upon completion, the script prints a message confirming the processing and the path to the output file.
94
- script:
95
- - "python3 python_Code/firstStep-format.py"
96
-
97
- - name: "train-text-classification-model"
98
- help: |
99
- Train the text classification model for the second step of the project using the `secondStep-score.py` script. This script loads a blank English spaCy model and adds a text classification pipeline to it. It then trains the model using the processed data from the first step.
100
-
101
- Usage:
102
- ```
103
- spacy project run train-text-classification-model
104
- ```
105
-
106
- Explanation:
107
- - The script `secondStep-score.py` loads a blank English spaCy model and adds a text classification pipeline to it.
108
- - It reads processed data from the file specified in the `processed_data_file` variable (`data/firstStep_file.jsonl` by default).
109
- - The processed data is converted to spaCy format for training the model.
110
- - The model is trained using the converted data for a specified number of iterations (`n_iter`).
111
- - Losses are printed for each iteration during training.
112
- - Upon completion, the trained model is saved to the specified output directory (`./my_trained_model` by default).
113
- script:
114
- - "python3 python_Code/secondStep-score.py"
115
-
116
- - name: "classify-unlabeled-data"
117
- help: |
118
- Classify the unlabeled data for the third step of the project using the `thirdStep-label.py` script. This script loads the trained spaCy model from the previous step and classifies each record in the unlabeled dataset.
119
-
120
- Usage:
121
- ```
122
- spacy project run classify-unlabeled-data
123
- ```
124
-
125
- Explanation:
126
- - The script `thirdStep-label.py` loads the trained spaCy model from the specified model directory (`./my_trained_model` by default).
127
- - It reads the unlabeled data from the file specified in the `unlabeled_data_file` variable (`data/train.jsonl` by default).
128
- - Each record in the unlabeled data is classified using the loaded model.
129
- - The predicted labels for each record are extracted and stored along with the text.
130
- - The classified data is optionally saved to a file specified in the `output_file` variable (`data/thirdStep_file.jsonl` by default).
131
- script:
132
- - "python3 python_Code/thirdStep-label.py"
133
-
134
- - name: "format-labeled-data"
135
- help: |
136
- Format the labeled data for the final step of the project using the `finalStep-formatLabel.py` script. This script processes the classified data from the third step and transforms it into a specific format, considering a threshold for label acceptance.
137
-
138
- Usage:
139
- ```
140
- spacy project run format-labeled-data
141
- ```
142
-
143
- Explanation:
144
- - The script `finalStep-formatLabel.py` reads classified data from the file specified in the `input_file` variable (`data/thirdStep_file.jsonl` by default).
145
- - For each record, it determines accepted categories based on a specified threshold.
146
- - It constructs an output record containing the text, predicted labels, accepted categories, answer (accept/reject), and options with meta information.
147
- - The transformed data is written to the file specified in the `output_file` variable (`data/train4465.jsonl` by default).
148
- script:
149
- - "python3 python_Code/finalStep-formatLabel.py"
150
-
151
- - name: "evaluate-model"
152
- help: |
153
- Evaluate the trained model using the evaluation data and print the metrics.
154
-
155
- Usage:
156
- ```
157
- spacy project run evaluate-model
158
- ```
159
-
160
- Explanation:
161
- - The script `evaluate_model.py` loads the trained model and evaluates it using the golden evaluation data.
162
- - It calculates evaluation metrics such as accuracy, precision, recall, and F1-score.
163
- - The metrics are printed to the console.
164
- script:
165
- - "python python_Code/evaluate_model.py"
166
-
167
- # - name: "review-evaluation-data"
168
- # help: |
169
- # Review the evaluation data in Prodigy and automatically accept annotations.
170
-
171
- # Usage:
172
- # ```
173
- # spacy project run review-evaluation-data
174
- # ```
175
-
176
- # Explanation:
177
- # - The command reviews the evaluation data in Prodigy.
178
- # - It automatically accepts annotations made during the review process.
179
- # - Only sessions allowed by the environment variable PRODIGY_ALLOWED_SESSIONS are permitted to review data. In this case, the session 'reviwer' is allowed.
180
- # script:
181
- # - "PRODIGY_ALLOWED_SESSIONS=reviwer python3 -m prodigy review project3eval-review project3eval --auto-accept"
182
-
183
- # - name: "export-reviewed-evaluation-data"
184
- # help: |
185
- # Export the reviewed evaluation data from Prodigy to a JSONL file named 'goldenEval.jsonl'.
186
-
187
- # Usage:
188
- # ```
189
- # spacy project run export-reviewed-evaluation-data
190
- # ```
191
-
192
- # Explanation:
193
- # - The command exports the reviewed evaluation data from Prodigy to a JSONL file.
194
- # - The data is exported from the Prodigy database associated with the project named 'project3eval-review'.
195
- # - The exported data is saved to the file 'goldenEval.jsonl'.
196
- # - This command helps in preserving the reviewed annotations for further analysis or processing.
197
- # script:
198
- # - "prodigy db-out project3eval-review > goldenEval.jsonl"
199
-
200
- # - name: "import-training-data"
201
- # help: |
202
- # Import the training data into Prodigy from a JSONL file named 'train200.jsonl'.
203
-
204
- # Usage:
205
- # ```
206
- # spacy project run import-training-data
207
- # ```
208
-
209
- # Explanation:
210
- # - The command imports the training data into Prodigy from the specified JSONL file.
211
- # - The data is imported into the Prodigy database associated with the project named 'prodigy3train'.
212
- # - This command prepares the training data for annotation and model training in Prodigy.
213
- # script:
214
- # - "prodigy db-in prodigy3train train200.jsonl"
215
-
216
- # - name: "import-golden-evaluation-data"
217
- # help: |
218
- # Import the golden evaluation data into Prodigy from a JSONL file named 'goldeneval.jsonl'.
219
-
220
- # Usage:
221
- # ```
222
- # spacy project run import-golden-evaluation-data
223
- # ```
224
-
225
- # Explanation:
226
- # - The command imports the golden evaluation data into Prodigy from the specified JSONL file.
227
- # - The data is imported into the Prodigy database associated with the project named 'golden3'.
228
- # - This command prepares the golden evaluation data for further analysis and model evaluation in Prodigy.
229
- # script:
230
- # - "prodigy db-in golden3 goldeneval.jsonl"
231
-
232
- # - name: "train-model-experiment1"
233
- # help: |
234
- # Train a text classification model using Prodigy with the 'prodigy3train' dataset and evaluating on 'golden3'.
235
-
236
- # Usage:
237
- # ```
238
- # spacy project run train-model-experiment1
239
- # ```
240
-
241
- # Explanation:
242
- # - The command trains a text classification model using Prodigy.
243
- # - It uses the 'prodigy3train' dataset for training and evaluates the model on the 'golden3' dataset.
244
- # - The trained model is saved to the './output/experiment1' directory.
245
- # script:
246
- # - "python3 -m prodigy train --textcat-multilabel prodigy3train,eval:golden3 ./output/experiment1"
247
-
248
- # - name: "download-model"
249
- # help: |
250
- # Download the English language model 'en_core_web_lg' from spaCy.
251
-
252
- # Usage:
253
- # ```
254
- # spacy project run download-model
255
- # ```
256
-
257
- # Explanation:
258
- # - The command downloads the English language model 'en_core_web_lg' from spaCy.
259
- # - This model is used as the base model for further data processing and training in the project.
260
- # script:
261
- # - "python3 -m spacy download en_core_web_lg"
262
-
263
- # - name: "convert-data-to-spacy-format"
264
- # help: |
265
- # Convert the annotated data from Prodigy to spaCy format using the 'prodigy3train' and 'golden3' datasets.
266
-
267
- # Usage:
268
- # ```
269
- # spacy project run convert-data-to-spacy-format
270
- # ```
271
-
272
- # Explanation:
273
- # - The command converts the annotated data from Prodigy to spaCy format.
274
- # - It uses the 'prodigy3train' and 'golden3' datasets for conversion.
275
- # - The converted data is saved to the './corpus' directory with the base model 'en_core_web_lg'.
276
- # script:
277
- # - "python3 -m prodigy data-to-spacy --textcat-multilabel prodigy3train,eval:golden3 ./corpus --base-model en_core_web_lg"
278
-
279
- # - name: "train-custom-model"
280
- # help: |
281
- # Train a custom text classification model using spaCy with the converted data in spaCy format.
282
-
283
- # Usage:
284
- # ```
285
- # spacy project run train-custom-model
286
- # ```
287
-
288
- # Explanation:
289
- # - The command trains a custom text classification model using spaCy.
290
- # - It uses the converted data in spaCy format located in the './corpus' directory.
291
- # - The model is trained using the configuration defined in 'corpus/config.cfg'.
292
- # script:
293
- # - "python -m spacy train corpus/config.cfg --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy"
294
-