ManjinderUNCC commited on
Commit
6150b70
·
verified ·
1 Parent(s): 425382d

Upload project.yml

Browse files
Files changed (1) hide show
  1. project.yml +311 -0
project.yml ADDED
@@ -0,0 +1,311 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Title and description of the project
2
+ title: "Citations of ECFR Banking Regulation in a spaCy pipeline."
3
+ description: "Custom text classification project for spaCy v3 adapted from the spaCy v3"
4
+
5
+ vars:
6
+ lang: "en"
7
+ train: corpus/train.spacy
8
+ dev: corpus/dev.spacy
9
+ version: "0.1.0"
10
+ gpu_id: -1
11
+ vectors_model: "en_core_web_lg"
12
+ name: ecfr_ner
13
+ prodigy:
14
+ ner_labels: ecfr_initial_ner
15
+ ner_manual_labels: ecfr_manual_ner
16
+ senter_labels: ecfr_labeled_sents
17
+ ner_labeled_dataset: ecfr_labeled_ner
18
+ assets:
19
+ ner_labels: assets/ecfr_ner_labels.jsonl
20
+ senter_labels: assets/ecfr_senter_labels.jsonl
21
+ ner_patterns: assets/patterns.jsonl
22
+ corpus_labels: corpus/labels
23
+ data_files: data
24
+ trained_model: my_trained_model
25
+ trained_model_textcat: my_trained_model/textcat_multilabel
26
+ output_models: output
27
+ python_code: python_Code
28
+
29
+ directories: [ "data", "python_Code"]
30
+
31
+ assets:
32
+ - dest: "data/firstStep_file.jsonl"
33
+ description: "JSONL file containing formatted data from the first step"
34
+ - dest: "data/five_examples_annotated5.jsonl"
35
+ description: "JSONL file containing five annotated examples"
36
+ - dest: "data/goldenEval.jsonl"
37
+ description: "JSONL file containing golden evaluation data"
38
+ - dest: "data/thirdStep_file.jsonl"
39
+ description: "JSONL file containing classified data from the third step"
40
+ - dest: "data/train.jsonl"
41
+ description: "JSONL file containing training data"
42
+ - dest: "data/train200.jsonl"
43
+ description: "JSONL file containing initial training data"
44
+ - dest: "data/train4465.jsonl"
45
+ description: "JSONL file containing formatted and labeled training data"
46
+ - dest: "python_Code/finalStep-formatLabel.py"
47
+ description: "Python script for formatting labeled data in the final step"
48
+ - dest: "python_Code/firstStep-format.py"
49
+ description: "Python script for formatting data in the first step"
50
+ - dest: "python_Code/five_examples_annotated.ipynb"
51
+ description: "Jupyter notebook containing five annotated examples"
52
+ - dest: "python_Code/secondStep-score.py"
53
+ description: "Python script for scoring data in the second step"
54
+ - dest: "python_Code/thirdStep-label.py"
55
+ description: "Python script for labeling data in the third step"
56
+ - dest: "python_Code/train_eval_split.ipynb"
57
+ description: "Jupyter notebook for training and evaluation data splitting"
58
+ - dest: "data/firstStep_file.jsonl"
59
+ description: "Python script for evaluating the trained model"
60
+ - dest: "README.md"
61
+ description: "Markdown file containing project documentation"
62
+
63
+ workflows:
64
+ train:
65
+ - preprocess
66
+ - train-text-classification-model
67
+ - classify-unlabeled-data
68
+ - format-labeled-data
69
+ # - review-evaluation-data
70
+ # - export-reviewed-evaluation-data
71
+ # - import-training-data
72
+ # - import-golden-evaluation-data
73
+ # - train-model-experiment1
74
+ # - convert-data-to-spacy-format
75
+ evaluate:
76
+ - set-threshold
77
+ - evaluate-model
78
+
79
+ commands:
80
+ - name: "preprocess"
81
+ help: |
82
+ Execute the Python script `firstStep-format.py`, which performs the initial formatting of a dataset file for the first step of the project. This script extracts text and labels from a dataset file in JSONL format and writes them to a new JSONL file in a specific format.
83
+
84
+ Usage:
85
+ ```
86
+ spacy project run preprocess
87
+ ```
88
+
89
+ Explanation:
90
+ - The script `firstStep-format.py` reads data from the file specified in the `dataset_file` variable (`data/train200.jsonl` by default).
91
+ - It extracts text and labels from each JSON object in the dataset file.
92
+ - If both text and at least one label are available, it writes a new JSON object to the output file specified in the `output_file` variable (`data/firstStep_file.jsonl` by default) with the extracted text and label.
93
+ - If either text or label is missing in a JSON object, a warning message is printed.
94
+ - Upon completion, the script prints a message confirming the processing and the path to the output file.
95
+ script:
96
+ - "python3 python_Code/firstStep-format.py"
97
+
98
+ - name: "train-text-classification-model"
99
+ help: |
100
+ Train the text classification model for the second step of the project using the `secondStep-score.py` script. This script loads a blank English spaCy model and adds a text classification pipeline to it. It then trains the model using the processed data from the first step.
101
+
102
+ Usage:
103
+ ```
104
+ spacy project run train-text-classification-model
105
+ ```
106
+
107
+ Explanation:
108
+ - The script `secondStep-score.py` loads a blank English spaCy model and adds a text classification pipeline to it.
109
+ - It reads processed data from the file specified in the `processed_data_file` variable (`data/firstStep_file.jsonl` by default).
110
+ - The processed data is converted to spaCy format for training the model.
111
+ - The model is trained using the converted data for a specified number of iterations (`n_iter`).
112
+ - Losses are printed for each iteration during training.
113
+ - Upon completion, the trained model is saved to the specified output directory (`./my_trained_model` by default).
114
+ script:
115
+ - "python3 python_Code/secondStep-score.py"
116
+
117
+ - name: "classify-unlabeled-data"
118
+ help: |
119
+ Classify the unlabeled data for the third step of the project using the `thirdStep-label.py` script. This script loads the trained spaCy model from the previous step and classifies each record in the unlabeled dataset.
120
+
121
+ Usage:
122
+ ```
123
+ spacy project run classify-unlabeled-data
124
+ ```
125
+
126
+ Explanation:
127
+ - The script `thirdStep-label.py` loads the trained spaCy model from the specified model directory (`./my_trained_model` by default).
128
+ - It reads the unlabeled data from the file specified in the `unlabeled_data_file` variable (`data/train.jsonl` by default).
129
+ - Each record in the unlabeled data is classified using the loaded model.
130
+ - The predicted labels for each record are extracted and stored along with the text.
131
+ - The classified data is optionally saved to a file specified in the `output_file` variable (`data/thirdStep_file.jsonl` by default).
132
+ script:
133
+ - "python3 python_Code/thirdStep-label.py"
134
+
135
+ - name: "format-labeled-data"
136
+ help: |
137
+ Format the labeled data for the final step of the project using the `finalStep-formatLabel.py` script. This script processes the classified data from the third step and transforms it into a specific format, considering a threshold for label acceptance.
138
+
139
+ Usage:
140
+ ```
141
+ spacy project run format-labeled-data
142
+ ```
143
+
144
+ Explanation:
145
+ - The script `finalStep-formatLabel.py` reads classified data from the file specified in the `input_file` variable (`data/thirdStep_file.jsonl` by default).
146
+ - For each record, it determines accepted categories based on a specified threshold.
147
+ - It constructs an output record containing the text, predicted labels, accepted categories, answer (accept/reject), and options with meta information.
148
+ - The transformed data is written to the file specified in the `output_file` variable (`data/train4465.jsonl` by default).
149
+ script:
150
+ - "python3 python_Code/finalStep-formatLabel.py"
151
+
152
+ - name: "evaluate-model"
153
+ help: |
154
+ Evaluate the trained model using the evaluation data and print the metrics.
155
+
156
+ Usage:
157
+ ```
158
+ spacy project run evaluate-model
159
+ ```
160
+
161
+ Explanation:
162
+ - The script `evaluate_model.py` loads the trained model and evaluates it using the golden evaluation data.
163
+ - It calculates evaluation metrics such as accuracy, precision, recall, and F1-score.
164
+ - The metrics are printed to the console.
165
+ script:
166
+ - "python python_Code/evaluate_model.py"
167
+
168
+ - name: "set-threshold"
169
+ help: |
170
+ Set the threshold for text categorization in a trained model.
171
+
172
+ Usage:
173
+ ```
174
+ spacy project run set-threshold <model_path> <threshold>
175
+ ```
176
+
177
+ Explanation:
178
+ - The script loads the trained model from the specified path.
179
+ - It sets the threshold for text categorization to the specified value.
180
+ script:
181
+ - "python python_Code/threshold.py"
182
+
183
+
184
+ # - name: "review-evaluation-data"
185
+ # help: |
186
+ # Review the evaluation data in Prodigy and automatically accept annotations.
187
+
188
+ # Usage:
189
+ # ```
190
+ # spacy project run review-evaluation-data
191
+ # ```
192
+
193
+ # Explanation:
194
+ # - The command reviews the evaluation data in Prodigy.
195
+ # - It automatically accepts annotations made during the review process.
196
+ # - Only sessions allowed by the environment variable PRODIGY_ALLOWED_SESSIONS are permitted to review data. In this case, the session 'reviwer' is allowed.
197
+ # script:
198
+ # - "PRODIGY_ALLOWED_SESSIONS=reviwer python3 -m prodigy review project3eval-review project3eval --auto-accept"
199
+
200
+ # - name: "export-reviewed-evaluation-data"
201
+ # help: |
202
+ # Export the reviewed evaluation data from Prodigy to a JSONL file named 'goldenEval.jsonl'.
203
+
204
+ # Usage:
205
+ # ```
206
+ # spacy project run export-reviewed-evaluation-data
207
+ # ```
208
+
209
+ # Explanation:
210
+ # - The command exports the reviewed evaluation data from Prodigy to a JSONL file.
211
+ # - The data is exported from the Prodigy database associated with the project named 'project3eval-review'.
212
+ # - The exported data is saved to the file 'goldenEval.jsonl'.
213
+ # - This command helps in preserving the reviewed annotations for further analysis or processing.
214
+ # script:
215
+ # - "prodigy db-out project3eval-review > goldenEval.jsonl"
216
+
217
+ # - name: "import-training-data"
218
+ # help: |
219
+ # Import the training data into Prodigy from a JSONL file named 'train200.jsonl'.
220
+
221
+ # Usage:
222
+ # ```
223
+ # spacy project run import-training-data
224
+ # ```
225
+
226
+ # Explanation:
227
+ # - The command imports the training data into Prodigy from the specified JSONL file.
228
+ # - The data is imported into the Prodigy database associated with the project named 'prodigy3train'.
229
+ # - This command prepares the training data for annotation and model training in Prodigy.
230
+ # script:
231
+ # - "prodigy db-in prodigy3train train200.jsonl"
232
+
233
+ # - name: "import-golden-evaluation-data"
234
+ # help: |
235
+ # Import the golden evaluation data into Prodigy from a JSONL file named 'goldeneval.jsonl'.
236
+
237
+ # Usage:
238
+ # ```
239
+ # spacy project run import-golden-evaluation-data
240
+ # ```
241
+
242
+ # Explanation:
243
+ # - The command imports the golden evaluation data into Prodigy from the specified JSONL file.
244
+ # - The data is imported into the Prodigy database associated with the project named 'golden3'.
245
+ # - This command prepares the golden evaluation data for further analysis and model evaluation in Prodigy.
246
+ # script:
247
+ # - "prodigy db-in golden3 goldeneval.jsonl"
248
+
249
+ # - name: "train-model-experiment1"
250
+ # help: |
251
+ # Train a text classification model using Prodigy with the 'prodigy3train' dataset and evaluating on 'golden3'.
252
+
253
+ # Usage:
254
+ # ```
255
+ # spacy project run train-model-experiment1
256
+ # ```
257
+
258
+ # Explanation:
259
+ # - The command trains a text classification model using Prodigy.
260
+ # - It uses the 'prodigy3train' dataset for training and evaluates the model on the 'golden3' dataset.
261
+ # - The trained model is saved to the './output/experiment1' directory.
262
+ # script:
263
+ # - "python3 -m prodigy train --textcat-multilabel prodigy3train,eval:golden3 ./output/experiment1"
264
+
265
+ # - name: "download-model"
266
+ # help: |
267
+ # Download the English language model 'en_core_web_lg' from spaCy.
268
+
269
+ # Usage:
270
+ # ```
271
+ # spacy project run download-model
272
+ # ```
273
+
274
+ # Explanation:
275
+ # - The command downloads the English language model 'en_core_web_lg' from spaCy.
276
+ # - This model is used as the base model for further data processing and training in the project.
277
+ # script:
278
+ # - "python3 -m spacy download en_core_web_lg"
279
+
280
+ # - name: "convert-data-to-spacy-format"
281
+ # help: |
282
+ # Convert the annotated data from Prodigy to spaCy format using the 'prodigy3train' and 'golden3' datasets.
283
+
284
+ # Usage:
285
+ # ```
286
+ # spacy project run convert-data-to-spacy-format
287
+ # ```
288
+
289
+ # Explanation:
290
+ # - The command converts the annotated data from Prodigy to spaCy format.
291
+ # - It uses the 'prodigy3train' and 'golden3' datasets for conversion.
292
+ # - The converted data is saved to the './corpus' directory with the base model 'en_core_web_lg'.
293
+ # script:
294
+ # - "python3 -m prodigy data-to-spacy --textcat-multilabel prodigy3train,eval:golden3 ./corpus --base-model en_core_web_lg"
295
+
296
+ # - name: "train-custom-model"
297
+ # help: |
298
+ # Train a custom text classification model using spaCy with the converted data in spaCy format.
299
+
300
+ # Usage:
301
+ # ```
302
+ # spacy project run train-custom-model
303
+ # ```
304
+
305
+ # Explanation:
306
+ # - The command trains a custom text classification model using spaCy.
307
+ # - It uses the converted data in spaCy format located in the './corpus' directory.
308
+ # - The model is trained using the configuration defined in 'corpus/config.cfg'.
309
+ # script:
310
+ # - "python -m spacy train corpus/config.cfg --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy"
311
+