ManjinderUNCC commited on
Commit
c0441e3
1 Parent(s): f0c6f53

Upload project.yml

Browse files
Files changed (1) hide show
  1. project.yml +294 -0
project.yml ADDED
@@ -0,0 +1,294 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # Title and description of the project
2
+ title: "Citations of ECFR Banking Regulation in a spaCy pipeline."
3
+ description: "Custom text classification project for spaCy v3 adapted from the spaCy v3"
4
+
5
+ vars:
6
+ lang: "en"
7
+ train: corpus/train.spacy
8
+ dev: corpus/dev.spacy
9
+ version: "0.1.0"
10
+ gpu_id: -1
11
+ vectors_model: "en_core_web_lg"
12
+ name: ecfr_ner
13
+ prodigy:
14
+ ner_labels: ecfr_initial_ner
15
+ ner_manual_labels: ecfr_manual_ner
16
+ senter_labels: ecfr_labeled_sents
17
+ ner_labeled_dataset: ecfr_labeled_ner
18
+ assets:
19
+ ner_labels: assets/ecfr_ner_labels.jsonl
20
+ senter_labels: assets/ecfr_senter_labels.jsonl
21
+ ner_patterns: assets/patterns.jsonl
22
+ corpus_labels: corpus/labels
23
+ data_files: data
24
+ trained_model: my_trained_model
25
+ trained_model_textcat: my_trained_model/textcat_multilabel
26
+ output_models: output
27
+ python_code: python_Code
28
+
29
+ directories: [ "data", "python_Code"]
30
+
31
+ assets:
32
+ - dest: "data/firstStep_file.jsonl"
33
+ description: "JSONL file containing formatted data from the first step"
34
+ - dest: "data/five_examples_annotated5.jsonl"
35
+ description: "JSONL file containing five annotated examples"
36
+ - dest: "data/goldenEval.jsonl"
37
+ description: "JSONL file containing golden evaluation data"
38
+ - dest: "data/thirdStep_file.jsonl"
39
+ description: "JSONL file containing classified data from the third step"
40
+ - dest: "data/train.jsonl"
41
+ description: "JSONL file containing training data"
42
+ - dest: "data/train200.jsonl"
43
+ description: "JSONL file containing initial training data"
44
+ - dest: "data/train4465.jsonl"
45
+ description: "JSONL file containing formatted and labeled training data"
46
+ - dest: "python_Code/finalStep-formatLabel.py"
47
+ description: "Python script for formatting labeled data in the final step"
48
+ - dest: "python_Code/firstStep-format.py"
49
+ description: "Python script for formatting data in the first step"
50
+ - dest: "python_Code/five_examples_annotated.ipynb"
51
+ description: "Jupyter notebook containing five annotated examples"
52
+ - dest: "python_Code/secondStep-score.py"
53
+ description: "Python script for scoring data in the second step"
54
+ - dest: "python_Code/thirdStep-label.py"
55
+ description: "Python script for labeling data in the third step"
56
+ - dest: "python_Code/train_eval_split.ipynb"
57
+ description: "Jupyter notebook for training and evaluation data splitting"
58
+ - dest: "data/firstStep_file.jsonl"
59
+ description: "Python script for evaluating the trained model"
60
+ - dest: "README.md"
61
+ description: "Markdown file containing project documentation"
62
+
63
+ workflows:
64
+ train:
65
+ - preprocess
66
+ - train-text-classification-model
67
+ - classify-unlabeled-data
68
+ - format-labeled-data
69
+ # - review-evaluation-data
70
+ # - export-reviewed-evaluation-data
71
+ # - import-training-data
72
+ # - import-golden-evaluation-data
73
+ # - train-model-experiment1
74
+ # - convert-data-to-spacy-format
75
+ evaluate:
76
+ - evaluate-model
77
+
78
+ commands:
79
+ - name: "preprocess"
80
+ help: |
81
+ Execute the Python script `firstStep-format.py`, which performs the initial formatting of a dataset file for the first step of the project. This script extracts text and labels from a dataset file in JSONL format and writes them to a new JSONL file in a specific format.
82
+
83
+ Usage:
84
+ ```
85
+ spacy project run preprocess
86
+ ```
87
+
88
+ Explanation:
89
+ - The script `firstStep-format.py` reads data from the file specified in the `dataset_file` variable (`data/train200.jsonl` by default).
90
+ - It extracts text and labels from each JSON object in the dataset file.
91
+ - If both text and at least one label are available, it writes a new JSON object to the output file specified in the `output_file` variable (`data/firstStep_file.jsonl` by default) with the extracted text and label.
92
+ - If either text or label is missing in a JSON object, a warning message is printed.
93
+ - Upon completion, the script prints a message confirming the processing and the path to the output file.
94
+ script:
95
+ - "python3 python_Code/firstStep-format.py"
96
+
97
+ - name: "train-text-classification-model"
98
+ help: |
99
+ Train the text classification model for the second step of the project using the `secondStep-score.py` script. This script loads a blank English spaCy model and adds a text classification pipeline to it. It then trains the model using the processed data from the first step.
100
+
101
+ Usage:
102
+ ```
103
+ spacy project run train-text-classification-model
104
+ ```
105
+
106
+ Explanation:
107
+ - The script `secondStep-score.py` loads a blank English spaCy model and adds a text classification pipeline to it.
108
+ - It reads processed data from the file specified in the `processed_data_file` variable (`data/firstStep_file.jsonl` by default).
109
+ - The processed data is converted to spaCy format for training the model.
110
+ - The model is trained using the converted data for a specified number of iterations (`n_iter`).
111
+ - Losses are printed for each iteration during training.
112
+ - Upon completion, the trained model is saved to the specified output directory (`./my_trained_model` by default).
113
+ script:
114
+ - "python3 python_Code/secondStep-score.py"
115
+
116
+ - name: "classify-unlabeled-data"
117
+ help: |
118
+ Classify the unlabeled data for the third step of the project using the `thirdStep-label.py` script. This script loads the trained spaCy model from the previous step and classifies each record in the unlabeled dataset.
119
+
120
+ Usage:
121
+ ```
122
+ spacy project run classify-unlabeled-data
123
+ ```
124
+
125
+ Explanation:
126
+ - The script `thirdStep-label.py` loads the trained spaCy model from the specified model directory (`./my_trained_model` by default).
127
+ - It reads the unlabeled data from the file specified in the `unlabeled_data_file` variable (`data/train.jsonl` by default).
128
+ - Each record in the unlabeled data is classified using the loaded model.
129
+ - The predicted labels for each record are extracted and stored along with the text.
130
+ - The classified data is optionally saved to a file specified in the `output_file` variable (`data/thirdStep_file.jsonl` by default).
131
+ script:
132
+ - "python3 python_Code/thirdStep-label.py"
133
+
134
+ - name: "format-labeled-data"
135
+ help: |
136
+ Format the labeled data for the final step of the project using the `finalStep-formatLabel.py` script. This script processes the classified data from the third step and transforms it into a specific format, considering a threshold for label acceptance.
137
+
138
+ Usage:
139
+ ```
140
+ spacy project run format-labeled-data
141
+ ```
142
+
143
+ Explanation:
144
+ - The script `finalStep-formatLabel.py` reads classified data from the file specified in the `input_file` variable (`data/thirdStep_file.jsonl` by default).
145
+ - For each record, it determines accepted categories based on a specified threshold.
146
+ - It constructs an output record containing the text, predicted labels, accepted categories, answer (accept/reject), and options with meta information.
147
+ - The transformed data is written to the file specified in the `output_file` variable (`data/train4465.jsonl` by default).
148
+ script:
149
+ - "python3 python_Code/finalStep-formatLabel.py"
150
+
151
+ - name: "evaluate-model"
152
+ help: |
153
+ Evaluate the trained model using the evaluation data and print the metrics.
154
+
155
+ Usage:
156
+ ```
157
+ spacy project run evaluate-model
158
+ ```
159
+
160
+ Explanation:
161
+ - The script `evaluate_model.py` loads the trained model and evaluates it using the golden evaluation data.
162
+ - It calculates evaluation metrics such as accuracy, precision, recall, and F1-score.
163
+ - The metrics are printed to the console.
164
+ script:
165
+ - "python python_Code/evaluate_model.py"
166
+
167
+ # - name: "review-evaluation-data"
168
+ # help: |
169
+ # Review the evaluation data in Prodigy and automatically accept annotations.
170
+
171
+ # Usage:
172
+ # ```
173
+ # spacy project run review-evaluation-data
174
+ # ```
175
+
176
+ # Explanation:
177
+ # - The command reviews the evaluation data in Prodigy.
178
+ # - It automatically accepts annotations made during the review process.
179
+ # - Only sessions allowed by the environment variable PRODIGY_ALLOWED_SESSIONS are permitted to review data. In this case, the session 'reviwer' is allowed.
180
+ # script:
181
+ # - "PRODIGY_ALLOWED_SESSIONS=reviwer python3 -m prodigy review project3eval-review project3eval --auto-accept"
182
+
183
+ # - name: "export-reviewed-evaluation-data"
184
+ # help: |
185
+ # Export the reviewed evaluation data from Prodigy to a JSONL file named 'goldenEval.jsonl'.
186
+
187
+ # Usage:
188
+ # ```
189
+ # spacy project run export-reviewed-evaluation-data
190
+ # ```
191
+
192
+ # Explanation:
193
+ # - The command exports the reviewed evaluation data from Prodigy to a JSONL file.
194
+ # - The data is exported from the Prodigy database associated with the project named 'project3eval-review'.
195
+ # - The exported data is saved to the file 'goldenEval.jsonl'.
196
+ # - This command helps in preserving the reviewed annotations for further analysis or processing.
197
+ # script:
198
+ # - "prodigy db-out project3eval-review > goldenEval.jsonl"
199
+
200
+ # - name: "import-training-data"
201
+ # help: |
202
+ # Import the training data into Prodigy from a JSONL file named 'train200.jsonl'.
203
+
204
+ # Usage:
205
+ # ```
206
+ # spacy project run import-training-data
207
+ # ```
208
+
209
+ # Explanation:
210
+ # - The command imports the training data into Prodigy from the specified JSONL file.
211
+ # - The data is imported into the Prodigy database associated with the project named 'prodigy3train'.
212
+ # - This command prepares the training data for annotation and model training in Prodigy.
213
+ # script:
214
+ # - "prodigy db-in prodigy3train train200.jsonl"
215
+
216
+ # - name: "import-golden-evaluation-data"
217
+ # help: |
218
+ # Import the golden evaluation data into Prodigy from a JSONL file named 'goldeneval.jsonl'.
219
+
220
+ # Usage:
221
+ # ```
222
+ # spacy project run import-golden-evaluation-data
223
+ # ```
224
+
225
+ # Explanation:
226
+ # - The command imports the golden evaluation data into Prodigy from the specified JSONL file.
227
+ # - The data is imported into the Prodigy database associated with the project named 'golden3'.
228
+ # - This command prepares the golden evaluation data for further analysis and model evaluation in Prodigy.
229
+ # script:
230
+ # - "prodigy db-in golden3 goldeneval.jsonl"
231
+
232
+ # - name: "train-model-experiment1"
233
+ # help: |
234
+ # Train a text classification model using Prodigy with the 'prodigy3train' dataset and evaluating on 'golden3'.
235
+
236
+ # Usage:
237
+ # ```
238
+ # spacy project run train-model-experiment1
239
+ # ```
240
+
241
+ # Explanation:
242
+ # - The command trains a text classification model using Prodigy.
243
+ # - It uses the 'prodigy3train' dataset for training and evaluates the model on the 'golden3' dataset.
244
+ # - The trained model is saved to the './output/experiment1' directory.
245
+ # script:
246
+ # - "python3 -m prodigy train --textcat-multilabel prodigy3train,eval:golden3 ./output/experiment1"
247
+
248
+ # - name: "download-model"
249
+ # help: |
250
+ # Download the English language model 'en_core_web_lg' from spaCy.
251
+
252
+ # Usage:
253
+ # ```
254
+ # spacy project run download-model
255
+ # ```
256
+
257
+ # Explanation:
258
+ # - The command downloads the English language model 'en_core_web_lg' from spaCy.
259
+ # - This model is used as the base model for further data processing and training in the project.
260
+ # script:
261
+ # - "python3 -m spacy download en_core_web_lg"
262
+
263
+ # - name: "convert-data-to-spacy-format"
264
+ # help: |
265
+ # Convert the annotated data from Prodigy to spaCy format using the 'prodigy3train' and 'golden3' datasets.
266
+
267
+ # Usage:
268
+ # ```
269
+ # spacy project run convert-data-to-spacy-format
270
+ # ```
271
+
272
+ # Explanation:
273
+ # - The command converts the annotated data from Prodigy to spaCy format.
274
+ # - It uses the 'prodigy3train' and 'golden3' datasets for conversion.
275
+ # - The converted data is saved to the './corpus' directory with the base model 'en_core_web_lg'.
276
+ # script:
277
+ # - "python3 -m prodigy data-to-spacy --textcat-multilabel prodigy3train,eval:golden3 ./corpus --base-model en_core_web_lg"
278
+
279
+ # - name: "train-custom-model"
280
+ # help: |
281
+ # Train a custom text classification model using spaCy with the converted data in spaCy format.
282
+
283
+ # Usage:
284
+ # ```
285
+ # spacy project run train-custom-model
286
+ # ```
287
+
288
+ # Explanation:
289
+ # - The command trains a custom text classification model using spaCy.
290
+ # - It uses the converted data in spaCy format located in the './corpus' directory.
291
+ # - The model is trained using the configuration defined in 'corpus/config.cfg'.
292
+ # script:
293
+ # - "python -m spacy train corpus/config.cfg --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy"
294
+