Commit
•
037ab80
1
Parent(s):
090684c
Update project.yml (#3)
Browse files- Update project.yml (d045be30ee549d9cd2dbb9596122e69b34a375a7)
Co-authored-by: Manjinder <[email protected]>
- project.yml +63 -94
project.yml
CHANGED
@@ -7,7 +7,7 @@ tags:
|
|
7 |
- machine learning
|
8 |
- natural language processing
|
9 |
- huggingface
|
10 |
-
|
11 |
vars:
|
12 |
lang: "en"
|
13 |
train: corpus/train.spacy
|
@@ -21,18 +21,21 @@ vars:
|
|
21 |
ner_manual_labels: ecfr_manual_ner
|
22 |
senter_labels: ecfr_labeled_sents
|
23 |
ner_labeled_dataset: ecfr_labeled_ner
|
24 |
-
assets:
|
25 |
-
ner_labels: assets/ecfr_ner_labels.jsonl
|
26 |
-
senter_labels: assets/ecfr_senter_labels.jsonl
|
27 |
-
ner_patterns: assets/patterns.jsonl
|
28 |
-
corpus_labels: corpus/labels
|
29 |
-
data_files: data
|
30 |
-
trained_model: my_trained_model
|
31 |
-
trained_model_textcat: my_trained_model/textcat_multilabel
|
32 |
-
output_models: output
|
33 |
-
python_code: python_Code
|
34 |
|
35 |
-
directories:
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
36 |
|
37 |
assets:
|
38 |
- dest: "corpus/labels/ner.json"
|
@@ -207,15 +210,11 @@ commands:
|
|
207 |
Explanation:
|
208 |
- The script `firstStep-format.py` reads data from the file specified in the `dataset_file` variable (`data/train200.jsonl` by default).
|
209 |
- It extracts text and labels from each JSON object in the dataset file.
|
210 |
-
- If both text and at least one label are available, it writes a new JSON object to the output file specified in the `output_file` variable (`data/firstStep_file.jsonl` by default) with the extracted text and
|
211 |
-
- If either text or label is missing in a JSON object, a warning message is printed.
|
212 |
-
- Upon completion, the script prints a message confirming the processing and the path to the output file.
|
213 |
-
script:
|
214 |
-
- "python3 python_Code/firstStep-format.py"
|
215 |
|
216 |
- name: "train-text-classification-model"
|
217 |
help: |
|
218 |
-
Train
|
219 |
|
220 |
Usage:
|
221 |
```
|
@@ -223,18 +222,13 @@ commands:
|
|
223 |
```
|
224 |
|
225 |
Explanation:
|
226 |
-
-
|
227 |
-
-
|
228 |
-
- The
|
229 |
-
- The model is trained using the converted data for a specified number of iterations (`n_iter`).
|
230 |
-
- Losses are printed for each iteration during training.
|
231 |
-
- Upon completion, the trained model is saved to the specified output directory (`./my_trained_model` by default).
|
232 |
-
script:
|
233 |
-
- "python3 python_Code/secondStep-score.py"
|
234 |
|
235 |
- name: "classify-unlabeled-data"
|
236 |
help: |
|
237 |
-
Classify
|
238 |
|
239 |
Usage:
|
240 |
```
|
@@ -242,17 +236,13 @@ commands:
|
|
242 |
```
|
243 |
|
244 |
Explanation:
|
245 |
-
-
|
246 |
-
- It
|
247 |
-
-
|
248 |
-
- The predicted labels for each record are extracted and stored along with the text.
|
249 |
-
- The classified data is optionally saved to a file specified in the `output_file` variable (`data/thirdStep_file.jsonl` by default).
|
250 |
-
script:
|
251 |
-
- "python3 python_Code/thirdStep-label.py"
|
252 |
|
253 |
- name: "format-labeled-data"
|
254 |
help: |
|
255 |
-
|
256 |
|
257 |
Usage:
|
258 |
```
|
@@ -260,23 +250,25 @@ commands:
|
|
260 |
```
|
261 |
|
262 |
Explanation:
|
263 |
-
- The script `finalStep-formatLabel.py` reads
|
264 |
-
-
|
265 |
-
-
|
266 |
-
|
267 |
-
script:
|
268 |
-
- "python3 python_Code/finalStep-formatLabel.py"
|
269 |
-
|
270 |
- name: "setup-environment"
|
271 |
help: |
|
272 |
-
Set up the Python
|
273 |
-
script:
|
274 |
-
- "python3 -m virtualenv venv"
|
275 |
-
- "source venv/bin/activate"
|
276 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
277 |
- name: "review-evaluation-data"
|
278 |
help: |
|
279 |
-
Review the evaluation data
|
280 |
|
281 |
Usage:
|
282 |
```
|
@@ -284,15 +276,13 @@ commands:
|
|
284 |
```
|
285 |
|
286 |
Explanation:
|
287 |
-
-
|
288 |
-
-
|
289 |
-
-
|
290 |
-
script:
|
291 |
-
- "PRODIGY_ALLOWED_SESSIONS=reviwer python3 -m prodigy review project3eval-review project3eval --auto-accept"
|
292 |
|
293 |
- name: "export-reviewed-evaluation-data"
|
294 |
help: |
|
295 |
-
Export the reviewed evaluation data from Prodigy
|
296 |
|
297 |
Usage:
|
298 |
```
|
@@ -300,16 +290,12 @@ commands:
|
|
300 |
```
|
301 |
|
302 |
Explanation:
|
303 |
-
-
|
304 |
-
-
|
305 |
-
- The exported data is saved to the file 'goldenEval.jsonl'.
|
306 |
-
- This command helps in preserving the reviewed annotations for further analysis or processing.
|
307 |
-
script:
|
308 |
-
- "prodigy db-out project3eval-review > goldenEval.jsonl"
|
309 |
|
310 |
- name: "import-training-data"
|
311 |
help: |
|
312 |
-
Import
|
313 |
|
314 |
Usage:
|
315 |
```
|
@@ -317,15 +303,11 @@ commands:
|
|
317 |
```
|
318 |
|
319 |
Explanation:
|
320 |
-
-
|
321 |
-
- The data is imported into the Prodigy database associated with the project named 'prodigy3train'.
|
322 |
-
- This command prepares the training data for annotation and model training in Prodigy.
|
323 |
-
script:
|
324 |
-
- "prodigy db-in prodigy3train train200.jsonl"
|
325 |
|
326 |
- name: "import-golden-evaluation-data"
|
327 |
help: |
|
328 |
-
Import
|
329 |
|
330 |
Usage:
|
331 |
```
|
@@ -333,15 +315,11 @@ commands:
|
|
333 |
```
|
334 |
|
335 |
Explanation:
|
336 |
-
-
|
337 |
-
- The data is imported into the Prodigy database associated with the project named 'golden3'.
|
338 |
-
- This command prepares the golden evaluation data for further analysis and model evaluation in Prodigy.
|
339 |
-
script:
|
340 |
-
- "prodigy db-in golden3 goldeneval.jsonl"
|
341 |
|
342 |
- name: "train-model-experiment1"
|
343 |
help: |
|
344 |
-
Train a text classification model
|
345 |
|
346 |
Usage:
|
347 |
```
|
@@ -349,15 +327,13 @@ commands:
|
|
349 |
```
|
350 |
|
351 |
Explanation:
|
352 |
-
-
|
353 |
-
-
|
354 |
-
- The trained
|
355 |
-
script:
|
356 |
-
- "python3 -m prodigy train --textcat-multilabel prodigy3train,eval:golden3 ./output/experiment1"
|
357 |
|
358 |
- name: "download-model"
|
359 |
help: |
|
360 |
-
Download
|
361 |
|
362 |
Usage:
|
363 |
```
|
@@ -365,14 +341,12 @@ commands:
|
|
365 |
```
|
366 |
|
367 |
Explanation:
|
368 |
-
-
|
369 |
-
-
|
370 |
-
script:
|
371 |
-
- "python3 -m spacy download en_core_web_lg"
|
372 |
|
373 |
- name: "convert-data-to-spacy-format"
|
374 |
help: |
|
375 |
-
Convert
|
376 |
|
377 |
Usage:
|
378 |
```
|
@@ -380,15 +354,12 @@ commands:
|
|
380 |
```
|
381 |
|
382 |
Explanation:
|
383 |
-
-
|
384 |
-
- It
|
385 |
-
- The converted data is saved to the './corpus' directory with the base model 'en_core_web_lg'.
|
386 |
-
script:
|
387 |
-
- "python3 -m prodigy data-to-spacy --textcat-multilabel prodigy3train,eval:golden3 ./corpus --base-model en_core_web_lg"
|
388 |
|
389 |
- name: "train-custom-model"
|
390 |
help: |
|
391 |
-
Train a custom
|
392 |
|
393 |
Usage:
|
394 |
```
|
@@ -396,8 +367,6 @@ commands:
|
|
396 |
```
|
397 |
|
398 |
Explanation:
|
399 |
-
-
|
400 |
-
-
|
401 |
-
- The model is
|
402 |
-
script:
|
403 |
-
- "python -m spacy train corpus/config.cfg --paths.train corpus/train.spacy --paths.dev corpus/dev.spacy"
|
|
|
7 |
- machine learning
|
8 |
- natural language processing
|
9 |
- huggingface
|
10 |
+
|
11 |
vars:
|
12 |
lang: "en"
|
13 |
train: corpus/train.spacy
|
|
|
21 |
ner_manual_labels: ecfr_manual_ner
|
22 |
senter_labels: ecfr_labeled_sents
|
23 |
ner_labeled_dataset: ecfr_labeled_ner
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
24 |
|
25 |
+
directories:
|
26 |
+
- corpus/labels
|
27 |
+
- data
|
28 |
+
- my_trained_model/textcat_multilabel
|
29 |
+
- my_trained_model/vocab
|
30 |
+
- output/experiment1/model-best/textcat_multilabel
|
31 |
+
- output/experiment1/model-best/vocab
|
32 |
+
- output/experiment1/model-last/textcat_multilabel
|
33 |
+
- output/experiment1/model-last/vocab
|
34 |
+
- output/experiment3/model-best/textcat_multilabel
|
35 |
+
- output/experiment3/model-best/vocab
|
36 |
+
- output/experiment3/model-last/textcat_multilabel
|
37 |
+
- output/experiment3/model-last/vocab
|
38 |
+
- python_Code
|
39 |
|
40 |
assets:
|
41 |
- dest: "corpus/labels/ner.json"
|
|
|
210 |
Explanation:
|
211 |
- The script `firstStep-format.py` reads data from the file specified in the `dataset_file` variable (`data/train200.jsonl` by default).
|
212 |
- It extracts text and labels from each JSON object in the dataset file.
|
213 |
+
- If both text and at least one label are available, it writes a new JSON object to the output file specified in the `output_file` variable (`data/firstStep_file.jsonl` by default) with the extracted text and labels.
|
|
|
|
|
|
|
|
|
214 |
|
215 |
- name: "train-text-classification-model"
|
216 |
help: |
|
217 |
+
Train a text classification model using spaCy.
|
218 |
|
219 |
Usage:
|
220 |
```
|
|
|
222 |
```
|
223 |
|
224 |
Explanation:
|
225 |
+
- This command trains a text classification model using the spaCy library based on the configuration provided in the `textcat_multilabel.cfg` file.
|
226 |
+
- The model is trained on the data specified in the `train` and `dev` variables (`corpus/train.spacy` and `corpus/dev.spacy` by default).
|
227 |
+
- The trained model is saved to the directory specified in the `output_model_dir` variable (`my_trained_model/textcat_multilabel/model` by default).
|
|
|
|
|
|
|
|
|
|
|
228 |
|
229 |
- name: "classify-unlabeled-data"
|
230 |
help: |
|
231 |
+
Classify unlabeled data using a trained text classification model.
|
232 |
|
233 |
Usage:
|
234 |
```
|
|
|
236 |
```
|
237 |
|
238 |
Explanation:
|
239 |
+
- This command loads the trained text classification model from the directory specified in the `model_dir` variable (`my_trained_model/textcat_multilabel/model` by default).
|
240 |
+
- It classifies unlabeled data from the file specified in the `unlabeled_data_file` variable (`data/thirdStep_file.jsonl` by default).
|
241 |
+
- The classified data is saved to the file specified in the `classified_data_file` variable (`data/classified_data.jsonl` by default).
|
|
|
|
|
|
|
|
|
242 |
|
243 |
- name: "format-labeled-data"
|
244 |
help: |
|
245 |
+
Execute the Python script `finalStep-formatLabel.py`, which performs the final formatting of labeled data for the last step of the project. This script converts labeled data from the JSONL format used by Prodigy to the JSONL format used by spaCy.
|
246 |
|
247 |
Usage:
|
248 |
```
|
|
|
250 |
```
|
251 |
|
252 |
Explanation:
|
253 |
+
- The script `finalStep-formatLabel.py` reads labeled data from the file specified in the `labeled_data_file` variable (`data/thirdStep_file.jsonl` by default).
|
254 |
+
- It converts the labeled data from Prodigy's JSONL format to spaCy's JSONL format.
|
255 |
+
- The converted data is saved to the file specified in the `formatted_data_file` variable (`data/fourthStep_file.jsonl` by default).
|
256 |
+
|
|
|
|
|
|
|
257 |
- name: "setup-environment"
|
258 |
help: |
|
259 |
+
Set up the Python environment for the project using pip and the provided requirements.txt file.
|
|
|
|
|
|
|
260 |
|
261 |
+
Usage:
|
262 |
+
```
|
263 |
+
spacy project run setup-environment
|
264 |
+
```
|
265 |
+
|
266 |
+
Explanation:
|
267 |
+
- This command installs the required Python packages listed in the `requirements.txt` file using pip.
|
268 |
+
|
269 |
- name: "review-evaluation-data"
|
270 |
help: |
|
271 |
+
Review the evaluation data using Prodigy.
|
272 |
|
273 |
Usage:
|
274 |
```
|
|
|
276 |
```
|
277 |
|
278 |
Explanation:
|
279 |
+
- This command launches Prodigy to review the evaluation data.
|
280 |
+
- Prodigy loads the evaluation data from the file specified in the `eval_data_file` variable (`data/eval.jsonl` by default).
|
281 |
+
- You can review the data and annotate it as needed using Prodigy's user interface.
|
|
|
|
|
282 |
|
283 |
- name: "export-reviewed-evaluation-data"
|
284 |
help: |
|
285 |
+
Export the reviewed evaluation data from Prodigy.
|
286 |
|
287 |
Usage:
|
288 |
```
|
|
|
290 |
```
|
291 |
|
292 |
Explanation:
|
293 |
+
- This command exports the reviewed evaluation data from Prodigy to a JSONL file.
|
294 |
+
- Prodigy exports the reviewed data to the file specified in the `exported_eval_data_file` variable (`data/goldenEval.jsonl` by default).
|
|
|
|
|
|
|
|
|
295 |
|
296 |
- name: "import-training-data"
|
297 |
help: |
|
298 |
+
Import training data into Prodigy.
|
299 |
|
300 |
Usage:
|
301 |
```
|
|
|
303 |
```
|
304 |
|
305 |
Explanation:
|
306 |
+
- This command imports training data into Prodigy from the file specified in the `training_data_file` variable (`data/fourthStep_file.jsonl` by default).
|
|
|
|
|
|
|
|
|
307 |
|
308 |
- name: "import-golden-evaluation-data"
|
309 |
help: |
|
310 |
+
Import golden evaluation data into Prodigy.
|
311 |
|
312 |
Usage:
|
313 |
```
|
|
|
315 |
```
|
316 |
|
317 |
Explanation:
|
318 |
+
- This command imports golden evaluation data into Prodigy from the file specified in the `golden_evaluation_data_file` variable (`data/goldenEval.jsonl` by default).
|
|
|
|
|
|
|
|
|
319 |
|
320 |
- name: "train-model-experiment1"
|
321 |
help: |
|
322 |
+
Train a text classification model with different configurations for experiment 1.
|
323 |
|
324 |
Usage:
|
325 |
```
|
|
|
327 |
```
|
328 |
|
329 |
Explanation:
|
330 |
+
- This command trains a text classification model using different configurations specified in the `experiment1_configs` list in the `config.cfg` file.
|
331 |
+
- The model is trained on the data specified in the `train` and `dev` variables (`corpus/train.spacy` and `corpus/dev.spacy` by default).
|
332 |
+
- The trained models are saved to the directories specified in the `output_model_dir` variable (`output/experiment1/model-last/textcat_multilabel/model` and `output/experiment1/model-best/textcat_multilabel/model` by default).
|
|
|
|
|
333 |
|
334 |
- name: "download-model"
|
335 |
help: |
|
336 |
+
Download a trained text classification model.
|
337 |
|
338 |
Usage:
|
339 |
```
|
|
|
341 |
```
|
342 |
|
343 |
Explanation:
|
344 |
+
- This command downloads a trained text classification model from the URL specified in the `model_url` variable (`https://example.com/model.tar.gz` by default).
|
345 |
+
- The downloaded model is saved to the directory specified in the `output_model_dir` variable (`models` by default).
|
|
|
|
|
346 |
|
347 |
- name: "convert-data-to-spacy-format"
|
348 |
help: |
|
349 |
+
Convert data to spaCy's JSONL format.
|
350 |
|
351 |
Usage:
|
352 |
```
|
|
|
354 |
```
|
355 |
|
356 |
Explanation:
|
357 |
+
- This command converts data from Prodigy's JSONL format to spaCy's JSONL format.
|
358 |
+
- It reads data from the file specified in the `prodigy_data_file` variable (`data/ner_dataset.jsonl` by default) and writes the converted data to the file specified in the `spacy_data_file` variable (`data/ner_dataset_spacy.jsonl` by default).
|
|
|
|
|
|
|
359 |
|
360 |
- name: "train-custom-model"
|
361 |
help: |
|
362 |
+
Train a custom NER model using spaCy.
|
363 |
|
364 |
Usage:
|
365 |
```
|
|
|
367 |
```
|
368 |
|
369 |
Explanation:
|
370 |
+
- This command trains a custom NER model using spaCy based on the configuration provided in the `config.cfg` file.
|
371 |
+
- The model is trained on the data specified in the `train` and `dev` variables (`corpus/train.spacy` and `corpus/dev.spacy` by default).
|
372 |
+
- The trained model is saved to the directory specified in the `output_model_dir` variable (`my_trained_model` by default).
|
|
|
|