transiteration commited on
Commit
2f10f14
Β·
verified Β·
1 Parent(s): 139ea24

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +195 -0
README.md CHANGED
@@ -1,3 +1,198 @@
1
  ---
2
  license: mit
 
 
 
 
 
 
 
 
 
3
  ---
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
  ---
2
  license: mit
3
+ language:
4
+ - en
5
+ pipeline_tag: token-classification
6
+ tags:
7
+ - pytorch
8
+ - mlflow
9
+ - ray
10
+ - fastapi
11
+ - nlp
12
  ---
13
+ ## Scaling-ML
14
+ Scaling-ML is a project that classifies news headlines into 10 groups.
15
+ The main part of the project fine-tuning of the [BERT](https://huggingface.co/allenai/scibert_scivocab_uncased)[1] model and including tools like MLflow for tracking experiments, Ray for scaling and distibuted computing, and MLOps components for seamless management of machine learning workflows.\
16
+
17
+ ### Set Up
18
+
19
+ 1. Clone the repository:
20
+ ```bash
21
+ git clone https://github.com/your-username/scaling-ml.git
22
+ cd scaling-ml
23
+ ```
24
+ 2. Set up your virtual environment and install dependencies:
25
+ ```bash
26
+ export PYTHONPATH=$PYTHONPATH:$PWD
27
+ pip install -r requirements.txt
28
+ ```
29
+ ### Scripts Overview
30
+ ```bash
31
+ scripts
32
+ β”œβ”€β”€ app.py
33
+ β”œβ”€β”€ config.py
34
+ β”œβ”€β”€ data.py
35
+ β”œβ”€β”€ evaluate.py
36
+ β”œβ”€β”€ model.py
37
+ β”œβ”€β”€ predict.py
38
+ β”œβ”€β”€ train.py
39
+ β”œβ”€β”€ tune.py
40
+ └── utils.py
41
+ ```
42
+ - `app.py` - Implementation of FastAPI web service for serving a model.
43
+ - `config.py` - Configuration of logging settings, directory structures, and MLflow registry.
44
+ - `data.py`- Functions and a class for data preprocessing tasks in a scalable machine learning project.
45
+ - `evaluate.py` - Evaluating the performance of a model, calculating precision, recall and F1 score.
46
+ - `model.py` - Finetuned language model by adding a fully connected layer for classification tasks.
47
+ - `predict.py` - TorchPredictor class for making predictions using a PyTorch-based model.
48
+ - `train.py` - Training process using Ray for distributed training.
49
+ - `tune.py` - Hyperparameter tuning for Language Model using Ray Tune.
50
+ - `utils.py` - Various utility functions for handling data, setting random seeds, saving and loading dictionaries, etc.\
51
+ #### Dataset
52
+ For training, small portion of the [News Category Dataset](https://www.kaggle.com/datasets/setseries/news-category-dataset) was used, which contains numerous headlines and descriptions of various articles.
53
+
54
+ ### How to Train
55
+ ```bash
56
+ export DATASET_LOC="path/to/dataset"
57
+ export TRAIN_LOOP_CONFIG='{"dropout_p": 0.5, "lr": 1e-4, "lr_factor": 0.8, "lr_patience": 5}'
58
+ python3 scripts/train.py \
59
+ --experiment_name "llm_train" \
60
+ --dataset_loc $DATASET_LOC \
61
+ --train_loop_config "$TRAIN_LOOP_CONFIG" \
62
+ --num_workers 1 \
63
+ --cpu_per_worker 1 \
64
+ --gpu_per_worker 0 \
65
+ --num_epochs 1 \
66
+ --batch_size 128 \
67
+ --results_fp results.json
68
+ ```
69
+ - experiment_name: A name for the experiment or run, in this case, "llm".
70
+ - dataset_loc: The location of the training dataset, replace with the actual path.
71
+ - train_loop_config: The configuration for the training loop, replace with the actual configuration.
72
+ - num_workers: The number of workers used for parallel processing. Adjust based on available CPU resources.
73
+ - cpu_per_worker: The number of CPU cores assigned to each worker. Adjust based on available CPU resources.
74
+ - gpu_per_worker: The number of GPUs assigned to each worker. Adjust based on available GPU resources.
75
+ - num_epochs: The number of training epochs.
76
+ - batch_size: The batch size used during training.
77
+ - results_fp: The file path to save the results.
78
+
79
+ ### How to Tune
80
+ ```bash
81
+ export DATASET_LOC="path/to/dataset"
82
+ export INITIAL_PARAMS='{"dropout_p": 0.5, "lr": 1e-4, "lr_factor": 0.8, "lr_patience": 5}'
83
+ python3 scripts/tune.py \
84
+ --experiment_name "llm_tune" \
85
+ --dataset_loc "$DATASET_LOC" \
86
+ --initial_params "$INITIAL_PARAMS" \
87
+ --num_workers 1 \
88
+ --cpu_per_worker 1 \
89
+ --gpu_per_worker 0 \
90
+ --num_runs 1 \
91
+ --grace_period 1 \
92
+ --num_epochs 1 \
93
+ --batch_size 128 \
94
+ --results_fp results.json
95
+ ```
96
+ - num_runs: The number of tuning runs to perform.
97
+ - grace_period: The grace period for early stopping during hyperparameter tuning.
98
+
99
+ **Note**: modify the values of the `--num-workers`, `--cpu-per-worker`, and `--gpu-per-worker` input parameters below according to the resources available on your system.
100
+
101
+ ### Experiment Tracking with MLflow
102
+ ```bash
103
+ mlflow server -h 0.0.0.0 -p 8080 --backend-store-uri /path/to/mlflow/folder
104
+ ```
105
+
106
+ ### Evaluation
107
+ ```bash
108
+ export RUN_ID=YOUR_MLFLOW_EXPERIMENT_RUN_ID
109
+ python3 evaluate.py --run_id $RUN_ID --dataset_loc "path/to/dataset" --results_fp results.json
110
+ ```
111
+ ```json
112
+ {
113
+ "timestamp": "January 22, 2024 09:57:12 AM",
114
+ "precision": 0.9163323229539818,
115
+ "recall": 0.9124083769633508,
116
+ "f1": 0.9137224104301406,
117
+ "num_samples": 1000.0
118
+ }
119
+ ```
120
+ - run_id: ID of the specific MLflow run to load from.
121
+ ### Inference
122
+ ```
123
+ python3 predict.py --run_id $RUN_ID --headline "Airport Guide: Chicago O'Hare" --keyword "destination"
124
+ ```
125
+ ```json
126
+ [
127
+ {
128
+ "prediction": "TRAVEL",
129
+ "probabilities": {
130
+ "BUSINESS": 0.0024151806719601154,
131
+ "ENTERTAINMENT": 0.002721842611208558,
132
+ "FOOD & DRINK": 0.001193400239571929,
133
+ "PARENTING": 0.0015436559915542603,
134
+ "POLITICS": 0.0012392215430736542,
135
+ "SPORTS": 0.0020724297501146793,
136
+ "STYLE & BEAUTY": 0.0018642042996361852,
137
+ "TRAVEL": 0.9841892123222351,
138
+ "WELLNESS": 0.0013303911546245217,
139
+ "WORLD NEWS": 0.0014305398799479008
140
+ }
141
+ }
142
+ ]
143
+ ```
144
+ ### Application
145
+ ```bash
146
+ python3 app.py --run_id $RUN_ID --num_cpus 2
147
+ ```
148
+ Now, we can send requests to our application:
149
+ ```python
150
+ import json
151
+ import requests
152
+ headline = "Reboot Your Skin For Spring With These Facial Treatments"
153
+ keywords = "skin-facial-treatments"
154
+ json_data = json.dumps({"headline": headline, "keywords": keywords})
155
+ out = requests.post("http://127.0.0.1:8010/predict", data=json_data).json()
156
+ print(out["results"][0])
157
+ ```
158
+ ```json
159
+ {
160
+ "prediction": "STYLE & BEAUTY",
161
+ "probabilities": {
162
+ "BUSINESS": 0.002265132963657379,
163
+ "ENTERTAINMENT": 0.008689943701028824,
164
+ "FOOD & DRINK": 0.0011296054581180215,
165
+ "PARENTING": 0.002621663035824895,
166
+ "POLITICS": 0.002141285454854369,
167
+ "SPORTS": 0.0017548275645822287,
168
+ "STYLE & BEAUTY": 0.9760453104972839,
169
+ "TRAVEL": 0.0024237297475337982,
170
+ "WELLNESS": 0.001382972695864737,
171
+ "WORLD NEWS": 0.0015455639222636819
172
+ }
173
+ ```
174
+ ### Testing the Code
175
+ How to test the written code for asserted inputs and outputs:
176
+ ```bash
177
+ python3 -m pytest tests/code --verbose --disable-warnings
178
+ ```
179
+ How to test the Model behaviour:
180
+ ```bash
181
+ python3 -m pytest --run-id $RUN_ID tests/model --verbose --disable-warnings
182
+ ```
183
+
184
+ ### Workload
185
+ To execute all stages of this project with a single command, `workload.sh` script has been provided, change the resource(cpu_nums, gpu_nums, etc.) parameters to suit your needs.
186
+ ```bash
187
+ bash workload.sh
188
+ ```
189
+
190
+ ### Extras
191
+ Makefile to clean the directories and format scripts:
192
+ ```bash
193
+ make style && make clean
194
+ ```
195
+ Served documentation for functions and classes:
196
+ ```bash
197
+ python3 -m mkdocs serve
198
+ ```