File size: 6,855 Bytes

---
license: mit
language:
- en
pipeline_tag: text-classification
tags:
- pytorch
- mlflow
- ray
- fastapi
- nlp
---
## Scaling-ML
Scaling-ML is a project that classifies news headlines into 10 groups.
The main part of the project fine-tuning of the [BERT](https://huggingface.co/allenai/scibert_scivocab_uncased)[1] model and including tools like MLflow for tracking experiments, Ray for scaling and distibuted computing, and MLOps components for seamless management of machine learning workflows.

### Set Up

1. Clone the repository:
```bash
git clone https://github.com/your-username/scaling-ml.git
cd scaling-ml
```
2. Set up your virtual environment and install dependencies:
```bash
export PYTHONPATH=$PYTHONPATH:$PWD
pip install -r requirements.txt
```
### Scripts Overview
```bash
scripts
├── app.py
├── config.py
├── data.py
├── evaluate.py
├── model.py
├── predict.py
├── train.py
├── tune.py
└── utils.py
```
- `app.py` - Implementation of FastAPI web service for serving a model.
- `config.py` - Configuration of logging settings, directory structures, and MLflow registry.
- `data.py`- Functions and a class for data preprocessing tasks in a scalable machine learning project.
- `evaluate.py` - Evaluating the performance of a model, calculating precision, recall and F1 score.
- `model.py` - Finetuned language model by adding a fully connected layer for classification tasks.
- `predict.py` - TorchPredictor class for making predictions using a PyTorch-based model.
- `train.py` - Training process using Ray for distributed training.
- `tune.py` -  Hyperparameter tuning for Language Model using Ray Tune.
- `utils.py` - Various utility functions for handling data, setting random seeds, saving and loading dictionaries, etc.
#### Dataset
For training, small portion of the [News Category Dataset](https://www.kaggle.com/datasets/setseries/news-category-dataset) was used, which contains numerous headlines and descriptions of various articles.

### How to Train
```bash
export DATASET_LOC="path/to/dataset"
export TRAIN_LOOP_CONFIG='{"dropout_p": 0.5, "lr": 1e-4, "lr_factor": 0.8, "lr_patience": 5}'
python3 scripts/train.py \
--experiment_name "llm_train" \
--dataset_loc $DATASET_LOC \
--train_loop_config "$TRAIN_LOOP_CONFIG" \
--num_workers 1 \
--cpu_per_worker 1 \
--gpu_per_worker 0 \
--num_epochs 1 \
--batch_size 128 \
--results_fp results.json 
```
- experiment_name: A name for the experiment or run, in this case, "llm".
- dataset_loc: The location of the training dataset, replace with the actual path.
- train_loop_config: The configuration for the training loop, replace with the actual configuration.
- num_workers: The number of workers used for parallel processing. Adjust based on available CPU resources.
- cpu_per_worker: The number of CPU cores assigned to each worker. Adjust based on available CPU resources.
- gpu_per_worker: The number of GPUs assigned to each worker. Adjust based on available GPU resources.
- num_epochs: The number of training epochs.
- batch_size: The batch size used during training.
- results_fp: The file path to save the results.

### How to Tune
```bash
export DATASET_LOC="path/to/dataset"
export INITIAL_PARAMS='{"dropout_p": 0.5, "lr": 1e-4, "lr_factor": 0.8, "lr_patience": 5}'
python3 scripts/tune.py \
--experiment_name "llm_tune" \
--dataset_loc "$DATASET_LOC" \
--initial_params "$INITIAL_PARAMS" \
--num_workers 1 \
--cpu_per_worker 1 \
--gpu_per_worker 0 \
--num_runs 1 \
--grace_period 1 \
--num_epochs 1 \
--batch_size 128 \
--results_fp results.json 
```
- num_runs: The number of tuning runs to perform.
- grace_period: The grace period for early stopping during hyperparameter tuning.

**Note**: modify the values of the `--num-workers`, `--cpu-per-worker`, and `--gpu-per-worker` input parameters below according to the resources available on your system.

### Experiment Tracking with MLflow
```bash
mlflow server -h 0.0.0.0 -p 8080 --backend-store-uri /path/to/mlflow/folder
```

### Evaluation
```bash
export RUN_ID=YOUR_MLFLOW_EXPERIMENT_RUN_ID
python3 evaluate.py --run_id $RUN_ID --dataset_loc "path/to/dataset" --results_fp results.json
```
```json
{                                                                                                                                                                                                           
  "timestamp": "January 22, 2024 09:57:12 AM",
  "precision": 0.9163323229539818,
  "recall": 0.9124083769633508,
  "f1": 0.9137224104301406,
  "num_samples": 1000.0
}
```
- run_id: ID of the specific MLflow run to load from.
### Inference
```
python3 predict.py --run_id $RUN_ID --headline "Airport Guide: Chicago O'Hare" --keyword "destination" 
```
```json
[
  {
    "prediction": "TRAVEL",
    "probabilities": {
      "BUSINESS": 0.0024151806719601154,
      "ENTERTAINMENT": 0.002721842611208558,
      "FOOD & DRINK": 0.001193400239571929,
      "PARENTING": 0.0015436559915542603,
      "POLITICS": 0.0012392215430736542,
      "SPORTS": 0.0020724297501146793,
      "STYLE & BEAUTY": 0.0018642042996361852,
      "TRAVEL": 0.9841892123222351,
      "WELLNESS": 0.0013303911546245217,
      "WORLD NEWS": 0.0014305398799479008
    }
  }
]
```
### Application
```bash
python3 app.py --run_id $RUN_ID --num_cpus 2
```
Now, we can send requests to our application:
```python
import json
import requests
headline = "Reboot Your Skin For Spring With These Facial Treatments"
keywords = "skin-facial-treatments"
json_data = json.dumps({"headline": headline, "keywords": keywords})
out = requests.post("http://127.0.0.1:8010/predict", data=json_data).json()
print(out["results"][0])
```
```json
{
  "prediction": "STYLE & BEAUTY",
  "probabilities": {
      "BUSINESS": 0.002265132963657379,
      "ENTERTAINMENT": 0.008689943701028824,
      "FOOD & DRINK": 0.0011296054581180215,
      "PARENTING": 0.002621663035824895,
      "POLITICS": 0.002141285454854369,
      "SPORTS": 0.0017548275645822287,
      "STYLE & BEAUTY": 0.9760453104972839,
      "TRAVEL": 0.0024237297475337982,
      "WELLNESS": 0.001382972695864737,
      "WORLD NEWS": 0.0015455639222636819
}
```
### Testing the Code
How to test the written code for asserted inputs and outputs:
```bash
python3 -m pytest tests/code --verbose --disable-warnings
```
How to test the Model behaviour:
```bash
python3 -m pytest --run-id $RUN_ID tests/model --verbose --disable-warnings
```

### Workload
To execute all stages of this project with a single command, `workload.sh` script has been provided, change the resource(cpu_nums, gpu_nums, etc.) parameters to suit your needs.
```bash
bash workload.sh
```

### Extras
Makefile to clean the directories and format scripts:
```bash
make style && make clean
```
Served documentation for functions and classes:
```bash
python3 -m mkdocs serve
```