scaling-ml / README.md

Update README.md

f1765d7 verified over 1 year ago

6.86 kB

	---
	license: mit
	language:
	- en
	pipeline_tag: text-classification
	tags:
	- pytorch
	- mlflow
	- ray
	- fastapi
	- nlp
	---
	## Scaling-ML
	Scaling-ML is a project that classifies news headlines into 10 groups.
	The main part of the project fine-tuning of the [BERT](https://huggingface.co/allenai/scibert_scivocab_uncased)[1] model and including tools like MLflow for tracking experiments, Ray for scaling and distibuted computing, and MLOps components for seamless management of machine learning workflows.

	### Set Up

	1. Clone the repository:
	```bash
	git clone https://github.com/your-username/scaling-ml.git
	cd scaling-ml
	```
	2. Set up your virtual environment and install dependencies:
	```bash
	export PYTHONPATH=$PYTHONPATH:$PWD
	pip install -r requirements.txt
	```
	### Scripts Overview
	```bash
	scripts
	├── app.py
	├── config.py
	├── data.py
	├── evaluate.py
	├── model.py
	├── predict.py
	├── train.py
	├── tune.py
	└── utils.py
	```
	- `app.py` - Implementation of FastAPI web service for serving a model.
	- `config.py` - Configuration of logging settings, directory structures, and MLflow registry.
	- `data.py`- Functions and a class for data preprocessing tasks in a scalable machine learning project.
	- `evaluate.py` - Evaluating the performance of a model, calculating precision, recall and F1 score.
	- `model.py` - Finetuned language model by adding a fully connected layer for classification tasks.
	- `predict.py` - TorchPredictor class for making predictions using a PyTorch-based model.
	- `train.py` - Training process using Ray for distributed training.
	- `tune.py` - Hyperparameter tuning for Language Model using Ray Tune.
	- `utils.py` - Various utility functions for handling data, setting random seeds, saving and loading dictionaries, etc.
	#### Dataset
	For training, small portion of the [News Category Dataset](https://www.kaggle.com/datasets/setseries/news-category-dataset) was used, which contains numerous headlines and descriptions of various articles.

	### How to Train
	```bash
	export DATASET_LOC="path/to/dataset"
	export TRAIN_LOOP_CONFIG='{"dropout_p": 0.5, "lr": 1e-4, "lr_factor": 0.8, "lr_patience": 5}'
	python3 scripts/train.py \
	--experiment_name "llm_train" \
	--dataset_loc $DATASET_LOC \
	--train_loop_config "$TRAIN_LOOP_CONFIG" \
	--num_workers 1 \
	--cpu_per_worker 1 \
	--gpu_per_worker 0 \
	--num_epochs 1 \
	--batch_size 128 \
	--results_fp results.json
	```
	- experiment_name: A name for the experiment or run, in this case, "llm".
	- dataset_loc: The location of the training dataset, replace with the actual path.
	- train_loop_config: The configuration for the training loop, replace with the actual configuration.
	- num_workers: The number of workers used for parallel processing. Adjust based on available CPU resources.
	- cpu_per_worker: The number of CPU cores assigned to each worker. Adjust based on available CPU resources.
	- gpu_per_worker: The number of GPUs assigned to each worker. Adjust based on available GPU resources.
	- num_epochs: The number of training epochs.
	- batch_size: The batch size used during training.
	- results_fp: The file path to save the results.

	### How to Tune
	```bash
	export DATASET_LOC="path/to/dataset"
	export INITIAL_PARAMS='{"dropout_p": 0.5, "lr": 1e-4, "lr_factor": 0.8, "lr_patience": 5}'
	python3 scripts/tune.py \
	--experiment_name "llm_tune" \
	--dataset_loc "$DATASET_LOC" \
	--initial_params "$INITIAL_PARAMS" \
	--num_workers 1 \
	--cpu_per_worker 1 \
	--gpu_per_worker 0 \
	--num_runs 1 \
	--grace_period 1 \
	--num_epochs 1 \
	--batch_size 128 \
	--results_fp results.json
	```
	- num_runs: The number of tuning runs to perform.
	- grace_period: The grace period for early stopping during hyperparameter tuning.

	Note: modify the values of the `--num-workers`, `--cpu-per-worker`, and `--gpu-per-worker` input parameters below according to the resources available on your system.

	### Experiment Tracking with MLflow
	```bash
	mlflow server -h 0.0.0.0 -p 8080 --backend-store-uri /path/to/mlflow/folder
	```

	### Evaluation
	```bash
	export RUN_ID=YOUR_MLFLOW_EXPERIMENT_RUN_ID
	python3 evaluate.py --run_id $RUN_ID --dataset_loc "path/to/dataset" --results_fp results.json
	```
	```json
	{
	"timestamp": "January 22, 2024 09:57:12 AM",
	"precision": 0.9163323229539818,
	"recall": 0.9124083769633508,
	"f1": 0.9137224104301406,
	"num_samples": 1000.0
	}
	```
	- run_id: ID of the specific MLflow run to load from.
	### Inference
	```
	python3 predict.py --run_id $RUN_ID --headline "Airport Guide: Chicago O'Hare" --keyword "destination"
	```
	```json
	[
	{
	"prediction": "TRAVEL",
	"probabilities": {
	"BUSINESS": 0.0024151806719601154,
	"ENTERTAINMENT": 0.002721842611208558,
	"FOOD & DRINK": 0.001193400239571929,
	"PARENTING": 0.0015436559915542603,
	"POLITICS": 0.0012392215430736542,
	"SPORTS": 0.0020724297501146793,
	"STYLE & BEAUTY": 0.0018642042996361852,
	"TRAVEL": 0.9841892123222351,
	"WELLNESS": 0.0013303911546245217,
	"WORLD NEWS": 0.0014305398799479008
	}
	}
	]
	```
	### Application
	```bash
	python3 app.py --run_id $RUN_ID --num_cpus 2
	```
	Now, we can send requests to our application:
	```python
	import json
	import requests
	headline = "Reboot Your Skin For Spring With These Facial Treatments"
	keywords = "skin-facial-treatments"
	json_data = json.dumps({"headline": headline, "keywords": keywords})
	out = requests.post("http://127.0.0.1:8010/predict", data=json_data).json()
	print(out["results"][0])
	```
	```json
	{
	"prediction": "STYLE & BEAUTY",
	"probabilities": {
	"BUSINESS": 0.002265132963657379,
	"ENTERTAINMENT": 0.008689943701028824,
	"FOOD & DRINK": 0.0011296054581180215,
	"PARENTING": 0.002621663035824895,
	"POLITICS": 0.002141285454854369,
	"SPORTS": 0.0017548275645822287,
	"STYLE & BEAUTY": 0.9760453104972839,
	"TRAVEL": 0.0024237297475337982,
	"WELLNESS": 0.001382972695864737,
	"WORLD NEWS": 0.0015455639222636819
	}
	```
	### Testing the Code
	How to test the written code for asserted inputs and outputs:
	```bash
	python3 -m pytest tests/code --verbose --disable-warnings
	```
	How to test the Model behaviour:
	```bash
	python3 -m pytest --run-id $RUN_ID tests/model --verbose --disable-warnings
	```

	### Workload
	To execute all stages of this project with a single command, `workload.sh` script has been provided, change the resource(cpu_nums, gpu_nums, etc.) parameters to suit your needs.
	```bash
	bash workload.sh
	```

	### Extras
	Makefile to clean the directories and format scripts:
	```bash
	make style && make clean
	```
	Served documentation for functions and classes:
	```bash
	python3 -m mkdocs serve
	```