# %% | |
# ## 4. Pre-train BERT on processed dataset | |
import os | |
# hyperparameters | |
hyperparameters = { | |
"model_config_id": "bert-base-uncased", | |
"dataset_id": "chaoyan/processed_bert_dataset", | |
"tokenizer_id": "cat_tokenizer", | |
"repository_id": "bert-base-uncased-cat", | |
"max_steps": 100_000, | |
"per_device_train_batch_size": 16, | |
"learning_rate": 5e-5, | |
} | |
hyperparameters_string = " ".join(f"--{key} {value}" for key, value in hyperparameters.items()) | |
cmd_str = f"python3 run_mlm_local.py {hyperparameters_string}" | |
os.system(cmd_str) | |
# %% [markdown] | |
#  | |
# _This [experiment](https://huggingface.co/philschmid/bert-base-uncased-2022-habana-test-6) ran for 60k steps_ | |
# | |
# In our `hyperparameters` we defined a `max_steps` property, which limited the pre-training to only `100_000` steps. The `100_000` steps with a global batch size of `256` took around 12,5 hour. | |
# | |
# BERT was originial pre-trained on [1 Million Steps](https://arxiv.org/pdf/1810.04805.pdf) with a global batch size of `256`: | |
# > We train with batch size of 256 sequences (256 sequences * 512 tokens = 128,000 tokens/batch) for 1,000,000 steps, which is approximately 40 epochs over the 3.3 billion word corpus. | |
# | |
# Meaning if we want to do a full pre-training it would take around 125h hours (12,5 hour * 10) and would cost us around ~$1,650 using Habana Gaudi on AWS, which is extermely cheap. | |
# | |
# For comparison the DeepSpeed Team, who holds the record for the [fastest BERT-pretraining](https://www.deepspeed.ai/tutorials/bert-pretraining/) [reported](https://www.deepspeed.ai/tutorials/bert-pretraining/) that pre-training BERT on 1 [DGX-2](https://www.nvidia.com/en-us/data-center/dgx-2/) (powered by 16 NVIDIA V100 GPUs with 32GB of memory each) takes around 33,25 hours. | |
# | |
# To be able to compare the cost we can use the [p3dn.24xlarge](https://aws.amazon.com/de/ec2/instance-types/p3/) as reference, which comes with 8x NVIDIA V100 32GB GPUs and costs ~31,22$/h. We would need two of these instances to have the same "setup" as the one DeepSpeed reported, for now we are ignoring any overhead created to the multi-node setup (I/O, Network etc.). | |
# This would bring the cost of the DeepSpeed GPU based training on AWS to around ~$2,075, which is 25% more than what Habana Gaudi currently delivers. | |
# _Something to note here is that using [DeepSpeed](https://www.deepspeed.ai/tutorials/bert-pretraining/#deepspeed-single-gpu-throughput-results) in general improves the performance by a factor of ~2._ | |
# | |
# We are looking forward on re-doing the experiment once the [Gaudi DeepSpeed integration](https://docs.habana.ai/en/latest/PyTorch/DeepSpeed/DeepSpeed_User_Guide.html#deepspeed-configs) is more widely available. | |
# | |
# | |
# ## Conlusion | |
# | |
# That's it for this tutorial. Now you know the basics on how to pre-train BERT from scratch using Hugging Face Transformers and Habana Gaudi. You also saw how easy it is to migrate from the `Trainer` to the `GaudiTrainer`. | |
# | |
# We compared our implementation with the [fastest BERT-pretraining](https://www.deepspeed.ai/tutorials/bert-pretraining/) results and saw that Habana Gaudi still delivers a 25% cost reduction and allows us to pre-train BERT for ~$1,650. | |
# | |
# Those results are incredible, since it will allow companies to adapt their pre-trained models to their language and domain to [improve accuracy up to 10%](https://huggingface.co/pile-of-law/legalbert-large-1.7M-1#evaluation-results) compared to the general BERT models. | |
# |