File size: 4,046 Bytes
387585b |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 |
---
tags:
- generated_from_trainer
datasets:
- Graphcore/wikipedia-bert-128
- Graphcore/wikipedia-bert-512
model-index:
- name: Graphcore/bert-large-uncased
results: []
---
# Graphcore/bert-large-uncased
This model is a pre-trained BERT-Large trained in two phases on the [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128) and [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512) datasets.
## Model description
Pre-trained BERT Large model trained on Wikipedia data.
## Training and evaluation data
Trained on wikipedia datasets:
- [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128)
- [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512)
## Training procedure
Trained MLM and NSP pre-training scheme from [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/abs/1904.00962).
Trained on 64 Graphcore Mk2 IPUs using [`optimum-graphcore`](https://github.com/huggingface/optimum-graphcore)
Command lines:
Phase 1:
```
python examples/language-modeling/run_pretraining.py \
--config_name bert-large-uncased \
--tokenizer_name bert-large-uncased \
--ipu_config_name Graphcore/bert-large-ipu \
--dataset_name Graphcore/wikipedia-bert-128 \
--do_train \
--logging_steps 5 \
--max_seq_length 128 \
--max_steps 10550 \
--is_already_preprocessed \
--dataloader_num_workers 64 \
--dataloader_mode async_rebatched \
--lamb \
--lamb_no_bias_correction \
--per_device_train_batch_size 8 \
--gradient_accumulation_steps 512 \
--pod_type pod64 \
--learning_rate 0.006 \
--lr_scheduler_type linear \
--loss_scaling 32768 \
--weight_decay 0.01 \
--warmup_ratio 0.28 \
--config_overrides "layer_norm_eps=0.001" \
--ipu_config_overrides "matmul_proportion=[0.14 0.19 0.19 0.19]" \
--output_dir output-pretrain-bert-large-phase1
```
Phase 2:
```
python examples/language-modeling/run_pretraining.py \
--config_name bert-large-uncased \
--tokenizer_name bert-large-uncased \
--model_name_or_path ./output-pretrain-bert-large-phase1 \
--ipu_config_name Graphcore/bert-large-ipu \
--dataset_name Graphcore/wikipedia-bert-512 \
--do_train \
--logging_steps 5 \
--max_seq_length 512 \
--max_steps 2038 \
--is_already_preprocessed \
--dataloader_num_workers 96 \
--dataloader_mode async_rebatched \
--lamb \
--lamb_no_bias_correction \
--per_device_train_batch_size 2 \
--gradient_accumulation_steps 512 \
--pod_type pod64 \
--learning_rate 0.002828 \
--lr_scheduler_type linear \
--loss_scaling 16384 \
--weight_decay 0.01 \
--warmup_ratio 0.128 \
--config_overrides "layer_norm_eps=0.001" \
--ipu_config_overrides "matmul_proportion=[0.14 0.19 0.19 0.19]" \
--output_dir output-pretrain-bert-large-phase2
```
### Training hyperparameters
The following hyperparameters were used during phase 1 training:
- learning_rate: 0.006
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: IPU
- gradient_accumulation_steps: 512
- total_train_batch_size: 65536
- total_eval_batch_size: 512
- optimizer: LAMB
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.28
- training_steps: 10550
- training precision: Mixed Precision
The following hyperparameters were used during phase 2 training:
- learning_rate: 0.002828
- train_batch_size: 2
- eval_batch_size: 8
- seed: 42
- distributed_type: IPU
- gradient_accumulation_steps: 512
- total_train_batch_size: 16384
- total_eval_batch_size: 512
- optimizer: LAMB
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.128
- training_steps: 2038
- training precision: Mixed Precision
### Training results
```
train/epoch: 2.04
train/global_step: 2038
train/loss: 1.2002
train/train_runtime: 12022.3897
train/train_steps_per_second: 0.17
train/train_samples_per_second: 2777.367
```
### Framework versions
- Transformers 4.17.0
- Pytorch 1.10.0+cpu
- Datasets 2.0.0
- Tokenizers 0.11.6
|