Transformers
PyTorch
Graphcore
bert
Generated from Trainer
File size: 4,046 Bytes
387585b
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
---
tags:
- generated_from_trainer
datasets:
- Graphcore/wikipedia-bert-128
- Graphcore/wikipedia-bert-512
model-index:
- name: Graphcore/bert-large-uncased
  results: []
---

# Graphcore/bert-large-uncased

This model is a pre-trained BERT-Large trained in two phases on the [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128) and [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512) datasets.

## Model description

Pre-trained BERT Large model trained on Wikipedia data.


## Training and evaluation data

Trained on wikipedia datasets:
- [Graphcore/wikipedia-bert-128](https://huggingface.co/datasets/Graphcore/wikipedia-bert-128)
- [Graphcore/wikipedia-bert-512](https://huggingface.co/datasets/Graphcore/wikipedia-bert-512)

## Training procedure

Trained MLM and NSP pre-training scheme from [Large Batch Optimization for Deep Learning: Training BERT in 76 minutes](https://arxiv.org/abs/1904.00962).
Trained on 64 Graphcore Mk2 IPUs using [`optimum-graphcore`](https://github.com/huggingface/optimum-graphcore)

Command lines:

Phase 1:
```
python examples/language-modeling/run_pretraining.py \
  --config_name bert-large-uncased \
  --tokenizer_name bert-large-uncased \
  --ipu_config_name Graphcore/bert-large-ipu \
  --dataset_name Graphcore/wikipedia-bert-128 \
  --do_train \
  --logging_steps 5 \
  --max_seq_length 128 \
  --max_steps 10550 \
  --is_already_preprocessed \
  --dataloader_num_workers 64 \
  --dataloader_mode async_rebatched \
  --lamb \
  --lamb_no_bias_correction \
  --per_device_train_batch_size 8 \
  --gradient_accumulation_steps 512 \
  --pod_type pod64 \
  --learning_rate 0.006 \
  --lr_scheduler_type linear \
  --loss_scaling 32768 \
  --weight_decay 0.01 \
  --warmup_ratio 0.28 \
  --config_overrides "layer_norm_eps=0.001" \
  --ipu_config_overrides "matmul_proportion=[0.14 0.19 0.19 0.19]" \
  --output_dir output-pretrain-bert-large-phase1
```

Phase 2:
```
python examples/language-modeling/run_pretraining.py \
  --config_name bert-large-uncased \
  --tokenizer_name bert-large-uncased \
  --model_name_or_path ./output-pretrain-bert-large-phase1 \
  --ipu_config_name Graphcore/bert-large-ipu \
  --dataset_name Graphcore/wikipedia-bert-512 \
  --do_train \
  --logging_steps 5 \
  --max_seq_length 512 \
  --max_steps 2038 \
  --is_already_preprocessed \
  --dataloader_num_workers 96 \
  --dataloader_mode async_rebatched \
  --lamb \
  --lamb_no_bias_correction \
  --per_device_train_batch_size 2 \
  --gradient_accumulation_steps 512 \
  --pod_type pod64 \
  --learning_rate 0.002828 \
  --lr_scheduler_type linear \
  --loss_scaling 16384 \
  --weight_decay 0.01 \
  --warmup_ratio 0.128 \
  --config_overrides "layer_norm_eps=0.001" \
  --ipu_config_overrides "matmul_proportion=[0.14 0.19 0.19 0.19]" \
  --output_dir output-pretrain-bert-large-phase2
```

### Training hyperparameters

The following hyperparameters were used during phase 1 training:
- learning_rate: 0.006
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- distributed_type: IPU
- gradient_accumulation_steps: 512
- total_train_batch_size: 65536
- total_eval_batch_size: 512
- optimizer: LAMB
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.28
- training_steps: 10550
- training precision: Mixed Precision

The following hyperparameters were used during phase 2 training:
- learning_rate: 0.002828
- train_batch_size: 2
- eval_batch_size: 8
- seed: 42
- distributed_type: IPU
- gradient_accumulation_steps: 512
- total_train_batch_size: 16384
- total_eval_batch_size: 512
- optimizer: LAMB
- lr_scheduler_type: linear
- lr_scheduler_warmup_ratio: 0.128
- training_steps: 2038
- training precision: Mixed Precision

### Training results

```
train/epoch: 2.04
train/global_step: 2038
train/loss: 1.2002
train/train_runtime: 12022.3897
train/train_steps_per_second: 0.17
train/train_samples_per_second: 2777.367
```

### Framework versions

- Transformers 4.17.0
- Pytorch 1.10.0+cpu
- Datasets 2.0.0
- Tokenizers 0.11.6