Update README.md
Browse files
README.md
CHANGED
@@ -56,18 +56,25 @@ Fine tuning is done using the `train` split of the GLUE MNLI dataset and the per
|
|
56 |
|
57 |
`validation_mismatched` means validation examples are not derived from the same sources as those in the training set and therefore not closely resembling any of the examples seen at training time.
|
58 |
|
|
|
|
|
|
|
|
|
59 |
## Fine-tuning procedure
|
60 |
Fine tuned on a Graphcore IPU-POD64 using `popxl`.
|
61 |
|
62 |
Prompt sentences are tokenized and packed together to form 1024 token sequences, following [HF packing algorithm](https://github.com/huggingface/transformers/blob/v4.20.1/examples/pytorch/language-modeling/run_clm.py). No padding is used.
|
|
|
|
|
|
|
63 |
Since the model is trained to predict the next token, labels are simply the input sequence shifted by one token.
|
64 |
Given the training format, no extra care is needed to account for different sequences: the model does not need to know which sentence a token belongs to.
|
65 |
|
66 |
### Hyperparameters:
|
67 |
-
- epochs:
|
68 |
- optimiser: AdamW (beta1: 0.9, beta2: 0.999, eps: 1e-6, weight decay: 0.0, learning rate: 5e-6)
|
69 |
- learning rate schedule: warmup schedule (min: 1e-7, max: 5e-6, warmup proportion: 0.005995)
|
70 |
- batch size: 128
|
|
|
71 |
|
72 |
## Performance
|
73 |
The resulting model matches SOTA performance with 82.5% accuracy.
|
|
|
56 |
|
57 |
`validation_mismatched` means validation examples are not derived from the same sources as those in the training set and therefore not closely resembling any of the examples seen at training time.
|
58 |
|
59 |
+
Data splits for the mnli dataset are the following
|
60 |
+
|train |validation_matched|validation_mismatched|
|
61 |
+
|-----:|-----------------:|--------------------:|
|
62 |
+
|392702| 9815| 9832|
|
63 |
## Fine-tuning procedure
|
64 |
Fine tuned on a Graphcore IPU-POD64 using `popxl`.
|
65 |
|
66 |
Prompt sentences are tokenized and packed together to form 1024 token sequences, following [HF packing algorithm](https://github.com/huggingface/transformers/blob/v4.20.1/examples/pytorch/language-modeling/run_clm.py). No padding is used.
|
67 |
+
The packing process works in groups of 1000 examples and discards any remainder from each group that isn't a whole sequence.
|
68 |
+
For the 392,702 training examples this gives a total of 17,762 sequences per epoch.
|
69 |
+
|
70 |
Since the model is trained to predict the next token, labels are simply the input sequence shifted by one token.
|
71 |
Given the training format, no extra care is needed to account for different sequences: the model does not need to know which sentence a token belongs to.
|
72 |
|
73 |
### Hyperparameters:
|
|
|
74 |
- optimiser: AdamW (beta1: 0.9, beta2: 0.999, eps: 1e-6, weight decay: 0.0, learning rate: 5e-6)
|
75 |
- learning rate schedule: warmup schedule (min: 1e-7, max: 5e-6, warmup proportion: 0.005995)
|
76 |
- batch size: 128
|
77 |
+
- training steps: 300. Each epoch consists of ceil(17,762/128) steps, hence 300 steps are approximately 2 epochs.
|
78 |
|
79 |
## Performance
|
80 |
The resulting model matches SOTA performance with 82.5% accuracy.
|