File size: 2,565 Bytes
dee2cfc 36202cf dee2cfc 9b5ae34 dee2cfc b5ef005 dee2cfc c27de3a 9b5ae34 dee2cfc 781e6f6 dee2cfc 781e6f6 dee2cfc 4f1e798 dee2cfc 781e6f6 dee2cfc 781e6f6 dee2cfc 781e6f6 dee2cfc 5776571 781e6f6 c27de3a dee2cfc c27de3a dee2cfc 781e6f6 dee2cfc 781e6f6 dee2cfc c27de3a 9b5ae34 c27de3a 781e6f6 dee2cfc 781e6f6 dee2cfc 781e6f6 dee2cfc 781e6f6 c27de3a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 |
---
license: apache-2.0
datasets:
- oscar-corpus/OSCAR-2109
language:
- en
- el
pipeline_tag: text-generation
library_name: transformers
---
# B-GPT_en_el_sequential
This is a bilingual GPT-2 style model. For the first half of training, this model was trained only on English data. In the second half of training, the model was trained on only Greek data. At the end of training, 50% of training data seen by the model is English and 50% is Greek. The tokenizer was trained on the same overall proportions of data as the language model at the final step.
This model was released alongside the paper [On the Acquisition of Shared Grammatical Representations in Bilingual Language Models](https://arxiv.org/abs/2503.03962), which contains more details about the models. Additionally, the [OSF page](https://osf.io/5cw2e/) provides all code and data related to the project.
## Model details:
All models are trained with a [CLS] (same as [BOS]) token prepended, and a [SEP] (same as [EOS]) token separating sequences.
For best results, make sure that [CLS] is prepended to your input sequence (see sample usage linked above)!
Details for this model specifically:
* Architecture: gpt2
* Parameters: 124770816
* Maximum sequence length: 512 tokens
* Training tokens: 12B
* Vocabulary size: 50000
* Compute cost: ~9 NVIDIA A6000 GPU hours
* CO2 Emission: 1.17 kg
Training dataset: [OSCAR 2021/09](https://huggingface.co/datasets/oscar-corpus/OSCAR-2109)
Checkpoints are taken at training steps: 0, 10000, 20000, 30000, 40000, 50000, 64000, 64010, 64020, 64030, 64040, 64050, 64060, 64070, 64080, 64090, 64100, 64110, 64120, 64130, 64140, 64150, 64160, 64170, 64180, 64190, 64200, 64300, 64400, 64500, 64600, 64700, 64800, 64900, 65000, 66000, 67000, 68000, 69000, 70000, 80000, 90000, 100000, 110000, 120000, 128000.
## Use This Model
Load the model:
Note: if you do not specify a revision, it will load the final checkpoint of the model. See above for the list of checkpoints. The checkpoint step is the name of the revision.
```
from transformers import AutoTokenizer, AutoModelForCausalLM
tokenizer = AutoTokenizer.from_pretrained("catherinearnett/B-GPT_en_nl_sequential")
model = AutoModelForCausalLM.from_pretrained("catherinearnett/B-GPT_en_nl_sequential", revision = "128000")
```
Text Generation:
```
from transformers import pipeline
pipe = pipeline("text-generation", model="catherinearnett/B-GPT_en_nl_sequential")
print(pipe("I am a", max_length=20)[0]["generated_text"])
```
## Citation
If you use this model, please cite:
```
```
|