|
--- |
|
language: |
|
- hu |
|
- en |
|
- zh |
|
tags: |
|
- text-generation |
|
- puli |
|
license: cc-by-nc-4.0 |
|
widget: |
|
- text: Elmesélek egy történetet a nyelvtechnológiáról. |
|
--- |
|
|
|
# PULI GPTrio (7.67B billion parameter) |
|
|
|
For further details read [our paper](http://real.mtak.hu/173960/1/TSD_2023_GPT.pdf) or testing our instruct model, see [our demo site](https://juniper.nytud.hu/demo/gptrio). |
|
|
|
- Hungarian-English-Chinese trilingual GPT-NeoX model (7.67B billion parameter) |
|
- Trained with EleutherAI's GPT-NeoX [github](https://github.com/EleutherAI/gpt-neox) |
|
- Checkpoint: 410 000 steps |
|
|
|
## Dataset |
|
|
|
- Hungarian: 41.5 billion words (314 GB) |
|
- English: 61.9 billion words (391 GB) |
|
- Github: 6 million documents (33 GB) |
|
- Chinese: 98.7 billion Chinese character (340 GB) |
|
- (12 billion non Chinese token) |
|
|
|
## Limitations |
|
|
|
- max_seq_length = 2048 |
|
- float16 |
|
- vocab size: 150 016 |
|
|
|
|
|
## Citation |
|
If you use this model, please cite the following paper: |
|
|
|
``` |
|
@inproceedings {yang-puli-gptrio, |
|
title = {Mono- and multilingual GPT-3 models for Hungarian}, |
|
booktitle = {Text, Speech, and Dialogue}, |
|
year = {2023}, |
|
publisher = {Springer Nature Switzerland}, |
|
series = {Lecture Notes in Computer Science}, |
|
address = {Plzeň, Czech Republic}, |
|
author = {Yang, Zijian Győző and Laki, László János and Váradi, Tamás and Prószéky, Gábor}, |
|
pages = {94--104}, |
|
isbn = {978-3-031-40498-6} |
|
} |
|
``` |
|
|
|
## Usage |
|
|
|
```python |
|
from transformers import GPTNeoXForCausalLM, AutoTokenizer |
|
|
|
model = GPTNeoXForCausalLM.from_pretrained("NYTK/PULI-GPTrio") |
|
tokenizer = AutoTokenizer.from_pretrained("NYTK/PULI-GPTrio") |
|
prompt = "Elmesélek egy történetet a nyelvtechnológiáról." |
|
input_ids = tokenizer(prompt, return_tensors="pt").input_ids |
|
|
|
gen_tokens = model.generate( |
|
input_ids, |
|
do_sample=True, |
|
temperature=0.9, |
|
max_length=100, |
|
) |
|
|
|
gen_text = tokenizer.batch_decode(gen_tokens)[0] |
|
print(gen_text) |
|
``` |
|
## Usage with pipeline |
|
|
|
```python |
|
from transformers import pipeline, GPTNeoXForCausalLM, AutoTokenizer |
|
|
|
model = GPTNeoXForCausalLM.from_pretrained("NYTK/PULI-GPTrio") |
|
tokenizer = AutoTokenizer.from_pretrained("NYTK/PULI-GPTrio") |
|
prompt = "Elmesélek egy történetet a nyelvtechnológiáról." |
|
generator = pipeline(task="text-generation", model=model, tokenizer=tokenizer) |
|
|
|
print(generator(prompt)[0]["generated_text"]) |
|
``` |
|
# [Open LLM Leaderboard Evaluation Results](https://huggingface.co/spaces/HuggingFaceH4/open_llm_leaderboard) |
|
Detailed results can be found [here](https://huggingface.co/datasets/open-llm-leaderboard/details_NYTK__PULI-GPTrio) |
|
|
|
| Metric | Value | |
|
|-----------------------|---------------------------| |
|
| Avg. | 30.07 | |
|
| ARC (25-shot) | 30.72 | |
|
| HellaSwag (10-shot) | 53.49 | |
|
| MMLU (5-shot) | 24.73 | |
|
| TruthfulQA (0-shot) | 39.03 | |
|
| Winogrande (5-shot) | 57.77 | |
|
| GSM8K (5-shot) | 0.76 | |
|
| DROP (3-shot) | 4.03 | |
|
|