|
--- |
|
language: tr |
|
tags: |
|
- turkish |
|
- tr |
|
- gpt2-tr |
|
- gpt2-turkish |
|
license: mit |
|
metrics: |
|
- accuracy |
|
--- |
|
# 🇹🇷 Turkish GPT-2 Model |
|
|
|
In this repository I release GPT-2 model, that was trained on various texts for Turkish. |
|
|
|
The model is meant to be an entry point for fine-tuning on other texts. |
|
|
|
## Training corpora |
|
|
|
I used a Turkish corpus that is taken from different written and oral sources. |
|
|
|
|
|
With the Tokenizers library, I created a 52K BPE vocab based on the training corpus. |
|
|
|
After creating the vocab, I could train the GPT-2 for Turkish on over the complete training corpus (five epochs). |
|
|
|
Logs during training: |
|
https://tensorboard.dev/experiment/3AWKv8bBTaqcqZP5frtGkw/#scalars |
|
|
|
|
|
|
|
## Using the model |
|
|
|
The model itself can be used in this way: |
|
|
|
``` python |
|
from transformers import AutoTokenizer, AutoModelWithLMHead |
|
tokenizer = AutoTokenizer.from_pretrained("ahmet1338/gpt2-turkish-cased") |
|
model = AutoModelWithLMHead.from_pretrained("ahmet1338/gpt2-turkish-cased") |
|
``` |
|
|
|
Here's an example that shows how to use the great Transformers Pipelines for generating text: |
|
|
|
``` python |
|
from transformers import pipeline |
|
pipe = pipeline('text-generation', model="ahmet1338/gpt2-turkish-cased", |
|
tokenizer="ahmet1338/gpt2-turkish-cased", config={'max_length':800}) |
|
text = pipe("Akşamüstü yolda ilerlerken, ")[0]["generated_text"] |
|
print(text) |
|
``` |
|
|
|
### How to clone the model repo? |
|
``` |
|
git lfs install |
|
git clone https://huggingface.co/ahmet1338/gpt2-turkish-cased |
|
``` |
|
|
|
|