|
--- |
|
language: |
|
- tr |
|
tags: |
|
- '#Turkish ' |
|
- '#turkish' |
|
- '#gpt2' |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
|
|
gpt2 fine-tuned with Turkish cleaned corpus data. |
|
|
|
Warning: Since the model is trained on a large dataset, it may produce unethical texts. Please be careful in this regard. No liability is accepted. |
|
|
|
|
|
### Training Data |
|
|
|
- Dataset size: ~1.5 million cleaned data (Wikipedia, News and etc.) |
|
|
|
|
|
## Using model |
|
|
|
```Python |
|
from tokenizers import (decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer) |
|
from transformers import GPT2Tokenizer, GPT2TokenizerFast, GPT2Model, GPT2LMHeadModel |
|
from transformers import TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments |
|
import torch |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
print(device) |
|
|
|
model = GPT2LMHeadModel.from_pretrained("erythropygia/gpt2-turkish-base").to(device) |
|
tokenizer = GPT2TokenizerFast.from_pretrained("erythropygia/gpt2-turkish-base") |
|
tokenizer.pad_token = tokenizer.eos_token |
|
|
|
def generate_output(text): |
|
# Input text for completion |
|
input_text = text |
|
|
|
# Tokenize the input text |
|
input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device) |
|
|
|
# Generate text completions with specified parameters |
|
output_text = model.generate(input_ids, |
|
no_repeat_ngram_size = 3, |
|
max_length=50, |
|
repetition_penalty=1.1, |
|
top_k=100, |
|
top_p=0.7, |
|
temperature = 0.8, |
|
do_sample=True, |
|
num_return_sequences=1)[0] |
|
|
|
# Decode the generated token IDs to text |
|
completed_text = tokenizer.decode(output_text, skip_special_tokens=False) |
|
|
|
#print("Input Text:", input_text) |
|
return completed_text |
|
|
|
print(generate_output("Türkiye'nin en çok tercih ")) |
|
``` |
|
|
|
#### Training Hyperparameters |
|
|
|
- **Epochs:** 15 |
|
- **LearningRate:** 4e-4 |
|
|
|
|
|
#### Training Results |
|
**training_loss:** 3.4589332405925295 |