--- language: - tr tags: - '#Turkish ' - '#turkish' - '#gpt2' pipeline_tag: text-generation --- # Model Card for Model ID gpt2 fine-tuned with Turkish cleaned corpus data. Warning: Since the model is trained on a large dataset, it may produce unethical texts. Please be careful in this regard. No liability is accepted. ### Training Data - Dataset size: ~1.5 million cleaned data (Wikipedia, News and etc.) ## Using model ```Python from tokenizers import (decoders, models, normalizers, pre_tokenizers, processors, trainers, Tokenizer) from transformers import GPT2Tokenizer, GPT2TokenizerFast, GPT2Model, GPT2LMHeadModel from transformers import TextDataset, DataCollatorForLanguageModeling, Trainer, TrainingArguments import torch device = torch.device("cuda" if torch.cuda.is_available() else "cpu") print(device) model = GPT2LMHeadModel.from_pretrained("erythropygia/gpt2-turkish-base").to(device) tokenizer = GPT2TokenizerFast.from_pretrained("erythropygia/gpt2-turkish-base") tokenizer.pad_token = tokenizer.eos_token def generate_output(text): # Input text for completion input_text = text # Tokenize the input text input_ids = tokenizer.encode(input_text, return_tensors="pt").to(device) # Generate text completions with specified parameters output_text = model.generate(input_ids, no_repeat_ngram_size = 3, max_length=50, repetition_penalty=1.1, top_k=100, top_p=0.7, temperature = 0.8, do_sample=True, num_return_sequences=1)[0] # Decode the generated token IDs to text completed_text = tokenizer.decode(output_text, skip_special_tokens=False) #print("Input Text:", input_text) return completed_text print(generate_output("Türkiye'nin en çok tercih ")) ``` #### Training Hyperparameters - **Epochs:** 15 - **LearningRate:** 4e-4 #### Training Results **training_loss:** 3.4589332405925295