File size: 1,652 Bytes

41db757
 
 
 
 
 
 
 
 
 
 
 
8e96a9b
 
 
 
4d0fa5c
 
8e96a9b
1779d8a
8e96a9b
 
 
 
 
5bb11f2
8e96a9b
 
 
 
 
 
 
01eb43c
8e96a9b
 
 
ed1a989
8e96a9b
4af8ead
 
 
 
 
 
 
 
 
 
 
 
8e96a9b

---
license: mit
datasets:
- ELiRF/dacsa
- projecte-aina/CATalog
language:
- ca
- en
base_model:
- openai-community/gpt2
- openai-community/gpt2-medium
pipeline_tag: text-generation
---

# GPT-2 Medium Catalan-English Model

The model is still being trained, and I will be making updates. Please do not expect great results just yet. 😀

## Model Overview
This model is a GPT-2 Medium architecture trained **from scratch**, meaning it does not inherit any weights from existing models. It has been trained using **Catalan** dataset, specifically **ELiRF/dacsa** and **projecte-aina/CATalog**. 

## License and Usage
This model is **free to use** under the MIT license. However, proper credit must be given when using it in research, applications, or any derived work. 

## Tokenizer
The model utilizes a **52,000-token vocabulary**, using gpt2 config, specifically trained to handle Catalan, the tokenizer is also available in "Marxx01/gpt2-catalan-tokenizer".

## How to Use
To use this model for text generation, you can load it with the `transformers` library as follows:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "Marxx01/gpt2_catalan"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

text = "El president de la generalitat va dir "
inputs = tokenizer(text, return_tensors="pt")

outputs = model.generate(
    **inputs,
    do_sample = True,
    max_length=150, 
    temperature=0.7, 
    top_p=0.8,  
    top_k=1000, 
    no_repeat_ngram_size=2, 
    num_return_sequences=1
)

print(tokenizer.decode(outputs[0], skip_special_tokens=True))