gpt2_catalan / README.md
Marxx01's picture
Update README.md
c2fb6af verified
---
license: mit
datasets:
- ELiRF/dacsa
- projecte-aina/CATalog
language:
- ca
- en
base_model:
- openai-community/gpt2
- openai-community/gpt2-medium
pipeline_tag: text-generation
---
# GPT-2 Medium Catalan-English Model
The model is still being trained, and I will be making updates. Please do not expect great results just yet. ๐Ÿ˜€
## Model Overview
This model is a GPT-2 Medium architecture trained **from scratch**, meaning it does not inherit any weights from existing models. It has been trained using **Catalan** dataset, specifically **ELiRF/dacsa** and **projecte-aina/CATalog**.
## License and Usage
This model is **free to use** under the MIT license. However, proper credit must be given when using it in research, applications, or any derived work.
## Tokenizer
The model utilizes a **52,000-token vocabulary**, using gpt2 config, specifically trained to handle Catalan, the tokenizer is also available in "Marxx01/gpt2-catalan-tokenizer".
## How to Use
To use this model for text generation, you can load it with the `transformers` library as follows:
```python
from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "Marxx01/gpt2_catalan"
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
text = "El president de la generalitat va dir "
inputs = tokenizer(text, return_tensors="pt")
outputs = model.generate(
**inputs,
do_sample = True,
max_length=150,
temperature=0.7,
top_p=0.8,
top_k=1000,
no_repeat_ngram_size=2,
num_return_sequences=1
)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))