|
--- |
|
license: mit |
|
datasets: |
|
- ELiRF/dacsa |
|
- projecte-aina/CATalog |
|
language: |
|
- ca |
|
- en |
|
base_model: |
|
- openai-community/gpt2 |
|
- openai-community/gpt2-medium |
|
pipeline_tag: text-generation |
|
--- |
|
|
|
# GPT-2 Medium Catalan-English Model |
|
|
|
The model is still being trained, and I will be making updates. Please do not expect great results just yet. ๐ |
|
|
|
## Model Overview |
|
This model is a GPT-2 Medium architecture trained **from scratch**, meaning it does not inherit any weights from existing models. It has been trained using **Catalan** dataset, specifically **ELiRF/dacsa** and **projecte-aina/CATalog**. |
|
|
|
## License and Usage |
|
This model is **free to use** under the MIT license. However, proper credit must be given when using it in research, applications, or any derived work. |
|
|
|
## Tokenizer |
|
The model utilizes a **52,000-token vocabulary**, using gpt2 config, specifically trained to handle Catalan, the tokenizer is also available in "Marxx01/gpt2-catalan-tokenizer". |
|
|
|
## How to Use |
|
To use this model for text generation, you can load it with the `transformers` library as follows: |
|
|
|
```python |
|
from transformers import AutoModelForCausalLM, AutoTokenizer |
|
|
|
model_name = "Marxx01/gpt2_catalan" |
|
model = AutoModelForCausalLM.from_pretrained(model_name) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
text = "El president de la generalitat va dir " |
|
inputs = tokenizer(text, return_tensors="pt") |
|
|
|
outputs = model.generate( |
|
**inputs, |
|
do_sample = True, |
|
max_length=150, |
|
temperature=0.7, |
|
top_p=0.8, |
|
top_k=1000, |
|
no_repeat_ngram_size=2, |
|
num_return_sequences=1 |
|
) |
|
|
|
print(tokenizer.decode(outputs[0], skip_special_tokens=True)) |