metadata
language: ca
license: cc-by-sa-4.0
datasets:
- cc100
- oscar
- wikipedia
widget:
- text: M'agrada el clima i el menjar
- text: Ell està una mica
GPT2 Catalan small model Version 2 (Uncased)
Prerequisites
transformers==4.19.2
Model architecture
This model uses parameters based on GPT2 base setttings, but the number of layers is half the size of it.
Tokenizer
Using BPE tokenizer with vocabulary size 50,000.
Training Data
- wiki40b/ca (Catalan Wikipedia)
- Subset of oscar
- Subset of CC-100/ca : Monolingual Datasets from Web Crawl Data
Usage
from transformers import pipeline
unmasker = pipeline('fill-mask', model='ClassCat/gpt2-small-catalan-v2')
unmasker("Ell està una mica")