โ€
โ€โ€
โ€โ€โ€โ€Model: BERT-TWEET
โ€โ€โ€โ€Lang: IT
โ€โ€
โ€

Model description

This is a BERT [1] uncased model for the Italian language, obtained using TwHIN-BERT [2] (twhin-bert-base) as a starting point and focusing it on the Italian language by modifying the embedding layer (as in [3], computing document-level frequencies over the Wikipedia dataset)

The resulting model has 110M parameters, a vocabulary of 30.520 tokens, and a size of ~440 MB.

Quick usage

from transformers import BertTokenizerFast, BertModel

tokenizer = BertTokenizerFast.from_pretrained("osiria/bert-tweet-base-italian-uncased")
model = BertModel.from_pretrained("osiria/bert-tweet-base-italian-uncased")

Here you can find the find the model already fine-tuned on Sentiment Analysis: https://huggingface.co/osiria/bert-tweet-italian-uncased-sentiment

References

[1] https://arxiv.org/abs/1810.04805

[2] https://arxiv.org/abs/2209.07562

[3] https://arxiv.org/abs/2010.05609

Limitations

This model was trained on tweets, so it's mainly suitable for general-purpose social media text processing, involving short texts written in a social network style. It might show limitations when it comes to longer and more structured text, or domain-specific text.

License

The model is released under Apache-2.0 license

Downloads last month
29
Safetensors
Model size
110M params
Tensor type
I64
ยท
F32
ยท
Inference Providers NEW
This model is not currently available via any of the supported Inference Providers.