|
--- |
|
language: no |
|
license: CC-BY 4.0 |
|
tags: |
|
- spanish |
|
- roberta |
|
pipeline_tag: fill-mask |
|
widget: |
|
- text: "Fui a la librería a comprar un <mask>." |
|
--- |
|
|
|
# BERTIN |
|
|
|
BERTIN is a series of BERT-based models for Spanish. This one is a RoBERTa-large model trained from scratch on the Spanish portion of mC4 using [Flax](https://github.com/google/flax), including training scripts. |
|
|
|
This is part of the |
|
[Flax/Jax Community Week](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104), organised by [HuggingFace](https://huggingface.co/) and TPU usage sponsored by Google. |
|
|
|
## Spanish mC4 |
|
|
|
The Spanish portion of mC4 containes about 416 million records and 235 billion words. |
|
|
|
```bash |
|
$ zcat c4/multilingual/c4-es*.tfrecord*.json.gz | wc -l |
|
416057992 |
|
``` |
|
|
|
```bash |
|
$ zcat c4/multilingual/c4-es*.tfrecord-*.json.gz | jq -r '.text | split(" ") | length' | paste -s -d+ - | bc |
|
235303687795 |
|
``` |
|
|
|
## Team members |
|
|
|
- Javier de la Rosa (versae) |
|
- Manu Romero (mrm8488) |
|
- María Grandury (mariagrandury) |
|
- Ari Polakov (aripo99) |
|
- Pablogps |
|
- daveni |
|
- Sri Lakshmi |
|
|
|
## Useful links |
|
|
|
- [Community Week timeline](https://discuss.huggingface.co/t/open-to-the-community-community-week-using-jax-flax-for-nlp-cv/7104#summary-timeline-calendar-6) |
|
- [Community Week README](https://github.com/huggingface/transformers/blob/master/examples/research_projects/jax-projects/README.md) |
|
- [Community Week thread](https://discuss.huggingface.co/t/bertin-pretrain-roberta-large-from-scratch-in-spanish/7125) |
|
- [Community Week channel](https://discord.com/channels/858019234139602994/859113060068229190) |
|
- [Masked Language Modelling example scripts](https://github.com/huggingface/transformers/tree/master/examples/flax/language-modeling) |
|
- [Model Repository](https://huggingface.co/flax-community/bertin-roberta-large-spanish/) |
|
|