|
--- |
|
datasets: |
|
- xnli |
|
language: |
|
- sw |
|
library_name: transformers |
|
examples: null |
|
widget: |
|
- text: Uhuru Kenyatta ni rais wa [MASK]. |
|
example_title: Sentence_1 |
|
- text: Tumefanya mabadiliko muhimu [MASK] sera zetu za faragha na vidakuzi |
|
example_title: Sentence_2 |
|
--- |
|
|
|
# SW |
|
|
|
## Model description |
|
|
|
This is a transformers model pre-trained on a large corpus of Swahili data in a self-supervised fashion. This means it |
|
was pre-trained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of |
|
publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it |
|
was pre-trained with one objective: |
|
|
|
- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run |
|
the entire masked sentence through the model and has to predict the masked words. This is different from traditional |
|
recurrent neural networks (RNNs) that usually see the terms one after the other, or from autoregressive models like |
|
GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the |
|
sentence. |
|
|
|
This way, the model learns an inner representation of the Swahili language that can then be used to extract features |
|
useful for downstream tasks e.g. |
|
* Named Entity Recognition (Token Classification) |
|
* Text Classification |
|
|
|
The model is based on the Orginal BERT UNCASED which can be found on [google-research/bert readme](https://github.com/google-research/bert/blob/master/README.md) |
|
|
|
|
|
## Intended uses & limitations |
|
|
|
You can use the raw model for masked language modeling, but it's primarily intended to be fine-tuned on a downstream task. |
|
|
|
### How to use |
|
You can use this model directly with a pipeline for masked language modeling: |
|
|
|
|
|
#### Tokenizer |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("eolang/SW-v1") |
|
model = AutoModelForMaskedLM.from_pretrained("eolang/SW-v1") |
|
|
|
text = "Hii ni tovuti ya idhaa ya Kiswahili ya BBC ambayo hukuletea habari na makala kutoka Afrika na kote duniani kwa lugha ya Kiswahili." |
|
encoded_input = tokenizer(text, return_tensors='pt') |
|
output = model(**encoded_input) |
|
print(output) |
|
``` |
|
|
|
#### Fill Mask Model |
|
|
|
```python |
|
from transformers import AutoTokenizer, AutoModelForMaskedLM |
|
from transformers import pipeline |
|
|
|
tokenizer = AutoTokenizer.from_pretrained("eolang/SW-v1") |
|
model = AutoModelForMaskedLM.from_pretrained("eolang/SW-v1") |
|
|
|
fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer) |
|
sample_text = "Tumefanya mabadiliko muhimu [MASK] sera zetu za faragha na vidakuzi" |
|
|
|
for prediction in fill_mask(sample_text): |
|
print(f"{prediction['sequence']}, confidence: {prediction['score']}") |
|
``` |
|
|
|
### Limitations and Bias |
|
|
|
Even if the training data used for this model could be reasonably neutral, this model can have biased predictions. |
|
This is something I'm still working on improving. Feel free to share suggestions/comments via Discussion or [Email Me ๐](mailto:[email protected]?subject=HF%20Model%20Suggestions) |
|
|