File size: 3,094 Bytes
30fd8dd
 
 
 
 
 
 
 
9f50f95
 
30fd8dd
9f50f95
 
 
30fd8dd
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1aae1ae
 
30fd8dd
 
 
 
 
 
 
 
1aae1ae
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30fd8dd
1aae1ae
30fd8dd
 
1aae1ae
b07fe88
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
---
datasets:
- xnli
language:
- sw
library_name: transformers
examples: null
widget:
  - text: Joe Bidden ni rais wa [MASK].
    example_title: Sentence 1
  - text: Tumefanya mabadiliko muhimu [MASK] sera zetu za faragha na vidakuzi
    example_title: Sentence 2
  - text: Mtoto anaweza kupoteza [MASK] kabisa
    example_title: Sentence 3
---

# SW

## Model description

This is a transformers model pre-trained on a large corpus of Swahili data in a self-supervised fashion. This means it
was pre-trained on the raw texts only, with no humans labeling them in any way (which is why it can use lots of
publicly available data) with an automatic process to generate inputs and labels from those texts. More precisely, it
was pre-trained with one objective:

- Masked language modeling (MLM): taking a sentence, the model randomly masks 15% of the words in the input then run
  the entire masked sentence through the model and has to predict the masked words. This is different from traditional
  recurrent neural networks (RNNs) that usually see the terms one after the other, or from autoregressive models like
  GPT which internally masks the future tokens. It allows the model to learn a bidirectional representation of the
  sentence.

This way, the model learns an inner representation of the Swahili language that can then be used to extract features
useful for downstream tasks e.g.
 * Named Entity Recognition (Token Classification)
 * Text Classification

The model is based on the Orginal BERT UNCASED which can be found on [google-research/bert readme](https://github.com/google-research/bert/blob/master/README.md)


## Intended uses & limitations

You can use the raw model for masked language modeling, but it's primarily intended to be fine-tuned on a downstream task. 

### How to use
You can use this model directly with a pipeline for masked language modeling:


#### Tokenizer

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM

tokenizer = AutoTokenizer.from_pretrained("eolang/SW-v1")

text = "Hii ni tovuti ya idhaa ya Kiswahili ya BBC ambayo hukuletea habari na makala kutoka Afrika na kote duniani kwa lugha ya Kiswahili."
encoded_input = tokenizer(text, return_tensors='pt')
output = model(**encoded_input)
print(output)
```

#### Fill Mask Model

```python
from transformers import AutoTokenizer, AutoModelForMaskedLM
from transformers import pipeline

tokenizer = AutoTokenizer.from_pretrained("eolang/SW-v1")
model = AutoModelForMaskedLM.from_pretrained("eolang/SW-v1")

fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
sample_text = "Tumefanya mabadiliko muhimu [MASK] sera zetu za faragha na vidakuzi"

for prediction in fill_mask(sample_text):
    print(f"{prediction['sequence']}, confidence: {prediction['score']}")
```

### Limitations and Bias

Even if the training data used for this model could be reasonably neutral, this model can have biased predictions.
This is something I'm still working on improving. Feel free to share suggestions/comments via [Discussions](https://huggingface.co/eolang/SW-v1/discussions)