eolang
/

SW-v1

Fill-Mask

Transformers

PyTorch

Safetensors

Swahili

bert

Model card Files Files and versions Community

eolang commited on Jun 11, 2023

Commit

1aae1ae

1 Parent(s): fec8fa9

Update README.md

Browse files

Files changed (1) hide show

README.md +22 -5

README.md CHANGED Viewed

@@ -14,8 +14,6 @@ widget:
 # SW
-* Pre-trained model on Swahili language using a masked language modeling (MLM) objective.
 ## Model description
 This is a transformers model pre-trained on a large corpus of Swahili data in a self-supervised fashion. This means it
@@ -42,10 +40,11 @@ The model is based on the Orginal BERT UNCASED which can be found on [google-res
 You can use the raw model for masked language modeling, but it's primarily intended to be fine-tuned on a downstream task.
 ### How to use
 You can use this model directly with a pipeline for masked language modeling:
 ```python
 from transformers import AutoTokenizer, AutoModelForMaskedLM
@@ -55,8 +54,26 @@ model = AutoModelForMaskedLM.from_pretrained("eolang/SW-v1")
 text = "Hii ni tovuti ya idhaa ya Kiswahili ya BBC ambayo hukuletea habari na makala kutoka Afrika na kote duniani kwa lugha ya Kiswahili."
 encoded_input = tokenizer(text, return_tensors='pt')
 output = model(**encoded_input)
 ```
 ### Limitations and Bias
-Even if the training data used for this model could be reasonably neutral, this model can have biased
-predictions. This is something we are still working on improving.

 # SW
 ## Model description
 This is a transformers model pre-trained on a large corpus of Swahili data in a self-supervised fashion. This means it
 You can use the raw model for masked language modeling, but it's primarily intended to be fine-tuned on a downstream task.
 ### How to use
 You can use this model directly with a pipeline for masked language modeling:
+#### Tokenizer
 ```python
 from transformers import AutoTokenizer, AutoModelForMaskedLM
 text = "Hii ni tovuti ya idhaa ya Kiswahili ya BBC ambayo hukuletea habari na makala kutoka Afrika na kote duniani kwa lugha ya Kiswahili."
 encoded_input = tokenizer(text, return_tensors='pt')
 output = model(**encoded_input)
+print(output)
+```
+#### Fill Mask Model
+```python
+from transformers import AutoTokenizer, AutoModelForMaskedLM
+from transformers import pipeline
+tokenizer = AutoTokenizer.from_pretrained("eolang/SW-v1")
+model = AutoModelForMaskedLM.from_pretrained("eolang/SW-v1")
+fill_mask = pipeline("fill-mask", model=model, tokenizer=tokenizer)
+sample_text = "Tumefanya mabadiliko muhimu [MASK] sera zetu za faragha na vidakuzi"
+for prediction in fill_mask(sample_text):
+    print(f"{prediction['sequence']}, confidence: {prediction['score']}")
 ```
 ### Limitations and Bias
+Even if the training data used for this model could be reasonably neutral, this model can have biased predictions.
+This is something I'm still working on improving. Feel free to share suggestions/comments via Discussion or [Email Me 😀](mailto:[email protected]?subject=HF%20Model%20Suggestions)