Update README.md
Browse files
README.md
CHANGED
@@ -5,14 +5,17 @@ language:
|
|
5 |
library_name: transformers
|
6 |
pipeline_tag: fill-mask
|
7 |
datasets:
|
8 |
-
- tahrirchi/
|
|
|
9 |
tags:
|
10 |
- bert
|
11 |
widget:
|
12 |
-
|
|
|
|
|
13 |
---
|
14 |
|
15 |
-
# TahrirchiBERT base
|
16 |
|
17 |
The TahrirchiBERT-base is an encoder-only Transformer text model with 110 million parameters.
|
18 |
It is pretrained model on Uzbek language (latin script) using a masked language modeling (MLM) objective. This model is case-sensitive: it does make a difference between uzbek and Uzbek.
|
@@ -90,7 +93,7 @@ You can use this model directly with a pipeline for masked language modeling:
|
|
90 |
|
91 |
## Training data
|
92 |
|
93 |
-
TahrirchiBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of text with some tokens hidden, and it has to predict these masked tokens. TahrirchiBERT is trained on the Uzbek [Uzbek
|
94 |
|
95 |
## Training procedure
|
96 |
|
|
|
5 |
library_name: transformers
|
6 |
pipeline_tag: fill-mask
|
7 |
datasets:
|
8 |
+
- tahrirchi/uz-crawl
|
9 |
+
- tahrirchi/uz-books
|
10 |
tags:
|
11 |
- bert
|
12 |
widget:
|
13 |
+
- text: >-
|
14 |
+
Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning <mask>,
|
15 |
+
mutafakkiri va davlat arbobi bo‘lgan.
|
16 |
---
|
17 |
|
18 |
+
# TahrirchiBERT base model
|
19 |
|
20 |
The TahrirchiBERT-base is an encoder-only Transformer text model with 110 million parameters.
|
21 |
It is pretrained model on Uzbek language (latin script) using a masked language modeling (MLM) objective. This model is case-sensitive: it does make a difference between uzbek and Uzbek.
|
|
|
93 |
|
94 |
## Training data
|
95 |
|
96 |
+
TahrirchiBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of text with some tokens hidden, and it has to predict these masked tokens. TahrirchiBERT is trained on the [Uzbek Crawl](https://huggingface.co/datasets/tahrirchi/uz-crawl) and all latin portion of [Uzbek Books](https://huggingface.co/datasets/tahrirchi/uz-books), which contains roughly 4000 preprocessd books, 1.2 million curated text documents scraped from the internet and Telegram blogs (equivalent to 5 billion tokens).
|
97 |
|
98 |
## Training procedure
|
99 |
|