murodbek commited on
Commit
6012d68
1 Parent(s): 4e4d68a

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +7 -4
README.md CHANGED
@@ -5,14 +5,17 @@ language:
5
  library_name: transformers
6
  pipeline_tag: fill-mask
7
  datasets:
8
- - tahrirchi/uzbek-corpus
 
9
  tags:
10
  - bert
11
  widget:
12
- - text: "Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning <mask>, mutafakkiri va davlat arbobi bo‘lgan."
 
 
13
  ---
14
 
15
- # TahrirchiBERT base mode
16
 
17
  The TahrirchiBERT-base is an encoder-only Transformer text model with 110 million parameters.
18
  It is pretrained model on Uzbek language (latin script) using a masked language modeling (MLM) objective. This model is case-sensitive: it does make a difference between uzbek and Uzbek.
@@ -90,7 +93,7 @@ You can use this model directly with a pipeline for masked language modeling:
90
 
91
  ## Training data
92
 
93
- TahrirchiBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of text with some tokens hidden, and it has to predict these masked tokens. TahrirchiBERT is trained on the Uzbek [Uzbek Corpus dataset](https://huggingface.co/tahrirchi/uzbek-corpus), which contains roughly 35000 preprocessd books, 4 million curated text documents scraped from the internet and 100 Telegram blogs (equivalent to 5 billion tokens).
94
 
95
  ## Training procedure
96
 
 
5
  library_name: transformers
6
  pipeline_tag: fill-mask
7
  datasets:
8
+ - tahrirchi/uz-crawl
9
+ - tahrirchi/uz-books
10
  tags:
11
  - bert
12
  widget:
13
+ - text: >-
14
+ Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning <mask>,
15
+ mutafakkiri va davlat arbobi bo‘lgan.
16
  ---
17
 
18
+ # TahrirchiBERT base model
19
 
20
  The TahrirchiBERT-base is an encoder-only Transformer text model with 110 million parameters.
21
  It is pretrained model on Uzbek language (latin script) using a masked language modeling (MLM) objective. This model is case-sensitive: it does make a difference between uzbek and Uzbek.
 
93
 
94
  ## Training data
95
 
96
+ TahrirchiBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of text with some tokens hidden, and it has to predict these masked tokens. TahrirchiBERT is trained on the [Uzbek Crawl](https://huggingface.co/datasets/tahrirchi/uz-crawl) and all latin portion of [Uzbek Books](https://huggingface.co/datasets/tahrirchi/uz-books), which contains roughly 4000 preprocessd books, 1.2 million curated text documents scraped from the internet and Telegram blogs (equivalent to 5 billion tokens).
97
 
98
  ## Training procedure
99