File size: 6,354 Bytes
fea59e3
 
 
 
 
 
 
6012d68
 
4e4d68a
 
 
6012d68
 
 
fea59e3
 
6012d68
fea59e3
 
 
 
 
 
 
 
 
 
4e4d68a
fea59e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
4e4d68a
fea59e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6012d68
fea59e3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
2d1a421
fea59e3
 
 
 
 
 
 
 
96aebe9
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
---
license: apache-2.0
language:
- uz
library_name: transformers
pipeline_tag: fill-mask
datasets:
- tahrirchi/uz-crawl
- tahrirchi/uz-books
tags:
- bert
widget:
- text: >-
    Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning <mask>,
    mutafakkiri va davlat arbobi bo‘lgan.
---

# TahrirchiBERT base model

The TahrirchiBERT-base is an encoder-only Transformer text model with 110 million parameters. 
It is pretrained model on Uzbek language (latin script) using a masked language modeling (MLM) objective. This model is case-sensitive: it does make a difference between uzbek and Uzbek.

For full details of this model please read our paper (coming soon!) and [release blog post](https://tahrirchi.uz/grammatika-tekshiruvi).

## Model variations

This model is part of the family of **TahrirchiBERT models** trained with different number of parameters that will continuously expanded in the future.

| Model | Number of parameters | Language | Script
|------------------------|--------------------------------|-------|-------|
| [`tahrirchi-bert-small`](https://huggingface.co/tahrirchi/tahrirchi-bert-small) | 67M   | Uzbek | Latin
| [`tahrirchi-bert-base`](https://huggingface.co/tahrirchi/tahrirchi-bert-base) | 110M   | Uzbek | Latin

## Intended uses & limitations

This model is primarily aimed at being fine-tuned on tasks that use the whole sentence (potentially masked)
to make decisions, such as sequence classification, token classification or question answering. 

### How to use

You can use this model directly with a pipeline for masked language modeling:

```python
>>> from transformers import pipeline
>>> unmasker = pipeline('fill-mask', model='tahrirchi/tahrirchi-bert-base')
>>> unmasker("Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning <mask>, mutafakkiri va davlat arbobi bo‘lgan.")

[{'score': 0.4616584777832031,
  'token': 10879,
  'token_str': ' shoiri',
  'sequence': 'Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning shoiri, mutafakkiri va davlat arbobi bo‘lgan.'},
 {'score': 0.19899587333202362,
  'token': 10013,
  'token_str': ' olimi',
  'sequence': 'Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning olimi, mutafakkiri va davlat arbobi bo‘lgan.'},
 {'score': 0.055418431758880615,
  'token': 12224,
  'token_str': ' asoschisi',
  'sequence': 'Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning asoschisi, mutafakkiri va davlat arbobi bo‘lgan.'},
 {'score': 0.037673842161893845,
  'token': 24597,
  'token_str': ' faylasufi',
  'sequence': 'Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning faylasufi, mutafakkiri va davlat arbobi bo‘lgan.'},
 {'score': 0.029616089537739754,
  'token': 9543,
  'token_str': ' farzandi',
  'sequence': 'Alisher Navoiy – ulug‘ o‘zbek va boshqa turkiy xalqlarning farzandi, mutafakkiri va davlat arbobi bo‘lgan.'}]


>>> unmasker("Egiluvchan boʻgʻinlari va <mask>, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.")

[{'score': 0.1740381121635437,
  'token': 12571,
  'token_str': ' oyoqlari',
  'sequence': 'Egiluvchan bo‘g‘inlari va oyoqlari, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'},
 {'score': 0.05455964431166649,
  'token': 2073,
  'token_str': ' uzun',
  'sequence': 'Egiluvchan bo‘g‘inlari va uzun, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'},
 {'score': 0.050441522151231766,
  'token': 19725,
  'token_str': ' barmoqlari',
  'sequence': 'Egiluvchan bo‘g‘inlari va barmoqlari, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'},
 {'score': 0.04490342736244202,
  'token': 10424,
  'token_str': ' tanasi',
  'sequence': 'Egiluvchan bo‘g‘inlari va tanasi, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'},
 {'score': 0.03777358680963516,
  'token': 27116,
  'token_str': ' bukilgan',
  'sequence': 'Egiluvchan bo‘g‘inlari va bukilgan, yarim bukilgan tirnoqlari tik qiyaliklar hamda daraxtlarga oson chiqish imkonini beradi.'}]
```

## Training data

TahrirchiBERT is pretrained using a standard Masked Language Modeling (MLM) objective: the model is given a sequence of text with some tokens hidden, and it has to predict these masked tokens. TahrirchiBERT is trained on the [Uzbek Crawl](https://huggingface.co/datasets/tahrirchi/uz-crawl) and all latin portion of [Uzbek Books](https://huggingface.co/datasets/tahrirchi/uz-books), which contains roughly 4000 preprocessd books, 1.2 million curated text documents scraped from the internet and Telegram blogs (equivalent to 5 billion tokens).

## Training procedure

### Preprocessing

The texts are tokenized using a byte version of Byte-Pair Encoding (BPE) and a vocabulary size of 30,528 to make fully use of rare words. The inputs of the model take pieces of 512 contiguous tokens that may span over documents. Also, we added number of regular expressions to avoid misrepresentation of different symbols that is used mostly incorrectly in practise.

### Pretraining

The model was trained for one million steps with a batch size of 512. The sequence length was limited to 512 tokens during all pre-training stage. The optimizer used is Adam with a learning rate of 5e-4, \\(\beta_{1} = 0.9\\) and \\(\beta_{2} = 0.98\\), a weight decay of 1e-5, learning rate warmup to the full LR for 6% of the training duration with linearly decay to 0.02x the full LR by the end of the training duration.

## Citation

Please cite this model using the following format:

```
@online{Mamasaidov2023TahrirchiBERT,
    author    = {Mukhammadsaid Mamasaidov and Abror Shopulatov},
    title     = {TahrirchiBERT base},
    year      = {2023},
    url       = {https://huggingface.co/tahrirchi/tahrirchi-bert-base},
    note      = {Accessed: 2023-10-27}, % change this date
    urldate   = {2023-10-27} % change this date
}
```

## Gratitude

We are thankfull for these awesome organizations and people for help to make it happen:

 - [MosaicML team](https://mosaicml.com/): for their script for efficiently training BERT models
 - [Ilya Gusev](https://github.com/IlyaGusev/): for advise throughout the process  
 - [David Dale](https://daviddale.ru): for advise throughout the process