File size: 8,295 Bytes
473feb1
 
 
 
 
 
 
 
1f2a961
 
473feb1
 
 
be16b9b
473feb1
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
fde006f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
---
license: mit
language:
- sk
datasets:
- oscar-corpus/OSCAR-2109
pipeline_tag: fill-mask
library_name: transformers
tags:
- slovak-language-model
---
# Slovak Morphological Baby Language Model (SK_Morph_BLM)

**SK_Morph_BLM** is a pretrained small language model for the Slovak language, based on the RoBERTa architecture. The model utilizes a custom morphological tokenizer (**SKMT**, more info [here](https://github.com/daviddrzik/Slovak_subword_tokenizers)) specifically designed for the Slovak language, which focuses on **preserving the integrity of root morphemes**. This tokenizer is not compatible with the standard `RobertaTokenizer` from the Hugging Face library due to its unique approach to tokenization. The model is case-insensitive, meaning it operates in lowercase. While the pretrained model can be used for masked language modeling, it is primarily intended for fine-tuning on downstream NLP tasks.

## How to Use the Model

To use the SK_Morph_BLM model, follow these steps:

```python
import torch
import sys
from transformers import AutoModelForMaskedLM
from huggingface_hub import snapshot_download

# Download the repository from Hugging Face and append the path to sys.path
repo_path = snapshot_download(repo_id="daviddrzik/SK_Morph_BLM")
sys.path.append(repo_path)

# Import the custom tokenizer from the downloaded repository
from SKMT_lib_v2.SKMT_BPE import SKMorfoTokenizer

# Initialize the tokenizer and model
tokenizer = SKMorfoTokenizer()
model = AutoModelForMaskedLM.from_pretrained("daviddrzik/SK_Morph_BLM")

# Function to fill in the masked token in a given text
def fill_mask(tokenized_text, tokenizer, model, top_k=5):
    inputs = tokenizer.tokenize(tokenized_text.lower(), max_length=256, return_tensors='pt', return_subword=False)
    mask_token_index = torch.where(inputs["input_ids"][0] == 4)[0]
    with torch.no_grad():
        predictions = model(**inputs)

    topk_tokens = torch.topk(predictions.logits[0, mask_token_index], k=top_k, dim=-1).indices

    fill_results = []
    for idx, i in enumerate(mask_token_index):
        for j, token_idx in enumerate(topk_tokens[idx]):
            token_text = tokenizer.convert_ids_to_tokens(token_idx.item())
            token_text = token_text.replace("Ġ", " ")  # Replace special characters with a space
            probability = torch.softmax(predictions.logits[0, i], dim=-1)[token_idx].item()
            fill_results.append({
                'score': probability,
                'token': token_idx.item(),
                'token_str': token_text,
                'sequence': tokenized_text.replace("<mask>", token_text.strip())
            })

    fill_results.sort(key=lambda x: x['score'], reverse=True)
    return fill_results

# Example usage of the function
text = "Včera večer sme <mask> nový film v kine, ktorý mal premiéru iba pred týždňom."
result = fill_mask(text.lower(), tokenizer, model, top_k=5)
print(result)

[{'score': 0.4014046788215637,
  'token': 6626,
  'token_str': ' videli',
  'sequence': 'včera večer sme videli nový film v kine, ktorý mal premiéru iba pred týždňom.'},
 {'score': 0.15018892288208008,
  'token': 874,
  'token_str': ' mali',
  'sequence': 'včera večer sme mali nový film v kine, ktorý mal premiéru iba pred týždňom.'},
 {'score': 0.057530131191015244,
  'token': 21193,
  'token_str': ' pozreli',
  'sequence': 'včera večer sme pozreli nový film v kine, ktorý mal premiéru iba pred týždňom.'},
 {'score': 0.049020398408174515,
  'token': 26468,
  'token_str': ' sledovali',
  'sequence': 'včera večer sme sledovali nový film v kine, ktorý mal premiéru iba pred týždňom.'},
 {'score': 0.04107135161757469,
  'token': 9171,
  'token_str': ' objavili',
  'sequence': 'včera večer sme objavili nový film v kine, ktorý mal premiéru iba pred týždňom.'}]
```

## Training Data

The `SK_Morph_BLM` model was pretrained using a subset of the OSCAR 2019 corpus, specifically focusing on the Slovak language. The corpus underwent comprehensive preprocessing to ensure the quality and relevance of the data:

- **Language Filtering:** Non-Slovak text was removed to focus solely on the Slovak language.
- **Character Normalization:** Various types of spaces, quotes, dashes, and separators were standardized (e.g., replacing different types of spaces with a single space, or dashes with hyphens). Emoticons were replaced with spaces.
- **Symbol and Unwanted Text Removal:** Sentences containing mathematical symbols, pictograms, or characters from Asian and African languages were deleted. Duplicates of punctuation, special characters, and spaces were also removed.
- **URL and Text Normalization:** All web addresses were removed, and the text was converted to lowercase to simplify tokenization.
- **Content Cleanup:** Text that included irrelevant content from web crawling, such as keywords and HTML tags, was identified and removed.

Additionally, the preprocessing included further refinement steps to create the final dataset:

- **Parentheses Content Removal:** All content within parentheses was removed to reduce noise.
- **Selection of Text Segments:** Medium-length text paragraphs were selected to maintain consistency.
- **Similarity Filtering:** Paragraphs with at least 50% similarity to previous ones were removed to minimize redundancy.
- **Random Sampling:** Finally, 20% of the remaining paragraphs were randomly selected.

After preprocessing, the training corpus consisted of:
- **455 MB of text**
- **895,125 paragraphs**
- **64.6 million words**
- **1.13 million unique words**
- **119 unique characters**

## Pretraining

The `SK_Morph_BLM` model was trained with the following key parameters:

- **Architecture:** Based on RoBERTa, with 6 hidden layers and 12 attention heads.
- **Hidden size:** 576
- **Vocabulary size:** 50,264 tokens
- **Sequence length:** 256 tokens
- **Dropout:** 0.1
- **Number of parameters:** 58 million
- **Optimizer:** AdamW, learning rate 1×10^(-4), weight decay 0.01
- **Training:** 30 epochs, divided into 3 phases:
  - **Phase 1:** 10 epochs on CPU (4x AMD EPYC 7542), batch size 64, 50 hours per epoch, 139,870 steps total.
  - **Phase 2:** 5 epochs on GPU (1x Nvidia A100 40GB), batch size 64, 100 minutes per epoch, 69,935 steps total.
  - **Phase 3:** 15 epochs on GPU (2x Nvidia A100 40GB), batch size 128, 60 minutes per epoch, 104,910 steps total.

The model was trained using the Hugging Face library, but without using the `Trainer` class—native PyTorch was used instead.

## Fine-Tuned Versions of the SK_Morph_BLM Model

Here are the fine-tuned versions of the `SK_Morph_BLM` model based on the folders provided:

- [`SK_Morph_BLM-ner`](https://huggingface.co/daviddrzik/SK_Morph_BLM-ner): Fine-tuned for Named Entity Recognition (NER) tasks.
- [`SK_Morph_BLM-pos`](https://huggingface.co/daviddrzik/SK_Morph_BLM-pos): Fine-tuned for Part-of-Speech (POS) tagging.
- [`SK_Morph_BLM-qa`](https://huggingface.co/daviddrzik/SK_Morph_BLM-qa): Fine-tuned for Question Answering tasks.
- [`SK_Morph_BLM-sentiment-csfd`](https://huggingface.co/daviddrzik/SK_Morph_BLM-sentiment-csfd): Fine-tuned for sentiment analysis on the CSFD (movie review) dataset.
- [`SK_Morph_BLM-sentiment-multidomain`](https://huggingface.co/daviddrzik/SK_Morph_BLM-sentiment-multidomain): Fine-tuned for sentiment analysis across multiple domains.
- [`SK_Morph_BLM-sentiment-reviews`](https://huggingface.co/daviddrzik/SK_Morph_BLM-sentiment-reviews): Fine-tuned for sentiment analysis on general review datasets.
- [`SK_Morph_BLM-topic-news`](https://huggingface.co/daviddrzik/SK_Morph_BLM-topic-news): Fine-tuned for topic classification in news articles.

## Citation

If you find our model or paper useful, please consider citing our work:

### Article:
Držík, D., & Forgac, F. (2024). Slovak morphological tokenizer using the Byte-Pair Encoding algorithm. PeerJ Computer Science, 10, e2465. https://doi.org/10.7717/peerj-cs.2465

### BibTeX Entry:
```bib
@article{drzik2024slovak,
  title={Slovak morphological tokenizer using the Byte-Pair Encoding algorithm},
  author={Držík, Dávid and Forgac, František},
  journal={PeerJ Computer Science},
  volume={10},
  pages={e2465},
  year={2024},
  month={11},
  issn={2376-5992},
  doi={10.7717/peerj-cs.2465}
}
```