SK_Morph_BLM / README.md

Update README.md

fde006f verified about 1 month ago

8.3 kB

	---
	license: mit
	language:
	- sk
	datasets:
	- oscar-corpus/OSCAR-2109
	pipeline_tag: fill-mask
	library_name: transformers
	tags:
	- slovak-language-model
	---
	# Slovak Morphological Baby Language Model (SK_Morph_BLM)

	SK_Morph_BLM is a pretrained small language model for the Slovak language, based on the RoBERTa architecture. The model utilizes a custom morphological tokenizer (SKMT, more info [here](https://github.com/daviddrzik/Slovak_subword_tokenizers)) specifically designed for the Slovak language, which focuses on preserving the integrity of root morphemes. This tokenizer is not compatible with the standard `RobertaTokenizer` from the Hugging Face library due to its unique approach to tokenization. The model is case-insensitive, meaning it operates in lowercase. While the pretrained model can be used for masked language modeling, it is primarily intended for fine-tuning on downstream NLP tasks.

	## How to Use the Model

	To use the SK_Morph_BLM model, follow these steps:

	```python
	import torch
	import sys
	from transformers import AutoModelForMaskedLM
	from huggingface_hub import snapshot_download

	# Download the repository from Hugging Face and append the path to sys.path
	repo_path = snapshot_download(repo_id="daviddrzik/SK_Morph_BLM")
	sys.path.append(repo_path)

	# Import the custom tokenizer from the downloaded repository
	from SKMT_lib_v2.SKMT_BPE import SKMorfoTokenizer

	# Initialize the tokenizer and model
	tokenizer = SKMorfoTokenizer()
	model = AutoModelForMaskedLM.from_pretrained("daviddrzik/SK_Morph_BLM")

	# Function to fill in the masked token in a given text
	def fill_mask(tokenized_text, tokenizer, model, top_k=5):
	inputs = tokenizer.tokenize(tokenized_text.lower(), max_length=256, return_tensors='pt', return_subword=False)
	mask_token_index = torch.where(inputs["input_ids"][0] == 4)[0]
	with torch.no_grad():
	predictions = model(**inputs)

	topk_tokens = torch.topk(predictions.logits[0, mask_token_index], k=top_k, dim=-1).indices

	fill_results = []
	for idx, i in enumerate(mask_token_index):
	for j, token_idx in enumerate(topk_tokens[idx]):
	token_text = tokenizer.convert_ids_to_tokens(token_idx.item())
	token_text = token_text.replace("Ġ", " ") # Replace special characters with a space
	probability = torch.softmax(predictions.logits[0, i], dim=-1)[token_idx].item()
	fill_results.append({
	'score': probability,
	'token': token_idx.item(),
	'token_str': token_text,
	'sequence': tokenized_text.replace("<mask>", token_text.strip())
	})

	fill_results.sort(key=lambda x: x['score'], reverse=True)
	return fill_results

	# Example usage of the function
	text = "Včera večer sme <mask> nový film v kine, ktorý mal premiéru iba pred týždňom."
	result = fill_mask(text.lower(), tokenizer, model, top_k=5)
	print(result)

	[{'score': 0.4014046788215637,
	'token': 6626,
	'token_str': ' videli',
	'sequence': 'včera večer sme videli nový film v kine, ktorý mal premiéru iba pred týždňom.'},
	{'score': 0.15018892288208008,
	'token': 874,
	'token_str': ' mali',
	'sequence': 'včera večer sme mali nový film v kine, ktorý mal premiéru iba pred týždňom.'},
	{'score': 0.057530131191015244,
	'token': 21193,
	'token_str': ' pozreli',
	'sequence': 'včera večer sme pozreli nový film v kine, ktorý mal premiéru iba pred týždňom.'},
	{'score': 0.049020398408174515,
	'token': 26468,
	'token_str': ' sledovali',
	'sequence': 'včera večer sme sledovali nový film v kine, ktorý mal premiéru iba pred týždňom.'},
	{'score': 0.04107135161757469,
	'token': 9171,
	'token_str': ' objavili',
	'sequence': 'včera večer sme objavili nový film v kine, ktorý mal premiéru iba pred týždňom.'}]
	```

	## Training Data

	The `SK_Morph_BLM` model was pretrained using a subset of the OSCAR 2019 corpus, specifically focusing on the Slovak language. The corpus underwent comprehensive preprocessing to ensure the quality and relevance of the data:

	- Language Filtering: Non-Slovak text was removed to focus solely on the Slovak language.
	- Character Normalization: Various types of spaces, quotes, dashes, and separators were standardized (e.g., replacing different types of spaces with a single space, or dashes with hyphens). Emoticons were replaced with spaces.
	- Symbol and Unwanted Text Removal: Sentences containing mathematical symbols, pictograms, or characters from Asian and African languages were deleted. Duplicates of punctuation, special characters, and spaces were also removed.
	- URL and Text Normalization: All web addresses were removed, and the text was converted to lowercase to simplify tokenization.
	- Content Cleanup: Text that included irrelevant content from web crawling, such as keywords and HTML tags, was identified and removed.

	Additionally, the preprocessing included further refinement steps to create the final dataset:

	- Parentheses Content Removal: All content within parentheses was removed to reduce noise.
	- Selection of Text Segments: Medium-length text paragraphs were selected to maintain consistency.
	- Similarity Filtering: Paragraphs with at least 50% similarity to previous ones were removed to minimize redundancy.
	- Random Sampling: Finally, 20% of the remaining paragraphs were randomly selected.

	After preprocessing, the training corpus consisted of:
	- 455 MB of text
	- 895,125 paragraphs
	- 64.6 million words
	- 1.13 million unique words
	- 119 unique characters

	## Pretraining

	The `SK_Morph_BLM` model was trained with the following key parameters:

	- Architecture: Based on RoBERTa, with 6 hidden layers and 12 attention heads.
	- Hidden size: 576
	- Vocabulary size: 50,264 tokens
	- Sequence length: 256 tokens
	- Dropout: 0.1
	- Number of parameters: 58 million
	- Optimizer: AdamW, learning rate 1×10^(-4), weight decay 0.01
	- Training: 30 epochs, divided into 3 phases:
	- Phase 1: 10 epochs on CPU (4x AMD EPYC 7542), batch size 64, 50 hours per epoch, 139,870 steps total.
	- Phase 2: 5 epochs on GPU (1x Nvidia A100 40GB), batch size 64, 100 minutes per epoch, 69,935 steps total.
	- Phase 3: 15 epochs on GPU (2x Nvidia A100 40GB), batch size 128, 60 minutes per epoch, 104,910 steps total.

	The model was trained using the Hugging Face library, but without using the `Trainer` class—native PyTorch was used instead.

	## Fine-Tuned Versions of the SK_Morph_BLM Model

	Here are the fine-tuned versions of the `SK_Morph_BLM` model based on the folders provided:

	- [`SK_Morph_BLM-ner`](https://huggingface.co/daviddrzik/SK_Morph_BLM-ner): Fine-tuned for Named Entity Recognition (NER) tasks.
	- [`SK_Morph_BLM-pos`](https://huggingface.co/daviddrzik/SK_Morph_BLM-pos): Fine-tuned for Part-of-Speech (POS) tagging.
	- [`SK_Morph_BLM-qa`](https://huggingface.co/daviddrzik/SK_Morph_BLM-qa): Fine-tuned for Question Answering tasks.
	- [`SK_Morph_BLM-sentiment-csfd`](https://huggingface.co/daviddrzik/SK_Morph_BLM-sentiment-csfd): Fine-tuned for sentiment analysis on the CSFD (movie review) dataset.
	- [`SK_Morph_BLM-sentiment-multidomain`](https://huggingface.co/daviddrzik/SK_Morph_BLM-sentiment-multidomain): Fine-tuned for sentiment analysis across multiple domains.
	- [`SK_Morph_BLM-sentiment-reviews`](https://huggingface.co/daviddrzik/SK_Morph_BLM-sentiment-reviews): Fine-tuned for sentiment analysis on general review datasets.
	- [`SK_Morph_BLM-topic-news`](https://huggingface.co/daviddrzik/SK_Morph_BLM-topic-news): Fine-tuned for topic classification in news articles.

	## Citation

	If you find our model or paper useful, please consider citing our work:

	### Article:
	Držík, D., & Forgac, F. (2024). Slovak morphological tokenizer using the Byte-Pair Encoding algorithm. PeerJ Computer Science, 10, e2465. https://doi.org/10.7717/peerj-cs.2465

	### BibTeX Entry:
	```bib
	@article{drzik2024slovak,
	title={Slovak morphological tokenizer using the Byte-Pair Encoding algorithm},
	author={Držík, Dávid and Forgac, František},
	journal={PeerJ Computer Science},
	volume={10},
	pages={e2465},
	year={2024},
	month={11},
	issn={2376-5992},
	doi={10.7717/peerj-cs.2465}
	}
	```