Logo Lifeweb

Shiraz Language Model

Welcome to Shiraz, the repository for Lifeweb's language model. First versions of our models are all trained on our own dataset called Divan with more than 164 million documents and more than 10B tokens which is normalized and deduplicated meticulously to ensure its enrichment and comprehensiveness. A better dataset leads to a better model!

Use Model

You can easily access the models using the sample code provided below.

from transformers import AutoTokenizer, AutoModelForMaskedLM, FillMaskPipeline
# v1.0
model_name = "lifeweb-ai/shiraz"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForMaskedLM.from_pretrained(model_name)

text = "در همین لحظه که شما مشغول خواندن این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم."
print(tokenizer.tokenize(text))

# ['در', 'همین', 'لحظه', 'که', 'شما', 'مشغول', 'خواندن', 'این', 'متن', 'هستید،', 'میلیون', '[zwnj]', 'ها', 'دیتا', 'در', 'فضای', 'انلاین', 'در', 'حال', 'تولید', 'است', '.', 'ما', 'در', 'لایف', 'وب', 'به', 'جمع', '[zwnj]', 'اوری', '##،', 'پردازش', 'و', 'تحلیل', 'این', 'کلان', 'داده', '(', 'big', 'data', ')', 'می', '[zwnj]', 'پردازیم', '.', '.']

# fill mask task
text = "در همین لحظه که شما مشغول [MASK] این متن هستید، میلیون‌ها دیتا در فضای آنلاین در حال تولید است. ما در لایف وب به جمع‌آوری، پردازش و تحلیل این کلان داده (Big Data) می‌پردازیم."

classifier = FillMaskPipeline(model=model, tokenizer=tokenizer)
result = classifier(text)
print(result[0])
#{'score': 0.3584367036819458, 'token': 5764, 'token_str': 'خواندن', 'sequence': 'در همین لحظه که شما مشغول خواندن این متن هستید، میلیون ها دیتا در فضای انلاین در حال تولید است. ما در لایف وب به جمع اوری، پردازش و تحلیل این کلان داده ( big data ) می پردازیم.'}

Results

The Shiraz is evaluated on three downstream NLP tasks comprising NER, Sentiment Analysis, and Emotion Detection. Shiraz is considerably faster, and its accuracy remains highly competitive without compromising much on speed. According to MobileBERT paper, this model is 4.3× smaller and 5.5× faster than BERT-base.

Obvious from the table below, you can find the colab codes for each task to use as a tutorial besides the macro F1 score.

Model NER Sentiment Emotion
Arman Peyma Sentipers (multi) Snappfood Arman
lifeweb-ai/tehran 71.87%
90.79%
63.75%
88.74%
77.73%
lifeweb-ai/shiraz 67.62%
Colab Code
86.24%
Colab Code
59.17%
Colab Code
88.01%
Colab Code
66.97%
Colab Code
sbunlp/fabert 71.23%
Colab Code
88.53%
Colab Code
58.51%
Colab Code
88.60%
Colab Code
72.65%
ViraIntelligentDataMining/AriaBERT 69.12%
Colab Code
87.15%
Colab Code
59.26%
Colab Code
87.96%
Colab Code
69.11%
HooshvareLab/bert-fa-zwnj-base 67.49%
Colab Code
85.73%
Colab Code
59.61%
Colab Code
87.58%
Colab Code
59.27%
Colab Code
HooshvareLab/roberta-fa-zwnj-base 69.73%
Colab Code
86.21%
Colab Code
56.23%
Colab Code
87.19%
Colab Code
57.96%
Colab Code

If you tested our models on a public dataset, and you wanted to add your results to the table above, open a pull request or contact us. Also make sure to have your code available online so that we can add a reference.

Cite

You are welcome to use our LM models in your work or research, if so, we kindly ask you to cite it using the following entry:

@misc{Shiraz,
    author = {Mehrdad Azizi, Reza Salehi Chegeni, Parisa Mousavi, Iman Hashemi},
    title = {[Optimizing Pre-trained BERT-based Models for Persian Language Processing]},
    year = {2024},
    publisher = {LifeWeb}
}

Contributors

Releases

v1.0(2024-03-09)

First version of Shiraz model trained on DIVAN.

Downloads last month
501
Safetensors
Model size
46.6M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.