Hafez_Bert_based language model

The paragraph describes the development of a language model named "Hafez," which references the famous Persian poet from Shiraz, Iran. Here’s a breakdown of the information presented:

Model Type: Hafez is based on the BERT architecture, which is a popular model for natural language processing (NLP).

Cultural Reference: The model is named after Hafez, a renowned Persian poet known for his deeply emotional and philosophical verses. This choice of name suggests a connection to Persian literature and an intention to handle language in a way that may resonate with the cultural significance of the poet. (NLP).

Training Data: The model has been trained on a substantial dataset comprising over 12 billion tokens. The text used to train the Hafez language model is comprised of two parts: 90% consists of educational materials, including research papers, dissertations, and theses, while the remaining 10% includes general texts. This careful selection of content aims to provide the model with a strong foundation in academic language and discourse.

Text Cleaning and Preprocessing: The training data underwent a cleaning and preprocessing phase, which is essential for ensuring that the data is of high quality and suitable for training a machine learning model. The cleaning and preparation were conducted using "Viravirast text tools," which are likely specialized tools designed for text processing in this context.

How to use

from transformers import pipeline
unmasker = pipeline('fill-mask', model='ViravirastSHZ/Hafez_Bert')
print(unmasker("شیراز یکی از زیباترین [MASK] ایران است."))

Results

We have conducted evaluations of the Hafez language model specifically on a text classification task, and we welcome others to explore its performance on various downstream tasks as well. The F1 score will be utilized as the primary metric for measuring the model's effectiveness. We would greatly appreciate any efforts in testing Hafez across different applications.

Model Text classificaion
Msobhi/virgool_62k
ViravirastSHZ/Hafez_Bert test-F1 score: 0.437764
Colab Code
lifeweb-ai/shiraz test-F1 score: 0.349834
Colab Code

Cite

@misc{Hafez language model, author = {Amin Rahmani}, title = {[Pre-trained BERT-based language Model for Persian Language]}, year = {2024}, publisher = {Viravirast} }

Contributor

-Amin Rahmani viravirast -Amin Rahmani Linkedin

Downloads last month
9
Safetensors
Model size
110M params
Tensor type
F32
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.