idoco
/

MenakBERT

Token Classification

Model card Files Files and versions Community

MenakBERT / README.md

idoco's picture

Create README.md

7b96467 verified 12 months ago

|

history blame contribute delete

2.51 kB

	---
	language:
	- he
	pipeline_tag: token-classification
	tags:
	- Transformers
	- PyTorch
	---

	<!-- Provide a quick summary of what the model is/does. -->

	## MenakBERT

	A Hebrew BERT-style masked language model operating over characters, pre-trained by masking spans of characters, similarly to SpanBERT (Joshi et al., 2020).
	A Hebrew diacritizer based on a BERT-style char-level backbone. Predicts diacritical marks in a seq2seq fashion.

	### Model Description

	This model is takes tau/tavbert-he and adds a three headed classification head that outputs 3 sequences corresponding to 3 types of Hebrew Niqqud (diacritics).
	It was finetuned on the dataset generously provided by Elazar Gershuni of Nakdimon.


	- Developed by: Jacob Gidron, Ido Cohen and Idan Pinto
	- Model type: Bert
	- Language: Hebrew
	- Finetuned from model: tau/tavbert-he

	<!-- ### Model Sources [optional] -->

	<!-- Provide the basic links for the model. -->

	- Repository: https://github.com/jacobgidron/MenakBert
	<!-- - Paper [optional]: [More Information Needed] -->
	<!-- - Demo [optional]: [More Information Needed] -->

	## Use

	The model expects undotted Hebrew text, that may contain numbers and punctuation.

	The output is three sequences of diacritical marks, corresponding with:
	1. Dot distinguishing the letters Shin vs Sin.
	2. The dot in the center of a letter that in some case changes pronunciation of certain letters, and in other cases creating a similar affect as an emphasis on the letter, or gemination.
	3. All the rest of the marks, used mostly for vocalization.

	The length of each sequence is the same as the input - each mark corresponding with the char at the same possition in the input.

	The provided script weaves the sequences together.

	## How to Get Started with the Model

	Use the code below to get started with the model.

	[More Information Needed]

	### Training Data

	The backbone tau/tavber-he was trained on OSCAR (Ortiz, 2019) Hebrew section (10 GB text, 20 million sentences).
	The fine tuning was done on the Nakdimon dataset, which can be found at https://github.com/elazarg/hebrew_diacritized and contains 274,436 dotted Hebrew tokens across 413 documents.
	For more information see https://arxiv.org/abs/2105.05209

	<!-- #### Metrics -->

	<!-- These are the evaluation metrics being used, ideally with a description of why. -->

	<!-- [More Information Needed] -->

	<!-- ### Results -->

	<!-- [More Information Needed] -->


	## Model Card Contact

	Ido Cohen - [email protected]