|
--- |
|
language: |
|
- he |
|
pipeline_tag: token-classification |
|
tags: |
|
- Transformers |
|
- PyTorch |
|
--- |
|
|
|
<!-- Provide a quick summary of what the model is/does. --> |
|
|
|
## MenakBERT |
|
|
|
A Hebrew BERT-style masked language model operating over characters, pre-trained by masking spans of characters, similarly to SpanBERT (Joshi et al., 2020). |
|
A Hebrew diacritizer based on a BERT-style char-level backbone. Predicts diacritical marks in a seq2seq fashion. |
|
|
|
### Model Description |
|
|
|
This model is takes tau/tavbert-he and adds a three headed classification head that outputs 3 sequences corresponding to 3 types of Hebrew Niqqud (diacritics). |
|
It was finetuned on the dataset generously provided by Elazar Gershuni of Nakdimon. |
|
|
|
|
|
- **Developed by:** Jacob Gidron, Ido Cohen and Idan Pinto |
|
- **Model type:** Bert |
|
- **Language:** Hebrew |
|
- **Finetuned from model:** tau/tavbert-he |
|
|
|
<!-- ### Model Sources [optional] --> |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** https://github.com/jacobgidron/MenakBert |
|
<!-- - **Paper [optional]:** [More Information Needed] --> |
|
<!-- - **Demo [optional]:** [More Information Needed] --> |
|
|
|
## Use |
|
|
|
The model expects undotted Hebrew text, that may contain numbers and punctuation. |
|
|
|
The output is three sequences of diacritical marks, corresponding with: |
|
1. Dot distinguishing the letters Shin vs Sin. |
|
2. The dot in the center of a letter that in some case changes pronunciation of certain letters, and in other cases creating a similar affect as an emphasis on the letter, or gemination. |
|
3. All the rest of the marks, used mostly for vocalization. |
|
|
|
The length of each sequence is the same as the input - each mark corresponding with the char at the same possition in the input. |
|
|
|
The provided script weaves the sequences together. |
|
|
|
## How to Get Started with the Model |
|
|
|
Use the code below to get started with the model. |
|
|
|
[More Information Needed] |
|
|
|
### Training Data |
|
|
|
The backbone tau/tavber-he was trained on OSCAR (Ortiz, 2019) Hebrew section (10 GB text, 20 million sentences). |
|
The fine tuning was done on the Nakdimon dataset, which can be found at https://github.com/elazarg/hebrew_diacritized and contains 274,436 dotted Hebrew tokens across 413 documents. |
|
For more information see https://arxiv.org/abs/2105.05209 |
|
|
|
<!-- #### Metrics --> |
|
|
|
<!-- These are the evaluation metrics being used, ideally with a description of why. --> |
|
|
|
<!-- [More Information Needed] --> |
|
|
|
<!-- ### Results --> |
|
|
|
<!-- [More Information Needed] --> |
|
|
|
|
|
## Model Card Contact |
|
|
|
Ido Cohen - [email protected] |