language:
- he
tags:
- roberta
- Language model
pipeline_tag: text-generation
D-Nikud
Welcome to the D-Nikud Diacritization Model main code repository! This repository is dedicated to the implementation of our innovative D-Nikud model, which use the TavBERT architecture and Bi-LSTM to predict and apply diacritics (nikud) to Hebrew text. Diacritics play a crucial role in accurately conveying pronunciation and interpretation, making our model an essential tool for enhancing the quality of Hebrew text analysis.
The code provided here encompasses various functionalities, including prediction, evaluation, and training of the D-Nikud diacritization model.
Repository for the paper D-Nikud: Enhancing Hebrew Diacritization with LSTM and Pretrained Models by Nadav Shaked and Adi Rosenthal.
Prerequisites
Before running the script, make sure you have the following installed:
- Tested with Python 3.10
torch
library (PyTorch)transformers
library- Required Python packages (Install using
pip install -r requirements.txt
)
Table of Contents
Introduction
Our D-Nikud model utilizes the TevBERT architecture and Bi-LSTM for diacritization (nikud) of Hebrew text. Diacritics (nikud) are essential for accurate pronunciation and interpretation of the text. This repository provides the core code for implementing and utilizing the D-Nikud model.
Pre Trained model
Our pre-trained D-Nikud model can be found at Link. To use it, unzip the downloaded file and copy the contents to the 'models' folder.
Usage
Clone the repository:
git clone https://github.com/NadavShaked/D_Nikud.git
cd D-Nikud
Clone D-Nikud data:
git submodule update --init --recursive
Predict
The "Predict" command enables the prediction of diacritics for input text files or folders containing diacritized or un-diacritized text. It generates diacritization predictions using the specified diacritization model and saves the results to the specified output file. Optionally, you can choose to predict text for comparison with Nakdimon using the -c/--compare
flag.
To predict diacritics for input text files or folders, use the following command:
python main.py predict <input_path> <output_path> [-c/--compare <compare_nakdimon>] [-ptmp/--pretrain_model_path <pretrain_model_path>]
<input_path>
: Path to the input file or folder containing text data.<output_path>
: Path to the output file where the predicted diacritized text will be saved.-c/--compare
: Optional. Set toTrue
to predict text for comparison with Nakdimon.-ptmp/--pretrain_model_path
: Optional. Path to the pre-trained model weights to be used for prediction. If not provided, the command will default to using our pre-trained D-Nikud model.
For example, to predict diacritics for a specific input text file and save the results to an output file, you can execute:
python main.py predict input.txt output.txt
If you wish to predict text for comparison with Nakdimon and specify a custom pre-trained model path, you can use:
python main.py predict input_folder output_folder -c True -ptmp path/to/pretrained/model.pth
Here, the command will predict diacritics for the texts in the input_folder
, generate output files in the output_folder
, and use the specified pre-trained model for prediction.
You can adapt the paths and options to suit your project's requirements. If the -ptmp parameter is omitted, the command will automatically employ our default pre-trained D-Nikud model for prediction.
Evaluate
The "Evaluate" command assesses the performance of the diacritization model by computing accuracy metrics for specific diacritics elements: nikud, dagesh, sin, as well as overall letter and word accuracy. This evaluation process involves comparing the model's diacritization results with the original diacritics text, providing insights into the model's effectiveness in accurately predicting and applying diacritics.
To evaluate the diacritization model, you can use the following command:
python main.py evaluate <input_path> [-ptmp/--pretrain_model_path <pretrain_model_path>] [-df/--plots_folder <plots_folder>] [-es/--eval_sub_folders]
<input_path>
: Path to the input file or folder containing text data for evaluation.-ptmp/--pretrain_model_path
: Optional. Path to the pre-trained model weights to be employed for evaluation. If this parameter is not specified, the command will default to using our pre-trained D-Nikud model.-df/--plots_folder
: Optional. Path to the folder where evaluation plots will be saved. If not provided, the default plots folder will be used.-es/--eval_sub_folders
: Optional. Include this flag to enable accuracy calculation for sub-folders within theinput_path
folder, providing independent assessments for each subfolder.
For example, to evaluate the diacritization model's performance on a specific dataset, you might run:
python main.py evaluate dataset_folder -ptmp path/to/pretrained/model.pth -df evaluation_plots
This command will evaluate the model's accuracy on the dataset found in the dataset_folder
, using the specified pre-trained model weights and saving evaluation plots in the evaluation_plots
folder.
Train
The "Train" command enables the training of the diacritization model using your own dataset. This command supports fine-tuning a pre-trained model, adjusting hyperparameters such as learning rate and batch size, and specifying various training settings.
⚠️ Important Note: Any file or folder in the specified data folder that contains the string "not_use" or "NakdanResults" in its name will be excluded from the training and testing processes. This feature allows you to selectively exclude specific data from the training process if needed.
To train the diacritization model, use the following command:
python main.py train [--learning_rate <learning_rate>] [--batch_size <batch_size>]
[--n_epochs <n_epochs>] [--data_folder <data_folder>] [--checkpoints_frequency <checkpoints_frequency>]
[-df/--plots_folder <plots_folder>] [-ptmp/--pretrain_model_path <pretrain_model_path>]
--learning_rate
: Optional. Learning rate for training (default is 0.001).--batch_size
: Optional. Batch size for training (default is 32).--n_epochs
: Optional. Number of training epochs (default is 10).--data_folder
: Optional. Path to the folder containing training data (default is "data").--checkpoints_frequency
: Optional. Frequency of saving model checkpoints during training (default is 1).-df/--plots_folder
: Optional. Path to the folder where training plots will be saved.-ptmp/--pretrain_model_path
: Optional. Path to the pre-trained model weights to be used for training continuation. Use this only if you want to fine-tune a specific pre-trained model.
⚠️ Folder Structure: The --data_folder
must have the following structure:
- data_folder
- train
- Contains training data
- dev
- Contains development/validation data
- test
- Contains testing data
- train
For instance, to initiate training with a specified learning rate, batch size, and number of epochs, you can execute:
python main.py train --learning_rate 0.001 --batch_size 16 --n_epochs 20
If you want to continue training from a pre-trained model and save model checkpoints every 3 epochs, you can use:
python main.py train --checkpoints_frequency 3 -ptmp path/to/pretrained/model.pth
In this example, the command will resume training from the specified pre-trained model that located at 'path/to/pretrained/model.pth' and save checkpoints every 3 epochs. Training plots will be saved in the specified plots folder.
Remember to adjust the command options according to your training requirements and preferences. If you don't provide the -ptmp
parameter, the command will start training from scratch using the default D-Nikud model architecture.
Acknowledgments
This script utilizes the D-Nikud model developed by Adi Rosenthal and Nadav Shaked.
License
This code is provided under the MIT License. You are free to use, modify, and distribute the code according to the terms of the license.