File size: 3,124 Bytes
618cb92 408d7a1 215db92 fa25fb6 a612586 6199366 a612586 6199366 a612586 6199366 a612586 6199366 a612586 215db92 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 |
---
license: cc-by-4.0
language:
- ru
library_name: nemo
pipeline_tag: token-classification
tags:
- G2P
- Grapheme-to-Phoneme
---
# Russian G2P token classification model
This is a non-autoregressive model for Russian grapheme-to-phoneme (G2P) conversion based on BERT architecture. It predicts phonemes in IPA format.
Initial data was built using Wiktionary json from https://kaikki.org/dictionary/Russian/index.html
## Intended uses & limitations
The input is expected to consist of cyrillic letters separated by space. Real space should be replaced to underscore(_).
Note that the model was trained on single words and some short phrases.
Though it can accept longer phrases its accuracy may degrade on them.
### How to use
Install NeMo.
Download ru_g2p.nemo (this model)
```bash
git lfs install
git clone https://huggingface.co/bene-ges/ru_g2p_ipa_bert_large
```
Run
```bash
python ${NEMO_ROOT}/examples/nlp/text_normalization_as_tagging/normalization_as_tagging_infer.py \
pretrained_model=ru_g2p_ipa_bert_large/ru_g2p.nemo \
inference.from_file=input.txt \
inference.out_file=output.txt \
model.max_sequence_len=512 \
inference.batch_size=128 \
lang=ru
```
Example of input file:
```
и с х о д
т р а н с н е п т у н о в ы х
т е л я т н и к о в с к о е
ц а р с к о г о
к р о с х о ф
г а н с - ю р г е н
д а р д а н е л л
```
Example of output file:
```
ɪ s x 'o t и с х о д ɪ s x 'o t ɪ s x 'o t PLAIN PLAIN PLAIN PLAIN PLAIN
t r a nʲ sʲ nʲ ɪ p t 'u n ə v ɨ x т р а н с н е п т у н о в ы х t r a nʲ sʲ nʲ ɪ p t 'u n ə v ɨ x t r a nʲ sʲ nʲ ɪ p t 'u n ə v ɨ x PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
tʲ ɪ lʲ 'æ tʲ nʲ ɪ k ə f s k ə jə т е л я т н и к о в с к о е tʲ ɪ lʲ 'æ tʲ nʲ ɪ k ə f s k ə jə tʲ ɪ lʲ 'æ tʲ nʲ ɪ k ə f s k ə jə PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
t~s 'a r s k ə v ə ц а р с к о г о t~s 'a r s k ə v ə t~s 'a r s k ə v ə PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
k r ɐ s x 'o f к р о с х о ф k r ɐ s x 'o f k r ɐ s x 'o f PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
ɡ a n s 'ju r ɡʲ ɪ n г а н с - ю р г е н ɡ a n s _ 'ju r ɡʲ ɪ n ɡ a n s _ 'ju r ɡʲ ɪ n PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
d ə r d ɐ n 'ɛ ɫ д а р д а н е л л d ə r d ɐ n 'ɛ <DELETE> ɫ d ə r d ɐ n 'ɛ <DELETE> ɫ PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
```
Note that the correct output tags are in the **third** column, input is in the second column.
Tags correspond to input letters in a one-to-one fashion. If you remove `<DELETE>` tag, `+`, and spaces, you should get IPA-like transcription.
The model does not predict secondary stress. The primary stress is put directly before the stressed vowel. In some cases stress can be missing. |