English G2P token classification model

This is a non-autoregressive model for English grapheme-to-phoneme (G2P) conversion based on BERT architecture. It predicts phonemes in CMU format. Initial data was built using CMUdict v0.07

Intended uses & limitations

The input is expected to contain english words consisting of latin letters and apostrophe, all letters separated by space.

How to use

Install NeMo.

Download en_g2p.nemo (this model)

git lfs install
git clone https://huggingface.co/bene-ges/en_g2p_cmu_bert_large

Run

python ${NEMO_ROOT}/examples/nlp/text_normalization_as_tagging/normalization_as_tagging_infer.py \
  pretrained_model=en_g2p_cmu_bert_large/en_g2p.nemo \
  inference.from_file=input.txt \
  inference.out_file=output.txt \
  model.max_sequence_len=64 \
  inference.batch_size=128 \
  lang=en

Example of input file:

g e f f e r t
p r o s c r i b e d
p r o m i n e n t l y
j o c e l y n
m a r c e c a ' s
s t a n k o w s k i
m u f f l e

Example of output file:

G EH1  F  ER0 T	               g e f f e r t           G EH1 <DELETE> F <DELETE> ER0 T   G EH1 <DELETE> F <DELETE> ER0 T   PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
P R OW0 S K R AY1 B  D         p r o s c r i b e d	   P R OW0 S K R AY1 B <DELETE> D    P R OW0 S K R AY1 B <DELETE> D    PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
P R AA1 M AH0 N AH0 N T L IY0  p r o m i n e n t l y   P R AA1 M AH0 N AH0 N T L IY0     P R AA1 M AH0 N AH0 N T L IY0     PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
JH AO1 S  L IH0 N              j o c e l y n           JH AO1 S <DELETE> L IH0 N         JH AO1 S <DELETE> L IH0 N         PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
M AA0 R S EH1 K AH0  Z         m a r c e c a ' s       M AA0 R S EH1 K AH0 <DELETE> Z	 M AA0 R S EH1 K AH0 <DELETE> Z    PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
S T AH0 NG K AO1 F S K IY0     s t a n k o w s k i     S T AH0 NG K AO1 F S K IY0        S T AH0 NG K AO1 F S K IY0        PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN
M AH1  F AH0L                  m u f f l e	           M AH1 <DELETE> F AH0_L <DELETE>   M AH1 <DELETE> F AH0_L <DELETE>   PLAIN PLAIN PLAIN PLAIN PLAIN PLAIN

Note that the correct output tags are in the third column, input is in the second column. Tags correspond to input letters in a one-to-one fashion. If you remove <DELETE> tag, and replace _ with space, you should get CMU-like transcription.

How to use for TTS

See this script to run TTS directly from CMU phonemes.