import os import subprocess subprocess.run(['apt-get', 'update']) subprocess.run(['apt-get', 'install', '-y', 'build-essential', 'gawk', 'libasound2-dev', 'libpulse-dev', 'autoconf', 'automake', 'libtool']) subprocess.run(['wget', 'https://github.com/espeak-ng/espeak-ng/archive/refs/tags/1.52.0.tar.gz']) subprocess.run(['tar', 'xf', '1.52.0.tar.gz']) cwd = 'espeak-ng-1.52.0' subprocess.run(['./autogen.sh'], cwd=cwd) subprocess.run(['./configure'], cwd=cwd) subprocess.run(['make'], cwd=cwd) subprocess.run(['make', 'install'], cwd=cwd) del cwd env = os.environ.copy() env['LD_PRELOAD'] = '/usr/local/lib/libespeak-ng.so.1' subprocess.run(['espeak-ng', '--version'], env=env) from phonemizer.backend.espeak.wrapper import EspeakWrapper EspeakWrapper.set_library('/usr/local/lib/libespeak-ng.so.1') import spaces @spaces.GPU def greet(n): return f"Hello {zero + n} Tensor" from misaki import en, espeak import gradio as gr import pprint import time fbs = [espeak.EspeakFallback(british=british) for british in (False, True)] g2p = [[en.G2P(trf=trf, british=british, fallback=fbs[british]) for british in (False, True)] for trf in (False, True)] def predict(text, use_spacy_transformer, british): start = time.time() ps, tokens = g2p[use_spacy_transformer][british](text) debug = [] for word in tokens: if isinstance(word, list): debug.append([]) for t in word: debug[-1].append(t.debug_all()) else: debug.append(word.debug_all()) trace = pprint.pformat(debug) elapsed_cpu_time = time.time() - start return ps, len(ps), trace, elapsed_cpu_time with gr.Blocks() as app: gr.Markdown(''' Misaki is an experimental G2P engine designed to power future versions of Kokoro models. This English-only preview is primarily intended for researchers and linguists. It may be deeply uninteresting to most people. ''', container=True) gr.Interface(fn=predict, inputs=[gr.Text(), gr.Checkbox(), gr.Checkbox()], outputs=[gr.Text(label='phonemes'), gr.Number(label='token_count <= 510 fits in Kokoro context length'), gr.Text(label='trace'), gr.Number(label='elapsed_cpu_time')]) gr.Markdown(''' ### Examples ```md American: [Misaki](/misˈɑki/) is an experimental G2P engine designed to power future versions of [Kokoro](/kˈOkəɹO/) models. British: [Misaki](/misˈɑːki/) is an experimental G2P engine designed to power future versions of [Kokoro](/kˈQkəɹQ/) models. But I am the Chosen One. But I [am](+1) the Chosen [One](-1). But I [am](+2) the Chosen [One](-2). 1002. [1002](#a#). [1002](#an#). [1002](#a&#). 2025. 2,025. $45.67 billion trillion. ``` ''', container=True) gr.Markdown(''' ### Token-Level Trace ```py # 1. Text. Can be useful for aligning text to phonemes, e.g. highlighting text during audio playback. # 2. Tag. See a full list of tags from spaCy: # https://github.com/explosion/spaCy/blob/master/spacy/glossary.py # 3. Whitespace. Whether or not a token has trailing whitespace (string => bool for this demo). whitespace = True if whitespace else False # 4. Phonemes. For this demo, the question mark means UNK, the ninja emoji means empty string. phonemes = '❓' if phonemes is None else ('🥷' if phonemes == '' else phonemes) # 5. Rating. Star rating for the estimated quality of this token's phonemes. ratings = dict( user_override = '💎(5/5)', gold = '🏆(4/5)', silver = '🥈(3/5)', bronze = '🥉(2/5)', unk = '❓(UNK)', ) ``` ''', container=True) gr.Markdown(''' ### Notes - For English, Misaki uses a gold dictionary with 80k words and a similarly sized silver dictionary. - There are separate dictionaries for American & British English. - Users can override the dictionary and/or individual tokens with custom pronunciations. - `espeak-ng` is used as the fallback for OOD words, and the token is rated "bronze" in this case. - Raw token objects are returned, with phonemes aligned at the per-token level. - UNKs are easy to detect when `token.phonemes is None`. - The entire implementation of Misaki (English) is <1000 lines of Python, excluding dictionary files. - POS disambiguation should be live, e.g. to wound someone vs wound up. - use_spacy_transformer should deliver more reliable POS tags. - Non-POS-based disambiguation, like graph axes vs throwing axes, is still a TODO. ''', container=True) with gr.Blocks() as info: gr.Markdown(''' # Misaki English Phonemes For English, Misaki currently uses 49 total phonemes. Of these, 41 are shared by both Americans and Brits, 4 are American-only, and 4 are British-only. Disclaimer: Author is an ML researcher, not a linguist, and may have butchered or reappropriated the traditional meaning of some symbols. These symbols are intended as input tokens for neural networks to yield optimal performance. ### 🤝 Shared (41) **Stress Marks (2)** - `ˈ`: Primary stress, visually looks similar to an apostrophe. - `ˌ`: Secondary stress. **IPA Consonants (22)** - `bdfhjklmnpstvwz`: 15 alpha consonants taken from IPA. They mostly sound as you'd expect, but `j` actually represents the "y" sound, like `yes => jˈɛs`. - `ɡ`: Hard "g" sound, like `get => ɡɛt`. Visually looks like the lowercase letter g, but its actually `U+0261`. - `ŋ`: The "ng" sound, like `sung => sˈʌŋ`. - `ɹ`: Upside-down r is just an "r" sound, like `red => ɹˈɛd`. - `ʃ`: The "sh" sound, like `shin => ʃˈɪn`. - `ʒ`: The "zh" sound, like `Asia => ˈAʒə`. - `ð`: Soft "th" sound, like `than => ðən`. - `θ`: Hard "th" sound, like `thin => θˈɪn`. **Consonant Clusters (2)** - `ʤ`: A "j" or "dg" sound, merges `dʒ`, like `jump => ʤˈʌmp` or `lunge => lˈʌnʤ`. - `ʧ`: The "ch" sound, merges `tʃ`, like `chump => ʧˈʌmp` or `lunch => lˈʌnʧ`. **IPA Vowels (10)** - `ə`: The schwa is a common, unstressed vowel sound, like `a 🍌 => ə 🍌`. - `i`: As in `easy => ˈizi`. - `u`: As in `flu => flˈu`. - `ɑ`: As in `spa => spˈɑ`. - `ɔ`: As in `all => ˈɔl`. - `ɛ`: As in `hair => hˈɛɹ` or `bed => bˈɛd`. Possibly dubious, because those vowel sounds do not sound similar to my ear. - `ɜ`: As in `her => hɜɹ`. Easy to confuse with `ɛ` above. - `ɪ`: As in `brick => bɹˈɪk`. - `ʊ`: As in `wood => wˈʊd`. - `ʌ`: As in `sun => sˈʌn`. **Dipthong Vowels (4)** - `A`: The "eh" vowel sound, like `hey => hˈA`. Expands to `eɪ` in IPA. - `I`: The "eye" vowel sound, like `high => hˈI`. Expands to `aɪ` in IPA. - `W`: The "ow" vowel sound, like `how => hˌW`. Expands to `aʊ` in IPA. - `Y`: The "oy" vowel sound, like `soy => sˈY`. Expands to `ɔɪ` in IPA. **Custom Vowel (1)** - `ᵊ`: Small schwa, muted version of `ə`, like `pixel => pˈɪksᵊl`. I made this one up, so I'm not entirely sure if it's correct. ### 🇺🇸 American-only (4) **Vowels (3)** - `æ`: The vowel sound at the start of `ash => ˈæʃ`. - `O`: Capital letter representing the American "oh" vowel sound. Expands to `oʊ` in IPA. - `ᵻ`: A sound somewhere in between `ə` and `ɪ`, often used in certain -s suffixes like `boxes => bˈɑksᵻz`. **Consonant (1)** - `ɾ`: A sound somewhere in between `t` and `d`, like `butter => bˈʌɾəɹ`. ### 🇬🇧 British-only (4) **Vowels (3)** - `a`: The vowel sound at the start of `ash => ˈaʃ`. - `Q`: Capital letter representing the British "oh" vowel sound. Expands to `əʊ` in IPA. - `ɒ`: The sound at the start of `on => ˌɒn`. Easy to confuse with `ɑ`, which is a shared phoneme. **Other (1)** - `ː`: Vowel extender, visually looks similar to a colon. Possibly dubious, because Americans extend vowels too, but the gold US dictionary somehow lacks these. Often used by the Brits instead of `ɹ`: Americans say `or => ɔɹ`, but Brits say `or => ɔː`. ### ♻️ Misaki to espeak ```py def to_espeak(ps): # Optionally, you can add a tie character in between the 2 replacement characters. ps = ps.replace('ʤ', 'dʒ').replace('ʧ', 'tʃ') ps = ps.replace('A', 'eɪ').replace('I', 'aɪ').replace('Y', 'ɔɪ') ps = ps.replace('O', 'oʊ').replace('Q', 'əʊ').replace('W', 'aʊ') return ps.replace('ᵊ', 'ə') ``` ''') demo = gr.TabbedInterface( [app, info], ['🔥 Misaki English', 'ℹ️ Phonemes'], ) demo.launch()