--- license: apache-2.0 datasets: - mozilla-foundation/common_voice_10_0 base_model: - facebook/wav2vec2-xls-r-300m tags: - pytorch - phoneme-recognition pipeline_tag: automatic-speech-recognition arxiv: arxiv.org/abs/2306.04306 metrics: - per - aer library_name: allophant language: - bn - ca - cs - cv - da - de - el - en - es - et - eu - fi - fr - ga - hi - hu - id - it - ka - ky - lt - mt - nl - pl - pt - ro - ru - sk - sl - sv - sw - ta - tr - uk --- Model Information ================= Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories. The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was pre-trained on a subset of the [Common Voice Corpus 10.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0) transcribed with [eSpeak NG](https://github.com/espeak-ng/espeak-ng). | Model Name | UCLA Phonetic Corpus (PER) | UCLA Phonetic Corpus (AER) | Common Voice (PER) | Common Voice (AER) | | ---------------- | ---------: | ---------: | -------: | -------: | | **Multitask** | **45.62%** | 19.44% | **34.34%** | **8.36%** | | [Hierarchical](https://huggingface.co/kgnlp/allophant-hierarchical) | 46.09% | **19.18%** | 34.35% | 8.56% | | [Multitask Shared](https://huggingface.co/kgnlp/allophant-shared) | 46.05% | 19.52% | 41.20% | 8.88% | | [Baseline Shared](https://huggingface.co/kgnlp/allophant-baseline-shared) | 48.25% | - | 45.35% | - | | [Baseline](https://huggingface.co/kgnlp/allophant-baseline) | 57.01% | - | 46.95% | - | Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition. Usage ===== Install the [`allophant`](https://github.com/kgnlp/allophant) package: ```bash pip install allophant ``` A pre-trained model can be loaded from a huggingface checkpoint or local file: ```python from allophant.estimator import Estimator device = "cpu" model, attribute_indexer = Estimator.restore("kgnlp/allophant", device=device) supported_features = attribute_indexer.feature_names # The phonetic feature categories supported by the model, including "phonemes" print(supported_features) ``` Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways: ```python # 1. For a single language: inventory = attribute_indexer.phoneme_inventory("es") # 2. For multiple languages, e.g. in code-switching scenarios inventory = attribute_indexer.phoneme_inventory(["es", "it"]) # 3. Any custom selection of phones for which features are available in the Allophoible database inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ'] ```` Audio files can then be loaded, resampled and transcribed using the given inventory by first computing the log probabilities for each classifier: ```python import torch import torchaudio from allophant.dataset_processing import Batch # Load an audio file and resample the first channel to the sample rate used by the model audio, sample_rate = torchaudio.load("utterance.wav") audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate) # Construct a batch of 0-padded single channel audio, lengths and language IDs # Language ID can be 0 for inference batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1)) model_outputs = model.predict( batch.to(device), attribute_indexer.composition_feature_matrix(inventory).to(device) ) ``` Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features: ```python from allophant import predictions # Create a feature mapping for your inventory and CTC decoders for the desired feature set inventory_indexer = attribute_indexer.attributes.subset(inventory) ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features) for feature_name, decoder in ctc_decoders.items(): decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths) # Print the feature name and values for each utterance in the batch for [hypothesis] in decoded: # NOTE: token indices are offset by one due to the token used during decoding recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1) print(feature_name, recognized) ``` Citation ======== ```bibtex @inproceedings{glocker2023allophant, title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes}, author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir}, year={2023}, booktitle={{Proc. Interspeech 2023}}, month={8}} ``` [](arxiv.org/abs/2306.04306)