File size: 4,608 Bytes
f39272b e1d2098 f39272b e1d2098 9f9a940 e1d2098 |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 |
---
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_10_0
base_model:
- facebook/wav2vec2-xls-r-300m
tags:
- pytorch
- phoneme-recognition
pipeline_tag: automatic-speech-recognition
---
Model Information
=================
Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories.
The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was pre-trained on a subset of the [Common Voice Corpus 10.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0) transcribed with [eSpeak NG](https://github.com/espeak-ng/espeak-ng).
| Model Name | UCLA Phonetic Corpus (PER) | UCLA Phonetic Corpus (AER) | Common Voice (PER) | Common Voice (AER) |
| ---------------- | ---------: | ---------: | -------: | -------: |
| [Multitask](https://huggingface.co/kgnlp/allophant) | **45.62%** | 19.44% | **34.34%** | **8.36%** |
| [Hierarchical](https://huggingface.co/kgnlp/allophant-hierarchical) | 46.09% | **19.18%** | 34.35% | 8.56% |
| **Multitask Shared** | 46.05% | 19.52% | 41.20% | 8.88% |
| [Baseline Shared](https://huggingface.co/kgnlp/allophant-baseline-shared) | 48.25% | - | 45.35% | - |
| [Baseline](https://huggingface.co/kgnlp/allophant-baseline) | 57.01% | - | 46.95% | - |
Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.
Usage
=====
A pre-trained model can be loaded with the [`allophant`](https://github.com/kgnlp/allophant) package from a huggingface checkpoint or local file:
```python
from allophant.estimator import Estimator
device = "cpu"
model, attribute_indexer = Estimator.restore("kgnlp/allophant-shared", device=device)
supported_features = attribute_indexer.feature_names
# The phonetic feature categories supported by the model, including "phonemes"
print(supported_features)
```
Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:
```python
# 1. For a single language:
inventory = attribute_indexer.phoneme_inventory("es")
# 2. For multiple languages, e.g. in code-switching scenarios
inventory = attribute_indexer.phoneme_inventory(["es", "it"])
# 3. Any custom selection of phones for which features are available in the Allophoible database
inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']
````
Audio files can then be loaded, resampled and transcribed using the given
inventory by first computing the log probabilities for each classifier:
```python
import torch
import torchaudio
from allophant.dataset_processing import Batch
# Load an audio file and resample the first channel to the sample rate used by the model
audio, sample_rate = torchaudio.load("utterance.wav")
audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)
# Construct a batch of 0-padded single channel audio, lengths and language IDs
# Language ID can be 0 for inference
batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))
model_outputs = model.predict(
batch.to(device),
attribute_indexer.composition_feature_matrix(inventory).to(device)
)
```
Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:
```python
from allophant import predictions
# Create a feature mapping for your inventory and CTC decoders for the desired feature set
inventory_indexer = attribute_indexer.attributes.subset(inventory)
ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)
for feature_name, decoder in ctc_decoders.items():
decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)
# Print the feature name and values for each utterance in the batch
for [hypothesis] in decoded:
# NOTE: token indices are offset by one due to the <BLANK> token used during decoding
recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)
print(feature_name, recognized)
```
Citation
========
```bibtex
@inproceedings{glocker2023allophant,
title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes},
author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir},
year={2023},
booktitle={{Proc. Interspeech 2023}},
month={8}}
```
|