File size: 4,981 Bytes
f39272b e1d2098 054a77a f39272b e1d2098 9f9a940 054a77a 9f9a940 e1d2098 054a77a |
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 151 152 |
---
license: apache-2.0
datasets:
- mozilla-foundation/common_voice_10_0
base_model:
- facebook/wav2vec2-xls-r-300m
tags:
- pytorch
- phoneme-recognition
pipeline_tag: automatic-speech-recognition
arxiv: arxiv.org/abs/2306.04306
metrics:
- per
- aer
library_name: allophant
language:
- bn
- ca
- cs
- cv
- da
- de
- el
- en
- es
- et
- eu
- fi
- fr
- ga
- hi
- hu
- id
- it
- ka
- ky
- lt
- mt
- nl
- pl
- pt
- ro
- ru
- sk
- sl
- sv
- sw
- ta
- tr
- uk
---
Model Information
=================
Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories.
The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was pre-trained on a subset of the [Common Voice Corpus 10.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0) transcribed with [eSpeak NG](https://github.com/espeak-ng/espeak-ng).
| Model Name | UCLA Phonetic Corpus (PER) | UCLA Phonetic Corpus (AER) | Common Voice (PER) | Common Voice (AER) |
| ---------------- | ---------: | ---------: | -------: | -------: |
| [Multitask](https://huggingface.co/kgnlp/allophant) | **45.62%** | 19.44% | **34.34%** | **8.36%** |
| [Hierarchical](https://huggingface.co/kgnlp/allophant-hierarchical) | 46.09% | **19.18%** | 34.35% | 8.56% |
| **Multitask Shared** | 46.05% | 19.52% | 41.20% | 8.88% |
| [Baseline Shared](https://huggingface.co/kgnlp/allophant-baseline-shared) | 48.25% | - | 45.35% | - |
| [Baseline](https://huggingface.co/kgnlp/allophant-baseline) | 57.01% | - | 46.95% | - |
Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.
Usage
=====
Install the [`allophant`](https://github.com/kgnlp/allophant) package:
```bash
pip install allophant
```
A pre-trained model can be loaded from a huggingface checkpoint or local file:
```python
from allophant.estimator import Estimator
device = "cpu"
model, attribute_indexer = Estimator.restore("kgnlp/allophant-shared", device=device)
supported_features = attribute_indexer.feature_names
# The phonetic feature categories supported by the model, including "phonemes"
print(supported_features)
```
Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:
```python
# 1. For a single language:
inventory = attribute_indexer.phoneme_inventory("es")
# 2. For multiple languages, e.g. in code-switching scenarios
inventory = attribute_indexer.phoneme_inventory(["es", "it"])
# 3. Any custom selection of phones for which features are available in the Allophoible database
inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']
````
Audio files can then be loaded, resampled and transcribed using the given
inventory by first computing the log probabilities for each classifier:
```python
import torch
import torchaudio
from allophant.dataset_processing import Batch
# Load an audio file and resample the first channel to the sample rate used by the model
audio, sample_rate = torchaudio.load("utterance.wav")
audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)
# Construct a batch of 0-padded single channel audio, lengths and language IDs
# Language ID can be 0 for inference
batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))
model_outputs = model.predict(
batch.to(device),
attribute_indexer.composition_feature_matrix(inventory).to(device)
)
```
Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:
```python
from allophant import predictions
# Create a feature mapping for your inventory and CTC decoders for the desired feature set
inventory_indexer = attribute_indexer.attributes.subset(inventory)
ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)
for feature_name, decoder in ctc_decoders.items():
decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)
# Print the feature name and values for each utterance in the batch
for [hypothesis] in decoded:
# NOTE: token indices are offset by one due to the <BLANK> token used during decoding
recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)
print(feature_name, recognized)
```
Citation
========
```bibtex
@inproceedings{glocker2023allophant,
title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes},
author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir},
year={2023},
booktitle={{Proc. Interspeech 2023}},
month={8}}
```
[](arxiv.org/abs/2306.04306)
|