Automatic Speech Recognition
PyTorch
allophant
phoneme-recognition
File size: 4,981 Bytes
0a40303
 
09c65cf
 
 
 
 
 
 
 
ad69d31
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a40303
09c65cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d8f9f3
 
 
ad69d31
 
 
 
 
 
 
7d8f9f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09c65cf
 
 
 
 
 
 
 
 
 
 
ad69d31
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
---

license: apache-2.0
datasets:
- mozilla-foundation/common_voice_10_0
base_model:
- facebook/wav2vec2-xls-r-300m
tags:
- pytorch
- phoneme-recognition
pipeline_tag: automatic-speech-recognition
arxiv: arxiv.org/abs/2306.04306
metrics:
- per
- aer
library_name: allophant
language:
- bn
- ca
- cs
- cv
- da
- de
- el
- en
- es
- et
- eu
- fi
- fr
- ga
- hi
- hu
- id
- it
- ka
- ky
- lt
- mt
- nl
- pl
- pt
- ro
- ru
- sk
- sl
- sv
- sw
- ta
- tr
- uk
---


Model Information
=================

Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories.

The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was pre-trained on a subset of the [Common Voice Corpus 10.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0) transcribed with [eSpeak NG](https://github.com/espeak-ng/espeak-ng).

| Model Name       | UCLA Phonetic Corpus (PER) | UCLA Phonetic Corpus (AER) | Common Voice (PER) | Common Voice (AER) |
| ---------------- | ---------: | ---------: | -------: | -------: |
| **Multitask**        | **45.62%** | 19.44% | **34.34%** | **8.36%** |
| [Hierarchical](https://huggingface.co/kgnlp/allophant-hierarchical)     | 46.09% | **19.18%** | 34.35% | 8.56% |
| [Multitask Shared](https://huggingface.co/kgnlp/allophant-shared) | 46.05% | 19.52% | 41.20% | 8.88% |
| [Baseline Shared](https://huggingface.co/kgnlp/allophant-baseline-shared)  | 48.25% |   -    | 45.35% |  -    |
| [Baseline](https://huggingface.co/kgnlp/allophant-baseline)         | 57.01% |   -    | 46.95% |  -    |

Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.

Usage
=====

Install the [`allophant`](https://github.com/kgnlp/allophant) package:

```bash

pip install allophant

```

A pre-trained model can be loaded from a huggingface checkpoint or local file:

```python

from allophant.estimator import Estimator



device = "cpu"

model, attribute_indexer = Estimator.restore("kgnlp/allophant", device=device)

supported_features = attribute_indexer.feature_names

# The phonetic feature categories supported by the model, including "phonemes"

print(supported_features)

```
Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:

```python

# 1. For a single language:

inventory = attribute_indexer.phoneme_inventory("es")

# 2. For multiple languages, e.g. in code-switching scenarios

inventory = attribute_indexer.phoneme_inventory(["es", "it"])

# 3. Any custom selection of phones for which features are available in the Allophoible database

inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']

````

Audio files can then be loaded, resampled and transcribed using the given
inventory by first computing the log probabilities for each classifier:

```python

import torch

import torchaudio

from allophant.dataset_processing import Batch



# Load an audio file and resample the first channel to the sample rate used by the model

audio, sample_rate = torchaudio.load("utterance.wav")

audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)



# Construct a batch of 0-padded single channel audio, lengths and language IDs

# Language ID can be 0 for inference

batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))

model_outputs = model.predict(

  batch.to(device),

  attribute_indexer.composition_feature_matrix(inventory).to(device)

)

```

Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:

```python

from allophant import predictions



# Create a feature mapping for your inventory and CTC decoders for the desired feature set

inventory_indexer = attribute_indexer.attributes.subset(inventory)

ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)



for feature_name, decoder in ctc_decoders.items():

    decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)

    # Print the feature name and values for each utterance in the batch

    for [hypothesis] in decoded:

        # NOTE: token indices are offset by one due to the <BLANK> token used during decoding

        recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)

        print(feature_name, recognized)

```

Citation
========

```bibtex

@inproceedings{glocker2023allophant,

    title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes},

    author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir},

    year={2023},

    booktitle={{Proc. Interspeech 2023}},

    month={8}}

```
[](arxiv.org/abs/2306.04306)