Automatic Speech Recognition
PyTorch
allophant
phoneme-recognition
File size: 4,608 Bytes
0a40303
 
09c65cf
 
 
 
 
 
 
 
0a40303
09c65cf
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
7d8f9f3
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
09c65cf
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
---

license: apache-2.0
datasets:
- mozilla-foundation/common_voice_10_0
base_model:
- facebook/wav2vec2-xls-r-300m
tags:
- pytorch
- phoneme-recognition
pipeline_tag: automatic-speech-recognition
---


Model Information
=================

Allophant is a multilingual phoneme recognizer trained on spoken sentences in 34 languages, capable of generalizing zero-shot to unseen phoneme inventories.

The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/facebook/wav2vec2-xls-r-300m) and was pre-trained on a subset of the [Common Voice Corpus 10.0](https://huggingface.co/datasets/mozilla-foundation/common_voice_10_0) transcribed with [eSpeak NG](https://github.com/espeak-ng/espeak-ng).

| Model Name       | UCLA Phonetic Corpus (PER) | UCLA Phonetic Corpus (AER) | Common Voice (PER) | Common Voice (AER) |
| ---------------- | ---------: | ---------: | -------: | -------: |
| **Multitask**        | **45.62%** | 19.44% | **34.34%** | **8.36%** |
| [Hierarchical](https://huggingface.co/kgnlp/allophant-hierarchical)     | 46.09% | **19.18%** | 34.35% | 8.56% |
| [Multitask Shared](https://huggingface.co/kgnlp/allophant-shared) | 46.05% | 19.52% | 41.20% | 8.88% |
| [Baseline Shared](https://huggingface.co/kgnlp/allophant-baseline-shared)  | 48.25% |   -    | 45.35% |  -    |
| [Baseline](https://huggingface.co/kgnlp/allophant-baseline)         | 57.01% |   -    | 46.95% |  -    |

Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.

Usage
=====

A pre-trained model can be loaded with the [`allophant`](https://github.com/kgnlp/allophant) package from a huggingface checkpoint or local file:

```python

from allophant.estimator import Estimator



device = "cpu"

model, attribute_indexer = Estimator.restore("kgnlp/allophant", device=device)

supported_features = attribute_indexer.feature_names

# The phonetic feature categories supported by the model, including "phonemes"

print(supported_features)

```
Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:

```python

# 1. For a single language:

inventory = attribute_indexer.phoneme_inventory("es")

# 2. For multiple languages, e.g. in code-switching scenarios

inventory = attribute_indexer.phoneme_inventory(["es", "it"])

# 3. Any custom selection of phones for which features are available in the Allophoible database

inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']

````

Audio files can then be loaded, resampled and transcribed using the given
inventory by first computing the log probabilities for each classifier:

```python

import torch

import torchaudio

from allophant.dataset_processing import Batch



# Load an audio file and resample the first channel to the sample rate used by the model

audio, sample_rate = torchaudio.load("utterance.wav")

audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)



# Construct a batch of 0-padded single channel audio, lengths and language IDs

# Language ID can be 0 for inference

batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))

model_outputs = model.predict(

  batch.to(device),

  attribute_indexer.composition_feature_matrix(inventory).to(device)

)

```

Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:

```python

from allophant import predictions



# Create a feature mapping for your inventory and CTC decoders for the desired feature set

inventory_indexer = attribute_indexer.attributes.subset(inventory)

ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)



for feature_name, decoder in ctc_decoders.items():

    decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)

    # Print the feature name and values for each utterance in the batch

    for [hypothesis] in decoded:

        # NOTE: token indices are offset by one due to the <BLANK> token used during decoding

        recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)

        print(feature_name, recognized)

```

Citation
========

```bibtex

@inproceedings{glocker2023allophant,

    title={Allophant: Cross-lingual Phoneme Recognition with Articulatory Attributes},

    author={Glocker, Kevin and Herygers, Aaricia and Georges, Munir},

    year={2023},

    booktitle={{Proc. Interspeech 2023}},

    month={8}}

```