Added Usage section to README
Browse files
README.md
CHANGED
@@ -27,6 +27,70 @@ The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/face
|
|
27 |
|
28 |
Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.
|
29 |
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
30 |
Citation
|
31 |
========
|
32 |
|
|
|
27 |
|
28 |
Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.
|
29 |
|
30 |
+
Usage
|
31 |
+
=====
|
32 |
+
|
33 |
+
A pre-trained model can be loaded with the [`allophant`](https://github.com/kgnlp/allophant) package from a huggingface checkpoint or local file:
|
34 |
+
|
35 |
+
```python
|
36 |
+
from allophant.estimator import Estimator
|
37 |
+
|
38 |
+
device = "cpu"
|
39 |
+
model, attribute_indexer = Estimator.restore("kgnlp/allophant-shared", device=device)
|
40 |
+
supported_features = attribute_indexer.feature_names
|
41 |
+
# The phonetic feature categories supported by the model, including "phonemes"
|
42 |
+
print(supported_features)
|
43 |
+
```
|
44 |
+
Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:
|
45 |
+
|
46 |
+
```python
|
47 |
+
# 1. For a single language:
|
48 |
+
inventory = attribute_indexer.phoneme_inventory("es")
|
49 |
+
# 2. For multiple languages, e.g. in code-switching scenarios
|
50 |
+
inventory = attribute_indexer.phoneme_inventory(["es", "it"])
|
51 |
+
# 3. Any custom selection of phones for which features are available in the Allophoible database
|
52 |
+
inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']
|
53 |
+
````
|
54 |
+
|
55 |
+
Audio files can then be loaded, resampled and transcribed using the given
|
56 |
+
inventory by first computing the log probabilities for each classifier:
|
57 |
+
|
58 |
+
```python
|
59 |
+
import torch
|
60 |
+
import torchaudio
|
61 |
+
from allophant.dataset_processing import Batch
|
62 |
+
|
63 |
+
# Load an audio file and resample the first channel to the sample rate used by the model
|
64 |
+
audio, sample_rate = torchaudio.load("utterance.wav")
|
65 |
+
audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)
|
66 |
+
|
67 |
+
# Construct a batch of 0-padded single channel audio, lengths and language IDs
|
68 |
+
# Language ID can be 0 for inference
|
69 |
+
batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))
|
70 |
+
model_outputs = model.predict(
|
71 |
+
batch.to(device),
|
72 |
+
attribute_indexer.composition_feature_matrix(inventory).to(device)
|
73 |
+
)
|
74 |
+
```
|
75 |
+
|
76 |
+
Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:
|
77 |
+
|
78 |
+
```python
|
79 |
+
from allophant import predictions
|
80 |
+
|
81 |
+
# Create a feature mapping for your inventory and CTC decoders for the desired feature set
|
82 |
+
inventory_indexer = attribute_indexer.attributes.subset(inventory)
|
83 |
+
ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)
|
84 |
+
|
85 |
+
for feature_name, decoder in ctc_decoders.items():
|
86 |
+
decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)
|
87 |
+
# Print the feature name and values for each utterance in the batch
|
88 |
+
for [hypothesis] in decoded:
|
89 |
+
# NOTE: token indices are offset by one due to the <BLANK> token used during decoding
|
90 |
+
recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)
|
91 |
+
print(feature_name, recognized)
|
92 |
+
```
|
93 |
+
|
94 |
Citation
|
95 |
========
|
96 |
|