Automatic Speech Recognition
PyTorch
allophant
phoneme-recognition
kgnlp commited on
Commit
7d8f9f3
1 Parent(s): 09c65cf

Added Usage section to README

Browse files
Files changed (1) hide show
  1. README.md +64 -0
README.md CHANGED
@@ -27,6 +27,70 @@ The model is based on [facebook/wav2vec2-xls-r-300m](https://huggingface.co/face
27
 
28
  Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.
29
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
30
  Citation
31
  ========
32
 
 
27
 
28
  Note that our baseline models were trained without phonetic feature classifiers and therefore only support phoneme recognition.
29
 
30
+ Usage
31
+ =====
32
+
33
+ A pre-trained model can be loaded with the [`allophant`](https://github.com/kgnlp/allophant) package from a huggingface checkpoint or local file:
34
+
35
+ ```python
36
+ from allophant.estimator import Estimator
37
+
38
+ device = "cpu"
39
+ model, attribute_indexer = Estimator.restore("kgnlp/allophant", device=device)
40
+ supported_features = attribute_indexer.feature_names
41
+ # The phonetic feature categories supported by the model, including "phonemes"
42
+ print(supported_features)
43
+ ```
44
+ Allophant supports decoding custom phoneme inventories, which can be constructed in multiple ways:
45
+
46
+ ```python
47
+ # 1. For a single language:
48
+ inventory = attribute_indexer.phoneme_inventory("es")
49
+ # 2. For multiple languages, e.g. in code-switching scenarios
50
+ inventory = attribute_indexer.phoneme_inventory(["es", "it"])
51
+ # 3. Any custom selection of phones for which features are available in the Allophoible database
52
+ inventory = ['a', 'ai̯', 'au̯', 'b', 'e', 'eu̯', 'f', 'ɡ', 'l', 'ʎ', 'm', 'ɲ', 'o', 'p', 'ɾ', 's', 't̠ʃ']
53
+ ````
54
+
55
+ Audio files can then be loaded, resampled and transcribed using the given
56
+ inventory by first computing the log probabilities for each classifier:
57
+
58
+ ```python
59
+ import torch
60
+ import torchaudio
61
+ from allophant.dataset_processing import Batch
62
+
63
+ # Load an audio file and resample the first channel to the sample rate used by the model
64
+ audio, sample_rate = torchaudio.load("utterance.wav")
65
+ audio = torchaudio.functional.resample(audio[:1], sample_rate, model.sample_rate)
66
+
67
+ # Construct a batch of 0-padded single channel audio, lengths and language IDs
68
+ # Language ID can be 0 for inference
69
+ batch = Batch(audio, torch.tensor([audio.shape[1]]), torch.zeros(1))
70
+ model_outputs = model.predict(
71
+ batch.to(device),
72
+ attribute_indexer.composition_feature_matrix(inventory).to(device)
73
+ )
74
+ ```
75
+
76
+ Finally, the log probabilities can be decoded into the recognized phonemes or phonetic features:
77
+
78
+ ```python
79
+ from allophant import predictions
80
+
81
+ # Create a feature mapping for your inventory and CTC decoders for the desired feature set
82
+ inventory_indexer = attribute_indexer.attributes.subset(inventory)
83
+ ctc_decoders = predictions.feature_decoders(inventory_indexer, feature_names=supported_features)
84
+
85
+ for feature_name, decoder in ctc_decoders.items():
86
+ decoded = decoder(model_outputs.outputs[feature_name].transpose(1, 0), model_outputs.lengths)
87
+ # Print the feature name and values for each utterance in the batch
88
+ for [hypothesis] in decoded:
89
+ # NOTE: token indices are offset by one due to the <BLANK> token used during decoding
90
+ recognized = inventory_indexer.feature_values(feature_name, hypothesis.tokens - 1)
91
+ print(feature_name, recognized)
92
+ ```
93
+
94
  Citation
95
  ========
96