fav-kky
/

wav2vec2-base-cs-80k-ClTRUS

Inference Endpoints

Model card Files Files and versions Community

jlehecka commited on Jun 16, 2022

Commit

ae6e20e

•

1 Parent(s): 6588abf

Update README.md

Files changed (1) hide show

README.md +22 -0

README.md CHANGED Viewed

@@ -24,6 +24,28 @@ More than 80 thousand hours of unlabeled Czech speech:
 - telephone data (2k hours),
 - and a smaller amount of data from several other domains including the public CommonVoice dataset.
 ## Speech recognition results
 After fine-tuning, the model scored the following results on public datasets:
 - Czech portion of CommonVoice v7.0: **WER = 3.8%**

 - telephone data (2k hours),
 - and a smaller amount of data from several other domains including the public CommonVoice dataset.
+## Usage
+Inputs must be 16kHz mono audio files.
+This model could be used e.g. to extract per-frame contextual embeddings from audio:
+```python
+from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
+import torchaudio
+feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("fav-kky/wav2vec2-base-cs-80k-ClTRUS")
+model = Wav2Vec2Model.from_pretrained("fav-kky/wav2vec2-base-cs-80k-ClTRUS")
+speech_array, sampling_rate = torchaudio.load("/path/to/audio/file.wav")
+inputs = feature_extractor(
+    speech_array,
+    sampling_rate=16_000,
+    return_tensors="pt"
+)["input_values"][0]
+output = model(inputs)
+embeddings = output.last_hidden_state.detach().numpy()[0]
+```
 ## Speech recognition results
 After fine-tuning, the model scored the following results on public datasets:
 - Czech portion of CommonVoice v7.0: **WER = 3.8%**