jlehecka commited on
Commit
ae6e20e
1 Parent(s): 6588abf

Update README.md

Browse files
Files changed (1) hide show
  1. README.md +22 -0
README.md CHANGED
@@ -24,6 +24,28 @@ More than 80 thousand hours of unlabeled Czech speech:
24
  - telephone data (2k hours),
25
  - and a smaller amount of data from several other domains including the public CommonVoice dataset.
26
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
27
  ## Speech recognition results
28
  After fine-tuning, the model scored the following results on public datasets:
29
  - Czech portion of CommonVoice v7.0: **WER = 3.8%**
 
24
  - telephone data (2k hours),
25
  - and a smaller amount of data from several other domains including the public CommonVoice dataset.
26
 
27
+ ## Usage
28
+ Inputs must be 16kHz mono audio files.
29
+
30
+ This model could be used e.g. to extract per-frame contextual embeddings from audio:
31
+ ```python
32
+ from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
33
+ import torchaudio
34
+
35
+ feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("fav-kky/wav2vec2-base-cs-80k-ClTRUS")
36
+ model = Wav2Vec2Model.from_pretrained("fav-kky/wav2vec2-base-cs-80k-ClTRUS")
37
+
38
+ speech_array, sampling_rate = torchaudio.load("/path/to/audio/file.wav")
39
+ inputs = feature_extractor(
40
+ speech_array,
41
+ sampling_rate=16_000,
42
+ return_tensors="pt"
43
+ )["input_values"][0]
44
+
45
+ output = model(inputs)
46
+ embeddings = output.last_hidden_state.detach().numpy()[0]
47
+ ```
48
+
49
  ## Speech recognition results
50
  After fine-tuning, the model scored the following results on public datasets:
51
  - Czech portion of CommonVoice v7.0: **WER = 3.8%**