language: cs
tags:
- Czech
- KKY
- FAV
license: cc-by-nc-sa-4.0
wav2vec2-base-cs-80k-ClTRUS
Czech language TRransformer from Unlabeled Speech (ClTRUS) is a monolingual Czech Wav2Vec 2.0 base model pre-trained from 80 thousand hours of Czech speech.
This model does not have a tokenizer as it was pretrained on audio alone. In order to use this model for speech recognition, a tokenizer should be created and the model should be fine-tuned on labeled text data.
Note: This is a checkpoint of the model after 4 epochs over the whole dataset. If you want some earlier or later checkpoints, please feel free to contact the author (jlehecka(at)kky.zcu.cz).
Pretraining data
More than 80 thousand hours of unlabeled Czech speech:
- recordings from radio (22k hours),
- unlabeled data from VoxPopuli dataset (18.7k hours),
- TV shows (15k hours),
- shadow speakers (12k hours),
- sports (5k hours),
- telephone data (2k hours),
- and a smaller amount of data from several other domains including the public CommonVoice dataset.
Usage
Inputs must be 16kHz mono audio files.
This model can be used e.g. to extract per-frame contextual embeddings from audio:
from transformers import Wav2Vec2Model, Wav2Vec2FeatureExtractor
import torchaudio
feature_extractor = Wav2Vec2FeatureExtractor.from_pretrained("fav-kky/wav2vec2-base-cs-80k-ClTRUS")
model = Wav2Vec2Model.from_pretrained("fav-kky/wav2vec2-base-cs-80k-ClTRUS")
speech_array, sampling_rate = torchaudio.load("/path/to/audio/file.wav")
inputs = feature_extractor(
speech_array,
sampling_rate=16_000,
return_tensors="pt"
)["input_values"][0]
output = model(inputs)
embeddings = output.last_hidden_state.detach().numpy()[0]
Speech recognition results
After fine-tuning, the model scored the following results on public datasets:
- Czech portion of CommonVoice v7.0: WER = 3.8%
- Czech portion of VoxPopuli: WER = 8.8%
See our paper for details.
Paper
The preprint of our paper (accepted to INTERSPEECH 2022) is available at http://arxiv.org/abs/2206.07627
Citation
If you find this model useful, please cite our paper:
@inproceedings{wav2vec2-base-cs-80k-ClTRUS,
title = {Exploring Capabilities of Monolingual Audio Transformers using Large Datasets in Automatic Speech Recognition of {C}zech},
author = {
Jan Lehe\v{c}ka and
Jan \v{S}vec and
Ale\v{s} Pra\v{z}\'ak and
Josef V. Psutka
},
booktitle = {{I}nterspeech 2022},
publisher = {{ISCA}},
year = {2022},
note = {(in press)},
url = {https://arxiv.org/abs/2206.07627},
}