Create README.md
Browse files
README.md
ADDED
@@ -0,0 +1,73 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
---
|
2 |
+
language: hr
|
3 |
+
datasets:
|
4 |
+
- parlaspeech-hr
|
5 |
+
tags:
|
6 |
+
- audio
|
7 |
+
- automatic-speech-recognition
|
8 |
+
- parlaspeech
|
9 |
+
widget:
|
10 |
+
- example_title: example 1
|
11 |
+
src: https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/1800.m4a
|
12 |
+
- example_title: example 2
|
13 |
+
src: https://huggingface.co/classla/wav2vec2-xls-r-parlaspeech-hr/raw/main/00020578b.flac.wav
|
14 |
+
|
15 |
+
---
|
16 |
+
|
17 |
+
# wav2vec2-xls-r-parlaspeech-hr
|
18 |
+
|
19 |
+
This model for Croatian ASR is based on the [facebook/wav2vec2-large-slavic-voxpopuli-v2 model](facebook/wav2vec2-large-slavic-voxpopuli-v2) and was fine-tuned with 300 hours of recordings and transcripts from the ASR Croatian parliament dataset [ParlaSpeech-HR v1.0](http://hdl.handle.net/11356/1494).
|
20 |
+
|
21 |
+
The efforts resulting in this model were coordinated by Nikola Ljubešić, the rough manual data alignment was performed by Ivo-Pavao Jazbec, the method for fine automatic data alignment from [Plüss et al.](https://arxiv.org/abs/2010.02810) was applied by Vuk Batanović and Lenka Bajčetić, the transcripts were normalised by Danijel Korzinek, while the final modelling was performed by Peter Rupnik.
|
22 |
+
|
23 |
+
If you use this model, please cite the following paper:
|
24 |
+
|
25 |
+
Nikola Ljubešić, Danijel Koržinek, Peter Rupnik, Ivo-Pavao Jazbec. ParlaSpeech-HR -- a freely available ASR dataset for Croatian bootstrapped from the ParlaMint corpus. Submitted to ParlaCLARIN@LREC.
|
26 |
+
|
27 |
+
## Metrics
|
28 |
+
|
29 |
+
|split|CER|WER|
|
30 |
+
|---|---|---|
|
31 |
+
|dev|0.0311|0.0921|
|
32 |
+
|test|0.0222|0.0679|
|
33 |
+
|
34 |
+
## Usage in `transformers`
|
35 |
+
|
36 |
+
```python
|
37 |
+
from transformers import Wav2Vec2Processor, Wav2Vec2ForCTC
|
38 |
+
import soundfile as sf
|
39 |
+
import torch
|
40 |
+
import os
|
41 |
+
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
|
42 |
+
# load model and tokenizer
|
43 |
+
processor = Wav2Vec2Processor.from_pretrained(
|
44 |
+
"classla/wav2vec2-large-slavic-parlaspeech-hr")
|
45 |
+
model = Wav2Vec2ForCTC.from_pretrained("classla/wav2vec2-large-slavic-parlaspeech-hr")
|
46 |
+
# download the example wav files:
|
47 |
+
os.system("wget https://huggingface.co/classla/wav2vec2-large-slavic-parlaspeech-hr/raw/main/00020570a.flac.wav")
|
48 |
+
# read the wav file
|
49 |
+
speech, sample_rate = sf.read("00020570a.flac.wav")
|
50 |
+
input_values = processor(speech, sampling_rate=sample_rate, return_tensors="pt").input_values.to(device)
|
51 |
+
# remove the raw wav file
|
52 |
+
os.system("rm 00020570a.flac.wav")
|
53 |
+
# retrieve logits
|
54 |
+
logits = model.to(device)(input_values).logits
|
55 |
+
# take argmax and decode
|
56 |
+
predicted_ids = torch.argmax(logits, dim=-1)
|
57 |
+
transcription = processor.decode(predicted_ids[0]).lower()
|
58 |
+
# transcription: 'veliki broj poslovnih subjekata posluje sa minusom velik dio'
|
59 |
+
```
|
60 |
+
|
61 |
+
|
62 |
+
|
63 |
+
## Training hyperparameters
|
64 |
+
|
65 |
+
In fine-tuning, the following arguments were used:
|
66 |
+
|
67 |
+
| arg | value |
|
68 |
+
|-------------------------------|-------|
|
69 |
+
| `per_device_train_batch_size` | 16 |
|
70 |
+
| `gradient_accumulation_steps` | 4 |
|
71 |
+
| `num_train_epochs` | 8 |
|
72 |
+
| `learning_rate` | 3e-4 |
|
73 |
+
| `warmup_steps` | 500 |
|