|
--- |
|
language: |
|
- hr |
|
license: cc-by-sa-4.0 |
|
library_name: transformers |
|
base_model: openai/whisper-large-v3 |
|
datasets: |
|
- classla/Mici_Princ |
|
metrics: |
|
- wer |
|
- cer |
|
pipeline_tag: automatic-speech-recognition |
|
widget: |
|
- example_title: example 1 |
|
src: https://huggingface.co/classla/whisper-large-v3-mici-princ/raw/main/MP_13_65.37-74.67.mp3.wav |
|
- example_title: example 2 |
|
src: https://huggingface.co/classla/whisper-large-v3-mici-princ/raw/main/MP_15_201.53-210.02.mp3.wav |
|
- example_title: example 3 |
|
src: https://huggingface.co/classla/whisper-large-v3-mici-princ/raw/main/MP_15_60.527-67.71.mp3.wav |
|
- example_title: example 4 |
|
src: https://huggingface.co/classla/whisper-large-v3-mici-princ/raw/main/MP_15_68.5-72.45.mp3.wav |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
This model was finetuned on the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ), |
|
the audiobook of the translation of _Le Petit Prince_ into the Chakavian dialect of Croatian. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
The model, already very potent in standard Croatian, was finetuned for 80 epochs with an effective batch size of 16. Performance was inspected every 4 epochs, and the latest checkpoint |
|
is uploaded here. Character error rate has been brought down from 11.54% to 3.95%, while word error rate has been lowered from 35.43% to 16.83%. |
|
|
|
- **Developed by:** Nikola Ljubešić, Peter Rupnik, Tea Perinčić |
|
- **Language(s) (NLP):** Croatian (hrv) - Chakavian dialect (ckm) |
|
- **License:** Creative Commons - Share Alike 4.0 |
|
- **Finetuned from model:** openai/whisper-large-v3 |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [GitHub](https://github.com/5roop/mici_princ_whisper) |
|
- **Paper:** Coming soon |
|
- **Dataset:** [Mići Princ](https://huggingface.co/datasets/classla/Mici_Princ) |
|
|
|
## Example use: |
|
|
|
```python |
|
import torch |
|
from datasets import load_dataset |
|
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline |
|
from transformers.pipelines.pt_utils import KeyDataset |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model_id = "classla/whisper-large-v3-mici-princ" |
|
model = AutoModelForSpeechSeq2Seq.from_pretrained( |
|
model_id, |
|
) |
|
|
|
model.to(device) |
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
ds = load_dataset("classla/Mici_Princ", split="test") |
|
pipe = pipeline( |
|
"automatic-speech-recognition", |
|
model=model, |
|
tokenizer=processor.tokenizer, |
|
feature_extractor=processor.feature_extractor, |
|
max_new_tokens=128, |
|
chunk_length_s=30, |
|
batch_size=16, |
|
return_timestamps=True, |
|
device=device, |
|
) |
|
|
|
result = pipe( |
|
KeyDataset(ds, "audio"), |
|
generate_kwargs={"language": "croatian"}, |
|
) |
|
|
|
for i in result: |
|
print(i) |
|
|
|
# Output: |
|
# {'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.', 'chunks': [{'timestamp': (0.0, 7.18), 'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.'}]} |
|
# ... |
|
|
|
``` |
|
|
|
|
|
|
|
## Training Details |
|
|
|
#### Preprocessing |
|
|
|
Model was trained on the `normalized_text` attribute of the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ). This means |
|
that the data included capital letters and punctuation, except bullet points, newlines, and quotation marks. Special characters, present in |
|
the dialect, but not in standard Croatian, were substituted. |
|
|
|
Only the `train` split was used in training. |
|
|
|
#### Training Hyperparameters |
|
|
|
``` |
|
per_device_train_batch_size=4, |
|
gradient_accumulation_steps=4, |
|
learning_rate=1e-5, |
|
warmup_steps=100, |
|
max_steps=277 * 80, |
|
gradient_checkpointing=True, |
|
predict_with_generate=True, |
|
generation_max_length=225, |
|
save_steps=277, |
|
``` |
|
|
|
## Evaluation |
|
|
|
For evaluation, the `test` split of the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ) was used. The test split consists of two known speakers, Autor and Mići Princ, and two unknown speakers, Geograf and Dilavac. Important to note is that each speaker uses a different micro-dialect, so the test data is challenging on including two new micro-dialects. |
|
|
|
#### Metrics |
|
| speaker | WER vanilla | WER fine-tuned | WER reduction | CER vanilla | CER fine-tuned| CER reduction | |
|
|---|---|---|---|---|---|---| |
|
| all | 35.43% | 16.83% | 52.50% | 11.54% | 3.95% | 65.77% | |
|
| Autor | 38.96% | 14.29% | 63.32% | 10.24% | 2.93% | 71.39% | |
|
| Geograf | 20.94% | 11.57% | 44.75% | 4.99% | 2.19% | 56.11% | |
|
| Mići Princ | 45.31% | 16.62% | 63.32% | 12.21% | 5.09% | 58.31% | |
|
| Dilavac | 39.60% | 23.70% | 40.15% | 18.55% | 5.27% | 71.59% | |
|
|
|
## Citation |
|
|
|
Coming soon. |
|
|
|
## Model Card Authors |
|
|
|
Peter Rupnik |
|
|
|
## Model Card Contact |
|
|
|
[https://huggingface.co/5roop](https://huggingface.co/5roop) |