|
--- |
|
library_name: transformers |
|
datasets: |
|
- classla/Mici_Princ |
|
language: |
|
- hr |
|
license: cc-by-sa-4.0 |
|
metrics: |
|
- wer |
|
- cer |
|
pipeline_tag: automatic-speech-recognition |
|
--- |
|
|
|
# Model Card for Model ID |
|
|
|
This model was finetuned on [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ), |
|
the audiobook of the translation of _Le Petit Prince_ into the Chakavian dialect of Croatian. |
|
|
|
## Model Details |
|
|
|
### Model Description |
|
|
|
<!-- Provide a longer summary of what this model is. --> |
|
|
|
This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated. |
|
|
|
- **Developed by:** Nikola Ljubešić, Peter Rupnik, Tea Perinčić |
|
- **Model type:** [More Information Needed] |
|
- **Language(s) (NLP):** Croatian (hrv) - Chakavian dialect (ckm) |
|
- **License:** Creative Commons - Share Alike 4.0 |
|
- **Finetuned from model:** openai/whisper-large-v3 |
|
|
|
### Model Sources |
|
|
|
<!-- Provide the basic links for the model. --> |
|
|
|
- **Repository:** [GitHub](https://github.com/5roop/mici_princ_whisper) |
|
- **Paper:** Coming soon |
|
- **Dataset:** [Mići Princ](https://huggingface.co/datasets/classla/Mici_Princ) |
|
|
|
## Example use: |
|
|
|
```python |
|
import torch |
|
from datasets import load_dataset |
|
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline |
|
from transformers.pipelines.pt_utils import KeyDataset |
|
|
|
device = torch.device("cuda" if torch.cuda.is_available() else "cpu") |
|
model_id = "classla/whisper-large-v3-mici-princ" |
|
model = AutoModelForSpeechSeq2Seq.from_pretrained( |
|
model_id, |
|
) |
|
|
|
model.to(device) |
|
processor = AutoProcessor.from_pretrained(model_id) |
|
|
|
ds = load_dataset("classla/Mici_Princ", split="test") |
|
pipe = pipeline( |
|
"automatic-speech-recognition", |
|
model=model, |
|
tokenizer=processor.tokenizer, |
|
feature_extractor=processor.feature_extractor, |
|
max_new_tokens=128, |
|
chunk_length_s=30, |
|
batch_size=16, |
|
return_timestamps=True, |
|
device=device, |
|
) |
|
|
|
result = pipe( |
|
KeyDataset(ds, "audio"), |
|
generate_kwargs={"language": "croatian"}, |
|
) |
|
|
|
for i in result: |
|
print(i) |
|
|
|
# Output: |
|
# {'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.', 'chunks': [{'timestamp': (0.0, 7.18), 'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.'}]} |
|
# ... |
|
|
|
``` |
|
|
|
|
|
|
|
## Training Details |
|
|
|
#### Preprocessing |
|
|
|
Model was trained on the `normalized_text` attribute of the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ). This means |
|
that the data included capital letters and punctuation, except bullet points, newlines, and quotation marks. Special characters, present in |
|
the dialect, but not in standard Croatian, were substituted. |
|
|
|
Only the `train` split was used in training. |
|
|
|
#### Training Hyperparameters |
|
|
|
``` |
|
per_device_train_batch_size=4, |
|
gradient_accumulation_steps=4, |
|
learning_rate=1e-5, |
|
warmup_steps=100, |
|
max_steps=309 * 10, |
|
gradient_checkpointing=True, |
|
predict_with_generate=True, |
|
generation_max_length=225, |
|
save_steps=309, |
|
``` |
|
|
|
## Evaluation |
|
|
|
For evaluation, the `test` split of the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ) was used. |
|
|
|
#### Metrics |
|
|
|
* WER: 0.04422 |
|
* CER: 0.16248 |
|
|
|
|
|
## Citation |
|
|
|
Coming soon. |
|
|
|
## Model Card Authors |
|
|
|
Peter Rupnik |
|
|
|
## Model Card Contact |
|
|
|
[https://huggingface.co/5roop](https://huggingface.co/5roop) |