File size: 3,829 Bytes
2ac2327
 
128fdf1
 
 
 
8d39a71
 
 
 
 
2ac2327
 
 
 
128fdf1
 
2ac2327
 
 
 
 
 
 
 
 
128fdf1
2ac2327
128fdf1
 
 
2ac2327
128fdf1
2ac2327
 
 
128fdf1
 
 
2ac2327
128fdf1
2ac2327
128fdf1
 
 
 
 
2ac2327
128fdf1
da5719f
128fdf1
 
 
2ac2327
128fdf1
 
2ac2327
128fdf1
 
 
 
 
 
 
 
 
 
 
 
2ac2327
128fdf1
 
 
 
2ac2327
128fdf1
 
2ac2327
128fdf1
 
 
2ac2327
128fdf1
2ac2327
 
 
 
 
 
 
 
 
 
 
 
 
 
 
128fdf1
2ac2327
128fdf1
 
 
2ac2327
128fdf1
2ac2327
 
 
128fdf1
 
 
 
 
 
 
 
 
 
 
2ac2327
 
 
128fdf1
2ac2327
 
 
128fdf1
 
2ac2327
 
128fdf1
2ac2327
128fdf1
2ac2327
128fdf1
2ac2327
128fdf1
2ac2327
 
 
128fdf1
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
---
library_name: transformers
datasets:
- classla/Mici_Princ
language:
- hr
license: cc-by-sa-4.0
metrics:
- wer
- cer
pipeline_tag: automatic-speech-recognition
---

# Model Card for Model ID

This model was finetuned on [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ), 
audiobook of a translation of _Le Petit Prince_ in Chakavian dialect of Croatian. 

## Model Details

### Model Description

<!-- Provide a longer summary of what this model is. -->

This is the model card of a 🤗 transformers model that has been pushed on the Hub. This model card has been automatically generated.

- **Developed by:** Nikola Ljubešić, Peter Rupnik, Tea Perinčić
- **Model type:** [More Information Needed]
- **Language(s) (NLP):** Croatian - Chakavian dialect
- **License:** Creative Commons - Share Alike 4.0
- **Finetuned from model:** openai/whisper-large-v3

### Model Sources

<!-- Provide the basic links for the model. -->

- **Repository:** [GitHub](https://github.com/5roop/mici_princ_whisper)
- **Paper:** Coming soon
- **Dataset:** [Mići Princ](https://huggingface.co/datasets/classla/Mici_Princ)

## Example use:

```python
import torch
from datasets import load_dataset
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from transformers.pipelines.pt_utils import KeyDataset

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model_id = "classla/whisper-large-v3-mici-princ"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id,
)

model.to(device)
processor = AutoProcessor.from_pretrained(model_id)

ds = load_dataset("classla/Mici_Princ", split="test")
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    device=device,
)

result = pipe(
    KeyDataset(ds, "audio"),
    generate_kwargs={"language": "croatian"},
)

for i in result:
    print(i)

# Output:
# {'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.', 'chunks': [{'timestamp': (0.0, 7.18), 'text': ' Šesti planet je biv deset put veći. Na njin je bivav niki stari čovik ki je pisav vele knjige.'}]}
# ...

```



## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

[More Information Needed]

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing

Model was trained on the `normalized_text` attribute of the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ). This means
that the data included capital letters and punctuation, except bullet points, newlines, and quotation marks. Special characters, present in
the dialect, but not in standard Croatian, were substituted.

Only the `train` split was used in training.

#### Training Hyperparameters

```
    per_device_train_batch_size=4,
    gradient_accumulation_steps=4,
    learning_rate=1e-5,
    warmup_steps=100,
    max_steps=309 * 10,
    gradient_checkpointing=True,
    predict_with_generate=True,
    generation_max_length=225,
    save_steps=309,
```

## Evaluation

For evaluation, the `test` split of the [Mići Princ dataset](https://huggingface.co/datasets/classla/Mici_Princ) was used. 

#### Metrics

* WER: 0.04422
* CER: 0.16248


## Citation 

Coming soon.

## Model Card Authors

Peter Rupnik

## Model Card Contact

[https://huggingface.co/5roop](https://huggingface.co/5roop)