File size: 3,489 Bytes
d35dc49
 
 
 
 
 
 
 
 
3cb4d1c
d35dc49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
de56784
d35dc49
 
 
 
 
a38ab8c
d35dc49
a38ab8c
d35dc49
 
 
 
9c97ccc
e497c89
9c97ccc
 
 
 
d35dc49
9c97ccc
d35dc49
 
 
 
 
 
 
 
 
 
 
 
7a18a1b
d35dc49
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
de56784
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
---
library_name: transformers
language:
- de
license: mit
base_model: openai/whisper-large-v3-turbo
tags:
- generated_from_trainer
datasets:
- MR-Eder/GER-STT-50-Conversations
metrics:
- wer
model-index:
- name: Whisper Large -v3 Turbo German - GRAG
  results:
  - task:
      name: Automatic Speech Recognition
      type: automatic-speech-recognition
    dataset:
      name: GER-TTS-50-Conversations
      type: MR-Eder/GER-TTS-50-Conversations
      config: default
      split: None
      args: 'config: de, split: test'
    metrics:
    - name: Wer
      type: wer
      value: 15.16864233785768
pipeline_tag: automatic-speech-recognition
---

<!-- This model card has been generated automatically according to the information the Trainer had access to. You
should probably proofread and complete it, then remove this comment. -->

# GRAG-WHISPER-LARGE-v3-TURBO-HESSIAN-AI

This model is fine-tuned on a carefully curated 13 hour dataset.


## Evaluations - Word error rate

| Test-Dataset                             | openai-whisper-large-v3-turbo | **GRAG-WHISPER-LARGE-v3-TURBO** | primeline-whisper-large-v3-turbo-german | 
|-------------------------------------|-------------------------------|-------------------------|-----------------------------------|
| Tuda-De                             | 8.195                         | **6.360**                   | 6.441                             | 
| common_voice_19_0                   | 3.839                         | 3.249                   | **3.217**                             | 
| multilingual librispeech            | 3.202                         | 2.071                   | **2.067**                             | 
| All                                 | 3.641                         | 2.633                   | **2.630**                             | 

The data and code for evaluations are available [here](https://huggingface.co/datasets/avemio/ASR-GERMAN-MIXED-EVALS-GRAG)

### Training data
The training data for this model includes conversations of spoken German with a mix of english business phrases included. The data was carefully selected and processed to optimize recognition performance. The dataset will not be published because of unclear situation if the data would be used for voice-cloning. The rights to use the collected data are only for the intended use to train speech-to-text models.

### How to use

```python
import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset
device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
model_id = "avemio/GRAG-WHISPER-LARGE-v3-TURBO"
model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)
processor = AutoProcessor.from_pretrained(model_id)
pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    max_new_tokens=128,
    chunk_length_s=30,
    batch_size=16,
    return_timestamps=True,
    torch_dtype=torch_dtype,
    device=device,
)
dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation")
sample = dataset[0]["audio"]
result = pipe(sample)
print(result["text"])
```


### Framework versions

- Transformers 4.47.1
- Pytorch 2.5.1+cu121
- Datasets 3.2.0
- Tokenizers 0.21.0