--- library_name: transformers language: - de license: mit base_model: openai/whisper-large-v3-turbo tags: - generated_from_trainer pipeline_tag: automatic-speech-recognition --- # GRAG-WHISPER-LARGE-v3-TURBO-HESSIAN-AI This model is fine-tuned on a carefully curated 13 hour dataset. ## Evaluations - Word error rate | Test-Dataset | openai-whisper-large-v3-turbo | **GRAG-WHISPER-LARGE-v3-TURBO** | primeline-whisper-large-v3-turbo-german | |-------------------------------------|-------------------------------|-------------------------|-----------------------------------| | Tuda-De | 8.195 | **6.360** | 6.441 | | common_voice_19_0 | 3.839 | 3.249 | **3.217** | | multilingual librispeech | 3.202 | 2.071 | **2.067** | | All | 3.641 | 2.633 | **2.630** | The data and code for evaluations are available [here](https://huggingface.co/datasets/avemio/ASR-GERMAN-MIXED-EVALS-GRAG) ### Training data The training data for this model includes conversations of spoken German with a mix of english business phrases included. The data was carefully selected and processed to optimize recognition performance. The dataset will not be published because of unclear situation if the data would be used for voice-cloning. The rights to use the collected data are only for the intended use to train speech-to-text models. ### How to use ```python import torch from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline from datasets import load_dataset device = "cuda:0" if torch.cuda.is_available() else "cpu" torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32 model_id = "avemio/GRAG-WHISPER-LARGE-v3-TURBO" model = AutoModelForSpeechSeq2Seq.from_pretrained( model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True ) model.to(device) processor = AutoProcessor.from_pretrained(model_id) pipe = pipeline( "automatic-speech-recognition", model=model, tokenizer=processor.tokenizer, feature_extractor=processor.feature_extractor, max_new_tokens=128, chunk_length_s=30, batch_size=16, return_timestamps=True, torch_dtype=torch_dtype, device=device, ) dataset = load_dataset("distil-whisper/librispeech_long", "clean", split="validation") sample = dataset[0]["audio"] result = pipe(sample) print(result["text"]) ``` ### Framework versions - Transformers 4.47.1 - Pytorch 2.5.1+cu121 - Datasets 3.2.0 - Tokenizers 0.21.0