Val123val
/

ru_whisper_small

@@ -21,17 +21,115 @@ This model is a fine-tuned version of [openai/whisper-small](https://huggingface
 ## Model description
-More information needed
 ## Intended uses & limitations
-More information needed
-## Training and evaluation data
-More information needed
-## Training procedure
 ### Training hyperparameters

 ## Model description
+Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on 680k hours of labelled speech data annotated using large-scale weak supervision. Russian language is only 5k hours within all.
+ru_whisper_small is a fine-tuned version of [openai/whisper-small](https://huggingface.co/openai/whisper-small) on the Sberdevices_golos_10h_crowd dataset. ru-whisper is also potentially quite useful as an ASR solution for developers, especially for Russian speech recognition. They may exhibit additional capabilities, particularly if fine-tuned on certain tasks
 ## Intended uses & limitations
+from transformers import WhisperProcessor, WhisperForConditionalGeneration
+from datasets import load_dataset
+# load model and processor
+processor = WhisperProcessor.from_pretrained("Val123val/ru_whisper_small")
+model = WhisperForConditionalGeneration.from_pretrained("Val123val/ru_whisper_small")
+model.config.forced_decoder_ids = None
+# load dataset and read audio files
+ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
+sample = ds[0]["audio"]
+input_features = processor(sample["array"], sampling_rate=sample["sampling_rate"], return_tensors="pt").input_features
+# generate token ids
+predicted_ids = model.generate(input_features)
+# decode token ids to text
+transcription = processor.batch_decode(predicted_ids, skip_special_tokens=False)
+transcription = processor.batch_decode(predicted_ids, skip_special_tokens=True)
+## Long-Form Transcription
+The Whisper model is intrinsically designed to work on audio samples of up to 30s in duration. However, by using a chunking algorithm, it can be used to transcribe audio samples of up to arbitrary length. This is possible through Transformers pipeline method. Chunking is enabled by setting chunk_length_s=30 when instantiating the pipeline. With chunking enabled, the pipeline can be run with batched inference. It can also be extended to predict sequence level timestamps by passing return_timestamps=True:
+import torch
+from transformers import pipeline
+from datasets import load_dataset
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+pipe = pipeline(
+  "automatic-speech-recognition",
+  model="Val123val/ru_whisper_small",
+  chunk_length_s=30,
+  device=device,
+)
+ds = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
+sample = ds[0]["audio"]
+prediction = pipe(sample.copy(), batch_size=8)["text"]
+# we can also return timestamps for the predictions
+prediction = pipe(sample.copy(), batch_size=8, return_timestamps=True)["chunks"]
+## Faster using with Speculative Decoding
+Speculative Decoding was proposed in Fast Inference from Transformers via Speculative Decoding by Yaniv Leviathan et. al. from Google. It works on the premise that a faster, assistant model very often generates the same tokens as a larger main model.
+import torch
+from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor
+device = "cuda:0" if torch.cuda.is_available() else "cpu"
+torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32
+dataset = load_dataset("bond005/sberdevices_golos_10h_crowd", split="validation", token=True)
+model_id = "Val123val/ru_whisper_small"
+model = AutoModelForSpeechSeq2Seq.from_pretrained(
+    model_id,
+    torch_dtype=torch_dtype,
+    low_cpu_mem_usage=True,
+    use_safetensors=True,
+    attn_implementation="sdpa",
+)
+model.to(device)
+processor = AutoProcessor.from_pretrained(model_id)
+assistant_model_id = "openai/whisper-tiny"
+assistant_model = AutoModelForSpeechSeq2Seq.from_pretrained(
+    assistant_model_id,
+    torch_dtype=torch_dtype,
+    low_cpu_mem_usage=True,
+    use_safetensors=True,
+    attn_implementation="sdpa",
+)
+assistant_model.to(device);
+from transformers import pipeline
+pipe = pipeline(
+    "automatic-speech-recognition",
+    model=model,
+    tokenizer=processor.tokenizer,
+    feature_extractor=processor.feature_extractor,
+    max_new_tokens=128,
+    chunk_length_s=15,
+    batch_size=4,
+    generate_kwargs={"assistant_model": assistant_model},
+    torch_dtype=torch_dtype,
+    device=device,
+)
+sample = dataset[0]["audio"]
+result = pipe(sample)
+print(result["text"])
 ### Training hyperparameters