metadata
license: apache-2.0
language:
- bn
base_model:
- openai/whisper-small
pipeline_tag: automatic-speech-recognition
BengaliRegionalASR trained on bengali regional dialact dataset. sha1779/Bengali_Regional_dataset
This model is trained on this barishal regional data only. The dataset is taken from ভাষা-বিচিত্রা: ASR for Regional Dialects competition.
Try the model
!pip install librosa torch torchaudio transformers
import os
import librosa
import torch, torchaudio
import numpy as np
from transformers import WhisperTokenizer ,WhisperProcessor, WhisperFeatureExtractor, WhisperForConditionalGeneration
model_path_ = "sha1779/BengaliRegionalASR"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path_)
tokenizer = WhisperTokenizer.from_pretrained(model_path_)
processor = WhisperProcessor.from_pretrained(model_path_)
model = WhisperForConditionalGeneration.from_pretrained(model_path_).to(device)
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="bengali", task="transcribe")
mp3_path = "https://huggingface.co/sha1779/BengaliRegionalASR/resolve/main/Mp3/valid_barishal%20(1).wav"
speech_array, sampling_rate = librosa.load(mp3_path, sr=16000)
speech_array = librosa.resample(np.asarray(speech_array), orig_sr=sampling_rate, target_sr=16000)
input_features = feature_extractor(speech_array, sampling_rate=16000, return_tensors="pt").input_features
predicted_ids = model.generate(inputs=input_features.to(device))[0]
transcription = processor.decode(predicted_ids, skip_special_tokens=True)
print(transcription)
For larger audio , more than 30s
import os
import librosa
import torch, torchaudio
import numpy as np
from transformers import WhisperTokenizer ,WhisperProcessor, WhisperFeatureExtractor, WhisperForConditionalGeneration
model_path_ = "sha1779/BengaliRegionalASR"
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
feature_extractor = WhisperFeatureExtractor.from_pretrained(model_path_)
tokenizer = WhisperTokenizer.from_pretrained(model_path_)
processor = WhisperProcessor.from_pretrained(model_path_)
model = WhisperForConditionalGeneration.from_pretrained(model_path_).to(device)
model.config.forced_decoder_ids = processor.get_decoder_prompt_ids(language="bengali", task="transcribe")
mp3_path = "https://huggingface.co/sha1779/BengaliRegionalASR/resolve/main/Mp3/valid_barishal%20(1).wav"
speech_array, sampling_rate = librosa.load(mp3_path, sr=16000)
# Split audio into 30-second chunks with 5-second overlap
chunk_duration = 30 # seconds
overlap = 5 # seconds
chunk_size = int(chunk_duration * sampling_rate)
overlap_size = int(overlap * sampling_rate)
chunks = []
for start in range(0, len(speech_array), chunk_size - overlap_size):
end = start + chunk_size
chunk = speech_array[start:end]
chunks.append(chunk)
# Process each chunk
transcriptions = []
for i, chunk in enumerate(chunks):
# Resample and extract features
chunk = librosa.resample(np.asarray(chunk), orig_sr=sampling_rate, target_sr=16000)
input_features = feature_extractor(chunk, sampling_rate=16000, return_tensors="pt").input_features
# Generate transcription
predicted_ids = model.generate(inputs=input_features.to(device))[0]
transcription = processor.decode(predicted_ids, skip_special_tokens=True)
print(transcription,end=" ")
Evaluation
Word Error Rate 0.65 %