--- license: cc-by-nc-4.0 library_name: transformers datasets: - ai4bharat/NPTEL - ai4bharat/IndicVoices-ST - ai4bharat/WordProject - ai4bharat/Spoken-Tutorial - ai4bharat/Mann-ki-Baat - ai4bharat/Vanipedia - ai4bharat/UGCE-Resources pipeline_tag: automatic-speech-recognition language: - en - as - bn - gu - hi - ta - te - ur - kn - ml - mr - sd - ne --- # IndicSeamless for Speech-to-Text Translation Open in HuggingFace ## Model Overview This repository hosts the **IndicSeamless** model which is a SeamlessM4T-v2 finetuned on the **BhasaAnuvaad** dataset for **speech-to-text translation (STT)** across **Indian languages**. The dataset was filtered using the following thresholds before training: - **Alignment Score**: 0.8 - **Mining Score**: 0.6 ### Performance Highlights - The model **outperforms the base SeamlessM4Tv2 model** and all competing STT systems, including cascaded approaches. - It **achieves a new SOTA on Fleurs and significantly surpasses all other systems on the BhasaAnuvaad test set, which includes a diverse range of data from new domains.** ## Model Usage ### Installation Ensure you have the required dependencies installed: ```bash pip install torch torchaudio transformers datasets ``` ### Loading the Model ```python import torchaudio from transformers import SeamlessM4Tv2ForSpeechToText from transformers import SeamlessM4TTokenizer, SeamlessM4TFeatureExtractor model = SeamlessM4Tv2ForSpeechToText.from_pretrained("ai4bharat/indic-seamless").to("cuda") processor = SeamlessM4TFeatureExtractor.from_pretrained("ai4bharat/indic-seamless") tokenizer = SeamlessM4TTokenizer.from_pretrained("ai4bharat/indic-seamless") ``` ### Single Audio Inference ```python audio, orig_freq = torchaudio.load("../10002398547238927970.wav") audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array audio_inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").to("cuda") text_out = model.generate(**audio_inputs, tgt_lang="hin")[0].cpu().numpy().squeeze() print(tokenizer.decode(text_out, clean_up_tokenization_spaces=True, skip_special_tokens=True)) ``` ### Inference on Fleurs Dataset ```python from datasets import load_dataset dataset = load_dataset("google/fleurs", "hi_in", split="test") def process_audio(example): audio = example["audio"]["array"] audio_inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").to("cuda") text_out = model.generate(**audio_inputs, tgt_lang="hin")[0].cpu().numpy().squeeze() return {"predicted_text": tokenizer.decode(text_out, clean_up_tokenization_spaces=True, skip_special_tokens=True)} dataset = dataset.map(process_audio) dataset = dataset.remove_columns(["audio"]) dataset.to_csv("fleurs_hi_predictions.csv") ``` ### Batch Translation using Fleurs ```python from datasets import load_dataset import torch def process_batch(batch): audio_arrays = [audio["array"] for audio in batch["audio"]] audio_inputs = processor(audio_arrays, sampling_rate=16_000, return_tensors="pt", padding=True).to("cuda") text_outs = model.generate(**audio_inputs, tgt_lang="hin") batch["predicted_text"] = [tokenizer.decode(text_out.cpu().numpy().squeeze(), clean_up_tokenization_spaces=True, skip_special_tokens=True) for text_out in text_outs] return batch def batch_translate(language_code="hi_in", tgt_lang="hin"): dataset = load_dataset("google/fleurs", language_code, split="test") dataset = dataset.map(process_batch, batched=True, batch_size=8) return dataset["predicted_text"] # Example usage target_language = "hi_in" translations = batch_translate(target_language, tgt_lang="hin") print(translations) ``` ## Citation If you use BhasaAnuvaad in your work, please cite us: ```bibtex @misc{jain2024bhasaanuvaadspeechtranslationdataset, title={BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages}, author={Sparsh Jain and Ashwin Sankar and Devilal Choudhary and Dhairya Suman and Nikhil Narasimhan and Mohammed Safi Ur Rahman Khan and Anoop Kunchukuttan and Mitesh M Khapra and Raj Dabre}, year={2024}, eprint={2411.04699}, archivePrefix={arXiv}, primaryClass={cs.CL}, url={https://arxiv.org/abs/2411.04699}, } ``` ## License This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.