indic-seamless / README.md

Update README.md

9e97fd8 verified 8 days ago

4.6 kB

	---
	license: cc-by-nc-4.0
	library_name: transformers
	datasets:
	- ai4bharat/NPTEL
	- ai4bharat/IndicVoices-ST
	- ai4bharat/WordProject
	- ai4bharat/Spoken-Tutorial
	- ai4bharat/Mann-ki-Baat
	- ai4bharat/Vanipedia
	- ai4bharat/UGCE-Resources
	pipeline_tag: automatic-speech-recognition
	language:
	- en
	- as
	- bn
	- gu
	- hi
	- ta
	- te
	- ur
	- kn
	- ml
	- mr
	- sd
	- ne
	---
	# IndicSeamless for Speech-to-Text Translation

	<a target="_blank" href="https://huggingface.co/spaces/ai4bharat/indic-seamless">
	<img src="https://huggingface.co/datasets/huggingface/badges/raw/main/open-in-hf-spaces-sm.svg" alt="Open in HuggingFace"/>
	</a>

	## Model Overview
	This repository hosts the IndicSeamless model which is a SeamlessM4T-v2 finetuned on the BhasaAnuvaad dataset for speech-to-text translation (STT) across Indian languages. The dataset was filtered using the following thresholds before training:

	- Alignment Score: 0.8
	- Mining Score: 0.6

	### Performance Highlights
	- The model outperforms the base SeamlessM4Tv2 model and all competing STT systems, including cascaded approaches.
	- It achieves a new SOTA on Fleurs and significantly surpasses all other systems on the BhasaAnuvaad test set, which includes a diverse range of data from new domains.

	## Model Usage
	### Installation
	Ensure you have the required dependencies installed:
	```bash
	pip install torch torchaudio transformers datasets
	```

	### Loading the Model
	```python
	import torchaudio
	from transformers import SeamlessM4Tv2ForSpeechToText
	from transformers import SeamlessM4TTokenizer, SeamlessM4TFeatureExtractor

	model = SeamlessM4Tv2ForSpeechToText.from_pretrained("ai4bharat/indic-seamless").to("cuda")
	processor = SeamlessM4TFeatureExtractor.from_pretrained("ai4bharat/indic-seamless")
	tokenizer = SeamlessM4TTokenizer.from_pretrained("ai4bharat/indic-seamless")
	```

	### Single Audio Inference
	```python
	audio, orig_freq = torchaudio.load("../10002398547238927970.wav")
	audio = torchaudio.functional.resample(audio, orig_freq=orig_freq, new_freq=16_000) # must be a 16 kHz waveform array
	audio_inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").to("cuda")

	text_out = model.generate(**audio_inputs, tgt_lang="hin")[0].cpu().numpy().squeeze()
	print(tokenizer.decode(text_out, clean_up_tokenization_spaces=True, skip_special_tokens=True))
	```

	### Inference on Fleurs Dataset
	```python
	from datasets import load_dataset

	dataset = load_dataset("google/fleurs", "hi_in", split="test")

	def process_audio(example):
	audio = example["audio"]["array"]
	audio_inputs = processor(audio, sampling_rate=16_000, return_tensors="pt").to("cuda")
	text_out = model.generate(**audio_inputs, tgt_lang="hin")[0].cpu().numpy().squeeze()
	return {"predicted_text": tokenizer.decode(text_out, clean_up_tokenization_spaces=True, skip_special_tokens=True)}

	dataset = dataset.map(process_audio)
	dataset = dataset.remove_columns(["audio"])
	dataset.to_csv("fleurs_hi_predictions.csv")
	```

	### Batch Translation using Fleurs
	```python
	from datasets import load_dataset
	import torch

	def process_batch(batch):
	audio_arrays = [audio["array"] for audio in batch["audio"]]
	audio_inputs = processor(audio_arrays, sampling_rate=16_000, return_tensors="pt", padding=True).to("cuda")
	text_outs = model.generate(**audio_inputs, tgt_lang="hin")
	batch["predicted_text"] = [tokenizer.decode(text_out.cpu().numpy().squeeze(), clean_up_tokenization_spaces=True, skip_special_tokens=True) for text_out in text_outs]
	return batch

	def batch_translate(language_code="hi_in", tgt_lang="hin"):
	dataset = load_dataset("google/fleurs", language_code, split="test")
	dataset = dataset.map(process_batch, batched=True, batch_size=8)
	return dataset["predicted_text"]

	# Example usage
	target_language = "hi_in"
	translations = batch_translate(target_language, tgt_lang="hin")
	print(translations)
	```

	## Citation

	If you use BhasaAnuvaad in your work, please cite us:

	```bibtex
	@misc{jain2024bhasaanuvaadspeechtranslationdataset,
	title={BhasaAnuvaad: A Speech Translation Dataset for 13 Indian Languages},
	author={Sparsh Jain and Ashwin Sankar and Devilal Choudhary and Dhairya Suman and Nikhil Narasimhan and Mohammed Safi Ur Rahman Khan and Anoop Kunchukuttan and Mitesh M Khapra and Raj Dabre},
	year={2024},
	eprint={2411.04699},
	archivePrefix={arXiv},
	primaryClass={cs.CL},
	url={https://arxiv.org/abs/2411.04699},
	}
	```


	## License

	This model is released under the Creative Commons Attribution-NonCommercial 4.0 International (CC BY-NC 4.0) license.