Vietnamese to Lao Translation Model

In the domain of natural language processing (NLP), the development of translation models tailored for low-resource languages represents a critical endeavor to facilitate cross-cultural communication and knowledge exchange. In response to this challenge, we present a novel and impactful contribution: a translation model specifically designed to bridge the linguistic gap between Lao and Vietnamese.

Lao, a language spoken primarily in Laos and parts of Thailand, presents inherent challenges for machine translation due to its low-resource nature, characterized by limited parallel corpora and linguistic resources. Vietnamese, a language spoken by millions worldwide, shares some linguistic similarities with Lao, making it an ideal target language for translation purposes.

Leveraging the power of the Transformer-based T5 model, we have developed a robust translation system for the Vietnamese-Lao language pair. The T5 model, renowned for its versatility and effectiveness across various NLP tasks, serves as the cornerstone of our approach. Through fine-tuning on a curated dataset of Lao-Vietnamese parallel texts, we have endeavored to enhance translation accuracy and fluency, thus enabling smoother communication between speakers of these languages.

Our work represents a significant advancement in the field of machine translation, particularly for low-resource languages like Lao. By harnessing state-of-the-art NLP techniques and focusing on the specific linguistic nuances of the Lao-Vietnamese language pair, we aim to provide a valuable resource for facilitating cross-linguistic communication and cultural exchange.

How to use

On GPU

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("minhtoan/t5-translate-vietnamese-lao")  
model = AutoModelForSeq2SeqLM.from_pretrained("minhtoan/t5-translate-vietnamese-lao")
model.cuda()
src = "Tôi muốn mua một cuốn sách"
tokenized_text = tokenizer.encode(src, return_tensors="pt").cuda()
model.eval()
translate_ids = model.generate(tokenized_text, max_length=200)
output = tokenizer.decode(translate_ids[0], skip_special_tokens=True)
output

'ຂ້ອຍຢາກຊື້ປຶ້ມ'

On CPU

from transformers import AutoTokenizer, AutoModelForSeq2SeqLM
tokenizer = AutoTokenizer.from_pretrained("minhtoan/t5-translate-vietnamese-lao")  
model = AutoModelForSeq2SeqLM.from_pretrained("minhtoan/t5-translate-vietnamese-lao")
src = "Tôi muốn mua một cuốn sách"
input_ids = tokenizer(src, max_length=200, return_tensors="pt", padding="max_length", truncation=True).input_ids
outputs = model.generate(input_ids=input_ids, max_new_tokens=200)
output = tokenizer.batch_decode(outputs, skip_special_tokens=True)[0]
output

'ຂ້ອຍຢາກຊື້ປຶ້ມ'

Author

Phan Minh Toan

Downloads last month
4
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.