M2M100 Fine-Tuned on Parenthetical Terminology Translation (PTT) Dataset

Model Overview

This is a M2M100 model fine-tuned on the Parenthetical Terminology Translation (PTT) dataset. The PTT dataset focuses on translating technical terms accurately by placing the original English term in parentheses alongside its Korean translation, enhancing clarity and precision in specialized fields. This fine-tuned model is optimized for handling technical terminology in the Artificial Intelligence (AI) domain.

Example Usage

Hereโ€™s how to use this fine-tuned model with the Hugging Face transformers library:

Note: M2M100Tokenizer depends on sentencepiece, so make sure to install it before running the example. To install sentencepiece, run pip install sentencepiece

from transformers import M2M100ForConditionalGeneration, M2M100Tokenizer

model_name = "PrompTart/m2m100_418M_PTT_en_ko"
tokenizer = M2M100Tokenizer.from_pretrained(model_name)
model = M2M100ForConditionalGeneration.from_pretrained(model_name)

# Example sentence
text = "The model was fine-tuned using knowledge distillation techniques.\
The training dataset was created using a collaborative multi-agent framework powered by large language models."

# Tokenize and generate translation
tokenizer.src_lang = "en"
encoded = tokenizer(text.split('. '), return_tensors="pt", padding=True)
generated_tokens = model.generate(**encoded, forced_bos_token_id=tokenizer.get_lang_id("ko"))
outputs = tokenizer.batch_decode(generated_tokens, skip_special_tokens=True)
print('\n'.join(outputs))
# => "์ด ๋ชจ๋ธ์€ ์ง€์‹ ์ฆ๋ฅ˜ ๊ธฐ๋ฒ•(knowledge distillation techniques)์„ ์‚ฌ์šฉํ•˜์—ฌ ๋ฏธ์„ธ ์กฐ์ •๋˜์—ˆ์Šต๋‹ˆ๋‹ค.
# ํ›ˆ๋ จ ๋ฐ์ดํ„ฐ์…‹(training dataset)์€ ๋Œ€ํ˜• ์–ธ์–ด ๋ชจ๋ธ(large language models)์„ ๊ธฐ๋ฐ˜์œผ๋กœ ํ•œ ํ˜‘์—… ๋‹ค์ค‘ ์—์ด์ „ํŠธ ํ”„๋ ˆ์ž„์›Œํฌ(collaborative multi-agent framework)๋ฅผ ์‚ฌ์šฉํ•˜์—ฌ ์ƒ์„ฑ๋˜์—ˆ์Šต๋‹ˆ๋‹ค."

Limitations

  • Out-of-Domain Accuracy: While the model generalizes to some extent, accuracy may vary in domains that were not part of the training set.
  • Incomplete Parenthetical Annotation: Not all technical terms are consistently displayed in parentheses; in some cases, terms may be omitted or not annotated as expected.

Citation

If you use this model in your research, please cite the original dataset and paper:

@misc{myung2024efficienttechnicaltermtranslation,
      title={Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation}, 
      author={Jiyoon Myung and Jihyeon Park and Jungki Son and Kyungro Lee and Joohyung Han},
      year={2024},
      eprint={2410.00683},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2410.00683}, 
}

Contact

For questions or feedback, please contact [email protected].

Downloads last month
106
Safetensors
Model size
484M params
Tensor type
F32
ยท
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for PrompTartLAB/m2m100_418M_PTT_en_ko

Finetuned
(61)
this model

Dataset used to train PrompTartLAB/m2m100_418M_PTT_en_ko