|
--- |
|
datasets: |
|
- PrompTart/PTT_advanced_en_ko |
|
language: |
|
- en |
|
- ko |
|
base_model: |
|
- google/gemma-2-2b |
|
library_name: transformers |
|
--- |
|
|
|
# Gemma 2 Fine-Tuned on Parenthetical Terminology Translation (PTT) Dataset |
|
|
|
## Model Overview |
|
|
|
This is a **gemma-2-2b** model fine-tuned on the [**Parenthetical Terminology Translation (PTT)**](https://aclanthology.org/2024.wmt-1.129/) dataset. [The PTT dataset](https://huggingface.co/datasets/PrompTart/PTT_advanced_en_ko) focuses on translating technical terms accurately by placing the original English term in parentheses alongside its Korean translation, enhancing clarity and precision in specialized fields. This fine-tuned model is optimized for handling technical terminology in the **Artificial Intelligence (AI)** domain. |
|
|
|
|
|
## Example Usage |
|
|
|
Hereβs how to use this fine-tuned model with the Hugging Face `transformers` library: |
|
|
|
```python |
|
import transformers |
|
from transformers import AutoTokenizer, AutoModelForCausalLM |
|
|
|
# Load Model and Tokenizer |
|
model_name = "PrompTartLAB/gemma2_2B_PTT_en_ko" |
|
model = AutoModelForCausalLM.from_pretrained( |
|
model_name, |
|
torch_dtype="auto", |
|
device_map="auto", |
|
) |
|
tokenizer = AutoTokenizer.from_pretrained(model_name) |
|
|
|
# Example sentence |
|
text = "The model was fine-tuned using knowledge distillation techniques. The training dataset was created using a collaborative multi-agent framework powered by large language models." |
|
prompt = f"Translate input sentence to Korean \n### Input: {text} \n### Translated:" |
|
|
|
# Tokenize and generate translation |
|
input_ids = tokenizer(prompt, return_tensors="pt").to(model.device) |
|
outputs = model.generate(**input_ids, max_new_tokens=1024) |
|
out_message = tokenizer.decode(outputs[0][len(input_ids["input_ids"][0]):], skip_special_tokens=True) |
|
|
|
# " λͺ¨λΈμ μ§μ μ¦λ₯ κΈ°λ²(knowledge distillation techniques)μ μ¬μ©νμ¬ νλ ¨λμμ΅λλ€. νλ ¨ λ°μ΄ν°μ
μ λν μΈμ΄ λͺ¨λΈ(large language models)μ μν΄ κ΅¬λλλ νλ ₯ν λ€μ€ μμ΄μ νΈ νλ μμν¬(collaborative multi-agent framework)λ₯Ό μ¬μ©νμ¬ μμ±λμμ΅λλ€." |
|
|
|
``` |
|
|
|
## Limitations |
|
|
|
- **Out-of-Domain Accuracy**: While the model generalizes to some extent, accuracy may vary in domains that were not part of the training set. |
|
- **Incomplete Parenthetical Annotation**: Not all technical terms are consistently displayed in parentheses; in some cases, terms may be omitted or not annotated as expected. |
|
|
|
## Citation |
|
|
|
If you use this model in your research, please cite the original dataset and paper: |
|
|
|
```tex |
|
@inproceedings{jiyoon-etal-2024-efficient, |
|
title = "Efficient Technical Term Translation: A Knowledge Distillation Approach for Parenthetical Terminology Translation", |
|
author = "Jiyoon, Myung and |
|
Park, Jihyeon and |
|
Son, Jungki and |
|
Lee, Kyungro and |
|
Han, Joohyung", |
|
editor = "Haddow, Barry and |
|
Kocmi, Tom and |
|
Koehn, Philipp and |
|
Monz, Christof", |
|
booktitle = "Proceedings of the Ninth Conference on Machine Translation", |
|
month = nov, |
|
year = "2024", |
|
address = "Miami, Florida, USA", |
|
publisher = "Association for Computational Linguistics", |
|
url = "https://aclanthology.org/2024.wmt-1.129", |
|
doi = "10.18653/v1/2024.wmt-1.129", |
|
pages = "1410--1427", |
|
abstract = "This paper addresses the challenge of accurately translating technical terms, which are crucial for clear communication in specialized fields. We introduce the Parenthetical Terminology Translation (PTT) task, designed to mitigate potential inaccuracies by displaying the original term in parentheses alongside its translation. To implement this approach, we generated a representative PTT dataset using a collaborative approach with large language models and applied knowledge distillation to fine-tune traditional Neural Machine Translation (NMT) models and small-sized Large Language Models (sLMs). Additionally, we developed a novel evaluation metric to assess both overall translation accuracy and the correct parenthetical presentation of terms. Our findings indicate that sLMs did not consistently outperform NMT models, with fine-tuning proving more effective than few-shot prompting, particularly in models with continued pre-training in the target language. These insights contribute to the advancement of more reliable terminology translation methodologies.", |
|
} |
|
``` |
|
|
|
## Contact |
|
|
|
For questions or feedback, please contact [[email protected]](mailto:[email protected]). |