---
language:
- ja
- en
base_model:
- google/gemma-2-2b-jpn-it
tags:
- translation
---
# Model Card for gemma-2-2b-jpn-it-translate

gemma-2-2b-jpn-it-translateは、日英・英日翻訳タスクに特化したSLM（Small Language Model）です。パラメーター数は20億（2B）ですが、従来の70億（7B）モデルに迫るレベルの翻訳品質を提供します。ファイルサイズが約5GBと比較的小さいため、高速な実行が可能です。  

gemma-2-2b-jpn-it-translate is an SLM (Small Language Model) specialized for Japanese-English and English-Japanese translation tasks. Despite having only 2 billion parameters (2B), it provides translation quality approaching that of conventional 7 billion (7B) parameter models. With a relatively small file size of about 5GB, it enables fast execution.  

## モデル詳細 Model Details

### モデル説明 Model Description

このモデルは、Googleが公開した日本語専用モデル「gemma-2-2b-jpn-it」をファインチューニングしたものです。  
特徴として、高速、且つ無限長の文章の翻訳ができる事を目指しました。  

具体的には、最初にシステムプロンプト相当の文章（日本語/英語）を与えると、以降はユーザーが入力した文章を翻訳した文章（日本語/英語）を出力するようにトレーニングされています。  
また、apply_chat_templateを使用しているため、ミスを誘発しやすいプロンプトテンプレートの手書き作業が不要となっています。  

This model is fine-tuned from "gemma-2-2b-jpn-it", a Japanese-specific model released by Google.  
Our goal is to translate texts of unlimited length at high speed.  
It is trained to output translated text (Japanese/English) in response to user input after being given an initial system prompt-like text (Japanese/English).  
Additionally, by using apply_chat_template, it eliminates the need for manual writing of prompt templates, which can be prone to errors.

### 日英翻訳用サンプルコード Japanese-English Translation sample script.

'''
pip install -U transformers
'''

```
import re
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def get_torch_dtype():
    if torch.cuda.is_available():
        device = torch.device("cuda")
        prop = torch.cuda.get_device_properties(device)
        # Ampere (Compute Capability 8.0 above), for example L4 support bfloat16, but T4 not support.
        if prop.major >= 8:
            return torch.bfloat16
    return torch.float16


model_name = "webbigdata/gemma-2-2b-jpn-it-translate"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=get_torch_dtype(),
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.unk_token

system_prompt = "You are a highly skilled professional Japanese-English and English-Japanese translator. Translate the given text accurately, taking into account the context and specific instructions provided. Steps may include hints enclosed in square brackets [] with the key and value separated by a colon:. Only when the subject is specified in the Japanese sentence, the subject will be added when translating into English. If no additional instructions or context are provided, use your expertise to consider what the most appropriate context is and provide a natural translation that aligns with that context. When translating, strive to faithfully reflect the meaning and tone of the original text, pay attention to cultural nuances and differences in language usage, and ensure that the translation is grammatically correct and easy to read. After completing the translation, review it once more to check for errors or unnatural expressions. For technical terms and proper nouns, either leave them in the original language or use appropriate translations as necessary. Take a deep breath, calm down, and start translating.\n\n"
instruct = """Translate Japanese to English.\nWhen translating, please use the following hints:\n[writing_style: casual]"""

# 文章を区切る関数
def split_sentences(text):
    sentences = []
    last = 0
    # 句点で文を分割
    for match in re.finditer(r'[。！？…]', text):
        end = match.end()
        # 句点の直後に続く改行を含める
        while end < len(text) and text[end] == '\n':
            end += 1
        sentence = text[last:end]
        sentences.append(sentence)
        last = end
    # 残りのテキストを追加
    if last < len(text):
        remaining = text[last:]
        sentences.append(remaining)
    # 各文内の改行を適切に分割
    final_sentences = []
    for s in sentences:
      if '\n' in s:
          parts = s.split('\n')
          for i, part in enumerate(parts):
              if part:
                  # 最後の部分でなければ改行を追加
                  if i < len(parts) - 1:
                      final_sentences.append(part + '\n')
                  else:
                      final_sentences.append(part)
              # 改行自体を保持
              if i < len(parts) - 1:
                  final_sentences.append('\n')
      else:
          final_sentences.append(s)
    return final_sentences

# 翻訳処理を行う関数
def translate_sentence(sentence, previous_context):
      # 過去のコンテキストと新しい文を配列に格納
      if sentence.strip() == '':
          return sentence

      messages = previous_context + [
          {"role": "user", "content": sentence}
      ]

      # apply_chat_templateを使用してプロンプトを生成
      inputs = tokenizer.apply_chat_template(
          messages,
          tokenize=True,
          add_generation_prompt=True,
          return_tensors="pt",
      ).to("cuda")

      translation = ""
      with torch.no_grad():
          generated_ids = model.generate(
              input_ids=inputs,
              num_beams=3, max_new_tokens=1200, do_sample=True, temperature=0.5, top_p=0.3
          )
          full_output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
          translation = full_output.split('\nmodel\n')[-1].strip()
      return translation

from collections import deque

# メイン処理
def main(text):
    sentences = split_sentences(text)
    translated_sentences = []

    # Initialize context with system prompt
    context = deque([
        {"role": "user", "content": system_prompt + instruct},
        {"role": "assistant", "content": "OK"}
    ], maxlen=6)  # Maximum 10 elements (5 user, 5 assistant)

    for i, sentence in enumerate(sentences):
        # For the first sentence, use the full context including system prompt
        if i == 0:
            translation_context = list(context)
        else:
            # For subsequent sentences, exclude the system prompt
            translation_context = list(context)[2:]

        translated_sentence = translate_sentence(sentence, translation_context)
        translated_sentences.append(translated_sentence)

        # Add new interactions to the context
        if sentence.strip() != '':
            context.append({"role": "user", "content": sentence})
        else:
            context.append({"role": "user", "content": sentence})

        if translated_sentence.strip() != '':
            context.append({"role": "assistant", "content": translated_sentence})
        else:
            context.append({"role": "assistant", "content": translated_sentence})

    return translated_sentences


text = """こんにちは。私は田中です。今日はとても良い天気ですね。朝ごはんはパンとコーヒーを食べました。そのあとに散歩に行きました。公園にはたくさんの人がいました。子供たちは遊んでいました。
犬を連れている人もいました。私はベンチに座って本を読みました。風がとても気持ちよかったです。その後、友達とカフェに行きました。
カフェではコーヒーを飲みながらおしゃべりをしました。友達は最近引っ越したばかりだと言いました。新しい家の写真を見せてくれました。
とてもきれいな家でした。時間が経つのがあっという間でした。夕方になり、私は家に帰りました。夕食にはカレーを作りました。カレーはとても美味しかったです。今日一日、とても楽しかったです。"""
translated = main(text)
print(translated)
```


### 英日翻訳用サンプルコード English-Japanese Translation sample script.

```
import re
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

def get_torch_dtype():
    if torch.cuda.is_available():
        device = torch.device("cuda")
        prop = torch.cuda.get_device_properties(device)
        # Ampere (Compute Capability 8.0 above), for example L4 support bfloat16, but T4 not support.
        if prop.major >= 8:
            return torch.bfloat16
    return torch.float16

model_name = "gemma-2-2b-jpn-it-translate"
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=get_torch_dtype(),
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = tokenizer.unk_token

system_prompt = "You are a highly skilled professional Japanese-English and English-Japanese translator. Translate the given text accurately, taking into account the context and specific instructions provided. Steps may include hints enclosed in square brackets [] with the key and value separated by a colon:. Only when the subject is specified in the Japanese sentence, the subject will be added when translating into English. If no additional instructions or context are provided, use your expertise to consider what the most appropriate context is and provide a natural translation that aligns with that context. When translating, strive to faithfully reflect the meaning and tone of the original text, pay attention to cultural nuances and differences in language usage, and ensure that the translation is grammatically correct and easy to read. After completing the translation, review it once more to check for errors or unnatural expressions. For technical terms and proper nouns, either leave them in the original language or use appropriate translations as necessary. Take a deep breath, calm down, and start translating.\n\n"
instruct = """Translate English to Japanese.\nWhen translating, please use the following hints:\n[writing_style: business]"""

# Function to split English sentences
def split_sentences(text):
    sentences = []
    # Split by newlines, periods, exclamation marks, question marks, or two or more consecutive spaces
    pattern = r'(?:\r?\n|\.|\!|\?|(?:\s{2,}))'
    splits = re.split(pattern, text)

    for split in splits:
        split = split.strip()
        if split:
            sentences.append(split)

    return sentences

# Function to translate a sentence
def translate_sentence(sentence, previous_context):
    if sentence.strip() == '':
        return sentence

    messages = previous_context + [
        {"role": "user", "content": sentence}
    ]

    inputs = tokenizer.apply_chat_template(
        messages,
        tokenize=True,
        add_generation_prompt=True,
        return_tensors="pt",
    ).to("cuda")

    translation = ""
    with torch.no_grad():
        generated_ids = model.generate(
            input_ids=inputs,
            num_beams=3, max_new_tokens=1200, do_sample=True, temperature=0.5, top_p=0.3
        )
        full_output = tokenizer.batch_decode(generated_ids, skip_special_tokens=True)[0].strip()
        translation = full_output.split('\nmodel\n')[-1].strip()
    return translation

from collections import deque

# Main processing function
def main(text):
    sentences = split_sentences(text)
    translated_sentences = []

    context = deque([
        {"role": "user", "content": system_prompt + instruct},
        {"role": "assistant", "content": "OK"}
    ], maxlen=6)

    for i, sentence in enumerate(sentences):
        if i == 0:
            translation_context = list(context)
        else:
            translation_context = list(context)[2:]

        translated_sentence = translate_sentence(sentence, translation_context)
        translated_sentences.append(translated_sentence)

        if sentence.strip() != '':
            context.append({"role": "user", "content": sentence})
        else:
            context.append({"role": "user", "content": sentence})

        if translated_sentence.strip() != '':
            context.append({"role": "assistant", "content": translated_sentence})
        else:
            context.append({"role": "assistant", "content": translated_sentence})

    return translated_sentences

# Sample English text for translation (business context)
text = """Dear valued clients and partners,

I hope this email finds you well. I am writing to provide you with an important update regarding our company's recent developments and future plans.

Firstly, I am pleased to announce that our Q3 financial results have exceeded expectations, with a 15% increase in revenue compared to the same period last year. This success is largely attributed to the launch of our new product line and the expansion of our services into emerging markets.

In light of this growth, we are planning to implement several strategic initiatives in the coming months:

1. Expansion of our R&D department: We will be investing significantly in research and development to maintain our competitive edge in the market.

2. Sustainability efforts: We are committed to reducing our carbon footprint by 30% over the next five years. This includes transitioning to renewable energy sources and implementing eco-friendly practices across all our operations.

3. Digital transformation: We will be upgrading our IT infrastructure to enhance efficiency and provide better service to our clients.

Additionally, we are excited to announce our upcoming annual conference, which will be held virtually this year due to ongoing global health concerns. The conference will take place on November 15-16, 2024, and will feature keynote speeches from industry leaders, interactive workshops, and networking opportunities.

We value your continued support and partnership. If you have any questions or would like further information about any of these initiatives, please don't hesitate to reach out to your account manager or contact our customer support team.

Thank you for your trust in our company. We look forward to achieving new milestones together.

Best regards,
John Smith
CEO, XYZ Corporation"""


```

結果 result
```
['貴社にご愛顧いただき、誠にありがとうございます。', 'このメールがご健在であることを心よりお祈り申し上げます。', '弊社の最近の進展と今後の計画について、重要なお知らせをご提供いたします。', 'まず、第3四半期の収益が予想を上回ったことをお知らせいたします。昨年の同時期と比較して、売上高が15％増加しました。', 'この成功は、新製品ラインの発売と、新興市場へのサービスの拡大が大きく貢献しています。', 'この成長を踏まえ、今後の数ヶ月にわたって、いくつかの戦略的イニシアティブを実施する予定です。', '1', 'R＆D部門の拡大：市場での競争力を維持するために、大幅に研究開発に投資する予定です。', '2', 'サステナビリティの取り組み：次の5年間で、炭素排出量を30%削減することを目指しています。', 'これは、再生可能エネルギー源への移行と、すべての事業活動における環境にやさしい実践の導入を含むものです。', '3', 'デジタルトランスフォーメーション: 私たちのITインフラを強化し、効率を向上させ、より良いサービスを提供する', 'さらに、今年は新型コロナウイルス感染症の懸念が続くため、オンラインで開催されますが、毎年恒例の年次カンファレンスをお知らせいたします。', 'カンファレンスは2024年11月15日～16日に開催され、業界のリーダーによるキーノートスピーチ、インタラクティブワークショップ、ネットワークングの機会が盛りだくさんです。', '引き続きご支援とご協力を賜りますようお願い申し上げます。', 'これらのイニシアチブについてご質問がある場合や、さらに詳しい情報をご希望の場合は、ご担当マネジャーにご連絡するか、弊社のカスタマーサポートチームにご連絡ください。', '弊社の信頼を賜り、誠にありがとうございます。', '共に新たな目標を達成できることを楽しみにしています。', 'ご清栄のこととお慶び申し上げます。', 'ジョン・スミス', 'XYZ株式会社のCEO']
```


制限

空行を渡すと翻訳をせずに元文をそのまま出力してしまう減少が確認されています

<!-- Provide a longer summary of what this model is. -->


- **Developed by:** [More Information Needed]
- **Funded by [optional]:** [More Information Needed]
- **Shared by [optional]:** [More Information Needed]
- **Model type:** [More Information Needed]
- **Language(s) (NLP):** [More Information Needed]
- **License:** [More Information Needed]
- **Finetuned from model [optional]:** [More Information Needed]

### Model Sources [optional]

<!-- Provide the basic links for the model. -->

- **Repository:** [More Information Needed]
- **Paper [optional]:** [More Information Needed]
- **Demo [optional]:** [More Information Needed]

## Uses

<!-- Address questions around how the model is intended to be used, including the foreseeable users of the model and those affected by the model. -->

### Direct Use

<!-- This section is for the model use without fine-tuning or plugging into a larger ecosystem/app. -->

[More Information Needed]

### Downstream Use [optional]

<!-- This section is for the model use when fine-tuned for a task, or when plugged into a larger ecosystem/app -->

[More Information Needed]

### Out-of-Scope Use

<!-- This section addresses misuse, malicious use, and uses that the model will not work well for. -->

[More Information Needed]

## Bias, Risks, and Limitations

<!-- This section is meant to convey both technical and sociotechnical limitations. -->

[More Information Needed]

### Recommendations

<!-- This section is meant to convey recommendations with respect to the bias, risk, and technical limitations. -->

Users (both direct and downstream) should be made aware of the risks, biases and limitations of the model. More information needed for further recommendations.

## How to Get Started with the Model

Use the code below to get started with the model.

[More Information Needed]

## Training Details

### Training Data

<!-- This should link to a Dataset Card, perhaps with a short stub of information on what the training data is all about as well as documentation related to data pre-processing or additional filtering. -->

[More Information Needed]

### Training Procedure

<!-- This relates heavily to the Technical Specifications. Content here should link to that section when it is relevant to the training procedure. -->

#### Preprocessing [optional]

[More Information Needed]


#### Training Hyperparameters

- **Training regime:** [More Information Needed] <!--fp32, fp16 mixed precision, bf16 mixed precision, bf16 non-mixed precision, fp16 non-mixed precision, fp8 mixed precision -->

#### Speeds, Sizes, Times [optional]

<!-- This section provides information about throughput, start/end time, checkpoint size if relevant, etc. -->

[More Information Needed]

## Evaluation

<!-- This section describes the evaluation protocols and provides the results. -->

### Testing Data, Factors & Metrics

#### Testing Data

<!-- This should link to a Dataset Card if possible. -->

[More Information Needed]

#### Factors

<!-- These are the things the evaluation is disaggregating by, e.g., subpopulations or domains. -->

[More Information Needed]

#### Metrics

<!-- These are the evaluation metrics being used, ideally with a description of why. -->

[More Information Needed]

### Results

[More Information Needed]

#### Summary


## Model Examination [optional]

<!-- Relevant interpretability work for the model goes here -->

[More Information Needed]

## Environmental Impact

<!-- Total emissions (in grams of CO2eq) and additional considerations, such as electricity usage, go here. Edit the suggested text below accordingly -->

Carbon emissions can be estimated using the [Machine Learning Impact calculator](https://mlco2.github.io/impact#compute) presented in [Lacoste et al. (2019)](https://arxiv.org/abs/1910.09700).

- **Hardware Type:** [More Information Needed]
- **Hours used:** [More Information Needed]
- **Cloud Provider:** [More Information Needed]
- **Compute Region:** [More Information Needed]
- **Carbon Emitted:** [More Information Needed]


**BibTeX:**

[More Information Needed]

**APA:**

[More Information Needed]

## Glossary [optional]

<!-- If relevant, include terms and calculations in this section that can help readers understand the model or model card. -->

[More Information Needed]

## More Information [optional]

[More Information Needed]

## Model Card Authors [optional]

[More Information Needed]

## Model Card Contact

[More Information Needed]