File size: 5,976 Bytes

e1e7c9c
 
 
 
 
 
 
 
 
 
 
 
2948017
e1e7c9c
bdf9f06
63fd5bd
e1e7c9c
 
2948017
e1e7c9c

---
license: apache-2.0
language:
- ja
- en
base_model:
- llm-jp/llm-jp-3-13b
pipeline_tag: text-generation
library_name: transformers
---
# Enhanced LLM-JP Model with Extended Tokenizer and Chat Template

This is an enhanced version of [llm-jp-13B](https://huggingface.co/llm-jp-13B) with an extended tokenizer that includes additional special tokens for structured conversations and advanced prompting.

![image/jpg](tengentoppa.jpg)

## Model Information

- Base Model: [llm-jp-13B](https://huggingface.co/llm-jp-13B)
- Added Features: Extended tokenizer with special tokens for structured conversations and chat template
- Vocabulary Size: Extended from the base model

## Special Tokens

### Basic Tokens
- UNK Token: `{token_config.unk_token}`
- BOS Token: `{token_config.bos_token}`
- EOS Token: `{token_config.eos_token}`
- PAD Token: `{token_config.pad_token}`
- CLS Token: `{token_config.cls_token}`
- SEP Token: `{token_config.sep_token}`
- MASK Token: `{token_config.mask_token}`

### Conversation Structure Tokens
- System: `{token_config.system_token}` and `{token_config.system_end_token}`
- User: `{token_config.user_token}` and `{token_config.user_end_token}`
- Assistant: `{token_config.assistant_token}` and `{token_config.assistant_end_token}`

### Reasoning Process Tokens
- Reasoning: `{token_config.reasoning_token}` and `{token_config.reasoning_end_token}`
- Solution: `{token_config.solution_token}` and `{token_config.solution_end_token}`
- Response: `{token_config.response_token}` and `{token_config.response_end_token}`

### Hint and Supplementary Information Tokens
- Hint: `{token_config.hint_token}` and `{token_config.hint_end_token}`
- Note: `{token_config.note_token}` and `{token_config.note_end_token}`
- Context: `{token_config.context_token}` and `{token_config.context_end_token}`
- Reference: `{token_config.reference_token}` and `{token_config.reference_end_token}`
- Example: `{token_config.example_token}` and `{token_config.example_end_token}`

### Control Tokens
- Important: `{token_config.important_token}` and `{token_config.important_end_token}`
- Warning: `{token_config.warning_token}` and `{token_config.warning_end_token}`
- Error: `{token_config.error_token}` and `{token_config.error_end_token}`

## Chat Template Usage

このモデルは以下の役割（roles）をサポートしています：
- system: システムプロンプト用
- user: ユーザーの入力用
- hint: ヒントやガイダンス用
- reasoning: 推論プロセス用
- assistant: アシスタントの応答用

### Basic Usage:

```python
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained("{model_name}")
tokenizer = AutoTokenizer.from_pretrained("{model_name}")

# チャット形式での使用例
messages = [
    {
        "role": "system",
        "content": "あなたは親切で有能なAIアシスタントです。"
    },
    {
        "role": "user",
        "content": "次の数学の問題を解いてください：2x + 3 = 7"
    },
    {
        "role": "hint",
        "content": "方程式を解くときは、まず両辺から数を移項することを考えてみましょう。"
    },
    {
        "role": "reasoning",
        "content": "この方程式を解くために以下のステップで考えます：\\n1. 3を両辺から引く\\n2. 両辺を2で割る"
    },
    {
        "role": "assistant",
        "content": "x = 2 が方程式の解です。"
    }
]

# チャットテンプレートを使用してメッセージを整形
prompt = tokenizer.apply_chat_template(messages, tokenize=False)
print("\\nGenerated prompt:\\n", prompt)

# トークン化と推論
inputs = tokenizer(prompt, return_tensors="pt", max_length=2048, truncation=True)
outputs = model.generate(**inputs, max_length=2048, temperature=0.7)
response = tokenizer.decode(outputs[0])
print("\\nModel response:\\n", response)
```

### Advanced Usage:

# カスタムシステムメッセージを使用
messages = [
    {
        "role": "system",
        "content": "あなたは数学の専門家です。"
    },
    {
        "role": "user",
        "content": "二次方程式 x² - 4x + 4 = 0 を解いてください。"
    }
]

# 生成プロンプトを追加せずにテンプレートを適用
prompt = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=False
)

# 手動でヒントを追加
prompt += "\\n<|HINT|>因数分解を使うと簡単に解けるかもしれません。</|HINT|>"

# 手動で推論プロセスを追加
prompt += "\\n<|REASONING|>1. この式は(x-2)²の形に似ています\\n2. 実際に展開すると同じ式になります</|REASONING|>"

# アシスタントの応答用のプロンプトを追加
prompt += "\\n<|ASSISTANT|>"

# 以降は通常通り処理
inputs = tokenizer(prompt, return_tensors="pt", max_length=2048, truncation=True)
```

## Chat Template Specification

モデルのチャットテンプレートは以下の要素を含みます：
- 5つの異なるロール（system, user, hint, reasoning, assistant）
- 各ロールに対応する特殊トークン
- デフォルトのシステムメッセージ
- 柔軟なテンプレート構造

特徴：
- メッセージの順序は保持されます
- 各ロールは明確に区別されます
- システムメッセージは任意です
- ヒントと推論は必要に応じて追加できます

## Additional Notes

### トークナイザーの拡張について
- 元のトークナイザーの全機能を保持
- 新しい特殊トークンの追加による機能拡張
- チャットテンプレートによる構造化された会話のサポート

### 使用上の注意
- 特殊トークンは必要な場合にのみ使用してください
- チャットテンプレートは柔軟に調整可能です
- システムメッセージは対話の文脈に応じてカスタマイズできます