---
base_model: llm-jp/llm-jp-3-13b
tags:
- text-generation-inference
- transformers
- unsloth
- llama
- trl
license: apache-2.0
language:
- en
---

## 概要 (Overview)
[LLM-jp-3-13b](https://huggingface.co/llm-jp/llm-jp-3-13b) をベースに、LoRA (QLoRA) と [Unsloth](https://github.com/unslothai/unsloth) 、および Hugging Face [TRL](https://github.com/huggingface/trl) を用いて高速にファインチューニングした日本語LLMモデルです。松尾研大規模言語モデル講座2024のコンペ用の提出モデル作成の一環として作成・公開しています。   


- データセット：
  - Ichikara Instruction（複数のデータセットを結合）
  - elyza/ELYZA-tasks-100
  - izumi-lab/wikipedia-ja-20230720
    
[<img src="https://raw.githubusercontent.com/unslothai/unsloth/main/images/unsloth%20made%20with%20love.png" width="200"/>](https://github.com/unslothai/unsloth)


---


## 推論環境 (Environment Requirements)

- Python 3.10 以上推奨  
- GPU: 24GB 以上の VRAM (NVIDIA L4 / A5000 等)
- 必要パッケージ (例):
  - `transformers`
  - `torch`
  - `unsloth`
  - `bitsandbytes`
  - `accelerate`
  - `peft`

以下のようなコマンドで一括インストールできます (環境に応じて調整してください):

```bash
pip install transformers torch unsloth bitsandbytes accelerate peft
```


Google Colabの場合は以下のコマンドを実行してください。
```bash
!pip uninstall unsloth -y
!pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"
```


---

## モデルのロード & 推論手順 (Inference)

### 1. モデルのロード

```python
from unsloth import FastLanguageModel
import torch

# セッティング例
max_seq_length = 2048
dtype = None         # Noneで自動検出 (GPU世代に応じて fp16 / bfloat16)
load_in_4bit = True  # 4bit量子化を有効化（メモリ節約）

HF_TOKEN = "your_token"  # Hugging Faceのアクセストークン

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name = "Toki-AI/llm-jp-3-13b-finetune-241202",
    max_seq_length = max_seq_length,
    dtype = dtype,
    load_in_4bit = load_in_4bit,
    token = HF_TOKEN,
)
```

### 2. 推論用コード例

```python
from unsloth import FastLanguageModel
from tqdm import tqdm
import json

# 推論モードに切り替え
FastLanguageModel.for_inference(model)

# 推論したいタスクのJSONLファイルを読み込む例
datasets = []
with open("elyza-tasks-100-TV_0.jsonl", "r") as f:
    for line in f:
        if line.strip():
            datasets.append(json.loads(line))

# 推論の実行
results = []
for dt in tqdm(datasets):
    input_text = dt["input"]
    # プロンプト例
    prompt = f"""### 指示
{input_text}
### 回答
"""

    inputs = tokenizer([prompt], return_tensors="pt").to(model.device)
    outputs = model.generate(
        **inputs,
        max_new_tokens=512,
        use_cache=True,
        do_sample=False,
        repetition_penalty=1.2
    )

    # 出力を整形
    prediction = tokenizer.decode(outputs[0], skip_special_tokens=True).split('\n### 回答')[-1]
    results.append({"task_id": dt["task_id"], "input": input_text, "output": prediction})

# 推論結果の確認 (先頭3件)
for res in results[:3]:
    print(res)
```

※ 推論パラメータ（`max_new_tokens`, `do_sample`, `repetition_penalty`, `temperature`, `top_p`など）はタスクに応じて変更してください。

---

## ライセンス (License)

本モデルは [Apache License 2.0](./LICENSE) のもとで配布されています。  
ベースモデル [llm-jp/llm-jp-3-13b](https://huggingface.co/llm-jp/llm-jp-3-13b) に準拠した利用規約やライセンスについてもご確認ください。


---