File size: 4,763 Bytes
480dbff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
297adc1
 
0a9ad2f
 
 
2118cda
 
 
0a9ad2f
 
 
 
 
 
 
2118cda
0a9ad2f
2118cda
0a9ad2f
 
 
 
 
 
 
6bd9ea8
 
 
0a9ad2f
 
 
 
 
6bd9ea8
0a9ad2f
 
 
 
 
 
 
 
 
 
 
2118cda
0a9ad2f
2118cda
0a9ad2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6bd9ea8
0a9ad2f
 
6bd9ea8
0a9ad2f
2118cda
6bd9ea8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
0a9ad2f
 
 
 
297adc1
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
---
license: cc-by-nc-sa-4.0
language:
- ja
base_model:
- llm-jp/llm-jp-3-13b
---

# Fine-tuned Japanese Instruction Model

This is a fine-tuned version of the base model **[llm-jp/llm-jp-3-13b](https://huggingface.co/llm-jp/llm-jp-3-13b)** using the **ichikara-instruction** dataset.  
The model has been fine-tuned for **Japanese instruction-following tasks**.

---

## Model Information

### **Base Model**
- **Model**: [llm-jp/llm-jp-3-13b](https://huggingface.co/llm-jp/llm-jp-3-13b)  
- **Architecture**: Causal Language Model  
- **Parameters**: 13 billion  

### **Fine-tuning Dataset**
- **Dataset**: [ichikara-instruction](https://liat-aip.sakura.ne.jp/wp/llmのための日本語インストラクションデータ作成/)
- **Authors**: 関根聡, 安藤まや, 後藤美知子, 鈴木久美, 河原大輔, 井之上直也, 乾健太郎  
- **License**: [CC-BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)

The dataset includes Japanese instruction-response pairs and has been tailored for Japanese **instruction-following tasks**.

関根聡, 安藤まや, 後藤美知子, 鈴木久美, 河原大輔, 井之上直也, 乾健太郎. ichikara-instruction: LLMのための日本語インストラクションデータの構築. 言語処理学会第30回年次大会(2024)

---

## Usage


### 1. Install Required Libraries

```python
!pip install -U bitsandbytes
!pip install -U transformers
!pip install -U accelerate
!pip install -U datasets
!pip install -U peft
```

### 2. Load the Model and Libraries

```python
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import PeftModel
import torch
from tqdm import tqdm
import json
import re

# Hugging Face Token (recommended to set via environment variable)
HF_TOKEN = "YOUR_HF_ACCESS_TOKEN"

# Model and adapter IDs
# base_model_id = "models/models--llm-jp--llm-jp-3-13b/snapshots/cd3823f4c1fcbb0ad2e2af46036ab1b0ca13192a"
base_model_id = "llm-jp/llm-jp-3-13b"  # Base model
adapter_id = "sasakipeter/llm-jp-3-13b-finetune"

# QLoRA (4-bit quantization) configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
```

### 3. Load the Base Model and LoRA Adapter

```python
# Load base model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
    token=HF_TOKEN
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id, 
    trust_remote_code=True, 
    token=HF_TOKEN
)

# Integrate LoRA adapter into the base model
model = PeftModel.from_pretrained(model, adapter_id, token=HF_TOKEN)
model.config.use_cache = False
```

### 4. Perform Inference on `[elyza-tasks-100](https://huggingface.co/datasets/elyza/ELYZA-tasks-100)`

```python
# loading dataset
datasets = []
with open("./elyza-tasks-100-TV_0.jsonl", "r") as f:
    item = ""
    for line in f:
        line = line.strip()
        item += line
        if item.endswith("}"):
            datasets.append(json.loads(item))
            item = ""

# execute inference
results = []
for data in tqdm(datasets):

    input_text = data["input"]

    prompt = f"""### 指示
    {input_text}
    ### 回答
    """

    tokenized_input = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)
    attention_mask = torch.ones_like(tokenized_input)

    with torch.no_grad():
        outputs = model.generate(
          tokenized_input,
          attention_mask=attention_mask,
          max_new_tokens=100,
          do_sample=False,
          repetition_penalty=1.2,
          pad_token_id=tokenizer.eos_token_id
        )[0]
    output = tokenizer.decode(outputs[tokenized_input.size(1):], skip_special_tokens=True)

    results.append({"task_id": data["task_id"], "input": input, "output": output})

jsonl_id = re.sub(".*/", "", new_model_id)
with open(f"./{jsonl_id}-outputs-validation.jsonl", 'w', encoding='utf-8') as f:
    for result in results:
        json.dump(result, f, ensure_ascii=False)  # ensure_ascii=False for handling non-ASCII characters
        f.write('\n')
```

---

## License

This model is released under the **CC-BY-NC-SA 4.0** license.

- **Base Model**: [llm-jp/llm-jp-3-13b](https://huggingface.co/llm-jp/llm-jp-3-13b) (Apache License 2.0)
- **Fine-Tuning Dataset**: ichikara-instruction (CC-BY-NC-SA 4.0)

**Fine-tuned Model License**:  
Due to the Share-Alike (SA) condition of the ichikara-instruction dataset, the fine-tuned model is licensed under **CC-BY-NC-SA 4.0**.  
This means the model can only be used for **non-commercial purposes**, and any derivative works must adopt the same license.