File size: 4,760 Bytes

480dbff
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
297adc1
 
0a9ad2f
 
 
2118cda
 
 
0a9ad2f
 
 
 
 
 
 
2118cda
0a9ad2f
2118cda
0a9ad2f
 
 
 
 
 
 
6bd9ea8
 
 
0a9ad2f
 
 
 
 
6bd9ea8
0a9ad2f
 
 
 
 
 
 
 
 
 
 
2118cda
0a9ad2f
2118cda
0a9ad2f
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
6bd9ea8
0a9ad2f
 
6bd9ea8
0a9ad2f
2118cda
6bd9ea8
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
f2dc3ac
6bd9ea8
b6ceafb
6bd9ea8
 
 
 
0a9ad2f
 
 
 
297adc1
 
 
 
 
 
 
f2dc3ac
297adc1

---
license: cc-by-nc-sa-4.0
language:
- ja
base_model:
- llm-jp/llm-jp-3-13b
---

# Fine-tuned Japanese Instruction Model

This is a fine-tuned version of the base model **[llm-jp/llm-jp-3-13b](https://huggingface.co/llm-jp/llm-jp-3-13b)** using the **ichikara-instruction** dataset.  
The model has been fine-tuned for **Japanese instruction-following tasks**.

---

## Model Information

### **Base Model**
- **Model**: [llm-jp/llm-jp-3-13b](https://huggingface.co/llm-jp/llm-jp-3-13b)  
- **Architecture**: Causal Language Model  
- **Parameters**: 13 billion  

### **Fine-tuning Dataset**
- **Dataset**: [ichikara-instruction](https://liat-aip.sakura.ne.jp/wp/llmのための日本語インストラクションデータ作成/)
- **Authors**: 関根聡, 安藤まや, 後藤美知子, 鈴木久美, 河原大輔, 井之上直也, 乾健太郎  
- **License**: [CC-BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/)

The dataset includes Japanese instruction-response pairs and has been tailored for Japanese **instruction-following tasks**.

関根聡, 安藤まや, 後藤美知子, 鈴木久美, 河原大輔, 井之上直也, 乾健太郎. ichikara-instruction: LLMのための日本語インストラクションデータの構築. 言語処理学会第30回年次大会(2024)

---

## Usage


### 1. Install Required Libraries

```python
!pip install -U bitsandbytes
!pip install -U transformers
!pip install -U accelerate
!pip install -U datasets
!pip install -U peft
```

### 2. Load the Model and Libraries

```python
from transformers import (
    AutoModelForCausalLM,
    AutoTokenizer,
    BitsAndBytesConfig,
)
from peft import PeftModel
import torch
from tqdm import tqdm
import json
import re

# Hugging Face Token (recommended to set via environment variable)
HF_TOKEN = "YOUR_HF_ACCESS_TOKEN"

# Model and adapter IDs
# base_model_id = "models/models--llm-jp--llm-jp-3-13b/snapshots/cd3823f4c1fcbb0ad2e2af46036ab1b0ca13192a"
base_model_id = "llm-jp/llm-jp-3-13b"  # Base model
adapter_id = "sasakipeter/llm-jp-3-13b-finetune"

# QLoRA (4-bit quantization) configuration
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
```

### 3. Load the Base Model and LoRA Adapter

```python
# Load base model with 4-bit quantization
model = AutoModelForCausalLM.from_pretrained(
    base_model_id,
    quantization_config=bnb_config,
    device_map="auto",
    token=HF_TOKEN
)

# Load tokenizer
tokenizer = AutoTokenizer.from_pretrained(
    base_model_id, 
    trust_remote_code=True, 
    token=HF_TOKEN
)

# Integrate LoRA adapter into the base model
model = PeftModel.from_pretrained(model, adapter_id, token=HF_TOKEN)
model.config.use_cache = False
```

### 4. Perform Inference on `[elyza-tasks-100](https://huggingface.co/datasets/elyza/ELYZA-tasks-100)`

```python
# loading dataset
datasets = []
with open("./elyza-tasks-100-TV_0.jsonl", "r") as f:
    item = ""
    for line in f:
        line = line.strip()
        item += line
        if item.endswith("}"):
            datasets.append(json.loads(item))
            item = ""

# execute inference
results = []
for data in tqdm(datasets):

    input_text = data["input"]

    prompt = f"""### 指示
    {input_text}
    ### 回答
    """

    tokenized_input = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)
    attention_mask = torch.ones_like(tokenized_input)

    with torch.no_grad():
        outputs = model.generate(
          tokenized_input,
          attention_mask=attention_mask,
          max_new_tokens=100,
          do_sample=False,
          repetition_penalty=1.2,
          pad_token_id=tokenizer.eos_token_id
        )[0]
    output = tokenizer.decode(outputs[tokenized_input.size(1):], skip_special_tokens=True)

    results.append({"task_id": data["task_id"], "input": input_text, "output": output})

jsonl_id = re.sub(".*/", "", adapter_id)
with open(f"./{jsonl_id}-outputs-validation.jsonl", 'w', encoding='utf-8') as f:
    for result in results:
        json.dump(result, f, ensure_ascii=False)  # ensure_ascii=False for handling non-ASCII characters
        f.write('\n')
```

---

## License

This model is released under the **CC-BY-NC-SA 4.0** license.

- **Base Model**: [llm-jp/llm-jp-3-13b](https://huggingface.co/llm-jp/llm-jp-3-13b) (Apache License 2.0)
- **Fine-Tuning Dataset**: ichikara-instruction (CC-BY-NC-SA 4.0)

**This Model License**:  
Due to the Share-Alike (SA) condition of the ichikara-instruction dataset, the fine-tuned model is licensed under **CC-BY-NC-SA 4.0**.  
This means the model can only be used for **non-commercial purposes**, and any derivative works must adopt the same license.