Uploaded model

Developed by: kmagai
License: apache-2.0
Finetuned from model: llm-jp/llm-jp-3-13b

This llama model was trained 2x faster with Unsloth and Huggingface's TRL library.

JSONL Output Process

Model Inference Setup

from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
from tqdm import tqdm
import json

# QLoRA config for 4-bit quantization
bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=False,
)

# Load model and tokenizer
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    quantization_config=bnb_config,
    device_map="auto",
    token=HF_TOKEN
)
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True, token=HF_TOKEN)

Input Data Processing

The script reads input data from a JSONL file (elyza-tasks-100-TV_0.jsonl). Each line contains a JSON object with task information:

datasets = []
with open("./elyza-tasks-100-TV_0.jsonl", "r") as f:
    item = ""
    for line in f:
        line = line.strip()
        item += line
        if item.endswith("}"):
            datasets.append(json.loads(item))
            item = ""

Generation Process

For each input in the dataset:

Format the prompt with instruction template
Tokenize the input
Generate response using the model
Decode the output
Create result object with task_id and output

results = []
for data in tqdm(datasets):
    input = data["input"]
    prompt = f"""### Instruction
    {input}
    ### Response:
    """
    
    tokenized_input = tokenizer.encode(prompt, add_special_tokens=False, return_tensors="pt").to(model.device)
    with torch.no_grad():
        outputs = model.generate(
            tokenized_input,
            max_new_tokens=100,
            do_sample=False,
            repetition_penalty=1.2
        )[0]
    output = tokenizer.decode(outputs[tokenized_input.size(1):], skip_special_tokens=True)
    
    results.append({"task_id": data["task_id"], "input": input, "output": output})

Generation Parameters

max_new_tokens=100: Maximum number of tokens to generate
do_sample=False: Deterministic generation (same output every time)
repetition_penalty=1.2: Penalize repetition in generated text

Output Format

The generated responses are saved in a JSONL file with the following format:

{"task_id": "task_1", "input": "input text", "output": "generated response"}

Required fields:

task_id: Unique identifier for the task
output: Response generated by the model

Optional fields:

input: Input text (can be omitted in submission)

Training Data Format

The training data should be provided in JSONL (JSON Lines) format, where each line represents a single JSON object containing the following fields:

{
    "instruction": "Task instruction text",
    "input": "Input text (optional)",
    "output": "Expected output text"
}

Fields Description

instruction: Task instruction that tells the model what to do
input: (Optional) Input text that provides specific context for the instruction
output: Expected output that represents the ideal response

Example

{"instruction": "以下の文章を要約してください。", "input": "人工知能（AI）は、人間の知能を模倣し、学習、推論、判断などを行うコンピュータシステムです。近年、機械学習や深層学習の発展により、画像認識、自然言語処理、ゲームなど様々な分野で人間に匹敵する、あるいは人間を超える性能を示しています。", "output": "AIは人間の知能を模倣するコンピュータシステムで、機械学習の発展により多くの分野で高い性能を示している。"}
{"instruction": "次の英文を日本語に翻訳してください。", "input": "Artificial Intelligence is transforming the way we live and work.", "output": "人工知能は私たちの生活と仕事の仕方を変革しています。"}

kmagai
/

llm-jp-3-13b-finetune-2