LLMJP3-13B-IT2

Overview

LLMJP3-13B-IT2 is a fine-tuned language model built on top of the "llm-jp/llm-jp-3-13b" base model. This model has been optimized for Japanese text generation and understanding tasks, leveraging advanced techniques for faster training and efficient inference.

Key Features

Base Model: llm-jp/llm-jp-3-13b
Fine-tuned Dataset: DeL-TaiseiOzaki/Tengentoppa-sft-v1.0
Training Acceleration: Utilized Unsloth and Huggingface's TRL library to achieve a 2x faster training process.
Developer: tshyk
License: Apache-2.0

Dataset

The model was fine-tuned using the Tengentoppa-sft-v1.0 dataset, which is licensed under Creative Commons Attribution 4.0 International (CC BY 4.0).

License

This project is distributed under the Apache License 2.0. Please review the license terms before using the model.

How to Use

Installation and Setup

Follow the steps below to set up the environment and run inference using the model.

Colab Setup

# -*- coding: utf-8 -*-
"""myModel_Inference_Template_unsloth.ipynb"""

# Install dependencies
!pip install unsloth
!pip uninstall unsloth -y && pip install --upgrade --no-cache-dir "unsloth[colab-new] @ git+https://github.com/unslothai/unsloth.git"

from unsloth import FastLanguageModel
import torch
import json

model_name = "tshyk/llmjp3-13b-it2"

# Model Configuration
max_seq_length = 2048
dtype = None
load_in_4bit = True

# Load Model
model, tokenizer = FastLanguageModel.from_pretrained(
    model_name=model_name,
    max_seq_length=max_seq_length,
    dtype=dtype,
    load_in_4bit=load_in_4bit,
    token="your_huggingface_token",
)
FastLanguageModel.for_inference(model)

from google.colab import drive
drive.mount('/content/drive')

# Load Dataset
datasets = []
with open("/content/drive/MyDrive/elyza100_assignment/elyza-tasks-100-TV_0.jsonl", "r") as f:
    item = ""
    for line in f:
        line = line.strip()
        item += line
        if item.endswith("}"):
            datasets.append(json.loads(item))
            item = ""

from tqdm import tqdm

# Inference
results = []
for dt in tqdm(datasets):
    input = dt["input"]

    prompt = f"""### 指示
{input}
### 回答
"""

    inputs = tokenizer([prompt], return_tensors="pt").to(model.device)

    outputs = model.generate(**inputs, max_new_tokens=512, use_cache=True, do_sample=False, repetition_penalty=1.2)
    prediction = tokenizer.decode(outputs[0], skip_special_tokens=True).split('\n### 回答')[-1]

    results.append({
        "task_id": dt["task_id"],
        "input": input,
        "output": prediction
    })

with open(f"/content/model_output.jsonl", 'w', encoding='utf-8') as f:
    for result in results:
        json.dump(result, f, ensure_ascii=False)
        f.write('\n')

Notes

Replace your_huggingface_token with your Huggingface token.
Ensure the dataset files are properly mounted on Colab or your local environment.

Citation

If you use this model, please cite as follows:

@misc{tshyk2024llmjp,
  author = {tshyk},
  title = {LLMJP3-13B-IT2},
  year = {2024},
  url = {https://huggingface.co/tshyk/llmjp3-13b-it2},
  note = {Fine-tuned using the Tengentoppa-sft-v1.0 dataset.}
}

For further inquiries or contributions, feel free to contact tshyk via Hugging Face.

tshyk
/

llmjp3-13b-it2