How to use

We write our prompts in the ChatML format.

With vLLM (recommended for much faster inference)

Install vLLM

  pip install vllm

from vllm import LLM, SamplingParams
model_name = "lightblue/jod"
llm = LLM(model=model_name)

SYSTEM_MESSAGE = "You are a helpful assistant."
def process_chat_history(next_user_msg, text_chat_history = []):
    prompt_text = "<|im_start|>system\n"
    prompt_text += SYSTEM_MESSAGE
    prompt_text += "<|im_end|>\n\n"

    for user_msg, ai_msg in text_chat_history:
        prompt_text += "<|im_start|>user\n"
        prompt_text += user_msg
        prompt_text += "<|im_end|>\n\n"
        prompt_text += "<|im_start|>assistant\n"
        prompt_text += ai_msg
        prompt_text += "<|im_end|>\n\n"

    prompt_text += "<|im_start|>user\n"
    prompt_text += next_user_msg
    prompt_text += "<|im_end|>\n\n"
    prompt_text += "<|im_start|>assistant\n"
    return prompt_text

user_prompt = "日本の一番高い山は？"
prompt = process_chat_history(user_prompt)
sampling_params = SamplingParams(temperature=0, max_tokens=528)
outputs = llm.generate(prompt, sampling_params)
bot_message = outputs[0].outputs[0].text.strip()
print(bot_message)

With Huggingface

from transformers import AutoTokenizer, AutoModelForCausalLM, pipeline

model_name = "lightblue/jod"

tokenizer = AutoTokenizer.from_pretrained(model_dir)
model = AutoModelForCausalLM.from_pretrained(
    model_dir, torch_dtype=torch.bfloat16, device_map='auto', load_in_4bit=True,
)

pipe = pipeline("text-generation", model=model, tokenizer=tokenizer)

SYSTEM_MESSAGE = "You are a helpful assistant."
def process_chat_history(next_user_msg, text_chat_history = []):
    prompt_text = "<|im_start|>system\n"
    prompt_text += SYSTEM_MESSAGE
    prompt_text += "<|im_end|>\n\n"

    for user_msg, ai_msg in text_chat_history:
        prompt_text += "<|im_start|>user\n"
        prompt_text += user_msg
        prompt_text += "<|im_end|>\n\n"
        prompt_text += "<|im_start|>assistant\n"
        prompt_text += ai_msg
        prompt_text += "<|im_end|>\n\n"

    prompt_text += "<|im_start|>user\n"
    prompt_text += next_user_msg
    prompt_text += "<|im_end|>\n\n"
    prompt_text += "<|im_start|>assistant\n"
    return prompt_text

user_prompt = "日本の一番高い山は？"
prompt = process_chat_history(user_prompt)
bot_message = pipe(do_closed_qa(test_article, question), max_new_tokens=128, temperature=0)[0]["generated_text"]
print(bot_message)

Training details

We trained on the following 3 datasets:

using the (Open-Orca/Mistral-7B-SlimOrca) model as our base checkpoint.

This model was trained using the ChatML format, so it should be used for inference using the ChatML chatbot format. We chose this format as the base model (Open-Orca/Mistral-7B-SlimOrca) was trained with this format, and we find the chatbot format more compelling for practical use compared to the Alpaca style instruction format.

We trained for 1 epoch using the following Axolotl config. (Early stopping was not performed during our training.)

Axolotl config .yaml

base_model: Open-Orca/Mistral-7B-SlimOrca
base_model_config: Open-Orca/Mistral-7B-SlimOrca
model_type: MistralForCausalLM
tokenizer_type: LlamaTokenizer
is_mistral_derived_model: true

load_in_8bit: false
load_in_4bit: true
strict: false

datasets:
- path: ./data/jaster_plus.jsonl
  ds_type: json # see other options below
  type: sharegpt
  conversation: chatml
dataset_prepared_path: false
val_set_size: 0.002
output_dir: ./train_output/openorca-mistral-jaster-1epoch

use_wandb: true
wandb_project: \<HIDDEN\>
wandb_entity: \<HIDDEN\>

debug: 

adapter: qlora
lora_model_dir:

sequence_len: 4096
sample_packing: true
pad_to_sequence_len: true

lora_r: 32
lora_alpha: 16
lora_dropout: 0.05
lora_target_linear: true
lora_fan_in_fan_out:
lora_target_modules:
- gate_proj
- down_proj
- up_proj
- q_proj
- v_proj
- k_proj
- o_proj

gradient_accumulation_steps: 1
micro_batch_size: 10
eval_batch_size: 4
num_epochs: 1
optimizer: adamw_bnb_8bit
lr_scheduler: cosine
learning_rate: 0.0002

train_on_inputs: false
group_by_length: false
bf16: true
fp16: false
tf32: false

gradient_checkpointing: true
early_stopping_patience: 10
resume_from_checkpoint:
local_rank:
logging_steps: 1
xformers_attention:
flash_attention: true

warmup_steps: 10
eval_steps: 10
eval_table_size: 5
eval_table_max_new_tokens: 128
save_steps: 10
debug:
deepspeed:
weight_decay: 0.0
fsdp:
fsdp_config:
special_tokens:
bos_token: "<s>"
eos_token: "</s>"
unk_token: "<unk>"

lightblue
/

jod

How to use

With vLLM (recommended for much faster inference)

With Huggingface

Training details

Datasets used to train lightblue/jod