Uploaded model

  • Developed by: UsernameJustAnother
  • License: apache-2.0
  • Finetuned from model : unsloth/Mistral-Nemo-Instruct-2407

I am a terrible liar. I came across another dataset I had to use, and this is the result. Still experimental, as I made these to teach myself the basics of fine-tuning, with notes extensively borrowed from https://huggingface.co/nothingiisreal/MN-12B-Celeste-V1.9

It is an RP finetune using 10,801 human-generated conversations of varying lengths from a variety of sources and curated by me, trained in ChatML format.

The big differences from Celeste is a different LoRA scaling factor. Celeste uses 8; I did several tests with this data before concluding I got lower training loss with 2.

Training took around 5 hours on a single Colab A100 (but I didn't do an eval loop). Neat that I could get it all to fit into 40GB of vRAM thanks to Unsloth.

It was trained with the following settings:

==((====))==  Unsloth - 2x faster free finetuning | Num GPUs = 1
   \\   /|    Num examples = 10,801 | Num Epochs = 2
O^O/ \_/ \    Batch size per device = 2 | Gradient Accumulation steps = 4
\        /    Total batch size = 8 | Total steps = 2,700
 "-____-"     Number of trainable parameters = 912,261,120

[ 14/2700 01:20 < 4:59:21, 0.15 it/s, Epoch 0.01/2] 
[2040/2040 3:35:30, Epoch 2/2] 

model = FastLanguageModel.get_peft_model(
    model,
    r = 256,
    target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
                      "gate_proj", "up_proj", "down_proj",],
    lora_alpha = 32,  #   32 / sqrt(256) gives a scaling factor of 2
    lora_dropout = 0, # Supports any, but = 0 is optimized
    bias = "none",    # Supports any, but = "none" is optimized
    # [NEW] "unsloth" uses 30% less VRAM, fits 2x larger batch sizes!
    use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
    random_state = 3407,
    use_rslora = True,  # setting the adapter scaling factor to lora_alpha/math.sqrt(r) instead of lora_alpha/r
    loftq_config = None, # And LoftQ
)

lr_scheduler_kwargs = {
    'min_lr': 0.0000024  # Adjust this value as needed
}

trainer = SFTTrainer(
    model = model,
    tokenizer = tokenizer,
    train_dataset = train_ds,
    compute_metrics = compute_metrics,
    dataset_text_field = "text",
    max_seq_length = max_seq_length,
    dataset_num_proc = 2,
    packing = False, # Can make training 5x faster for short sequences.
    args = TrainingArguments(
        per_device_train_batch_size = 2,
        per_device_eval_batch_size = 2, # defaults to 8!
        gradient_accumulation_steps = 4,
        warmup_steps = 5,
        num_train_epochs = 2,
        learning_rate = 8e-5,
        fp16 = not is_bfloat16_supported(),
        bf16 = is_bfloat16_supported(),
        fp16_full_eval = True, # stops eval from trying to use fp32
        eval_strategy = "no", # 'no', 'steps', 'epoch'. Don't use this without an eval dataset etc
        eval_steps = 1, # is eval_strat is set to 'steps', do every N steps.
        logging_steps = 1, # so eval and logging happen on the same schedule
        optim = "adamw_8bit", 
        weight_decay = 0.01,
        lr_scheduler_type = "cosine_with_min_lr", # linear, cosine, cosine_with_min_lr, default linear
        lr_scheduler_kwargs = lr_scheduler_kwargs, # needed for cosine_with_min_lr
        seed = 3407,
        output_dir = "outputs",
    ),
)

This mistral model was trained 2x faster with Unsloth and Huggingface's TRL library.

Downloads last month
23
Safetensors
Model size
12.2B params
Tensor type
BF16
·
Inference Examples
This model does not have enough activity to be deployed to Inference API (serverless) yet. Increase its social visibility and check back later, or deploy to Inference Endpoints (dedicated) instead.

Model tree for UsernameJustAnother/Nemo-12B-Marlin-v5

Finetuned
(23)
this model
Merges
5 models
Quantizations
2 models