dnoever's picture
Update README.md
8fa651b
---
license: apache-2.0
language:
- en
pipeline_tag: text-generation
dtype: bfloat16
---
# Results:
T: 🟦
Model: CultriX/MistralTrix-v1 πŸ“‘
Average: 73.39
ARC: 72.27
HellaSwag: 88.33
MMLU: 65.24
TruthfulQA: 70.73
Winogrande: 80.98
GSM8K: 62.77
# Edit/Disclaimer:
Currently the #1 ranked 7B LLM on the LLM Leaderboards, converted with exl2 quantization
# Description:
Model: CultriX/MistralTrix-v1
MistralTrix-v1 is an zyh3826/GML-Mistral-merged-v1 model that has been further fine-tuned with Direct Preference Optimization (DPO) using Intel's dataset for neural-chat-7b-v3-1.
It surpasses the original model on several benchmarks (see results).
It is directly inspired by the RLHF process described by Intel/neural-chat-7b-v3-1's authors to improve performance.
# TRAINING SPECIFICATIONS
> LoRA configuration
peft_config = LoraConfig(
r=16,
lora_alpha=16,
lora_dropout=0.05,
bias="none",
task_type="CAUSAL_LM",
target_modules=['k_proj', 'gate_proj', 'v_proj', 'up_proj', 'q_proj', 'o_proj', 'down_proj']
)
> Model to fine-tune
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
load_in_4bit=True
)
model.config.use_cache = False
> Reference model
ref_model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=torch.float16,
load_in_4bit=True
)
> Training arguments
training_args = TrainingArguments(
per_device_train_batch_size=4,
gradient_accumulation_steps=4,
gradient_checkpointing=True,
learning_rate=5e-5,
lr_scheduler_type="cosine",
max_steps=200,
save_strategy="no",
logging_steps=1,
output_dir=new_model,
optim="paged_adamw_32bit",
warmup_steps=100,
bf16=True,
report_to="wandb",
)
> Create DPO trainer
dpo_trainer = DPOTrainer(
model,
ref_model,
args=training_args,
train_dataset=dataset,
tokenizer=tokenizer,
peft_config=peft_config,
beta=0.1,
max_prompt_length=1024,
max_length=1536,
)