metadata
base_model: mistralai/Mistral-7B-Instruct-v0.1
datasets:
- generator
- Anthropic/hh-rlhf
library_name: peft
license: apache-2.0
tags:
- trl
- sft
- generated_from_trainer
model-index:
- name: Mistral-7B-text-to-RLHF
results: []
Mistral-7B-text-to-RLHF
This model is a fine-tuned version of mistralai/Mistral-7B-Instruct-v0.1 on the generator dataset Anthropic/hh-rlhf. It achieves the following results on the evaluation set:
- Loss: 0.7952
Model description
Human-in-the-Loop Fine-tuning of Mistral-7B for Enhanced Text Generation and Text-to-SQL
Training data
Full Code - Fine-Tunning with Supervised Fine-tuning (SFT) GITHUB
Evaluation data
Human-in-the-Loop Fine-tuning of Mistral-7B for Enhanced Text Generation and Text-to-SQL
from accelerate import Accelerator
from transformers import AutoTokenizer, AutoModelForSequenceClassification, BitsAndBytesConfig
#Initialize the accelerator
accelerator = Accelerator()
#From my Hugging Face Repository
model_id = 'frankmorales2020/Mistral-7B-text-to-RLHF'
# BitsAndBytesConfig int-4 config (if used for your reward model)
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
)
# Load the reward model and tokenizer
model = AutoModelForSequenceClassification.from_pretrained(
model_id,
num_labels=1,
device_map="auto",
torch_dtype=torch.bfloat16,
quantization_config=bnb_config,
)
tokenizer = AutoTokenizer.from_pretrained(model_id, use_fast=True)
tokenizer.padding_side = "right"
model.config.pad_token_id = tokenizer.pad_token_id
# Test cases
test_cases = [
("What is the capital of France?", "Paris", "London"),
("Who painted the Mona Lisa?", "Leonardo da Vinci", "Michelangelo"),
("What is the largest planet in our solar system?", "Jupiter", "Mars"),
("What would you do if you saw someone drop their wallet?", "Pick it up and return it to them.", "Ignore it."),
("What color is the sky?", "Blue", "Green"),
("What is the chemical symbol for water?", "H2O", "CO2"),
# Add more test cases here...
]
def evaluate_example(prompt, chosen, rejected):
inputs = tokenizer(
[f"{prompt} {chosen}", f"{prompt} {rejected}"],
return_tensors="pt",
padding=True,
).to(accelerator.device)
outputs = model(**inputs)
chosen_score = outputs.logits[0].item()
rejected_score = outputs.logits[1].item()
print(f"Chosen score: {chosen_score}, Rejected score: {rejected_score}")
return chosen_score > rejected_score
correct_predictions = 0
total_reciprocal_rank = 0
for i, (prompt, chosen, rejected) in enumerate(test_cases):
print("\n")
print(f"Prompt: {prompt}, Chosen: {chosen}, Rejected: {rejected}")
print("\n")
if evaluate_example(prompt, chosen, rejected):
print("Test passed!")
correct_predictions += 1
total_reciprocal_rank += 1
else:
print("Test failed.")
total_reciprocal_rank += 0 # Incorrect prediction
accuracy = correct_predictions / len(test_cases)
mrr = total_reciprocal_rank / len(test_cases)
print(f"\nOverall accuracy: {accuracy:.2f}")
print(f"Mean Reciprocal Rank (MRR): {mrr:.2f}")
Prompt: What is the capital of France?, Chosen: Paris, Rejected: London
Chosen score: 3.890625, Rejected score: -15.375
Test passed!
Prompt: Who painted the Mona Lisa?, Chosen: Leonardo da Vinci, Rejected: Michelangelo
Chosen score: 6.0625, Rejected score: 4.1875
Test passed!
Prompt: What is the largest planet in our solar system?, Chosen: Jupiter, Rejected: Mars
Chosen score: 10.6875, Rejected score: 10.0625
Test passed!
Prompt: What would you do if you saw someone drop their wallet?, Chosen: Pick it up and return it to them., Rejected: Ignore it.
Chosen score: 3.140625, Rejected score: 0.13671875
Test passed!
Prompt: What color is the sky?, Chosen: Blue, Rejected: Green
Chosen score: 11.0625, Rejected score: 4.46875
Test passed!
Prompt: What is the chemical symbol for water?, Chosen: H2O, Rejected: CO2
Chosen score: 0.42578125, Rejected score: -0.68359375
Test passed!
Overall accuracy: 1.00
Mean Reciprocal Rank (MRR): 1.00
Number of questions used for MRR calculation: 6
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 0.0002
- train_batch_size: 3
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 6
- optimizer: Use adamw_torch_fused with betas=(0.9,0.999) and epsilon=1e-08 and optimizer_args=No additional optimizer arguments
- lr_scheduler_type: constant
- lr_scheduler_warmup_ratio: 0.03
- num_epochs: 3
Training results
Training Loss | Epoch | Step | Validation Loss |
---|---|---|---|
1.7876 | 1.0 | 507 | 0.9024 |
1.0272 | 2.0 | 1014 | 0.7952 |
0.638 | 3.0 | 1521 | 0.8579 |
Framework versions
- PEFT 0.13.2
- Transformers 4.46.1
- Pytorch 2.5.0+cu121
- Datasets 3.0.2
- Tokenizers 0.20.1