QLoRA Instruction Tuned Models

| Paper | Code | Demo |

The QLoRA Instruction Tuned Models are open-source models obtained through 4-bit QLoRA tuning of LLaMA base models on various instruction tuning datasets. They are available in 7B, 13B, 33B, and 65B parameter sizes.

Note: The best performing chatbot models are named Guanaco and finetuned on OASST1. This model card is for the other models finetuned on other instruction tuning datasets.

⚠️ These models are purely intended for research purposes and could produce problematic outputs.

What are QLoRA Instruction Tuned Models and why use them?

Strong performance on MMLU following the QLoRA instruction tuning.
Replicable and efficient instruction tuning procedure that can be extended to new use cases. QLoRA training scripts are available in the QLoRA repo.
Rigorous comparison to 16-bit methods (both 16-bit full-finetuning and LoRA) in our paper demonstrates the effectiveness of 4-bit QLoRA finetuning.
Lightweight checkpoints which only contain adapter weights.

License and Intended Use

QLoRA Instruction Tuned adapter weights are available under Apache 2 license. Note the use of these adapter weights, requires access to the LLaMA model weighs and therefore should be used according to the LLaMA license.

Usage

Here is an example of how you would load Flan v2 7B in 4-bits:

import torch
from peft import PeftModel    
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

model_name = "huggyllama/llama-7b"
adapters_name = 'timdettmers/qlora-flan-7b'

model = AutoModelForCausalLM.from_pretrained(
    model_name,
    load_in_4bit=True,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    max_memory= {i: '24000MB' for i in range(torch.cuda.device_count())},
    quantization_config=BitsAndBytesConfig(
        load_in_4bit=True,
        bnb_4bit_compute_dtype=torch.bfloat16,
        bnb_4bit_use_double_quant=True,
        bnb_4bit_quant_type='nf4'
    ),
)
model = PeftModel.from_pretrained(model, adapters_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Inference can then be performed as usual with HF models as follows:

prompt = "Introduce yourself"
formatted_prompt = (
    f"A chat between a curious human and an artificial intelligence assistant."
    f"The assistant gives helpful, detailed, and polite answers to the user's questions.\n"
    f"### Human: {prompt} ### Assistant:"
)
inputs = tokenizer(formatted_prompt, return_tensors="pt").to("cuda:0")
outputs = model.generate(inputs=inputs.input_ids, max_new_tokens=20)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Expected output similar to the following:

A chat between a curious human and an artificial intelligence assistant. The assistant gives helpful, detailed, and polite answers to the user's questions.
### Human: Introduce yourself ### Assistant: I am an artificial intelligence assistant. I am here to help you with any questions you may have.

Current Inference Limitations

Currently, 4-bit inference is slow. We recommend loading in 16 bits if inference speed is a concern. We are actively working on releasing efficient 4-bit inference kernels.

Below is how you would load the model in 16 bits:

model_name = "huggyllama/llama-7b"
adapters_name = 'timdettmers/qlora-flan-7b'
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    torch_dtype=torch.bfloat16,
    device_map="auto",
    max_memory= {i: '24000MB' for i in range(torch.cuda.device_count())},
)
model = PeftModel.from_pretrained(model, adapters_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)

Model Card

Architecture: The models released here are LoRA adapters to be used on top of LLaMA models. They are added to all layers. For all model sizes, we use $r=64$.

Base Model: These models use LLaMA as base model with sizes 7B, 13B, 33B, 65B. LLaMA is a causal language model pretrained on a large corpus of text. See LLaMA paper for more details. Note that these models can inherit biases and limitations of the base model.

Finetuning Data: These models are finetuned on various instruction tuning datasets. The datasets used are: Alpaca, HH-RLHF, Unnatural Instr., Chip2, Longform, Self-Instruct, FLAN v2.

Languages: The different datasets cover different languages. We direct to the various papers and resources describing the datasets for more details.

Next, we describe Training and Evaluation details.

Training

QLoRA Instruction Tuned Models are the result of 4-bit QLoRA supervised finetuning on different instruction tuning datasets.

All models use NormalFloat4 datatype for the base model and LoRA adapters on all linear layers with BFloat16 as computation datatype. We set LoRA $r=64$, $\alpha=16$. We also use Adam beta2 of 0.999, max grad norm of 0.3 and LoRA dropout of 0.1 for models up to 13B and 0.05 for 33B and 65B models. For the finetuning process, we use constant learning rate schedule and paged AdamW optimizer.

Training hyperparameters

Parameters	Dataset	Batch size	LR	Steps	Source Length	Target Length
7B	All	16	2e-4	10000	384	128
7B	OASST1	16	2e-4	1875	-	512
7B	HH-RLHF	16	2e-4	10000	-	768
7B	Longform	16	2e-4	4000	512	1024
13B	All	16	2e-4	10000	384	128
13B	OASST1	16	2e-4	1875	-	512
13B	HH-RLHF	16	2e-4	10000	-	768
13B	Longform	16	2e-4	4000	512	1024
33B	All	32	1e-4	5000	384	128
33B	OASST1	16	1e-4	1875	-	512
33B	HH-RLHF	32	1e-4	5000	-	768
33B	Longform	32	1e-4	2343	512	1024
65B	All	64	1e-4	2500	384	128
65B	OASST1	16	1e-4	1875	-	512
65B	HH-RLHF	64	1e-4	2500	-	768
65B	Longform	32	1e-4	2343	512	1024

Evaluation

We use the MMLU benchmark to measure performance on a range of language understanding tasks. This is a multiple-choice benchmark covering 57 tasks including elementary mathematics, US history, computer science, law, and more. We report 5-shot test accuracy.

Dataset	7B	13B	33B	65B
LLaMA no tuning	35.1	46.9	57.8	63.4
Self-Instruct	36.4	33.3	53.0	56.7
Longform	32.1	43.2	56.6	59.7
Chip2	34.5	41.6	53.6	59.8
HH-RLHF	34.9	44.6	55.8	60.1
Unnatural Instruct	41.9	48.1	57.3	61.3
OASST1 (Guanaco)	36.6	46.4	57.0	62.2
Alpaca	38.8	47.8	57.3	62.5
FLAN v2	44.5	51.4	59.2	63.9

We evaluate the generative language capabilities through automated evaluations on the Vicuna benchmark. We report the score of the QLoRA Instruction Finetuned Models relative to the score obtained by ChatGPT. The rater in this case is GPT-4 which is tasked to assign a score out of 10 to both ChatGPT and the model outputs for each prompt. We report scores for models ranging 7B to 65B and compare them to both academic and commercial baselilnes.

Model / Dataset	Params	Model bits	Memory	ChatGPT vs Sys	Sys vs ChatGPT	Mean	95% CI
GPT-4	-	-	-	119.4%	110.1%	114.5%	2.6%
Bard	-	-	-	93.2%	96.4%	94.8%	4.1%
Guanaco	65B	4-bit	41 GB	96.7%	101.9%	99.3%	4.4%
Alpaca	65B	4-bit	41 GB	63.0%	77.9%	70.7%	4.3%
FLAN v2	65B	4-bit	41 GB	37.0%	59.6%	48.4%	4.6%
Guanaco	33B	4-bit	21 GB	96.5%	99.2%	97.8%	4.4%
Open Assistant	33B	16-bit	66 GB	73.4%	85.7%	78.1%	5.3%
Alpaca	33B	4-bit	21 GB	67.2%	79.7%	73.6%	4.2%
FLAN v2	33B	4-bit	21 GB	26.3%	49.7%	38.0%	3.9%
Vicuna	13B	16-bit	26 GB	91.2%	98.7%	94.9%	4.5%
Guanaco	13B	4-bit	10 GB	87.3%	93.4%	90.4%	5.2%
Alpaca	13B	4-bit	10 GB	63.8%	76.7%	69.4%	4.2%
HH-RLHF	13B	4-bit	10 GB	55.5%	69.1%	62.5%	4.7%
Unnatural Instr.	13B	4-bit	10 GB	50.6%	69.8%	60.5%	4.2%
Chip2	13B	4-bit	10 GB	49.2%	69.3%	59.5%	4.7%
Longform	13B	4-bit	10 GB	44.9%	62.0%	53.6%	5.2%
Self-Instruct	13B	4-bit	10 GB	38.0%	60.5%	49.1%	4.6%
FLAN v2	13B	4-bit	10 GB	32.4%	61.2%	47.0%	3.6%
Guanaco	7B	4-bit	5 GB	84.1%	89.8%	87.0%	5.4%
Alpaca	7B	4-bit	5 GB	57.3%	71.2%	64.4%	5.0%
FLAN v2	7B	4-bit	5 GB	33.3%	56.1%	44.8%	4.0%

Citation

@article{dettmers2023qlora,
  title={QLoRA: Efficient Finetuning of Quantized LLMs},
  author={Dettmers, Tim and Pagnoni, Artidoro and Holtzman, Ari and Zettlemoyer, Luke},
  journal={arXiv preprint arXiv:2305.14314},
  year={2023}
}