This is a 3bit AutoRound GPTQ version of Mistral-Large-Instruct-2407. This conversion used model-*.safetensors.
This quantized model needs at least ~ 50GB + context (~5GB) VRAM. I quantized it so that it could fit 64GB VRAM.
Quantization script (it takes around 520 GB RAM and A40 GPU 48GB around 20 hours to convert):
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model_name = "mistralai/Mistral-Large-Instruct-2407"
model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.float16)
tokenizer = AutoTokenizer.from_pretrained(model_name)
from auto_round import AutoRound
bits, group_size, sym = 3, 128, True
autoround = AutoRound(model, tokenizer, nsamples=256, iters=512, low_gpu_mem_usage=True, batch_size=4, bits=bits, group_size=group_size, sym=sym,
device='cuda')
autoround.quantize()
output_dir = "./Mistral-Large-Instruct-2407-3bit"
autoround.save_quantized(output_dir, format='auto_gptq', inplace=True)
Evals using lm-eval-harness.
example command:
# !pip install git+https://github.com/EleutherAI/lm-evaluation-harness.git auto-gptq optimum
m="VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-256-woft"
!lm_eval --model hf --model_args pretrained={m},dtype=auto --tasks wikitext --num_fewshot 0 --batch_size 1 --output_path ./eval/
hf (pretrained=MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 2
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
wikitext | 2 | none | 0 | bits_per_byte | ↓ | 0.4103 | ± | N/A |
none | 0 | byte_perplexity | ↓ | 1.3290 | ± | N/A | ||
none | 0 | word_perplexity | ↓ | 4.5765 | ± | N/A |
vs 3bit VPTQ hf (pretrained=VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-256-woft,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
wikitext | 2 | none | 0 | bits_per_byte | ↓ | 0.4017 | ± | N/A |
none | 0 | byte_perplexity | ↓ | 1.3211 | ± | N/A | ||
none | 0 | word_perplexity | ↓ | 4.4324 | ± | N/A |
vs 4bit GPTQ: hf (pretrained=ModelCloud/Mistral-Large-Instruct-2407-gptq-4bit,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1:
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
wikitext | 2 | none | 0 | bits_per_byte | ↓ | 0.3536 | ± | N/A |
none | 0 | byte_perplexity | ↓ | 1.2777 | ± | N/A | ||
none | 0 | word_perplexity | ↓ | 3.7082 | ± | N/A |
vs 4bit VPTQ hf (pretrained=VPTQ-community/Mistral-Large-Instruct-2407-v8-k65536-65536-woft,dtype=auto), gen_kwargs: (None), limit: None, num_fewshot: 0, batch_size: 1
Tasks | Version | Filter | n-shot | Metric | Value | Stderr | ||
---|---|---|---|---|---|---|---|---|
wikitext | 2 | none | 0 | bits_per_byte | ↓ | 0.3415 | ± | N/A |
none | 0 | byte_perplexity | ↓ | 1.2671 | ± | N/A | ||
none | 0 | word_perplexity | ↓ | 3.5463 | ± | N/A |
vs exl2 4bpw (I think the tests are different)
Wikitext | C4 | FineWeb | Max VRAM | |
---|---|---|---|---|
EXL2 4.00 bpw | 2.885 | 6.484 | 6.246 | 60.07 GB |
- Downloads last month
- 56
Model tree for MLDataScientist/Mistral-Large-Instruct-2407-GPTQ-3bit
Base model
mistralai/Mistral-Large-Instruct-2407