base_model:
- mistralai/Mistral-Small-Instruct-2409
Mistral-Small-Instruct CTranslate2 Model
This repository contains a CTranslate2 version of the Mistral-Small-Instruct model. The conversion process involved AWQ quantization followed by CTranslate2 format conversion.
Quantization Parameters
The following AWQ parameters were used:
zero_point=true
q_group_size=128
w_bit=4
version=gemv
Quantization Process
The quantization was performed using the AutoAWQ library. AutoAWQ supports two quantization approaches:
Without calibration data:
- Quick process (~few minutes)
- Uses standard quantization schema
- Suitable for general use cases
With calibration data:
- Longer process (3-4 hours on RTX 4090)
- Preserves full precision for task-specific weights
- Slightly better performance for targeted tasks
Calibration Details
This model was quantized with calibration data. Specifically, the cosmopedia-100k dataset was used, which is good for overall QA and instruction-following.
Key parameters:
max_calib_seq_len
: 8192 (enables long-form responses)text_token_length
: 2048 (minimum input token length during quantization)
While these parameters don't fundamentally alter the model's architecture, they fine-tune its behavior for specific input-output length patterns and topic domains.
Requirements
torch 2.2.2
ctranslate2 4.4.0
- NOTE: The soon-to-be-released
ctranslate2 4.5.0
will supporttorch
greater than version 2.2.2. These instructions will be updated when that occurs.
Sample Script
import os
import sys
import ctranslate2
import gc
import torch
from transformers import AutoTokenizer
system_message = "You are a helpful person who answers questions."
user_message = "Hello, how are you today? I'd like you to write me a funny poem that is a parody of Milton's Paradise Lost if you are familiar with that famous epic poem?"
model_dir = r"D:\Scripts\bench_chat\models\mistralai--Mistral-Small-Instruct-2409-AWQ-ct2-awq" # uses ~13.8 GB
def build_prompt_mistral_small():
prompt = f"""<s>
[INST] {system_message}
{user_message}[/INST]"""
return prompt
def main():
model_name = os.path.basename(model_dir)
print(f"\033[32mLoading the model: {model_name}...\033[0m")
intra_threads = max(os.cpu_count() - 4, 4)
generator = ctranslate2.Generator(
model_dir,
device="cuda",
# compute_type="int8_bfloat16", # NOTE...YOU DO NOT USE THIS AT ALL WHEN USING AWQ/CTRANSLATE2 MODELS
intra_threads=intra_threads
)
tokenizer = AutoTokenizer.from_pretrained(model_dir, add_prefix_space=None)
prompt = build_prompt_mistral_small()
tokens = tokenizer.convert_ids_to_tokens(tokenizer.encode(prompt))
print(f"\nRun 1 (Beam Size: {beam_size}):")
results_batch = generator.generate_batch(
[tokens],
include_prompt_in_result=False,
max_batch_size=4096,
batch_type="tokens",
beam_size=1,
num_hypotheses=1,
max_length=512,
sampling_temperature=0.0,
)
output = tokenizer.decode(results_batch[0].sequences_ids[0])
print("\nGenerated response:")
print(output)
del generator
del tokenizer
torch.cuda.empty_cache()
gc.collect()
if __name__ == "__main__":
main()