aya-expanse-8b-awq-4bit

This repository contains a quantized version of the CohereForAI/aya-expanse-8b model using the AWQ method in 4-bit precision.

Model Summary

How to Use the Quantized Model

1. Install the necessary packages

Before using the quantized model, please ensure your environment has:

2. Run inference

Load and use the quantized model as shown below in Python:

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, AwqConfig

# Set up device
device = torch.device('cuda:1') # Remember to use the correct device here

# Load model and tokenizer
model_name = "kevinbazira/aya-expanse-8b-awq-4bit"
tokenizer = AutoTokenizer.from_pretrained(model_name)
quantization_config = AwqConfig(version="exllama")
model = AutoModelForCausalLM.from_pretrained(
    model_name, 
    device_map={"": device.index},
    quantization_config=quantization_config
)

# Prepare input
# https://huggingface.co/docs/transformers/en/pad_truncation
input_text = "Add your prompt here."
inputs = tokenizer(input_text, return_tensors="pt", truncation=True, padding="max_length", max_length=64)
inputs = {key: value.to(device) for key, value in inputs.items()}

# Perform text generation 
# https://huggingface.co/docs/transformers/en/main_classes/text_generation
outputs = model.generate(
    **inputs,
    num_return_sequences=1,
    min_new_tokens=64,
    max_new_tokens=64,
    do_sample=False,
    use_cache=True,
    num_beams=1
)

# Decode and print the output
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Benchmark Results

To evaluate the performance of the quantized model, we run benchmarks using the Hugging Face Optimum Benchmark tool on an AMD MI200 GPU with ROCm 6.1 and below are the results:

Unquantized Model Results:

Unquantized Model Results

AWQ Quantized Model Results:

AWQ Quantized Model Results

These results show that the AWQ quantized model offers significant speed advantages during critical inference stages (decode and per-token), outweighing the higher latencies encountered during the load and prefill phases. For deployment scenarios where inference speed is paramount, you can preload the quantized model to eliminate initial latency concerns.

More Information

  • Original Model: For details about the original model's architecture, training dataset, and performance, please visit the CohereForAI aya-expanse-8b model card.
  • Support or inquiries: If you run into any issues or have questions about the quantized model, feel free to reach me via email:[email protected]. I'll be happy to help!
Downloads last month
64
Safetensors
Model size
1.98B params
Tensor type
I32
·
FP16
·
Inference Examples
Inference API (serverless) has been turned off for this model.

Model tree for kevinbazira/aya-expanse-8b-awq-4bit

Quantized
(33)
this model