AWS Trainium & Inferentia documentation

Deploy Mixtral 8x7B on AWS Inferentia2

Hugging Face's logo
Join the Hugging Face community

and get access to the augmented documentation experience

to get started

Deploy Mixtral 8x7B on AWS Inferentia2

Mixtral 8x7B is an open-source LLM from Mistral AI. It is a Sparse Mixture of Experts and has a similar architecture to Mistral 7B, but comes with a twist: it’s actually 8 “expert” models in one. If you want to learn more about MoEs check out Mixture of Experts Explained.

In this tutorial you will learn how to deploy mistralai/Mixtral-8x7B-Instruct-v0.1 model on AWS Inferentia2 with Hugging Face Optimum Neuron on Amazon SageMaker. We are going to use the Hugging Face TGI Neuron Container, a purpose-built Inference Container to easily deploy LLMs on AWS Inferentia2 powered by Text Generation Inference and Optimum Neuron.

We will cover how to:

  1. Setup a development environment
  2. Retrieve the latest Hugging Face TGI Neuron DLC
  3. Deploy Mixtral 8x7B to Inferentia2
  4. Clean up

Lets get started! 🚀

AWS inferentia (Inf2) are purpose-built EC2 for deep learning (DL) inference workloads. Here are the different instances of the Inferentia2 family.

instance size accelerators Neuron Cores accelerator memory vCPU CPU Memory on-demand price ($/h)
inf2.xlarge 1 2 32 4 16 0.76
inf2.8xlarge 1 2 32 32 128 1.97
inf2.24xlarge 6 12 192 96 384 6.49
inf2.48xlarge 12 24 384 192 768 12.98

1. Setup development environment

For this tutorial, we are going to use a Notebook Instance in Amazon SageMaker with the Python 3 (ipykernel) and the sagemaker python SDK to deploy Mixtral 8x7B to a SageMaker inference endpoint.

Make sur you have the latest version of the SageMaker SDK installed.

!pip install sagemaker --upgrade --quiet

Then, instantiate the sagemaker role and session.

import sagemaker
import boto3

sess = sagemaker.Session()
# sagemaker session bucket -> used for uploading data, models and logs
# sagemaker will automatically create this bucket if it not exists
sagemaker_session_bucket = None
if sagemaker_session_bucket is None and sess is not None:
    # set to default bucket if a bucket name is not given
    sagemaker_session_bucket = sess.default_bucket()

try:
    role = sagemaker.get_execution_role()
except ValueError:
    iam = boto3.client("iam")
    role = iam.get_role(RoleName="sagemaker_execution_role")["Role"]["Arn"]

sess = sagemaker.Session(default_bucket=sagemaker_session_bucket)

print(f"sagemaker role arn: {role}")
print(f"sagemaker session region: {sess.boto_region_name}")

2. Retrieve the latest Hugging Face TGI Neuron DLC

The latest Hugging Face TGI Neuron DLCs can be used to run inference on AWS Inferentia2. You can use the get_huggingface_llm_image_uri method of the sagemaker SDK to retrieve the appropriate Hugging Face TGI Neuron DLC URI based on your desired backend, session, region, and version. You can find the latest version of the container here, if not yet added to the SageMaker SDK.

from sagemaker.huggingface import get_huggingface_llm_image_uri

# retrieve the llm image uri
llm_image = get_huggingface_llm_image_uri(
    "huggingface-neuronx",
)

# print ecr image uri
print(f"llm image uri: {llm_image}")

3. Deploy Mixtral 8x7B to Inferentia2

At the time of writing, AWS Inferentia2 does not support dynamic shapes for inference, which means that we need to specify our sequence length and batch size ahead of time. To make it easier for customers to utilize the full power of Inferentia2, we created a neuron model cache, which contains pre-compiled configurations for the most popular LLMs, including Mixtral 8x7B.

This means we don’t need to compile the model ourselves, but we can use the pre-compiled model from the cache. You can find compiled/cached configurations on the Hugging Face Hub. If your desired configuration is not yet cached, you can compile it yourself using the Optimum CLI or open a request at the Cache repository.

Let’s check the different configurations that are in the cache. For that you first need to log in the Hugging Face Hub, using a User Access Token with read access.

Make sure you have the necessary permissions to access the model. You can request access to the model here.

from huggingface_hub import notebook_login

notebook_login()

Then, we need to install the latest version of Optimum Neuron.

!pip install optimum-neuron --upgrade --quiet

Finally, we can query the cache and retrieve the existing set of configurations for which we maintained a compiled version of the model.

HF_MODEL_ID = "mistralai/Mixtral-8x7B-Instruct-v0.1"

!optimum-cli neuron cache lookup $HF_MODEL_ID

You should retrieve two entries in the cache:

*** 2 entrie(s) found in cache for mistralai/Mixtral-8x7B-Instruct-v0.1 for inference.***

auto_cast_type: bf16
batch_size: 1
checkpoint_id: mistralai/Mixtral-8x7B-Instruct-v0.1
checkpoint_revision: 41bd4c9e7e4fb318ca40e721131d4933966c2cc1
compiler_type: neuronx-cc
compiler_version: 2.16.372.0+4a9b2326
num_cores: 24
sequence_length: 4096
task: text-generation

auto_cast_type: bf16
batch_size: 4
checkpoint_id: mistralai/Mixtral-8x7B-Instruct-v0.1
checkpoint_revision: 41bd4c9e7e4fb318ca40e721131d4933966c2cc1
compiler_type: neuronx-cc
compiler_version: 2.16.372.0+4a9b2326
num_cores: 24
sequence_length: 4096
task: text-generation

Deploying Mixtral 8x7B to a SageMaker Endpoint

Before deploying the model to Amazon SageMaker, we must define the TGI Neuron endpoint configuration. We need to make sure the following additional parameters are defined:

  • HF_NUM_CORES: Number of Neuron Cores used for the compilation.
  • HF_BATCH_SIZE: The batch size that was used to compile the model.
  • HF_SEQUENCE_LENGTH: The sequence length that was used to compile the model.
  • HF_AUTO_CAST_TYPE: The auto cast type that was used to compile the model.

We still need to define traditional TGI parameters with:

  • HF_MODEL_ID: The Hugging Face model ID.
  • HF_TOKEN: The Hugging Face API token to access gated models.
  • MAX_BATCH_SIZE: The maximum batch size that the model can handle, equal to the batch size used for compilation.
  • MAX_INPUT_TOKEN: The maximum input length that the model can handle.
  • MAX_TOTAL_TOKENS: The maximum total tokens the model can generate, equal to the sequence length used for compilation.

Optionnaly, you can configure the endpoint to support chat templates:

  • MESSAGES_API_ENABLED: Enable Messages API

Select the right instance type

Mixtral 8x7B is a large model and requires a lot of memory. We are going to use the inf2.48xlarge instance type, which has 192 vCPUs and 384 GB of accelerator memory. The inf2.48xlarge instance comes with 12 Inferentia2 accelerators that include 24 Neuron Cores. In our case we will use a batch size of 4 and a sequence length of 4096.

After that we can create our endpoint configuration and deploy the model to Amazon SageMaker. We will deploy the endpoint with the Messages API enabled, so that it is fully compatible with the OpenAI Chat Completion API.

from sagemaker.huggingface import HuggingFaceModel

# sagemaker config
instance_type = "ml.inf2.48xlarge"
health_check_timeout = 2400  # additional time to load the model
volume_size = 512  # size in GB of the EBS volume

# Define Model and Endpoint configuration parameter
config = {
    "HF_MODEL_ID": "mistralai/Mixtral-8x7B-Instruct-v0.1",
    "HF_NUM_CORES": "24",  # number of neuron cores
    "HF_AUTO_CAST_TYPE": "bf16",  # dtype of the model
    "MAX_BATCH_SIZE": "4",  # max batch size for the model
    "MAX_INPUT_TOKENS": "4000",  # max length of input text
    "MAX_TOTAL_TOKENS": "4096",  # max length of generated text
    "MESSAGES_API_ENABLED": "true",  # Enable the messages API
    "HF_TOKEN": "<REPLACE WITH YOUR TOKEN>",
}

assert (
    config["HF_TOKEN"] != "<REPLACE WITH YOUR TOKEN>"
), "Please replace '<REPLACE WITH YOUR TOKEN>' with your Hugging Face Hub API token"


# create HuggingFaceModel with the image uri
llm_model = HuggingFaceModel(role=role, image_uri=llm_image, env=config)

After we have created the HuggingFaceModel we can deploy it to Amazon SageMaker using the deploy method. We will deploy the model with the ml.inf2.48xlarge instance type. TGI will automatically distribute and shard the model across all Inferentia devices.

# Deploy model to an endpoint
# https://sagemaker.readthedocs.io/en/stable/api/inference/model.html#sagemaker.model.Model.deploy
llm_model._is_compiled_model = True

llm = llm_model.deploy(
    initial_instance_count=1,
    instance_type=instance_type,
    container_startup_health_check_timeout=health_check_timeout,
    volume_size=volume_size,
)

SageMaker will now create our endpoint and deploy the model to it. It takes around 15 minutes for deployment.

After our endpoint is deployed we can run inference on it. We will use the predict method from the predictor to run inference on our endpoint.

The endpoint supports the Messages API, which is fully compatible with the OpenAI Chat Completion API. The Messages API allows us to interact with the model in a conversational way. We can define the role of the message and the content. The role can be either system,assistant or user. The system role is used to provide context to the model and the user role is used to ask questions or provide input to the model.

Parameters can be defined as in the parameters attribute of the payload. Check out the chat completion documentation to find supported parameters.

{
  "messages": [
    { "role": "system", "content": "You are a helpful assistant." },
    { "role": "user", "content": "What is deep learning?" }
  ]
}
# Prompt to generate
messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is deep learning in one sentence?"},
]

# Generation arguments https://platform.openai.com/docs/api-reference/chat/create
parameters = {
    "max_tokens": 100,
}

Okay lets test it.

chat = llm.predict({"messages": messages, **parameters, "steam": True})

print(chat["choices"][0]["message"]["content"].strip())

4. Clean up

To clean up, we can delete the model and endpoint.

llm.delete_model()
llm.delete_endpoint()