NousResearch/Nous-Hermes-2-Mixtral-8x7B-DPO · Amazon Sagemaker deployment

Mar 12, 2024

Deploying a SageMaker endpoint with an ml.g5.2xlarge instance, as demonstrated in the provided code sample, is not feasible due to a CUDA out of memory error. It appears that the minimum required configuration for the endpoint is an ml.g5.48xlarge instance, which comes with 8 GPUs.

jdmiwx changed discussion title from Amazon Sagemaker deploymnet to Amazon Sagemaker deployment Mar 12, 2024

teknium

NousResearch org Mar 14, 2024

That shouldn't be the case , otherwise sagemaker is bad

blueridanus

Mar 15, 2024

@teknium Why? ml.g5.2xlarge is a 1xA10 (24GB) instance, it shouldn't fit. This is the unquantized model.

For what it's worth, trying to deploy Mixtral 8x7B through vLLM on 4xA10 CUDA OOMs for me as well, so 24xlarge (4xA10) doesn't cut it either. Has to be g5.48xlarge.
The AWQ version runs fine on 4xA10 though.

teknium

NousResearch org Mar 15, 2024

@teknium Why? ml.g5.2xlarge is a 1xA10 (24GB) instance, it shouldn't fit. This is the unquantized model.

For what it's worth, trying to deploy Mixtral 8x7B through vLLM on 4xA10 CUDA OOMs for me as well, so 24xlarge (4xA10) doesn't cut it either. Has to be g5.48xlarge.
The AWQ version runs fine on 4xA10 though.

the provided example inference code sample, which jdmiwx mentioned, does actually quantize to 4bit. But, even in fp16, mixtral shoul fit with his 8 gpu setup, in 8x 24gb you will have 2x the vram size needed to run it