Amazon Sagemaker deployment
Deploying a SageMaker endpoint with an ml.g5.2xlarge instance, as demonstrated in the provided code sample, is not feasible due to a CUDA out of memory error. It appears that the minimum required configuration for the endpoint is an ml.g5.48xlarge instance, which comes with 8 GPUs.
That shouldn't be the case , otherwise sagemaker is bad
@teknium
Why? ml.g5.2xlarge
is a 1xA10 (24GB) instance, it shouldn't fit. This is the unquantized model.
For what it's worth, trying to deploy Mixtral 8x7B through vLLM on 4xA10 CUDA OOMs for me as well, so 24xlarge (4xA10) doesn't cut it either. Has to be g5.48xlarge.
The AWQ version runs fine on 4xA10 though.
@teknium Why?
ml.g5.2xlarge
is a 1xA10 (24GB) instance, it shouldn't fit. This is the unquantized model.For what it's worth, trying to deploy Mixtral 8x7B through vLLM on 4xA10 CUDA OOMs for me as well, so 24xlarge (4xA10) doesn't cut it either. Has to be g5.48xlarge.
The AWQ version runs fine on 4xA10 though.
the provided example inference code sample, which jdmiwx mentioned, does actually quantize to 4bit. But, even in fp16, mixtral shoul fit with his 8 gpu setup, in 8x 24gb you will have 2x the vram size needed to run it