Production ready Deepseek R1 GGUF deployment instructions (with cpu offloading) on AWS (10x cheaper than Bedrock imports)

#44
by samagra14 - opened

We have been deploying Deepseek R1 GGUF quants for a lot of companies ranging from startups to enterprises. Here is a guide that lets you tinker with the GPU type and the CPU offloading memory parameters so that you can modify this service for your own cost <-> latency settings.

Even without GPUs, we observe decent throughput with a single request tps of 5 tokens/sec, which goes up with concurrent requests.

Find the full guide here - https://tensorfuse.io/docs/guides/integrations/llama_cpp

Your need to confirm your account before you can post a new comment.

Sign up or log in to comment