Production ready Deepseek R1 GGUF deployment instructions (with cpu offloading) on AWS (10x cheaper than Bedrock imports)
#44
by
samagra14
- opened
We have been deploying Deepseek R1 GGUF quants for a lot of companies ranging from startups to enterprises. Here is a guide that lets you tinker with the GPU type and the CPU offloading memory parameters so that you can modify this service for your own cost <-> latency settings.
Even without GPUs, we observe decent throughput with a single request tps of 5 tokens/sec, which goes up with concurrent requests.
Find the full guide here - https://tensorfuse.io/docs/guides/integrations/llama_cpp