Production ready Deepseek R1 GGUF deployment instructions (with cpu offloading) on AWS (10x cheaper than Bedrock imports)

#44

by samagra14 - opened 5 days ago

5 days ago

We have been deploying Deepseek R1 GGUF quants for a lot of companies ranging from startups to enterprises. Here is a guide that lets you tinker with the GPU type and the CPU offloading memory parameters so that you can modify this service for your own cost <-> latency settings.

Even without GPUs, we observe decent throughput with a single request tps of 5 tokens/sec, which goes up with concurrent requests.

Find the full guide here - https://tensorfuse.io/docs/guides/integrations/llama_cpp

Upload images, audio, and videos by dragging in the text input, pasting, or clicking here.

Tap or paste here to upload images

Your need to confirm your account before you can post a new comment.

· Sign up or log in to comment