Post
2502
I was puzzled by the scope of 🐋DeepSeek🐋 projects, i.e. why they built (then open sourced) so many pieces which are all over their technology stack. Good engineers are minimalists. They build only when they have to.
Then I realized that FP8 should be the main driving force here. So your raw inter-GPU bandwidth is cut in half (H800). But if you compress your data presentation from 16 bits to 8 bits, then the effective throughput of your workload stays unchanged!
The idea is simple but lots of work had to be done. Their v3 technical report will give you a wholistic view (better than reading the code). To summarize, data structure is the foundation to any software. Since FP8 was new and untried, the ecosystem wasn't there. So DeepSeek became the trailblazer. Before cooking your meals, you need to till the land, grow crops, and grind the flour 😅
Then I realized that FP8 should be the main driving force here. So your raw inter-GPU bandwidth is cut in half (H800). But if you compress your data presentation from 16 bits to 8 bits, then the effective throughput of your workload stays unchanged!
The idea is simple but lots of work had to be done. Their v3 technical report will give you a wholistic view (better than reading the code). To summarize, data structure is the foundation to any software. Since FP8 was new and untried, the ecosystem wasn't there. So DeepSeek became the trailblazer. Before cooking your meals, you need to till the land, grow crops, and grind the flour 😅