Spaces:
Running
Running
Few Errors
#86
by
gordicaleksa
- opened
Awesome work! And thanks for linking my Flash Attention blog post :)
Caught few errors while reading (WIP - will add more as I go through the whole thing):
Typos:
- Cheatsheet glossary: ep -> "expert parallelism degree" not "context parallelism degree"
- "PROFILING THE MEMORY USAGE" -> "througho ut training" -> "throughout training"
- "extremely usefull" -> "extremely useful"
- "attention module will requires" -> "require"
- "the memory savings in activations when using TP with SP helps us fit far bigger batches than TP alone" mentioned twice (in succession) in the summarization section of the TP/SP chapter, i.e. bullet points 2 & 3 are the same
- "As you can see, ZeRO-3 and PP sove" -> "solve"
- "need to be balanced in Pipaline Parallelism," -> "Pipeline"
- "that are actually used to distribute and training larger" -> "train larger"
- "Efficiently accessing data from global memory can improve a lot the performance." -> "can improve performance by a lot"
- "Let's briefly mentionned" -> "Let's briefly go through"
- "For float16 it is ..." -> there is a weird tilda (~) over 10^-3 here
(note: maybe just pass it once through grammarly free :) you can just ctrl+f the strings on the left side to find matches for the errors i found)
Logic:
- Througput Scaling with TP/SP (3B Model) -> for TP=32 you get 41.4% whereas for TP=16 you get 43.4% (so it gets better :) despite the chart & logic showing the opposite)
- in general i'm a bit suspicious of the TP vs TP/SP throughput scaling / maximum batch size plots, it seems like for TP=32 you can have 5x the batch size just due to SP?
- "Looking at the figure above, we notice something interesting: while the model parameters are nicely split across GPUs, the activation memory remains the same on each GPU! This is because each GPU still needs to process the full batch of data, just with different layers" <- pipeline parallelism, this doesn't make sense? Activations for only a subset of layers now need to be kept on the GPU. Or if assuming act checkpointing it's the same conclusion, assuming we keep 4 layers per GPU now you need 4 @ X memory (assuming simplistically that you store activations at the beginning of each transformer layer) vs 4 @ X @ PP where PP is the number of stages in pipeline parallelism (note: using @ bc of rendering issues with asterisk).
- The final table in "5D parallelism in a nutshell" section has errors when it comes to "Disadvantage" and "Parallel/sharding dimension" columns for ZeRO-1, ZeRO-2, and ZeRO-3.