Today we are introducing YaFSDP, Yandex’s tool for efficient distributed LLM training. YaFSDP can be used in conjunction with huggingface workflows and is up to 25% faster compared to FSDP.
Recently, we open-sourced YaFSDP, Yandex’s tool for efficient distributed training of LLMs.
Here are some of the key ideas used in YaFSDP to provide speedup and memory savings over FSDP: • Allocate and utilize just two buffers throughout the transformer for all collected weights to circumvent the torch memory allocator; • Gather small normalization layers at the beginning of the iteration and average the gradients only at the end; • Move gradient division to the very end of the backward pass.