Ahmadzei's picture
added 3 more tables for large emb model
5fa1a76
By sharding the model parameters, optimizer and gradient states, and even offloading them to the CPU when they're inactive, FSDP can reduce the high cost of large-scale training.