NeuZip: Memory-Efficient Training and Inference with Dynamic Compression of Neural Networks
Abstract
The performance of neural networks improves when more parameters are used. However, the model sizes are constrained by the available on-device memory during training and inference. Although applying techniques like quantization can alleviate the constraint, they suffer from performance degradation. In this work, we introduce NeuZip, a new weight compression scheme based on the entropy of floating-point numbers in neural networks. With NeuZip, we are able to achieve memory-efficient training and inference without sacrificing performance. Notably, we significantly reduce the memory footprint of training a Llama-3 8B model from 31GB to less than 16GB, while keeping the training dynamics fully unchanged. In inference, our method can reduce memory usage by more than half while maintaining near-lossless performance. Our code is publicly available.
Community
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- CompAct: Compressed Activations for Memory-Efficient LLM Training (2024)
- COAT: Compressing Optimizer states and Activation for Memory-Efficient FP8 Training (2024)
- Natural GaLore: Accelerating GaLore for memory-efficient LLM Training and Fine-tuning (2024)
- QEFT: Quantization for Efficient Fine-Tuning of LLMs (2024)
- LLM Compression with Neural Architecture Search (2024)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment:
@librarian-bot
recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper