AdAdaGrad: Adaptive Batch Size Schemes for Adaptive Gradient Methods
Abstract
The choice of batch sizes in stochastic gradient optimizers is critical for model training. However, the practice of varying batch sizes throughout the training process is less explored compared to other hyperparameters. We investigate adaptive batch size strategies derived from adaptive sampling methods, traditionally applied only in stochastic gradient descent. Given the significant interplay between learning rates and batch sizes, and considering the prevalence of adaptive gradient methods in deep learning, we emphasize the need for adaptive batch size strategies in these contexts. We introduce Ad<PRE_TAG>AdaGrad</POST_TAG> and its scalar variant Ad<PRE_TAG><PRE_TAG>AdaGradNorm</POST_TAG></POST_TAG>, which incrementally increase batch sizes during training, while model updates are performed using AdaGrad and <PRE_TAG>AdaGradNorm</POST_TAG>. We prove that <PRE_TAG>AdaGradNorm</POST_TAG> converges with high probability at a rate of O(1/K) for finding a first-order stationary point of smooth nonconvex functions within K iterations. AdaGrad also demonstrates similar convergence properties when integrated with a novel coordinate-wise variant of our adaptive batch size strategies. Our theoretical claims are supported by numerical experiments on various image classification tasks, highlighting the enhanced adaptability of progressive batching protocols in deep learning and the potential of such adaptive batch size strategies with adaptive gradient optimizers in large-scale model training.
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper