Papers
arxiv:2401.08426

GD doesn't make the cut: Three ways that non-differentiability affects neural network training

Published on Jan 16, 2024
Authors:

Abstract

This paper investigates the distinctions between gradient methods applied to non-differentiable functions (NGDMs) and classical gradient descents (GDs) designed for differentiable functions. First, we demonstrate significant differences in the convergence properties of NGDMs compared to GDs, challenging the applicability of the extensive neural network convergence literature based on L-smoothness to non-smooth neural networks. Next, we demonstrate the paradoxical nature of NGDM solutions for L_{1}-regularized problems, showing that increasing the regularization penalty leads to an increase in the L_{1} norm of optimal solutions in NGDMs. Consequently, we show that widely adopted L_{1} penalization-based techniques for network pruning do not yield expected results. Finally, we explore the Edge of Stability phenomenon, indicating its inapplicability even to Lipschitz continuous convex differentiable functions, leaving its relevance to non-convex non-differentiable neural networks inconclusive. Our analysis exposes misguided interpretations of NGDMs in widely referenced papers and texts due to an overreliance on strong smoothness assumptions, emphasizing the necessity for a nuanced understanding of foundational assumptions in the analysis of these systems.

Community

Sign up or log in to comment

Models citing this paper 0

No model linking this paper

Cite arxiv.org/abs/2401.08426 in a model README.md to link it from this page.

Datasets citing this paper 0

No dataset linking this paper

Cite arxiv.org/abs/2401.08426 in a dataset README.md to link it from this page.

Spaces citing this paper 0

No Space linking this paper

Cite arxiv.org/abs/2401.08426 in a Space README.md to link it from this page.

Collections including this paper 0

No Collection including this paper

Add this paper to a collection to link it from this page.