Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates

Lu, Miao; Wu, Beining; Yang, Xiaodong; Zou, Difan

File Download

There are no files associated with this item.

Supplementary

Citations:
Appears in Collections:
- Computer Science: Conference papers

Conference Paper: Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates

Title	Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates
Authors	Lu, Miao Wu, Beining Yang, Xiaodong Zou, Difan
Issue Date	7-May-2024
Abstract	In this work, we theoretically investigate the generalization properties of neural networks (NN) trained by stochastic gradient descent (SGD) with large learning rates. Under such a training regime, our finding is that, the oscillation of the NN weights caused by SGD with large learning rates turns out to be beneficial to the generalization of the NN, potentially improving over the same NN trained by SGD with small learning rates that converges more smoothly. In view of this finding, we call such a phenomenon “benign oscillation”. Our theory towards demystifying such a phenomenon builds upon the feature learning perspective of deep learning. Specifically, we consider a feature-noise data generation model that consists of (i) weak features which have a small ℓ2-norm and appear in each data point; (ii) strong features which have a large ℓ2-norm but appear only in a certain fraction of all data points; and (iii) noise. We prove that NNs trained by oscillating SGD with a large learning rate can effectively learn the weak features in the presence of those strong features. In contrast, NNs trained by SGD with a small learning rate can only learn the strong features but make little progress in learning the weak features. Consequently, when it comes to the new testing data points that consist of only weak features, the NN trained by oscillating SGD with a large learning rate can still make correct predictions, while the NN trained by SGD with a small learning rate could not. Our theory sheds light on how large learning rate training benefits the generalization of NNs. Experimental results demonstrate our findings on the phenomenon of “benign oscillation”.
Persistent Identifier	http://hdl.handle.net/10722/348202

DC Field	Value	Language
dc.contributor.author	Lu, Miao	-
dc.contributor.author	Wu, Beining	-
dc.contributor.author	Yang, Xiaodong	-
dc.contributor.author	Zou, Difan	-
dc.date.accessioned	2024-10-08T00:30:57Z	-
dc.date.available	2024-10-08T00:30:57Z	-
dc.date.issued	2024-05-07	-
dc.identifier.uri	http://hdl.handle.net/10722/348202	-
dc.description.abstract	<p>In this work, we theoretically investigate the generalization properties of neural networks (NN) trained by stochastic gradient descent (SGD) with large learning rates. Under such a training regime, our finding is that, the oscillation of the NN weights caused by SGD with large learning rates turns out to be beneficial to the generalization of the NN, potentially improving over the same NN trained by SGD with small learning rates that converges more smoothly. In view of this finding, we call such a phenomenon “benign oscillation”. Our theory towards demystifying such a phenomenon builds upon the feature learning perspective of deep learning. Specifically, we consider a feature-noise data generation model that consists of (i) weak features which have a small ℓ2-norm and appear in each data point; (ii) strong features which have a large ℓ2-norm but appear only in a certain fraction of all data points; and (iii) noise. We prove that NNs trained by oscillating SGD with a large learning rate can effectively learn the weak features in the presence of those strong features. In contrast, NNs trained by SGD with a small learning rate can only learn the strong features but make little progress in learning the weak features. Consequently, when it comes to the new testing data points that consist of only weak features, the NN trained by oscillating SGD with a large learning rate can still make correct predictions, while the NN trained by SGD with a small learning rate could not. Our theory sheds light on how large learning rate training benefits the generalization of NNs. Experimental results demonstrate our findings on the phenomenon of “benign oscillation”.<br></p>	-
dc.language	eng	-
dc.relation.ispartof	The Twelfth International Conference on Learning Representations (ICLR) (07/05/2024-11/05/2024, Vienna)	-
dc.title	Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates	-
dc.type	Conference_Paper	-

File Download

Supplementary

Conference Paper: Benign Oscillation of Stochastic Gradient Descent with Large Learning Rates

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats