Generalization Guarantees of Gradient Descent for Shallow Neural Networks

Wang, Puyu; Lei, Yunwen; Wang, Di; Ying, Yiming; Zhou, Ding Xuan

File Download

There are no files associated with this item.

Links for fulltext

(May Require Subscription)

Publisher Website: 10.1162/neco_a_01725
Scopus: eid_2-s2.0-85216908507
PMID: 39556516
WOS: WOS:001406038400004
Find via

Supplementary

Citations:
- Scopus: 0
- Web of Science: 0
- PubMed Central: 0
Appears in Collections:
- Mathematics: Journal/Magazine Articles

Article: Generalization Guarantees of Gradient Descent for Shallow Neural Networks

Title	Generalization Guarantees of Gradient Descent for Shallow Neural Networks
Authors	Wang, Puyu Lei, Yunwen Wang, Di Ying, Yiming Zhou, Ding Xuan
Issue Date	21-Jan-2025
Publisher	Massachusetts Institute of Technology Press
Citation	Neural Computation, 2025, v. 37, n. 2, p. 344-402 How to Cite? DOI: http://dx.doi.org/10.1162/neco_a_01725
Abstract	Significant progress has been made recently in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on one-hidden-layer NNs and has not addressed the impact of different network scaling. Here, network scaling corresponds to the normalization of the layers. In this article, we greatly extend the previous work (Lei et al., 2022; Richards & Kuzborskij, 2021) by conducting a comprehensive stability and generalization analysis of GD for two-layer and three-layer NNs. For two-layer NNs, our results are established under general network scaling, relaxing previous conditions. In the case of three-layer NNs, our technical contribution lies in demonstrating its nearly co-coercive property by utilizing a novel induction strategy that thoroughly explores the effects of overparameterization. As a direct application of our general findings, we derive the excess risk rate of O(1/n) for GD in both two-layer and three-layer NNs. This sheds light on sufficient or necessary conditions for underparameterized and overparameterized NNs trained by GD to attain the desired risk rate of O(1/n). Moreover, we demonstrate that as the scaling factor increases or the network complexity decreases, less overparameterization is required for GD to achieve the desired error rates. Additionally, under a low-noise condition, we obtain a fast risk rate of O(1/n) for GD in both two-layer and three-layer NNs.
Persistent Identifier	http://hdl.handle.net/10722/355113
ISSN	0899-7667 2023 Impact Factor: 2.7 2023 SCImago Journal Rankings: 0.948
ISI Accession Number ID	WOS:001406038400004

DC Field	Value	Language
dc.contributor.author	Wang, Puyu	-
dc.contributor.author	Lei, Yunwen	-
dc.contributor.author	Wang, Di	-
dc.contributor.author	Ying, Yiming	-
dc.contributor.author	Zhou, Ding Xuan	-
dc.date.accessioned	2025-03-27T00:35:31Z	-
dc.date.available	2025-03-27T00:35:31Z	-
dc.date.issued	2025-01-21	-
dc.identifier.citation	Neural Computation, 2025, v. 37, n. 2, p. 344-402	-
dc.identifier.issn	0899-7667	-
dc.identifier.uri	http://hdl.handle.net/10722/355113	-
dc.description.abstract	<p>Significant progress has been made recently in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on one-hidden-layer NNs and has not addressed the impact of different network scaling. Here, network scaling corresponds to the normalization of the layers. In this article, we greatly extend the previous work (Lei et al., 2022; Richards & Kuzborskij, 2021) by conducting a comprehensive stability and generalization analysis of GD for two-layer and three-layer NNs. For two-layer NNs, our results are established under general network scaling, relaxing previous conditions. In the case of three-layer NNs, our technical contribution lies in demonstrating its nearly co-coercive property by utilizing a novel induction strategy that thoroughly explores the effects of overparameterization. As a direct application of our general findings, we derive the excess risk rate of O(1/n) for GD in both two-layer and three-layer NNs. This sheds light on sufficient or necessary conditions for underparameterized and overparameterized NNs trained by GD to attain the desired risk rate of O(1/n). Moreover, we demonstrate that as the scaling factor increases or the network complexity decreases, less overparameterization is required for GD to achieve the desired error rates. Additionally, under a low-noise condition, we obtain a fast risk rate of O(1/n) for GD in both two-layer and three-layer NNs.</p>	-
dc.language	eng	-
dc.publisher	Massachusetts Institute of Technology Press	-
dc.relation.ispartof	Neural Computation	-
dc.title	Generalization Guarantees of Gradient Descent for Shallow Neural Networks	-
dc.type	Article	-
dc.identifier.doi	10.1162/neco_a_01725	-
dc.identifier.pmid	39556516	-
dc.identifier.scopus	eid_2-s2.0-85216908507	-
dc.identifier.volume	37	-
dc.identifier.issue	2	-
dc.identifier.spage	344	-
dc.identifier.epage	402	-
dc.identifier.eissn	1530-888X	-
dc.identifier.isi	WOS:001406038400004	-
dc.identifier.issnl	0899-7667	-

File Download

Links for fulltext

(May Require Subscription)

Supplementary

Article: Generalization Guarantees of Gradient Descent for Shallow Neural Networks

Export via OAI-PMH Interface in XML Formats

OR

Export to Other Non-XML Formats