File Download
There are no files associated with this item.
Links for fulltext
(May Require Subscription)
- Publisher Website: 10.1162/neco_a_01725
- Scopus: eid_2-s2.0-85216908507
- PMID: 39556516
- WOS: WOS:001406038400004
- Find via
Supplementary
- Citations:
- Appears in Collections:
Article: Generalization Guarantees of Gradient Descent for Shallow Neural Networks
Title | Generalization Guarantees of Gradient Descent for Shallow Neural Networks |
---|---|
Authors | |
Issue Date | 21-Jan-2025 |
Publisher | Massachusetts Institute of Technology Press |
Citation | Neural Computation, 2025, v. 37, n. 2, p. 344-402 How to Cite? |
Abstract | Significant progress has been made recently in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on one-hidden-layer NNs and has not addressed the impact of different network scaling. Here, network scaling corresponds to the normalization of the layers. In this article, we greatly extend the previous work (Lei et al., 2022; Richards & Kuzborskij, 2021) by conducting a comprehensive stability and generalization analysis of GD for two-layer and three-layer NNs. For two-layer NNs, our results are established under general network scaling, relaxing previous conditions. In the case of three-layer NNs, our technical contribution lies in demonstrating its nearly co-coercive property by utilizing a novel induction strategy that thoroughly explores the effects of overparameterization. As a direct application of our general findings, we derive the excess risk rate of O(1/n) for GD in both two-layer and three-layer NNs. This sheds light on sufficient or necessary conditions for underparameterized and overparameterized NNs trained by GD to attain the desired risk rate of O(1/n). Moreover, we demonstrate that as the scaling factor increases or the network complexity decreases, less overparameterization is required for GD to achieve the desired error rates. Additionally, under a low-noise condition, we obtain a fast risk rate of O(1/n) for GD in both two-layer and three-layer NNs. |
Persistent Identifier | http://hdl.handle.net/10722/355113 |
ISSN | 2023 Impact Factor: 2.7 2023 SCImago Journal Rankings: 0.948 |
ISI Accession Number ID |
DC Field | Value | Language |
---|---|---|
dc.contributor.author | Wang, Puyu | - |
dc.contributor.author | Lei, Yunwen | - |
dc.contributor.author | Wang, Di | - |
dc.contributor.author | Ying, Yiming | - |
dc.contributor.author | Zhou, Ding Xuan | - |
dc.date.accessioned | 2025-03-27T00:35:31Z | - |
dc.date.available | 2025-03-27T00:35:31Z | - |
dc.date.issued | 2025-01-21 | - |
dc.identifier.citation | Neural Computation, 2025, v. 37, n. 2, p. 344-402 | - |
dc.identifier.issn | 0899-7667 | - |
dc.identifier.uri | http://hdl.handle.net/10722/355113 | - |
dc.description.abstract | <p>Significant progress has been made recently in understanding the generalization of neural networks (NNs) trained by gradient descent (GD) using the algorithmic stability approach. However, most of the existing research has focused on one-hidden-layer NNs and has not addressed the impact of different network scaling. Here, network scaling corresponds to the normalization of the layers. In this article, we greatly extend the previous work (Lei et al., 2022; Richards & Kuzborskij, 2021) by conducting a comprehensive stability and generalization analysis of GD for two-layer and three-layer NNs. For two-layer NNs, our results are established under general network scaling, relaxing previous conditions. In the case of three-layer NNs, our technical contribution lies in demonstrating its nearly co-coercive property by utilizing a novel induction strategy that thoroughly explores the effects of overparameterization. As a direct application of our general findings, we derive the excess risk rate of O(1/n) for GD in both two-layer and three-layer NNs. This sheds light on sufficient or necessary conditions for underparameterized and overparameterized NNs trained by GD to attain the desired risk rate of O(1/n). Moreover, we demonstrate that as the scaling factor increases or the network complexity decreases, less overparameterization is required for GD to achieve the desired error rates. Additionally, under a low-noise condition, we obtain a fast risk rate of O(1/n) for GD in both two-layer and three-layer NNs.</p> | - |
dc.language | eng | - |
dc.publisher | Massachusetts Institute of Technology Press | - |
dc.relation.ispartof | Neural Computation | - |
dc.title | Generalization Guarantees of Gradient Descent for Shallow Neural Networks | - |
dc.type | Article | - |
dc.identifier.doi | 10.1162/neco_a_01725 | - |
dc.identifier.pmid | 39556516 | - |
dc.identifier.scopus | eid_2-s2.0-85216908507 | - |
dc.identifier.volume | 37 | - |
dc.identifier.issue | 2 | - |
dc.identifier.spage | 344 | - |
dc.identifier.epage | 402 | - |
dc.identifier.eissn | 1530-888X | - |
dc.identifier.isi | WOS:001406038400004 | - |
dc.identifier.issnl | 0899-7667 | - |