File Download

There are no files associated with this item.

  Links for fulltext
     (May Require Subscription)
Supplementary

Article: Gradient descent optimizes over-parameterized deep ReLU networks

TitleGradient descent optimizes over-parameterized deep ReLU networks
Authors
KeywordsOver-parameterization
Deep neural networks
Global convergence
Gradient descent
Random initialization
Issue Date2020
Citation
Machine Learning, 2020, v. 109, n. 3, p. 467-492 How to Cite?
AbstractWe study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using gradient descent. We show that with proper random weight initialization, gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for gradient descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a; Du et al. in Gradient descent finds global minima of deep neural networks, 2018a) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of gradient descent for training deep neural networks.
Persistent Identifierhttp://hdl.handle.net/10722/303629
ISSN
2021 Impact Factor: 5.414
2020 SCImago Journal Rankings: 0.667
ISI Accession Number ID

 

DC FieldValueLanguage
dc.contributor.authorZou, Difan-
dc.contributor.authorCao, Yuan-
dc.contributor.authorZhou, Dongruo-
dc.contributor.authorGu, Quanquan-
dc.date.accessioned2021-09-15T08:25:42Z-
dc.date.available2021-09-15T08:25:42Z-
dc.date.issued2020-
dc.identifier.citationMachine Learning, 2020, v. 109, n. 3, p. 467-492-
dc.identifier.issn0885-6125-
dc.identifier.urihttp://hdl.handle.net/10722/303629-
dc.description.abstractWe study the problem of training deep fully connected neural networks with Rectified Linear Unit (ReLU) activation function and cross entropy loss function for binary classification using gradient descent. We show that with proper random weight initialization, gradient descent can find the global minima of the training loss for an over-parameterized deep ReLU network, under certain assumption on the training data. The key idea of our proof is that Gaussian random initialization followed by gradient descent produces a sequence of iterates that stay inside a small perturbation region centered at the initial weights, in which the training loss function of the deep ReLU networks enjoys nice local curvature properties that ensure the global convergence of gradient descent. At the core of our proof technique is (1) a milder assumption on the training data; (2) a sharp analysis of the trajectory length for gradient descent; and (3) a finer characterization of the size of the perturbation region. Compared with the concurrent work (Allen-Zhu et al. in A convergence theory for deep learning via over-parameterization, 2018a; Du et al. in Gradient descent finds global minima of deep neural networks, 2018a) along this line, our result relies on milder over-parameterization condition on the neural network width, and enjoys faster global convergence rate of gradient descent for training deep neural networks.-
dc.languageeng-
dc.relation.ispartofMachine Learning-
dc.subjectOver-parameterization-
dc.subjectDeep neural networks-
dc.subjectGlobal convergence-
dc.subjectGradient descent-
dc.subjectRandom initialization-
dc.titleGradient descent optimizes over-parameterized deep ReLU networks-
dc.typeArticle-
dc.description.naturelink_to_subscribed_fulltext-
dc.identifier.doi10.1007/s10994-019-05839-6-
dc.identifier.scopuseid_2-s2.0-85074601535-
dc.identifier.volume109-
dc.identifier.issue3-
dc.identifier.spage467-
dc.identifier.epage492-
dc.identifier.eissn1573-0565-
dc.identifier.isiWOS:000494074500001-

Export via OAI-PMH Interface in XML Formats


OR


Export to Other Non-XML Formats