{"title":"随机对角近似最大下降优化中的消失梯度分析","authors":"H. Tan, K. Lim","doi":"10.6688/JISE.202009_36(5).0005","DOIUrl":null,"url":null,"abstract":"Deep learning neural network is often associated with high complexity classification problems by stacking multiple hidden layers between input and output. The measured error is backpropagated layer-by-layer in a network with gradual vanishing gradient value due to the differentiation of activation function. In this paper, Stochastic Diagonal Approximate Greatest Descent (SDAGD) is proposed to tackle the issue of vanishing gradient in the deep learning neural network using the adaptive step length derived based on the second-order derivatives information. The proposed SDAGD optimizer trajectory is demonstrated using three-dimensional error surfaces, i:e: (a) a hilly error surface with two local minima and one global minimum; (b) a deep Gaussian trench to simulate drastic gradient changes experienced with ravine topography and (c) small initial gradient to simulate a plateau terrain. As a result, SDAGD is able to converge at the fastest rate to the global minimum without the interference of vanishing gradient issue as compared to other benchmark optimizers such as Gradient Descent (GD), AdaGrad and AdaDelta. Experiments are tested on saturated and unsaturated activation functions using sequential added hidden layers to evaluate the vanishing gradient mitigation with the proposed optimizer. The experimental results show that SDAGD is able to obtain good performance in the tested deep feedforward networks while stochastic GD obtain worse misclassification error when the network has more than three hidden layers due to the vanishing gradient issue. SDAGD can mitigate the vanishing gradient by adaptively control the step length element in layers using the second-order information. At the constant training iteration setup, SDAGD with ReLU can achieve the lowest misclassification rate of 1.77% as compared to other optimization methods.","PeriodicalId":50177,"journal":{"name":"Journal of Information Science and Engineering","volume":"40 1","pages":"1007-1019"},"PeriodicalIF":0.5000,"publicationDate":"2020-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Vanishing Gradient Analysis in Stochastic Diagonal Approximate Greatest Descent Optimization\",\"authors\":\"H. Tan, K. Lim\",\"doi\":\"10.6688/JISE.202009_36(5).0005\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Deep learning neural network is often associated with high complexity classification problems by stacking multiple hidden layers between input and output. The measured error is backpropagated layer-by-layer in a network with gradual vanishing gradient value due to the differentiation of activation function. In this paper, Stochastic Diagonal Approximate Greatest Descent (SDAGD) is proposed to tackle the issue of vanishing gradient in the deep learning neural network using the adaptive step length derived based on the second-order derivatives information. The proposed SDAGD optimizer trajectory is demonstrated using three-dimensional error surfaces, i:e: (a) a hilly error surface with two local minima and one global minimum; (b) a deep Gaussian trench to simulate drastic gradient changes experienced with ravine topography and (c) small initial gradient to simulate a plateau terrain. As a result, SDAGD is able to converge at the fastest rate to the global minimum without the interference of vanishing gradient issue as compared to other benchmark optimizers such as Gradient Descent (GD), AdaGrad and AdaDelta. Experiments are tested on saturated and unsaturated activation functions using sequential added hidden layers to evaluate the vanishing gradient mitigation with the proposed optimizer. The experimental results show that SDAGD is able to obtain good performance in the tested deep feedforward networks while stochastic GD obtain worse misclassification error when the network has more than three hidden layers due to the vanishing gradient issue. SDAGD can mitigate the vanishing gradient by adaptively control the step length element in layers using the second-order information. At the constant training iteration setup, SDAGD with ReLU can achieve the lowest misclassification rate of 1.77% as compared to other optimization methods.\",\"PeriodicalId\":50177,\"journal\":{\"name\":\"Journal of Information Science and Engineering\",\"volume\":\"40 1\",\"pages\":\"1007-1019\"},\"PeriodicalIF\":0.5000,\"publicationDate\":\"2020-09-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Information Science and Engineering\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.6688/JISE.202009_36(5).0005\",\"RegionNum\":4,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q4\",\"JCRName\":\"COMPUTER SCIENCE, INFORMATION SYSTEMS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Information Science and Engineering","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.6688/JISE.202009_36(5).0005","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, INFORMATION SYSTEMS","Score":null,"Total":0}
Vanishing Gradient Analysis in Stochastic Diagonal Approximate Greatest Descent Optimization
Deep learning neural network is often associated with high complexity classification problems by stacking multiple hidden layers between input and output. The measured error is backpropagated layer-by-layer in a network with gradual vanishing gradient value due to the differentiation of activation function. In this paper, Stochastic Diagonal Approximate Greatest Descent (SDAGD) is proposed to tackle the issue of vanishing gradient in the deep learning neural network using the adaptive step length derived based on the second-order derivatives information. The proposed SDAGD optimizer trajectory is demonstrated using three-dimensional error surfaces, i:e: (a) a hilly error surface with two local minima and one global minimum; (b) a deep Gaussian trench to simulate drastic gradient changes experienced with ravine topography and (c) small initial gradient to simulate a plateau terrain. As a result, SDAGD is able to converge at the fastest rate to the global minimum without the interference of vanishing gradient issue as compared to other benchmark optimizers such as Gradient Descent (GD), AdaGrad and AdaDelta. Experiments are tested on saturated and unsaturated activation functions using sequential added hidden layers to evaluate the vanishing gradient mitigation with the proposed optimizer. The experimental results show that SDAGD is able to obtain good performance in the tested deep feedforward networks while stochastic GD obtain worse misclassification error when the network has more than three hidden layers due to the vanishing gradient issue. SDAGD can mitigate the vanishing gradient by adaptively control the step length element in layers using the second-order information. At the constant training iteration setup, SDAGD with ReLU can achieve the lowest misclassification rate of 1.77% as compared to other optimization methods.
期刊介绍:
The Journal of Information Science and Engineering is dedicated to the dissemination of information on computer science, computer engineering, and computer systems. This journal encourages articles on original research in the areas of computer hardware, software, man-machine interface, theory and applications. tutorial papers in the above-mentioned areas, and state-of-the-art papers on various aspects of computer systems and applications.