Pub Date : 2019-03-01DOI: 10.1109/AICAS.2019.8771625
Kang-Lin Wang, Chi-Bang Kuan, Jiann-Fuh Liaw, Wei-Liang Kuo
Program autotuning has been proved to achieve great performance improvement in many compiler usage scenarios. Many autotuning frameworks have been provided to support fully-customizable configuration representations, a wide variety of representations for domain-specific tuning, and a user friendly interface for interaction between the program and the autotuner. However, tuning programs takes time, no matter it is autotuned or manually tuned. Oftentimes, programmers don’t have the time waiting for autotuners to finish and want to have rather good options to use instantly. This paper introduces Autopiler, a framework for building non-domain-specific program autotuners with machine learning based recommender systems for options prediction. This framework supports not only non-domain-specific tuning techniques, but also learns from previous tuning results and can make adequate good options recommendation before any tuning happens. We will illustrate the architecture of Autopiler and how to leverage recommender system for compiler options recommendation, in such way Autopiler can learn from the programs and becomes an AI boosted smart compiler. The experiment results show that Autopiler can deliver up to 19.46% performance improvement for in-house 4G LTE modem workloads.
{"title":"Autopiler: An AI Based Framework for Program Autotuning and Options Recommendation","authors":"Kang-Lin Wang, Chi-Bang Kuan, Jiann-Fuh Liaw, Wei-Liang Kuo","doi":"10.1109/AICAS.2019.8771625","DOIUrl":"https://doi.org/10.1109/AICAS.2019.8771625","url":null,"abstract":"Program autotuning has been proved to achieve great performance improvement in many compiler usage scenarios. Many autotuning frameworks have been provided to support fully-customizable configuration representations, a wide variety of representations for domain-specific tuning, and a user friendly interface for interaction between the program and the autotuner. However, tuning programs takes time, no matter it is autotuned or manually tuned. Oftentimes, programmers don’t have the time waiting for autotuners to finish and want to have rather good options to use instantly. This paper introduces Autopiler, a framework for building non-domain-specific program autotuners with machine learning based recommender systems for options prediction. This framework supports not only non-domain-specific tuning techniques, but also learns from previous tuning results and can make adequate good options recommendation before any tuning happens. We will illustrate the architecture of Autopiler and how to leverage recommender system for compiler options recommendation, in such way Autopiler can learn from the programs and becomes an AI boosted smart compiler. The experiment results show that Autopiler can deliver up to 19.46% performance improvement for in-house 4G LTE modem workloads.","PeriodicalId":273095,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"7 1-4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131492105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-01DOI: 10.1109/AICAS.2019.8771596
L. Chiou, Tsung-Han Yang, Jian-Tang Syu, Che-Pin Chang, Yeong-Jar Chang
The graphics processing unit (GPU) is widely used in applications that require massive computing resources such as big data, machine learning, computer vision, etc. As the diversity of applications grows, the GPU’s performance becomes difficult to maintain by its warp scheduler. Most of the prior studies of the warp scheduler are based on static analysis of GPU hardware behavior for certain types of benchmarks. We propose for the first time (to the best of our knowledge), a machine learning approach to intelligently select suitable policies for various applications in runtime. The simulation results indicate that the proposed approach can maintain performance comparable to the best policy across different applications.
图形处理单元(graphics processing unit, GPU)被广泛应用于大数据、机器学习、计算机视觉等需要大量计算资源的应用中。随着应用程序多样性的增长,GPU的性能变得难以通过其warp调度器来维持。先前对warp调度器的大多数研究都是基于对特定类型基准测试的GPU硬件行为的静态分析。我们首次(据我们所知)提出了一种机器学习方法,可以在运行时为各种应用程序智能地选择合适的策略。仿真结果表明,该方法可以在不同的应用程序中保持与最佳策略相当的性能。
{"title":"Intelligent Policy Selection for GPU Warp Scheduler","authors":"L. Chiou, Tsung-Han Yang, Jian-Tang Syu, Che-Pin Chang, Yeong-Jar Chang","doi":"10.1109/AICAS.2019.8771596","DOIUrl":"https://doi.org/10.1109/AICAS.2019.8771596","url":null,"abstract":"The graphics processing unit (GPU) is widely used in applications that require massive computing resources such as big data, machine learning, computer vision, etc. As the diversity of applications grows, the GPU’s performance becomes difficult to maintain by its warp scheduler. Most of the prior studies of the warp scheduler are based on static analysis of GPU hardware behavior for certain types of benchmarks. We propose for the first time (to the best of our knowledge), a machine learning approach to intelligently select suitable policies for various applications in runtime. The simulation results indicate that the proposed approach can maintain performance comparable to the best policy across different applications.","PeriodicalId":273095,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"113 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124671993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-01DOI: 10.1109/AICAS.2019.8771540
Hao Zhang, Jiongrui He, S. Ko
In recent years, many deep neural network accelerator architectures are proposed to improve the performance of processing deep neural network models. However, memory bandwidth is still the major issue and performance bottleneck of the deep neural network accelerators. The emerging 3D memory, such as hybrid memory cube (HMC) and processing-in-memory techniques provide new solutions to deep neural network implementation. In this paper, a novel HMC architecture is proposed for weight-sharing deep convolutional neural networks in order to solve the memory bandwidth bottleneck during the neural network implementation. The proposed HMC is designed based on conventional HMC architecture with only minor changes. In the logic layer, the vault controller is modified to enable parallel vault access. The weight parameters of pre-trained convolutional neural network are quantized to 16 numbers. During processing, the accumulation of the activations with shared weights is performed and only the accumulated results are transferred to the processing elements to perform multiplications with weights. By using this proposed architecture, the data transfer between main memory and processing elements can be reduced and the throughout of convolution operations can be improved by 30% compared to using HMC based multiply-accumulate design.
{"title":"Improved Hybrid Memory Cube for Weight-Sharing Deep Convolutional Neural Networks","authors":"Hao Zhang, Jiongrui He, S. Ko","doi":"10.1109/AICAS.2019.8771540","DOIUrl":"https://doi.org/10.1109/AICAS.2019.8771540","url":null,"abstract":"In recent years, many deep neural network accelerator architectures are proposed to improve the performance of processing deep neural network models. However, memory bandwidth is still the major issue and performance bottleneck of the deep neural network accelerators. The emerging 3D memory, such as hybrid memory cube (HMC) and processing-in-memory techniques provide new solutions to deep neural network implementation. In this paper, a novel HMC architecture is proposed for weight-sharing deep convolutional neural networks in order to solve the memory bandwidth bottleneck during the neural network implementation. The proposed HMC is designed based on conventional HMC architecture with only minor changes. In the logic layer, the vault controller is modified to enable parallel vault access. The weight parameters of pre-trained convolutional neural network are quantized to 16 numbers. During processing, the accumulation of the activations with shared weights is performed and only the accumulated results are transferred to the processing elements to perform multiplications with weights. By using this proposed architecture, the data transfer between main memory and processing elements can be reduced and the throughout of convolution operations can be improved by 30% compared to using HMC based multiply-accumulate design.","PeriodicalId":273095,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116755644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-03-01DOI: 10.1109/AICAS.2019.8771544
T. Hirtzlin, M. Bocquet, Jacques-Olivier Klein, E. Nowak, E. Vianello, J. Portal, D. Querlioz
Resistive random access memories (RRAM) are novel nonvolatile memory technologies, which can be embedded at the core of CMOS, and which could be ideal for the in-memory implementation of deep neural networks. A particularly exciting vision is using them for implementing Binarized Neural Networks (BNNs), a class of deep neural networks with a highly reduced memory footprint. The challenge of resistive memory, however, is that they are prone to device variation, which can lead to bit errors. In this work we show that BNNs can tolerate these bit errors to an outstanding level, through simulations of networks on the MNIST and CIFAR10 tasks. If a standard BNN is used, up to 10−4 bit error rate can be tolerated with little impact on recognition performance on both MNIST and CIFAR10. We then show that by adapting the training procedure to the fact that the BNN will be operated on error-prone hardware, this tolerance can be extended to a bit error rate of 4 × 10−2. The requirements for RRAM are therefore a lot less stringent for BNNs than more traditional applications. We show, based on experimental measurements on a RRAM HfO2 technology, that this result can allow reduce RRAM programming energy by a factor 30.
{"title":"Outstanding Bit Error Tolerance of Resistive RAM-Based Binarized Neural Networks","authors":"T. Hirtzlin, M. Bocquet, Jacques-Olivier Klein, E. Nowak, E. Vianello, J. Portal, D. Querlioz","doi":"10.1109/AICAS.2019.8771544","DOIUrl":"https://doi.org/10.1109/AICAS.2019.8771544","url":null,"abstract":"Resistive random access memories (RRAM) are novel nonvolatile memory technologies, which can be embedded at the core of CMOS, and which could be ideal for the in-memory implementation of deep neural networks. A particularly exciting vision is using them for implementing Binarized Neural Networks (BNNs), a class of deep neural networks with a highly reduced memory footprint. The challenge of resistive memory, however, is that they are prone to device variation, which can lead to bit errors. In this work we show that BNNs can tolerate these bit errors to an outstanding level, through simulations of networks on the MNIST and CIFAR10 tasks. If a standard BNN is used, up to 10−4 bit error rate can be tolerated with little impact on recognition performance on both MNIST and CIFAR10. We then show that by adapting the training procedure to the fact that the BNN will be operated on error-prone hardware, this tolerance can be extended to a bit error rate of 4 × 10−2. The requirements for RRAM are therefore a lot less stringent for BNNs than more traditional applications. We show, based on experimental measurements on a RRAM HfO2 technology, that this result can allow reduce RRAM programming energy by a factor 30.","PeriodicalId":273095,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121909140","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-02-22DOI: 10.1109/AICAS.2019.8771527
Andrea Borghesi, Antonio Libri, L. Benini, Andrea Bartolini
Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale supercomputers and facilities: automated methods to detect faults and unhealthy conditions is needed. Our method uses a type of neural network called autoncoder trained to learn the normal behavior of a real, in-production HPC system and it is deployed on the edge of each computing node. We obtain a very good accuracy (values ranging between 90% and 95%) and we also demonstrate that the approach can be deployed on the supercomputer nodes without negatively affecting the computing units performance.
{"title":"Online Anomaly Detection in HPC Systems","authors":"Andrea Borghesi, Antonio Libri, L. Benini, Andrea Bartolini","doi":"10.1109/AICAS.2019.8771527","DOIUrl":"https://doi.org/10.1109/AICAS.2019.8771527","url":null,"abstract":"Reliability is a cumbersome problem in High Performance Computing Systems and Data Centers evolution. During operation, several types of fault conditions or anomalies can arise, ranging from malfunctioning hardware to improper configurations or imperfect software. Currently, system administrator and final users have to discover it manually. Clearly this approach does not scale to large scale supercomputers and facilities: automated methods to detect faults and unhealthy conditions is needed. Our method uses a type of neural network called autoncoder trained to learn the normal behavior of a real, in-production HPC system and it is deployed on the edge of each computing node. We obtain a very good accuracy (values ranging between 90% and 95%) and we also demonstrate that the approach can be deployed on the supercomputer nodes without negatively affecting the computing units performance.","PeriodicalId":273095,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"139 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127924945","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-10-01DOI: 10.1109/AICAS.2019.8771562
L. Cavigelli, L. Benini
After the tremendous success of convolutional neural networks in image classification, object detection, speech recognition, etc., there is now rising demand for deployment of these compute-intensive ML models on tightly power constrained embedded and mobile systems at low cost as well as for pushing the throughput in data centers. This has triggered a wave of research towards specialized hardware accelerators. Their performance is often constrained by I/O bandwidth and the energy consumption is dominated by I/O transfers to off-chip memory. We introduce and evaluate a novel, hardware-friendly compression scheme for the feature maps present within convolutional neural networks. We show that an average compression ratio of 4.4× relative to uncompressed data and a gain of 60% over existing method can be achieved for ResNet-34 with a compression block requiring <300 bit of sequential cells and minimal combinational logic.
{"title":"Extended Bit-Plane Compression for Convolutional Neural Network Accelerators","authors":"L. Cavigelli, L. Benini","doi":"10.1109/AICAS.2019.8771562","DOIUrl":"https://doi.org/10.1109/AICAS.2019.8771562","url":null,"abstract":"After the tremendous success of convolutional neural networks in image classification, object detection, speech recognition, etc., there is now rising demand for deployment of these compute-intensive ML models on tightly power constrained embedded and mobile systems at low cost as well as for pushing the throughput in data centers. This has triggered a wave of research towards specialized hardware accelerators. Their performance is often constrained by I/O bandwidth and the energy consumption is dominated by I/O transfers to off-chip memory. We introduce and evaluate a novel, hardware-friendly compression scheme for the feature maps present within convolutional neural networks. We show that an average compression ratio of 4.4× relative to uncompressed data and a gain of 60% over existing method can be achieved for ResNet-34 with a compression block requiring <300 bit of sequential cells and minimal combinational logic.","PeriodicalId":273095,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129430638","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-10-01DOI: 10.1109/AICAS.2019.8771475
Yi Zhou, Yue Bai, S. Bhattacharyya, H. Huttunen
In this work we propose a framework for improving the performance of any deep neural network that may suffer from vanishing gradients. To address the vanishing gradient issue, we study a framework, where we insert an intermediate output branch after each layer in the computational graph and use the corresponding prediction loss for feeding the gradient to the early layers. The framework—which we name Elastic network—is tested with several well-known networks on CIFAR10 and CIFAR100 datasets, and the experimental results show that the proposed framework improves the accuracy on both shallow networks (e.g., MobileNet) and deep convolutional neural networks (e.g., DenseNet). We also identify the types of networks where the framework does not improve the performance and discuss the reasons. Finally, as a side product, the computational complexity of the resulting networks can be adjusted in an elastic manner by selecting the output branch according to current computational budget.
{"title":"Elastic Neural Networks for Classification","authors":"Yi Zhou, Yue Bai, S. Bhattacharyya, H. Huttunen","doi":"10.1109/AICAS.2019.8771475","DOIUrl":"https://doi.org/10.1109/AICAS.2019.8771475","url":null,"abstract":"In this work we propose a framework for improving the performance of any deep neural network that may suffer from vanishing gradients. To address the vanishing gradient issue, we study a framework, where we insert an intermediate output branch after each layer in the computational graph and use the corresponding prediction loss for feeding the gradient to the early layers. The framework—which we name Elastic network—is tested with several well-known networks on CIFAR10 and CIFAR100 datasets, and the experimental results show that the proposed framework improves the accuracy on both shallow networks (e.g., MobileNet) and deep convolutional neural networks (e.g., DenseNet). We also identify the types of networks where the framework does not improve the performance and discuss the reasons. Finally, as a side product, the computational complexity of the resulting networks can be adjusted in an elastic manner by selecting the output branch according to current computational budget.","PeriodicalId":273095,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133452455","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/AICAS.2019.8771581
Kai-Yen Wang, Yun-Lung Ho, Yu-De Huang, W. Fang
Emotions play a significant role in the field of affective computing and Human-Computer Interfaces(HCI). In this paper, we propose an intelligent human emotion detection system based on EEG features with a multi-channel fused processing. We also proposed an advanced convolutional neural network that was implemented in VLSI hardware design. This hardware design can accelerate both the training and classification processes and meet real-time system requirements for fast emotion detection. The performance of this design was validated using DEAP [1] database with datasets from 32 subjects, the mean classification accuracy achieved is 83.88%.
{"title":"Design of Intelligent EEG System for Human Emotion Recognition with Convolutional Neural Network","authors":"Kai-Yen Wang, Yun-Lung Ho, Yu-De Huang, W. Fang","doi":"10.1109/AICAS.2019.8771581","DOIUrl":"https://doi.org/10.1109/AICAS.2019.8771581","url":null,"abstract":"Emotions play a significant role in the field of affective computing and Human-Computer Interfaces(HCI). In this paper, we propose an intelligent human emotion detection system based on EEG features with a multi-channel fused processing. We also proposed an advanced convolutional neural network that was implemented in VLSI hardware design. This hardware design can accelerate both the training and classification processes and meet real-time system requirements for fast emotion detection. The performance of this design was validated using DEAP [1] database with datasets from 32 subjects, the mean classification accuracy achieved is 83.88%.","PeriodicalId":273095,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129007619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/AICAS.2019.8771600
I-Chen Wu, Po-Tsang Huang, Chin-Yang Lo, W. Hwang
Deep convolutional neural networks (CNNs) are widely used in image recognition and feature classification. However, deep CNNs are hard to be fully deployed for edge devices due to both computation-intensive and memory-intensive workloads. The energy efficiency of CNNs is dominated by off-chip memory accesses and convolution computation. In this paper, an energy-efficient accelerator is proposed for sparse compressed CNNs by reducing DRAM accesses and eliminating zero-operand computation. Weight compression is utilized for sparse compressed CNNs to reduce the required memory capacity/bandwidth and a large portion of connections. Thus, ReLU function produces zero-valued activations. Additionally, the workloads are distributed based on channels to increase the degree of task parallelism, and all-row- to-all-row non-zero element multiplication is adopted for skipping redundant computation. The simulation results over the dense accelerator show that the proposed accelerator achieves 1.79x speedup and reduces 23.51%, 69.53%, 88.67% on-chip memory size, energy, and DRAM accesses of VGG-16.
{"title":"An Energy-Efficient Accelerator with Relative- Indexing Memory for Sparse Compressed Convolutional Neural Network","authors":"I-Chen Wu, Po-Tsang Huang, Chin-Yang Lo, W. Hwang","doi":"10.1109/AICAS.2019.8771600","DOIUrl":"https://doi.org/10.1109/AICAS.2019.8771600","url":null,"abstract":"Deep convolutional neural networks (CNNs) are widely used in image recognition and feature classification. However, deep CNNs are hard to be fully deployed for edge devices due to both computation-intensive and memory-intensive workloads. The energy efficiency of CNNs is dominated by off-chip memory accesses and convolution computation. In this paper, an energy-efficient accelerator is proposed for sparse compressed CNNs by reducing DRAM accesses and eliminating zero-operand computation. Weight compression is utilized for sparse compressed CNNs to reduce the required memory capacity/bandwidth and a large portion of connections. Thus, ReLU function produces zero-valued activations. Additionally, the workloads are distributed based on channels to increase the degree of task parallelism, and all-row- to-all-row non-zero element multiplication is adopted for skipping redundant computation. The simulation results over the dense accelerator show that the proposed accelerator achieves 1.79x speedup and reduces 23.51%, 69.53%, 88.67% on-chip memory size, energy, and DRAM accesses of VGG-16.","PeriodicalId":273095,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127244267","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 1900-01-01DOI: 10.1109/AICAS.2019.8771616
Kai-Yen Wang, Yu-De Huang, Yun-Lung Ho, W. Fang
This paper proposes an improved softmax layer algorithm and hardware implementation, which is applicable to an effective convolutional neural network of EEG-based real-time human emotion recognition. Compared with the general softmax layer, this hardware design adds threshold layers to accelerate the training speed and replace the Euler’s base value with a dynamic base value to improve the network accuracy. This work also shows a hardware-friendly way to implement batch normalization layer on chip. Using the EEG emotion DEAP[7] database, the maximum and mean classification accuracy were achieved as 96.03% and 83.88% respectively. In this work, the usage of improved softmax layer can save up to 15% of training model convergence time and also increase by 3 to 5% the average accuracy.
{"title":"A Customized Convolutional Neural Network Design Using Improved Softmax Layer for Real-time Human Emotion Recognition","authors":"Kai-Yen Wang, Yu-De Huang, Yun-Lung Ho, W. Fang","doi":"10.1109/AICAS.2019.8771616","DOIUrl":"https://doi.org/10.1109/AICAS.2019.8771616","url":null,"abstract":"This paper proposes an improved softmax layer algorithm and hardware implementation, which is applicable to an effective convolutional neural network of EEG-based real-time human emotion recognition. Compared with the general softmax layer, this hardware design adds threshold layers to accelerate the training speed and replace the Euler’s base value with a dynamic base value to improve the network accuracy. This work also shows a hardware-friendly way to implement batch normalization layer on chip. Using the EEG emotion DEAP[7] database, the maximum and mean classification accuracy were achieved as 96.03% and 83.88% respectively. In this work, the usage of improved softmax layer can save up to 15% of training model convergence time and also increase by 3 to 5% the average accuracy.","PeriodicalId":273095,"journal":{"name":"2019 IEEE International Conference on Artificial Intelligence Circuits and Systems (AICAS)","volume":"304 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132649611","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}