Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00048
Takumi Kudo, Kodai Ueyoshi, Kota Ando, Kazutoshi Hirose, Ryota Uematsu, Yuka Oba, M. Ikebe, T. Asai, M. Motomura, Shinya Takamaeda-Yamazaki
In the remarkable evolution of deep neural network (DNN), development of a highly optimized DNN accelerator for edge computing with both less hardware resource and high computing performance is strongly required. As a well-known characteristic, DNN processing involves a large number multiplication and accumulation operations. Thus, low-precision quantization, such as binary and logarithm, is an essential technique in edge computing devices with strict restriction of circuit resource and energy. Bit-width requirement in quantization depends on application characteristics. Variable bit-width architecture based on the bit-serial processing has been proposed as a scalable alternative that allows different requirements of performance and accuracy balance by a unified hardware structure. In this paper, we propose a well-optimized DNN hardware architecture with supports of binary and variable bit-width logarithmic quantization. The key idea is the distributed-and-shared accumulator that processes multiple bit-serial inputs by a single accumulator with an additional low-overhead circuit for the binary mode. The evaluation results show that the idea reduces hardware resources by 29.8% compared to the prior architecture without losing any functionality, computing speed, and recognition accuracy. Moreover, it achieves 19.6% energy reduction using a practical DNN model of VGG 16.
{"title":"Area and Energy Optimization for Bit-Serial Log-Quantized DNN Accelerator with Shared Accumulators","authors":"Takumi Kudo, Kodai Ueyoshi, Kota Ando, Kazutoshi Hirose, Ryota Uematsu, Yuka Oba, M. Ikebe, T. Asai, M. Motomura, Shinya Takamaeda-Yamazaki","doi":"10.1109/MCSoC2018.2018.00048","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00048","url":null,"abstract":"In the remarkable evolution of deep neural network (DNN), development of a highly optimized DNN accelerator for edge computing with both less hardware resource and high computing performance is strongly required. As a well-known characteristic, DNN processing involves a large number multiplication and accumulation operations. Thus, low-precision quantization, such as binary and logarithm, is an essential technique in edge computing devices with strict restriction of circuit resource and energy. Bit-width requirement in quantization depends on application characteristics. Variable bit-width architecture based on the bit-serial processing has been proposed as a scalable alternative that allows different requirements of performance and accuracy balance by a unified hardware structure. In this paper, we propose a well-optimized DNN hardware architecture with supports of binary and variable bit-width logarithmic quantization. The key idea is the distributed-and-shared accumulator that processes multiple bit-serial inputs by a single accumulator with an additional low-overhead circuit for the binary mode. The evaluation results show that the idea reduces hardware resources by 29.8% compared to the prior architecture without losing any functionality, computing speed, and recognition accuracy. Moreover, it achieves 19.6% energy reduction using a practical DNN model of VGG 16.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133094881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00029
Hidehito Yabuuchi, Shinichi Awamoto, Hiroyuki Chishiro, S. Kato
Modern real-time systems need to efficiently handle aperiodic tasks as well as periodic ones. This paper presents a system design applying the hybrid operating system approach to multi-core architectures. A core is allocated exclusively and dynamically to a newly booted kernel and an aperiodic task on it so that the task can avoid overhead caused by the rest of the system, leading to reduced response time. We implemented and evaluated the presented design on a real multi-core architecture. The evaluation results indicate that the design improves responsiveness of aperiodic tasks that access shared resources frequently.
{"title":"Multikernel Design and Implementation for Improving Responsiveness of Aperiodic Tasks","authors":"Hidehito Yabuuchi, Shinichi Awamoto, Hiroyuki Chishiro, S. Kato","doi":"10.1109/MCSoC2018.2018.00029","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00029","url":null,"abstract":"Modern real-time systems need to efficiently handle aperiodic tasks as well as periodic ones. This paper presents a system design applying the hybrid operating system approach to multi-core architectures. A core is allocated exclusively and dynamically to a newly booted kernel and an aperiodic task on it so that the task can avoid overhead caused by the rest of the system, leading to reduced response time. We implemented and evaluated the presented design on a real multi-core architecture. The evaluation results indicate that the design improves responsiveness of aperiodic tasks that access shared resources frequently.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130477947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00030
K. Komatsu, Takumi Kishitani, Masayuki Sato, A. Musa, Hiroaki Kobayashi
The structures of recent computing systems have become complicated such as heterogeneous memory systems with a deep hierarchy and many core systems. To achieve high performance of HPC applications on such computing systems, performance tuning is mandatory. However, the number of tuning parameters has become large due to the complexities of the systems and applications. In addition, along with the improvement of computing systems, HPC applications are getting larger and complicated, resulting in long execution time of each application execution. Due to a large number of tuning parameters and a long time of each execution, a time to search for an appropriate tuning parameter combination becomes huge. This paper proposes a method to reduce the time to search for an appropriate tuning parameter combination. By considering the characteristics of a many-core processor and a simulation code, a search space of tuning parameters is reduced. Moreover, a time of each application execution for parameter search is reduced by limiting a simulation period of an application unless characteristics of the application are changed. Through the evaluation of performance tuning using the tsunami simulation code on the Intel Xeon Phi Knight Landing processor, it is clarified that a 3.67x performance improvement can be achieved by the parameter tuning. It is also clarified that the time for parameter tuning can drastically be saved by reducing the number of tuning parameters to be searched and limiting the simulation period of each application execution.
{"title":"Search Space Reduction for Parameter Tuning of a Tsunami Simulation on the Intel Knights Landing Processor","authors":"K. Komatsu, Takumi Kishitani, Masayuki Sato, A. Musa, Hiroaki Kobayashi","doi":"10.1109/MCSoC2018.2018.00030","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00030","url":null,"abstract":"The structures of recent computing systems have become complicated such as heterogeneous memory systems with a deep hierarchy and many core systems. To achieve high performance of HPC applications on such computing systems, performance tuning is mandatory. However, the number of tuning parameters has become large due to the complexities of the systems and applications. In addition, along with the improvement of computing systems, HPC applications are getting larger and complicated, resulting in long execution time of each application execution. Due to a large number of tuning parameters and a long time of each execution, a time to search for an appropriate tuning parameter combination becomes huge. This paper proposes a method to reduce the time to search for an appropriate tuning parameter combination. By considering the characteristics of a many-core processor and a simulation code, a search space of tuning parameters is reduced. Moreover, a time of each application execution for parameter search is reduced by limiting a simulation period of an application unless characteristics of the application are changed. Through the evaluation of performance tuning using the tsunami simulation code on the Intel Xeon Phi Knight Landing processor, it is clarified that a 3.67x performance improvement can be achieved by the parameter tuning. It is also clarified that the time for parameter tuning can drastically be saved by reducing the number of tuning parameters to be searched and limiting the simulation period of each application execution.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114368746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00013
Elishai Ezra Tsur, Elyassaf Madar, Natan Danan
Embedded vision processing is currently ingrained into many aspects of modern life, from computer-aided surgeries to navigation of unmanned aerial vehicles. Vision processing can be described using coarse-grained data flow graphs, which were standardized by OpenVX to enable both system and kernel level optimization via separation of concerns. Notably, graph-based specification provides a gateway to a code generation engine, which can produce an optimized, hardware-specific code for deployment. Here we provide an algorithm and JAVA-MVC-based implementation of automated code generation engine for OpenVX-based vision applications, tailored to NVIDIA multiple CUDA Cores SoC Jetson TX. Our algorithm pre-processes the graph, translates it into an ordered layer-oriented data model, and produces C code, which is optimized for the Jetson TX1 and comprised of error checking and iterative execution for real time vision processing.
{"title":"Code Generation of Graph-Based Vision Processing for Multiple CUDA Cores SoC Jetson TX","authors":"Elishai Ezra Tsur, Elyassaf Madar, Natan Danan","doi":"10.1109/MCSoC2018.2018.00013","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00013","url":null,"abstract":"Embedded vision processing is currently ingrained into many aspects of modern life, from computer-aided surgeries to navigation of unmanned aerial vehicles. Vision processing can be described using coarse-grained data flow graphs, which were standardized by OpenVX to enable both system and kernel level optimization via separation of concerns. Notably, graph-based specification provides a gateway to a code generation engine, which can produce an optimized, hardware-specific code for deployment. Here we provide an algorithm and JAVA-MVC-based implementation of automated code generation engine for OpenVX-based vision applications, tailored to NVIDIA multiple CUDA Cores SoC Jetson TX. Our algorithm pre-processes the graph, translates it into an ordered layer-oriented data model, and produces C code, which is optimized for the Jetson TX1 and comprised of error checking and iterative execution for real time vision processing.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123770250","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00015
Yuuma Azuma, H. Sakagami, Kenji Kise
The N-Queens problem is a generalized problem with the 8-Queens puzzle. The computational complexity of this problem is increased drastically when increasing N. To calculate the unsolved N-Queens problem in realistic time, implementing the high-speed solver and system is important. Therefore, efficient search methods of solutions by backtracking, bit operation, etc. have been introduced. Also, parallelization schemes of searching for solutions by arranging several queens in advance and gen-erating a large number of subproblems have been introduced. In the state-of-the-art system, to solve such subproblems a lot of solver modules are implemented on several FPGAs. In this paper, we propose two methods to enable further large-scale parallelization with realistic hardware resources. One is a method to reduce the hardware usage of a solver module using an encoder and a decoder for the crucial data structure. The other is an efficient method for distributing the subproblems to each solver module and collecting the resulting counts from each solver module. Through these methods, it is possible to increase the number of solver modules to be implemented on an FPGA. The evaluation results show that the performance of the proposed system implementing 700 solver modules achieves 2.58x of the previous work.
{"title":"An Efficient Parallel Hardware Scheme for Solving the N-Queens Problem","authors":"Yuuma Azuma, H. Sakagami, Kenji Kise","doi":"10.1109/MCSoC2018.2018.00015","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00015","url":null,"abstract":"The N-Queens problem is a generalized problem with the 8-Queens puzzle. The computational complexity of this problem is increased drastically when increasing N. To calculate the unsolved N-Queens problem in realistic time, implementing the high-speed solver and system is important. Therefore, efficient search methods of solutions by backtracking, bit operation, etc. have been introduced. Also, parallelization schemes of searching for solutions by arranging several queens in advance and gen-erating a large number of subproblems have been introduced. In the state-of-the-art system, to solve such subproblems a lot of solver modules are implemented on several FPGAs. In this paper, we propose two methods to enable further large-scale parallelization with realistic hardware resources. One is a method to reduce the hardware usage of a solver module using an encoder and a decoder for the crucial data structure. The other is an efficient method for distributing the subproblems to each solver module and collecting the resulting counts from each solver module. Through these methods, it is possible to increase the number of solver modules to be implemented on an FPGA. The evaluation results show that the performance of the proposed system implementing 700 solver modules achieves 2.58x of the previous work.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130839955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/MCSoC2018.2018.00041
E. Elsayed, Kenji Kise
Sorting is one of the fundamental operations that are important in many applications such as image processing and database. Many researches have been developed to improve the performance of sorting. One of the most promising techniques is FPGA-based hardware merge sorters (HMS). While previous studies on HMS achieved a very high throughput, most of them could output only power of two records per clock cycle. Moreover, they couldn't evaluate the performance of HMS configuration that outputs more than 32 records per clock cycle due to hardware resources limitation. In this paper, we propose an HMS architecture that can be configured to output not only power of two records but various outputs e.g., 3, 7, and 12. In addition, our proposed HMS can be configured to output more than 32 records such as 40, 48, and 56 records per clock cycle. Finally, we study the performance evaluation for different configurations of key and data widths that can be required by different sorting applications.
{"title":"Design and Evaluation of a Configurable Hardware Merge Sorter for Various Output Records","authors":"E. Elsayed, Kenji Kise","doi":"10.1109/MCSoC2018.2018.00041","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00041","url":null,"abstract":"Sorting is one of the fundamental operations that are important in many applications such as image processing and database. Many researches have been developed to improve the performance of sorting. One of the most promising techniques is FPGA-based hardware merge sorters (HMS). While previous studies on HMS achieved a very high throughput, most of them could output only power of two records per clock cycle. Moreover, they couldn't evaluate the performance of HMS configuration that outputs more than 32 records per clock cycle due to hardware resources limitation. In this paper, we propose an HMS architecture that can be configured to output not only power of two records but various outputs e.g., 3, 7, and 12. In addition, our proposed HMS can be configured to output more than 32 records such as 40, 48, and 56 records per clock cycle. Finally, we study the performance evaluation for different configurations of key and data widths that can be required by different sorting applications.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"29 12","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113955272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}