2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)最新文献

英文中文

Area and Energy Optimization for Bit-Serial Log-Quantized DNN Accelerator with Shared Accumulators 带共享蓄能器的位串行对数量化DNN加速器的面积和能量优化

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2018-09-01 DOI: 10.1109/MCSoC2018.2018.00048

Takumi Kudo, Kodai Ueyoshi, Kota Ando, Kazutoshi Hirose, Ryota Uematsu, Yuka Oba, M. Ikebe, T. Asai, M. Motomura, Shinya Takamaeda-Yamazaki

In the remarkable evolution of deep neural network (DNN), development of a highly optimized DNN accelerator for edge computing with both less hardware resource and high computing performance is strongly required. As a well-known characteristic, DNN processing involves a large number multiplication and accumulation operations. Thus, low-precision quantization, such as binary and logarithm, is an essential technique in edge computing devices with strict restriction of circuit resource and energy. Bit-width requirement in quantization depends on application characteristics. Variable bit-width architecture based on the bit-serial processing has been proposed as a scalable alternative that allows different requirements of performance and accuracy balance by a unified hardware structure. In this paper, we propose a well-optimized DNN hardware architecture with supports of binary and variable bit-width logarithmic quantization. The key idea is the distributed-and-shared accumulator that processes multiple bit-serial inputs by a single accumulator with an additional low-overhead circuit for the binary mode. The evaluation results show that the idea reduces hardware resources by 29.8% compared to the prior architecture without losing any functionality, computing speed, and recognition accuracy. Moreover, it achieves 19.6% energy reduction using a practical DNN model of VGG 16.

随着深度神经网络(deep neural network, DNN)的飞速发展，迫切需要开发一种硬件资源更少、计算性能更高、高度优化的边缘计算深度神经网络加速器。作为一个众所周知的特征，深度神经网络处理涉及大量的乘法和累加操作。因此，在电路资源和能量受到严格限制的边缘计算设备中，二进制和对数等低精度量化是必不可少的技术。量化对位宽的要求取决于应用的特性。基于位串行处理的可变位宽体系结构是一种可扩展的方案，可以通过统一的硬件结构来平衡不同的性能和精度要求。在本文中，我们提出了一个优化的深度神经网络硬件架构，支持二进制和可变位宽对数量化。关键思想是分布式和共享累加器，它通过单个累加器处理多个位串行输入，并带有用于二进制模式的附加低开销电路。评估结果表明，该方法在不损失任何功能、计算速度和识别精度的情况下，比原有架构减少了29.8%的硬件资源。此外，使用VGG 16的实用DNN模型，它可以实现19.6%的能量降低。

{"title":"Area and Energy Optimization for Bit-Serial Log-Quantized DNN Accelerator with Shared Accumulators","authors":"Takumi Kudo, Kodai Ueyoshi, Kota Ando, Kazutoshi Hirose, Ryota Uematsu, Yuka Oba, M. Ikebe, T. Asai, M. Motomura, Shinya Takamaeda-Yamazaki","doi":"10.1109/MCSoC2018.2018.00048","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00048","url":null,"abstract":"In the remarkable evolution of deep neural network (DNN), development of a highly optimized DNN accelerator for edge computing with both less hardware resource and high computing performance is strongly required. As a well-known characteristic, DNN processing involves a large number multiplication and accumulation operations. Thus, low-precision quantization, such as binary and logarithm, is an essential technique in edge computing devices with strict restriction of circuit resource and energy. Bit-width requirement in quantization depends on application characteristics. Variable bit-width architecture based on the bit-serial processing has been proposed as a scalable alternative that allows different requirements of performance and accuracy balance by a unified hardware structure. In this paper, we propose a well-optimized DNN hardware architecture with supports of binary and variable bit-width logarithmic quantization. The key idea is the distributed-and-shared accumulator that processes multiple bit-serial inputs by a single accumulator with an additional low-overhead circuit for the binary mode. The evaluation results show that the idea reduces hardware resources by 29.8% compared to the prior architecture without losing any functionality, computing speed, and recognition accuracy. Moreover, it achieves 19.6% energy reduction using a practical DNN model of VGG 16.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133094881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Multikernel Design and Implementation for Improving Responsiveness of Aperiodic Tasks 提高非周期任务响应性的多内核设计与实现

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2018-09-01 DOI: 10.1109/MCSoC2018.2018.00029

Hidehito Yabuuchi, Shinichi Awamoto, Hiroyuki Chishiro, S. Kato

Modern real-time systems need to efficiently handle aperiodic tasks as well as periodic ones. This paper presents a system design applying the hybrid operating system approach to multi-core architectures. A core is allocated exclusively and dynamically to a newly booted kernel and an aperiodic task on it so that the task can avoid overhead caused by the rest of the system, leading to reduced response time. We implemented and evaluated the presented design on a real multi-core architecture. The evaluation results indicate that the design improves responsiveness of aperiodic tasks that access shared resources frequently.

现代实时系统既需要有效地处理周期性任务，也需要有效地处理非周期性任务。本文提出了一种将混合操作系统方法应用于多核体系结构的系统设计。一个核心被动态地分配给一个新启动的内核和它上面的一个非周期任务，这样任务就可以避免系统其他部分造成的开销，从而减少响应时间。我们在一个真正的多核架构上实现并评估了所提出的设计。评估结果表明，该设计提高了频繁访问共享资源的非周期性任务的响应性。

引用次数: 0

Search Space Reduction for Parameter Tuning of a Tsunami Simulation on the Intel Knights Landing Processor 在Intel Knights Landing处理器上的海啸模拟参数调整的搜索空间缩减

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2018-09-01 DOI: 10.1109/MCSoC2018.2018.00030

K. Komatsu, Takumi Kishitani, Masayuki Sato, A. Musa, Hiroaki Kobayashi

The structures of recent computing systems have become complicated such as heterogeneous memory systems with a deep hierarchy and many core systems. To achieve high performance of HPC applications on such computing systems, performance tuning is mandatory. However, the number of tuning parameters has become large due to the complexities of the systems and applications. In addition, along with the improvement of computing systems, HPC applications are getting larger and complicated, resulting in long execution time of each application execution. Due to a large number of tuning parameters and a long time of each execution, a time to search for an appropriate tuning parameter combination becomes huge. This paper proposes a method to reduce the time to search for an appropriate tuning parameter combination. By considering the characteristics of a many-core processor and a simulation code, a search space of tuning parameters is reduced. Moreover, a time of each application execution for parameter search is reduced by limiting a simulation period of an application unless characteristics of the application are changed. Through the evaluation of performance tuning using the tsunami simulation code on the Intel Xeon Phi Knight Landing processor, it is clarified that a 3.67x performance improvement can be achieved by the parameter tuning. It is also clarified that the time for parameter tuning can drastically be saved by reducing the number of tuning parameters to be searched and limiting the simulation period of each application execution.

当前计算系统的结构日趋复杂，如异构存储系统和多核心系统等。为了在这样的计算系统上实现高性能的HPC应用程序，必须进行性能调优。然而，由于系统和应用程序的复杂性，调优参数的数量已经变得很大。此外，随着计算系统的不断完善，HPC应用越来越庞大和复杂，导致每个应用的执行时间都很长。由于调优参数数量多，每次执行的时间长，因此搜索合适的调优参数组合的时间变得非常长。本文提出了一种减少搜索合适的调谐参数组合的时间的方法。结合多核处理器和仿真代码的特点，减小了调优参数的搜索空间。此外，除非改变应用程序的特征，否则通过限制应用程序的模拟周期来减少每个应用程序执行参数搜索的时间。通过在Intel Xeon Phi Knight Landing处理器上使用海啸模拟代码进行性能调优的评估，明确了通过参数调优可以实现3.67倍的性能提升。还澄清了，通过减少要搜索的调优参数的数量和限制每个应用程序执行的模拟周期，可以大大节省参数调优的时间。

{"title":"Search Space Reduction for Parameter Tuning of a Tsunami Simulation on the Intel Knights Landing Processor","authors":"K. Komatsu, Takumi Kishitani, Masayuki Sato, A. Musa, Hiroaki Kobayashi","doi":"10.1109/MCSoC2018.2018.00030","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00030","url":null,"abstract":"The structures of recent computing systems have become complicated such as heterogeneous memory systems with a deep hierarchy and many core systems. To achieve high performance of HPC applications on such computing systems, performance tuning is mandatory. However, the number of tuning parameters has become large due to the complexities of the systems and applications. In addition, along with the improvement of computing systems, HPC applications are getting larger and complicated, resulting in long execution time of each application execution. Due to a large number of tuning parameters and a long time of each execution, a time to search for an appropriate tuning parameter combination becomes huge. This paper proposes a method to reduce the time to search for an appropriate tuning parameter combination. By considering the characteristics of a many-core processor and a simulation code, a search space of tuning parameters is reduced. Moreover, a time of each application execution for parameter search is reduced by limiting a simulation period of an application unless characteristics of the application are changed. Through the evaluation of performance tuning using the tsunami simulation code on the Intel Xeon Phi Knight Landing processor, it is clarified that a 3.67x performance improvement can be achieved by the parameter tuning. It is also clarified that the time for parameter tuning can drastically be saved by reducing the number of tuning parameters to be searched and limiting the simulation period of each application execution.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"52 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114368746","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Publisher's Information 出版商的信息

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2018-09-01 DOI: 10.1109/mcsoc2018.2018.00047

引用次数: 0

Code Generation of Graph-Based Vision Processing for Multiple CUDA Cores SoC Jetson TX 多CUDA内核SoC Jetson TX基于图形的视觉处理代码生成

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2018-09-01 DOI: 10.1109/MCSoC2018.2018.00013

Elishai Ezra Tsur, Elyassaf Madar, Natan Danan

Embedded vision processing is currently ingrained into many aspects of modern life, from computer-aided surgeries to navigation of unmanned aerial vehicles. Vision processing can be described using coarse-grained data flow graphs, which were standardized by OpenVX to enable both system and kernel level optimization via separation of concerns. Notably, graph-based specification provides a gateway to a code generation engine, which can produce an optimized, hardware-specific code for deployment. Here we provide an algorithm and JAVA-MVC-based implementation of automated code generation engine for OpenVX-based vision applications, tailored to NVIDIA multiple CUDA Cores SoC Jetson TX. Our algorithm pre-processes the graph, translates it into an ordered layer-oriented data model, and produces C code, which is optimized for the Jetson TX1 and comprised of error checking and iterative execution for real time vision processing.

嵌入式视觉处理目前已深入到现代生活的许多方面，从计算机辅助手术到无人驾驶飞行器导航。视觉处理可以使用粗粒度的数据流图来描述，这是由OpenVX标准化的，通过关注点分离来实现系统级和内核级的优化。值得注意的是，基于图的规范为代码生成引擎提供了一个网关，该引擎可以为部署生成优化的、特定于硬件的代码。在这里，我们为基于openx的视觉应用提供了一种算法和基于java - mvc的自动代码生成引擎实现，该算法针对NVIDIA多CUDA内核SoC Jetson TX量身定制。我们的算法对图形进行预处理，将其转换为有序的面向层的数据模型，并生成针对Jetson TX1进行优化的C代码，该代码包括错误检查和迭代执行，用于实时视觉处理。

引用次数: 2

An Efficient Parallel Hardware Scheme for Solving the N-Queens Problem 一种求解N-Queens问题的高效并行硬件方案

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2018-09-01 DOI: 10.1109/MCSoC2018.2018.00015

Yuuma Azuma, H. Sakagami, Kenji Kise

The N-Queens problem is a generalized problem with the 8-Queens puzzle. The computational complexity of this problem is increased drastically when increasing N. To calculate the unsolved N-Queens problem in realistic time, implementing the high-speed solver and system is important. Therefore, efficient search methods of solutions by backtracking, bit operation, etc. have been introduced. Also, parallelization schemes of searching for solutions by arranging several queens in advance and gen-erating a large number of subproblems have been introduced. In the state-of-the-art system, to solve such subproblems a lot of solver modules are implemented on several FPGAs. In this paper, we propose two methods to enable further large-scale parallelization with realistic hardware resources. One is a method to reduce the hardware usage of a solver module using an encoder and a decoder for the crucial data structure. The other is an efficient method for distributing the subproblems to each solver module and collecting the resulting counts from each solver module. Through these methods, it is possible to increase the number of solver modules to be implemented on an FPGA. The evaluation results show that the performance of the proposed system implementing 700 solver modules achieves 2.58x of the previous work.

n皇后问题是一个广义的8皇后问题。随着n的增加，N-Queens问题的计算复杂度急剧增加，为了在现实时间内计算出未解决的N-Queens问题，实现高速求解器和系统至关重要。因此，介绍了回溯、位运算等求解的高效方法。此外，还介绍了通过预先安排多个皇后并生成大量子问题来寻解的并行化方案。在最先进的系统中，为了解决这些子问题，在几个fpga上实现了许多求解器模块。在本文中，我们提出了两种方法来实现进一步的大规模并行化与现实的硬件资源。一种方法是使用关键数据结构的编码器和解码器来减少求解器模块的硬件使用。另一种是将子问题分配到每个求解器模块并从每个求解器模块收集结果计数的有效方法。通过这些方法，可以增加在FPGA上实现的求解器模块的数量。评估结果表明，采用700个求解器模块的系统性能达到了前人工作的2.58倍。

{"title":"An Efficient Parallel Hardware Scheme for Solving the N-Queens Problem","authors":"Yuuma Azuma, H. Sakagami, Kenji Kise","doi":"10.1109/MCSoC2018.2018.00015","DOIUrl":"https://doi.org/10.1109/MCSoC2018.2018.00015","url":null,"abstract":"The N-Queens problem is a generalized problem with the 8-Queens puzzle. The computational complexity of this problem is increased drastically when increasing N. To calculate the unsolved N-Queens problem in realistic time, implementing the high-speed solver and system is important. Therefore, efficient search methods of solutions by backtracking, bit operation, etc. have been introduced. Also, parallelization schemes of searching for solutions by arranging several queens in advance and gen-erating a large number of subproblems have been introduced. In the state-of-the-art system, to solve such subproblems a lot of solver modules are implemented on several FPGAs. In this paper, we propose two methods to enable further large-scale parallelization with realistic hardware resources. One is a method to reduce the hardware usage of a solver module using an encoder and a decoder for the crucial data structure. The other is an efficient method for distributing the subproblems to each solver module and collecting the resulting counts from each solver module. Through these methods, it is possible to increase the number of solver modules to be implemented on an FPGA. The evaluation results show that the performance of the proposed system implementing 700 solver modules achieves 2.58x of the previous work.","PeriodicalId":413836,"journal":{"name":"2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130839955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Design and Evaluation of a Configurable Hardware Merge Sorter for Various Output Records 多种输出记录可配置硬件归并排序器的设计与评价

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

Pub Date : 2018-09-01 DOI: 10.1109/MCSoC2018.2018.00041

E. Elsayed, Kenji Kise

Sorting is one of the fundamental operations that are important in many applications such as image processing and database. Many researches have been developed to improve the performance of sorting. One of the most promising techniques is FPGA-based hardware merge sorters (HMS). While previous studies on HMS achieved a very high throughput, most of them could output only power of two records per clock cycle. Moreover, they couldn't evaluate the performance of HMS configuration that outputs more than 32 records per clock cycle due to hardware resources limitation. In this paper, we propose an HMS architecture that can be configured to output not only power of two records but various outputs e.g., 3, 7, and 12. In addition, our proposed HMS can be configured to output more than 32 records such as 40, 48, and 56 records per clock cycle. Finally, we study the performance evaluation for different configurations of key and data widths that can be required by different sorting applications.

排序是在图像处理和数据库等许多应用中非常重要的基本操作之一。为了提高分选性能，人们进行了许多研究。最有前途的技术之一是基于fpga的硬件合并分类器(HMS)。虽然以前的HMS研究实现了非常高的吞吐量，但大多数HMS每个时钟周期只能输出两个记录的功率。此外，由于硬件资源的限制，他们无法评估每个时钟周期输出超过32条记录的HMS配置的性能。在本文中，我们提出了一个HMS架构，可以配置为不仅输出两个记录的功率，还输出各种输出，例如3,7和12。此外，我们建议的HMS可以配置为每个时钟周期输出超过32条记录，例如40、48和56条记录。最后，我们研究了不同排序应用程序可能需要的键和数据宽度的不同配置的性能评估。

引用次数: 2

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2018 IEEE 12th International Symposium on Embedded Multicore/Many-core Systems-on-Chip (MCSoC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀