2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献

英文中文

MicRun: A framework for scale-free graph algorithms on SIMD architecture of the Xeon Phi 基于Xeon Phi协处理器SIMD架构的无标度图形算法框架

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995269

Jie Lin, Q. Wu, Yusong Tan, Jie Yu, Qi Zhang, Xiaoling Li, Lei Luo

Graph algorithms currently play increasingly important roles, especially in social networks and language modeling scenarios. Recently, accelerating graph algorithms by heterogeneous high performance computers with the integrated cores and expanded SIMD lanes has been becoming the mainstream. However, the existing methods, restricted by the low-efficiency grouping strategy and the non-optimized selection mechanism of tile size of a graph, are far below our expectations in many ways. Moreover, there are few convenient integrated tools provided for deploying the graph algorithms on MIC architecture. In this paper, we propose a high-efficiency framework MicRun, which is flexible to be used for graph algorithms on SIMD architecture of the Xeon Phi. There are two key components in MicRun, the Bucket Grouping module and Auto-tuning module. In the Grouping module, an optimization algorithm is designed for splitting graph tiles into conflict-free groups, which can be directly processed on SIMD parallelism. In the Auto-tuning module, a novel strategy is proposed for optimizing the tile size to boost execution efficiency of the graph computation. MicRun currently supports Bellman-Ford and PageRank algorithms, we also conduct extensive validation experiments on MicRun. Experimental results show that MicRun outperforms existing mechanisms in terms of storage and time overhead. As a consequence, both graph algorithms achieve an average speedup of 1.1× by MicRun, compared with the state-of-the-art.

图算法目前发挥着越来越重要的作用，特别是在社交网络和语言建模场景中。近年来，采用集成核心和扩展SIMD通道的异构高性能计算机加速图形算法已成为主流。然而，现有的方法受到效率低下的分组策略和图块大小的非优化选择机制的限制，在很多方面远远低于我们的预期。此外，很少有方便的集成工具来部署图算法在MIC架构上。在本文中，我们提出了一个高效的框架MicRun，它可以灵活地用于Xeon Phi处理器的SIMD架构上的图形算法。MicRun中有两个关键组件，桶分组模块和自动调优模块。在Grouping模块中，设计了一种优化算法，将图块分割成无冲突的组，这些组可以直接在SIMD并行上进行处理。在自调优模块中，提出了一种优化图块大小的新策略，以提高图计算的执行效率。MicRun目前支持Bellman-Ford算法和PageRank算法，我们也在MicRun上进行了大量的验证实验。实验结果表明，MicRun在存储和时间开销方面优于现有机制。因此，与最先进的算法相比，MicRun的两种图形算法的平均加速都达到了1.1倍。

{"title":"MicRun: A framework for scale-free graph algorithms on SIMD architecture of the Xeon Phi","authors":"Jie Lin, Q. Wu, Yusong Tan, Jie Yu, Qi Zhang, Xiaoling Li, Lei Luo","doi":"10.1109/ASAP.2017.7995269","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995269","url":null,"abstract":"Graph algorithms currently play increasingly important roles, especially in social networks and language modeling scenarios. Recently, accelerating graph algorithms by heterogeneous high performance computers with the integrated cores and expanded SIMD lanes has been becoming the mainstream. However, the existing methods, restricted by the low-efficiency grouping strategy and the non-optimized selection mechanism of tile size of a graph, are far below our expectations in many ways. Moreover, there are few convenient integrated tools provided for deploying the graph algorithms on MIC architecture. In this paper, we propose a high-efficiency framework MicRun, which is flexible to be used for graph algorithms on SIMD architecture of the Xeon Phi. There are two key components in MicRun, the Bucket Grouping module and Auto-tuning module. In the Grouping module, an optimization algorithm is designed for splitting graph tiles into conflict-free groups, which can be directly processed on SIMD parallelism. In the Auto-tuning module, a novel strategy is proposed for optimizing the tile size to boost execution efficiency of the graph computation. MicRun currently supports Bellman-Ford and PageRank algorithms, we also conduct extensive validation experiments on MicRun. Experimental results show that MicRun outperforms existing mechanisms in terms of storage and time overhead. As a consequence, both graph algorithms achieve an average speedup of 1.1× by MicRun, compared with the state-of-the-art.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134344214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

High-Level Synthesis for side-channel defense 高级合成的侧通道防御

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-07-01 DOI: 10.1109/ASAP.2017.7995257

S. T. C. Konigsmark, Deming Chen, Martin D. F. Wong

The Internet of Things (IoT) and cloud computing rely on strong confidence in security of confidential or highly privacy sensitive data. Therefore, side-channel leakage is an important threat, but countermeasures require expert-level security knowledge for efficient application, limiting adoption. This work addresses this need by presenting the first High-Level Synthesis (HLS) flow with primary focus on side-channel leakage reduction. Minimal security annotation to the high-level C-code is sufficient to perform automatic analysis of security critical operations with corresponding insertion of countermeasures. Additionally, imbalanced branches are detected and corrected. For practicality, the flow can meet both resource and information leakage constraints. The presented flow is extensively evaluated on established HLS benchmarks and a general IoT benchmark. Under identical resource constraints, leakage is reduced between 32% and 72% compared to the reference. Under leakage target, the constraints are achieved with 31% to 81% less resource overhead.

物联网(IoT)和云计算依赖于对机密或高度隐私敏感数据的安全性的强烈信心。因此，侧信道泄漏是一个重要的威胁，但应对措施需要专家级的安全知识才能有效应用，限制了采用。这项工作通过提出第一个高水平合成(HLS)流来解决这一需求，主要侧重于减少侧通道泄漏。对高级c代码的最小安全注释足以执行安全关键操作的自动分析，并插入相应的对策。此外，不平衡的分支被检测和纠正。在实用性方面，该流程可以同时满足资源和信息泄漏的约束。所提出的流程在已建立的HLS基准和一般物联网基准上进行了广泛评估。在相同的资源限制下，与参考相比，泄漏减少了32%至72%。在达到泄漏目标的情况下，减少了31%到81%的资源开销。

引用次数: 10

CATERPILLAR: Coarse Grain Reconfigurable Architecture for accelerating the training of Deep Neural Networks 卡特彼勒:用于加速深度神经网络训练的粗粒度可重构架构

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-06-01 DOI: 10.1109/ASAP.2017.7995252

Yuanfang Li, A. Pedram

Accelerating the inference of a trained DNN is a well studied subject. In this paper we switch the focus to the training of DNNs. The training phase is compute intensive, demands complicated data communication, and contains multiple levels of data dependencies and parallelism. This paper presents an algorithm/architecture space exploration of efficient accelerators to achieve better network convergence rates and higher energy efficiency for training DNNs. We further demonstrate that an architecture with hierarchical support for collective communication semantics provides flexibility in training various networks performing both stochastic and batched gradient descent based techniques. Our results suggest that smaller networks favor non-batched techniques while performance for larger networks is higher using batched operations. At 45nm technology, CATERPILLAR achieves performance efficiencies of 177 GFLOPS/W at over 80% utilization for SGD training on small networks and 211 GFLOPS/W at over 90% utilization for pipelined SGD/CP training on larger networks using a total area of 103.2 mm2 and 178.9 mm2 respectively.

加速训练好的深度神经网络的推理是一个很好的研究课题。在本文中，我们将重点转向dnn的训练。训练阶段是计算密集型的，需要复杂的数据通信，并且包含多层次的数据依赖性和并行性。本文提出了一种高效加速器的算法/架构空间探索，以实现更好的网络收敛速度和更高的能量效率，用于训练dnn。我们进一步证明，具有分层支持集体通信语义的架构为训练执行随机和批处理梯度下降技术的各种网络提供了灵活性。我们的结果表明，较小的网络倾向于使用非批处理技术，而使用批处理操作的大型网络的性能更高。在45纳米技术下，卡特彼勒在小型网络上的SGD训练实现了177 GFLOPS/W的性能效率，利用率超过80%;在较大网络上的流水线SGD/CP训练实现了211 GFLOPS/W的性能效率，利用率超过90%，总面积分别为103.2 mm2和178.9 mm2。

{"title":"CATERPILLAR: Coarse Grain Reconfigurable Architecture for accelerating the training of Deep Neural Networks","authors":"Yuanfang Li, A. Pedram","doi":"10.1109/ASAP.2017.7995252","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995252","url":null,"abstract":"Accelerating the inference of a trained DNN is a well studied subject. In this paper we switch the focus to the training of DNNs. The training phase is compute intensive, demands complicated data communication, and contains multiple levels of data dependencies and parallelism. This paper presents an algorithm/architecture space exploration of efficient accelerators to achieve better network convergence rates and higher energy efficiency for training DNNs. We further demonstrate that an architecture with hierarchical support for collective communication semantics provides flexibility in training various networks performing both stochastic and batched gradient descent based techniques. Our results suggest that smaller networks favor non-batched techniques while performance for larger networks is higher using batched operations. At 45nm technology, CATERPILLAR achieves performance efficiencies of 177 GFLOPS/W at over 80% utilization for SGD training on small networks and 211 GFLOPS/W at over 90% utilization for pipelined SGD/CP training on larger networks using a total area of 103.2 mm2 and 178.9 mm2 respectively.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126857866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Parallel Multi Channel convolution using General Matrix Multiplication 使用一般矩阵乘法的并行多通道卷积

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2017-04-06 DOI: 10.1109/ASAP.2017.7995254

Aravind Vasudevan, Andrew Anderson, David Gregg

Convolutional neural networks (CNNs) have emerged as one of the most successful machine learning technologies for image and video processing. The most computationally-intensive parts of CNNs are the convolutional layers, which convolve multi-channel images with multiple kernels. A common approach to implementing convolutional layers is to expand the image into a column matrix (im2col) and perform Multiple Channel Multiple Kernel (MCMK) convolution using an existing parallel General Matrix Multiplication (GEMM) library. This im2col conversion greatly increases the memory footprint of the input matrix and reduces data locality. In this paper we propose a new approach to MCMK convolution that is based on General Matrix Multiplication (GEMM), but not on im2col. Our algorithm eliminates the need for data replication on the input thereby enabling us to apply the convolution kernels on the input images directly. We have implemented several variants of our algorithm on a CPU processor and an embedded ARM processor. On the CPU, our algorithm is faster than im2col in most cases.

卷积神经网络(cnn)已经成为图像和视频处理领域最成功的机器学习技术之一。cnn中计算量最大的部分是卷积层，它用多个核对多通道图像进行卷积。实现卷积层的一种常见方法是将图像扩展为列矩阵(im2col)，并使用现有的并行通用矩阵乘法(GEMM)库执行多通道多核(MCMK)卷积。这种im2col转换极大地增加了输入矩阵的内存占用并减少了数据局部性。本文提出了一种基于通用矩阵乘法(GEMM)而非im2col的MCMK卷积新方法。我们的算法消除了对输入数据复制的需要，从而使我们能够直接对输入图像应用卷积核。我们已经在CPU处理器和嵌入式ARM处理器上实现了算法的几个变体。在CPU上，我们的算法在大多数情况下比im2col更快。

{"title":"Parallel Multi Channel convolution using General Matrix Multiplication","authors":"Aravind Vasudevan, Andrew Anderson, David Gregg","doi":"10.1109/ASAP.2017.7995254","DOIUrl":"https://doi.org/10.1109/ASAP.2017.7995254","url":null,"abstract":"Convolutional neural networks (CNNs) have emerged as one of the most successful machine learning technologies for image and video processing. The most computationally-intensive parts of CNNs are the convolutional layers, which convolve multi-channel images with multiple kernels. A common approach to implementing convolutional layers is to expand the image into a column matrix (im2col) and perform Multiple Channel Multiple Kernel (MCMK) convolution using an existing parallel General Matrix Multiplication (GEMM) library. This im2col conversion greatly increases the memory footprint of the input matrix and reduces data locality. In this paper we propose a new approach to MCMK convolution that is based on General Matrix Multiplication (GEMM), but not on im2col. Our algorithm eliminates the need for data replication on the input thereby enabling us to apply the convolution kernels on the input images directly. We have implemented several variants of our algorithm on a CPU processor and an embedded ARM processor. On the CPU, our algorithm is faster than im2col in most cases.","PeriodicalId":405953,"journal":{"name":"2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-04-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127290034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 112

Hardwiring the OS kernel into a Java application processor 将操作系统内核硬连接到Java应用程序处理器中

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 1900-01-01 DOI: 10.1109/ASAP.2017.7995259

Chun-Jen Tsai, Cheng J. Lin, Cheng-Yang Chen, Yan-Hung Lin, Wei-Jhong Ji, Sheng-Di Hong

This paper presents the design and implementation of a hardwired OS kernel circuitry inside a Java application processor to provide the system services that are traditionally implemented in software. The hardwired system functions in the proposed SoC include the thread manager, the memory manager, and the I/O subsystem interface. There are many advantages in making the OS kernel a hardware component, such as a fast system boot time, highly efficient single-core multi-thread context-switching performance, and a better potential for supporting a complex multi-level memory subsystem. In addition, since the target application processor used in this paper is based on a Java processor, the system is not susceptible to the stack and pointer-based security attacks that are common to the register-based processors. Full-system performance evaluations on an FPGA show that the proposed system is very promising for deeply-embedded multi-thread applications.

本文介绍了在Java应用处理器内部设计和实现一个硬连线的操作系统内核电路，以提供传统上在软件中实现的系统服务。该SoC的硬连线系统功能包括线程管理器、内存管理器和I/O子系统接口。将操作系统内核作为硬件组件有许多优点，例如快速的系统启动时间、高效的单核多线程上下文切换性能，以及更好地支持复杂的多级内存子系统。此外，由于本文中使用的目标应用程序处理器是基于Java处理器的，因此系统不容易受到基于寄存器的处理器常见的基于堆栈和指针的安全攻击。在FPGA上进行的全系统性能评估表明，该系统在深度嵌入式多线程应用中非常有前景。

引用次数: 0

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2017 IEEE 28th International Conference on Application-specific Systems, Architectures and Processors (ASAP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀