2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献

英文中文

Efficient FeFET Crossbar Accelerator for Binary Neural Networks 用于二元神经网络的高效FeFET交叉棒加速器

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-07-01 DOI: 10.1109/ASAP49362.2020.00027

T. Soliman, R. Olivo, T. Kirchner, Cecilia De la Parra, M. Lederer, T. Kämpfe, A. Guntoro, N. Wehn

This paper presents a novel ferroelectric field-effect transistor (FeFET) in-memory computing architecture dedicated to accelerate Binary Neural Networks (BNNs). We present in-memory convolution, batch normalization and dense layer processing through a grid of small crossbars with reduced unit size, which enables multiple bit operation and value accumulation. Additionally, we explore the possible operations parallelization for maximized computational performance. Simulation results show that our new architecture achieves a computing performance up to 2.46 TOPS while achieving a high power efficiency reaching 111.8 TOPS/Watt and an area of 0.026 mm2 in 22nm FDSOI technology.

本文提出了一种新的铁电场效应晶体管(FeFET)内存计算架构，用于加速二进制神经网络(BNNs)。我们通过减少单元尺寸的小横条网格提出了内存卷积、批归一化和密集层处理，从而实现了多比特操作和值积累。此外，我们还探讨了最大化计算性能的可能的操作并行化。仿真结果表明，我们的新架构在22nm FDSOI技术中实现了高达2.46 TOPS的计算性能，同时实现了高达111.8 TOPS/Watt的高功率效率和0.026 mm2的面积。

引用次数: 3

Fast and Accurate Training of Ensemble Models with FPGA-based Switch 基于fpga的开关集成模型快速准确训练

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-07-01 DOI: 10.1109/ASAP49362.2020.00023

Jiuxi Meng, Ce Guo, Nadeen Gebara, W. Luk

Random projection is gaining more attention in large scale machine learning. It has been proved to reduce the dimensionality of a set of data whilst approximately preserving the pairwise distance between points by multiplying the original dataset with a chosen matrix. However, projecting data to a lower dimension subspace typically reduces the training accuracy. In this paper, we propose a novel architecture that combines an FPGA-based switch with the ensemble learning method. This architecture enables reducing training time while maintaining high accuracy. Our initial result shows a speedup of 2.12-6.77 times using four different high dimensionality datasets.

随机投影在大规模机器学习中受到越来越多的关注。已经证明，通过将原始数据集与选定的矩阵相乘，可以降低数据集的维数，同时近似地保持点之间的成对距离。然而，将数据投影到低维子空间通常会降低训练精度。在本文中，我们提出了一种将基于fpga的开关与集成学习方法相结合的新架构。这种架构能够在保持高精度的同时减少训练时间。我们的初始结果显示，使用四个不同的高维数据集，速度提高了2.12-6.77倍。

引用次数: 2

Accelerating Radiative Transfer Simulation with GPU-FPGA Cooperative Computation 基于GPU-FPGA协同计算的加速辐射传递仿真

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-07-01 DOI: 10.1109/ASAP49362.2020.00011

Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, T. Boku, K. Yoshikawa, Makito Abe, M. Umemura

Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing. This is ascribed to the drastic improvement in their computational and communication capabilities in recent years owing to advances in semiconductor integration technologies that rely on Moore’s Law. In addition to these performance improvements, toolchains for the development of FPGAs in OpenCL have been offered by FPGA vendors to reduce the programming effort required. These improvements suggest the possibility of implementing the concept of enabling on-the-fly offloading computation at which CPUs/GPUs perform poorly relative to FPGAs while performing low-latency data transfers. We consider this concept to be of key importance to improve the performance of heterogeneous supercomputers that employ accelerators such as a GPU. In this study, we propose GPU–FPGA-accelerated simulation based on this concept and demonstrate the implementation of the proposed method with CUDA and OpenCL mixed programming. The experimental results showed that our proposed method can increase the performance by up to $17.4 times$ compared with GPU-based implementation. This performance is still $1.32 times$ higher even when solving problems with the largest size, which is the fastest problem size for GPU-based implementation. We consider the realization of GPU–FPGA-accelerated simulation to be the most significant difference between our work and previous studies.

现场可编程门阵列(fpga)在高性能计算研究中引起了极大的兴趣。这是因为近年来依靠摩尔定律的半导体集成技术的进步，使它们的计算和通信能力得到了极大的提高。除了这些性能改进之外，FPGA供应商还提供了用于在OpenCL中开发FPGA的工具链，以减少所需的编程工作。这些改进表明，在执行低延迟数据传输时，cpu / gpu相对于fpga性能较差的情况下，实现动态卸载计算概念的可能性。我们认为这个概念对于提高使用GPU等加速器的异构超级计算机的性能至关重要。在本研究中，我们提出了基于此概念的gpu - fpga加速仿真，并演示了使用CUDA和OpenCL混合编程实现所提出的方法。实验结果表明，与基于gpu的实现相比，我们提出的方法可以将性能提高17.4倍。即使在解决最大规模的问题时，这一性能仍然高出1.32倍，这是基于gpu的实现中最快的问题规模。我们认为gpu - fpga加速仿真的实现是我们的工作与以往研究最显著的区别。

{"title":"Accelerating Radiative Transfer Simulation with GPU-FPGA Cooperative Computation","authors":"Ryohei Kobayashi, N. Fujita, Y. Yamaguchi, T. Boku, K. Yoshikawa, Makito Abe, M. Umemura","doi":"10.1109/ASAP49362.2020.00011","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00011","url":null,"abstract":"Field-programmable gate arrays (FPGAs) have garnered significant interest in research on high-performance computing. This is ascribed to the drastic improvement in their computational and communication capabilities in recent years owing to advances in semiconductor integration technologies that rely on Moore’s Law. In addition to these performance improvements, toolchains for the development of FPGAs in OpenCL have been offered by FPGA vendors to reduce the programming effort required. These improvements suggest the possibility of implementing the concept of enabling on-the-fly offloading computation at which CPUs/GPUs perform poorly relative to FPGAs while performing low-latency data transfers. We consider this concept to be of key importance to improve the performance of heterogeneous supercomputers that employ accelerators such as a GPU. In this study, we propose GPU–FPGA-accelerated simulation based on this concept and demonstrate the implementation of the proposed method with CUDA and OpenCL mixed programming. The experimental results showed that our proposed method can increase the performance by up to $17.4 times$ compared with GPU-based implementation. This performance is still $1.32 times$ higher even when solving problems with the largest size, which is the fastest problem size for GPU-based implementation. We consider the realization of GPU–FPGA-accelerated simulation to be the most significant difference between our work and previous studies.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132217653","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Improved Side-Channel Resistance by Dynamic Fault-Injection Countermeasures 动态故障注入对策提高侧通道电阻

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-07-01 DOI: 10.1109/ASAP49362.2020.00029

Jan Richter-Brockmann, T. Güneysu

Side-channel analysis and fault-injection attacks are known as serious threats to cryptographic hardware implementations and the combined protection against both is currently an open line of research. A promising countermeasure with considerable implementation overhead appears to be a mix of first-order secure Threshold Implementations and linear Error-Correcting Codes.In this paper we employ for the first time the inherent structure of non-systematic codes as fault countermeasure which dynamically mutates the applied generator matrices to achieve a higher-order side-channel and fault-protected design. As a case study, we apply our scheme to the PRESENT block cipher that do not show any higher-order side-channel leakage after measuring 150 million power traces.

侧信道分析和故障注入攻击被认为是对加密硬件实现的严重威胁，目前对两者的联合保护是一个开放的研究方向。一种很有前途但实现开销很大的对策似乎是一阶安全阈值实现和线性纠错码的混合。本文首次采用非系统码的固有结构作为故障对策，动态改变应用的发生器矩阵，实现高阶侧信道和故障保护设计。作为一个案例研究，我们将我们的方案应用于PRESENT分组密码，该分组密码在测量1.5亿个功率走线后没有显示任何高阶侧信道泄漏。

引用次数: 2

ASAP 2020 Index

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-07-01 DOI: 10.1109/asap49362.2020.00043

引用次数: 0

A Parallel-friendly Majority Gate to Accelerate In-memory Computation 一个并行友好的多数门加速内存计算

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-07-01 DOI: 10.1109/ASAP49362.2020.00025

J. Reuben, Stefan Pechmann

Efforts to combat the ‘von Neumann bottleneck’ have been strengthened by Resistive RAMs (RRAMs), which enable computation in the memory array. Majority logic can accelerate computation when compared to NAND/NOR/IMPLY logic due to it’s expressive power. In this work, we propose a method to compute majority while reading from a transistor-accessed RRAM array. The proposed gate was verified by simulations using a physics-based model (for RRAM) and industry standard model (for CMOS sense amplifier) and, found to tolerate reasonable variations in the RRAMs’ resistive states. Together with NOT gate, which is also implemented in-memory, the proposed gate forms a functionally complete Boolean logic, capable of implementing any digital logic. Computing is simplified to a sequence of READ and WRITE operations and does not require any major modifications to the peripheral circuitry of the array. The parallel-friendly nature of the proposed gate is exploited to implement an eight-bit parallel-prefix adder in memory array. The proposed in-memory adder could achieve a latency reduction of 70% and 50% when compared to IMPLY and NAND/NOR logic-based adders, respectively.

抵抗“冯·诺依曼瓶颈”的努力已经被阻性ram (rram)所加强，它可以在存储器阵列中进行计算。多数逻辑与NAND/NOR/IMPLY逻辑相比，由于其表达能力，可以加速计算。在这项工作中，我们提出了一种在从晶体管访问的RRAM阵列读取时计算多数数的方法。通过使用基于物理的模型(用于RRAM)和工业标准模型(用于CMOS感测放大器)的仿真验证了所提出的门，并且发现可以容忍RRAM电阻状态的合理变化。与同样在内存中实现的非门一起，所提出的门构成了一个功能完整的布尔逻辑，能够实现任何数字逻辑。计算被简化为一系列READ和WRITE操作，并且不需要对阵列的外围电路进行任何重大修改。利用所提出的门的并行友好特性，在存储器阵列中实现了一个8位并行前缀加法器。与基于暗示和基于NAND/NOR逻辑的加法器相比，所提出的内存加法器可以分别减少70%和50%的延迟。

{"title":"A Parallel-friendly Majority Gate to Accelerate In-memory Computation","authors":"J. Reuben, Stefan Pechmann","doi":"10.1109/ASAP49362.2020.00025","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00025","url":null,"abstract":"Efforts to combat the ‘von Neumann bottleneck’ have been strengthened by Resistive RAMs (RRAMs), which enable computation in the memory array. Majority logic can accelerate computation when compared to NAND/NOR/IMPLY logic due to it’s expressive power. In this work, we propose a method to compute majority while reading from a transistor-accessed RRAM array. The proposed gate was verified by simulations using a physics-based model (for RRAM) and industry standard model (for CMOS sense amplifier) and, found to tolerate reasonable variations in the RRAMs’ resistive states. Together with NOT gate, which is also implemented in-memory, the proposed gate forms a functionally complete Boolean logic, capable of implementing any digital logic. Computing is simplified to a sequence of READ and WRITE operations and does not require any major modifications to the peripheral circuitry of the array. The parallel-friendly nature of the proposed gate is exploited to implement an eight-bit parallel-prefix adder in memory array. The proposed in-memory adder could achieve a latency reduction of 70% and 50% when compared to IMPLY and NAND/NOR logic-based adders, respectively.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"33 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124341728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

ParaHist: FPGA Implementation of Parallel Event-Based Histogram for Optical Flow Calculation 并行事件直方图光流计算的FPGA实现

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-07-01 DOI: 10.1109/ASAP49362.2020.00038

Mohammad Pivezhandi, Phillip H. Jones, Joseph Zambreno

In this paper, we present an FPGA-based architecture for histogram generation to support event-based camera optical flow calculation. Our proposed histogram generation mechanism reduces memory and logic resources by storing the time difference between consecutive events, instead of the absolute time of each event. Additionally, we explore the trade-off between system resource usage and histogram accuracy as a function of the precision at which time is encoded. Our results show that across three event-based camera benchmarks we can reduce the encoding of time from 32 to 7 bits with a loss of only approximately 3% in histogram accuracy. In comparison to a software implementation, our architecture shows a significant speedup.

在本文中，我们提出了一种基于fpga的直方图生成架构，以支持基于事件的相机光流计算。我们提出的直方图生成机制通过存储连续事件之间的时间差而不是每个事件的绝对时间来减少内存和逻辑资源。此外，我们探讨了系统资源使用和直方图精度之间的权衡，作为编码时间精度的函数。我们的结果表明，在三个基于事件的相机基准测试中，我们可以将时间编码从32位减少到7位，直方图精度仅损失约3%。与软件实现相比，我们的体系结构显示出显著的加速。

引用次数: 2

[Copyright notice] (版权)

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-07-01 DOI: 10.1109/asap49362.2020.00003

引用次数: 0

A New Hardware Approach to Self-Organizing Maps 自组织地图的一种新的硬件方法

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-07-01 DOI: 10.1109/ASAP49362.2020.00041

L. Dias, M. G. Coutinho, E. Gaura, Marcelo A. C. Fernandes

Self-Organizing Maps (SOMs) are widely used as a data mining technique for applications that require data dimensionality reduction and clustering. Given the complexity of the SOM learning phase and the massive dimensionality of many data sets as well as their sample size in Big Data applications, high-speed processing is critical when implementing SOM approaches. This paper proposes a new hardware approach to SOM implementation, exploiting parallelization, to optimize the system’s processing time. Unlike most implementations in the literature, this proposed approach allows the parallelization of the data dimensions instead of the map, ensuring high processing speed regardless of data dimensions. An implementation with field-programmable gate arrays (FPGA) is presented and evaluated. Key evaluation metrics are processing time (or throughput) and FPGA area occupancy (or hardware resources).

自组织映射(SOMs)作为一种数据挖掘技术被广泛用于需要数据降维和聚类的应用程序。考虑到SOM学习阶段的复杂性，以及大数据应用中许多数据集的巨大维度及其样本量，在实施SOM方法时，高速处理至关重要。本文提出了一种新的SOM硬件实现方法，利用并行化来优化系统的处理时间。与文献中的大多数实现不同，这种建议的方法允许并行化数据维度而不是映射，从而确保无论数据维度如何都具有高处理速度。提出了一种现场可编程门阵列(FPGA)实现方法，并对其进行了评价。关键的评估指标是处理时间(或吞吐量)和FPGA区域占用(或硬件资源)。

引用次数: 2

A Template-based Framework for Exploring Coarse-Grained Reconfigurable Architectures 用于探索粗粒度可重构架构的基于模板的框架

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

Pub Date : 2020-07-01 DOI: 10.1109/ASAP49362.2020.00010

Artur Podobas, K. Sano, S. Matsuoka

Coarse-Grained Reconfigurable Architectures (CGRAs) are being considered as a complementary addition to modern High-Performance Computing (HPC) systems. These reconfigurable devices overcome many of the limitations of the (more popular) FPGA, by providing higher operating frequency, denser compute capacity, and lower power consumption. Today, CGRAs have been used in several embedded applications, including automobile, telecommunication, and mobile systems, but the literature on CGRAs in HPC is sparse and the field full of research opportunities. In this work, we introduce our CGRA simulator infrastructure for use in evaluating future HPC CGRA systems. Our CGRA simulator is built on synthesizable VHDL and is highly parametrizable, including support for connectivity, SIMD, data-type width, and heterogeneity. Unlike other related work, our framework supports co-integration with third-party memory simulators or evaluation of future memory architecture, which is crucial to reason around memory-bound applications. We demonstrate how our framework can be used to explore the performance of multiple different kernels, showing the impact of different configuration and design-space options.

粗粒度可重构体系结构(CGRAs)被认为是现代高性能计算(HPC)系统的补充。这些可重构器件通过提供更高的工作频率、更密集的计算能力和更低的功耗，克服了(更流行的)FPGA的许多限制。目前，CGRAs已应用于多个嵌入式应用，包括汽车、电信和移动系统，但关于CGRAs在高性能计算中的研究文献很少，该领域充满了研究机会。在这项工作中，我们介绍了我们的CGRA模拟器基础设施，用于评估未来的HPC CGRA系统。我们的CGRA模拟器建立在可合成的VHDL上，具有高度的参数化，包括对连接性、SIMD、数据类型宽度和异构性的支持。与其他相关工作不同，我们的框架支持与第三方内存模拟器的协集成或对未来内存体系结构的评估，这对于推断内存绑定应用程序至关重要。我们将演示如何使用我们的框架来探索多个不同内核的性能，展示不同配置和设计空间选项的影响。

{"title":"A Template-based Framework for Exploring Coarse-Grained Reconfigurable Architectures","authors":"Artur Podobas, K. Sano, S. Matsuoka","doi":"10.1109/ASAP49362.2020.00010","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00010","url":null,"abstract":"Coarse-Grained Reconfigurable Architectures (CGRAs) are being considered as a complementary addition to modern High-Performance Computing (HPC) systems. These reconfigurable devices overcome many of the limitations of the (more popular) FPGA, by providing higher operating frequency, denser compute capacity, and lower power consumption. Today, CGRAs have been used in several embedded applications, including automobile, telecommunication, and mobile systems, but the literature on CGRAs in HPC is sparse and the field full of research opportunities. In this work, we introduce our CGRA simulator infrastructure for use in evaluating future HPC CGRA systems. Our CGRA simulator is built on synthesizable VHDL and is highly parametrizable, including support for connectivity, SIMD, data-type width, and heterogeneity. Unlike other related work, our framework supports co-integration with third-party memory simulators or evaluation of future memory architecture, which is crucial to reason around memory-bound applications. We demonstrate how our framework can be used to explore the performance of multiple different kernels, showing the impact of different configuration and design-space options.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128212961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀