2020 57th ACM/IEEE Design Automation Conference (DAC)最新文献

英文中文

Centaur: Hybrid Processing in On/Off-chip Memory Architecture for Graph Analytics 半人马:用于图形分析的片上/片外内存架构中的混合处理

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218624

Abraham Addisie, V. Bertacco

The increased use of graph algorithms in diverse fields has highlighted their inefficiencies in current chip-multiprocessor (CMP) architectures, primarily due to their seemingly random-access patterns to off-chip memory. Recently, two families of solutions have been proposed: 1) solutions that offload operations generated by all vertices from the processor cores to off-chip memory; and 2) solutions that offload only operations generated by high-degree vertices to dedicated on-chip memory, while the cores continue to process the work related to the remaining vertices. Neither approach is optimal over the full range of vertex’s degrees. Thus, in this work, we propose Centaur, a novel architecture that processes operations on vertex data in on- and off-chip memory. Centaur utilizes a vertex’s degree as a proxy to determine whether to process related operations in on- or off-chip memory. Centaur manages to provide up to 4.0× improvement in performance and 3.8× in energy benefits, compared to a baseline CMP, and up to a 2.0× performance boost over state-of-the-art specialized solutions.

图算法在不同领域的使用越来越多，这突出了它们在当前芯片多处理器(CMP)架构中的低效率，主要是由于它们对片外存储器的看似随机的访问模式。最近，提出了两类解决方案:1)将所有顶点产生的操作从处理器内核卸载到片外存储器;2)只将高度顶点产生的操作卸载到专用片上存储器的解决方案，而内核继续处理与剩余顶点相关的工作。这两种方法在顶点度的整个范围内都不是最优的。因此，在这项工作中，我们提出了Centaur，这是一种新颖的架构，可以在片内和片外存储器中处理顶点数据的操作。Centaur利用顶点的度作为代理来确定是否在片内或片外内存中处理相关操作。与基准CMP相比，Centaur能够提供高达4.0倍的性能提升和3.8倍的能源效益，并且比最先进的专业解决方案提供高达2.0倍的性能提升。

引用次数: 6

CryptoPIM: In-memory Acceleration for Lattice-based Cryptographic Hardware 基于格的加密硬件的内存加速

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218730

Hamid Nejatollahi, Saransh Gupta, M. Imani, T. Simunic, Rosario Cammarota, N. Dutt

Quantum computers promise to solve hard mathematical problems such as integer factorization and discrete logarithms in polynomial time, making standardized public-key cryptosystems insecure. Lattice-Based Cryptography (LBC) is a promising post-quantum public key cryptographic protocol that could replace standardized public key cryptography, thanks to the inherent post-quantum resistant properties, efficiency, and versatility. A key mathematical tool in LBC is the Number Theoretic Transform (NTT), a common method to compute polynomial multiplication. It is the most compute-intensive routine and requires acceleration for practical deployment of LBC protocols. In this paper, we propose CryptoPIM, a high-throughput Processing In-Memory (PIM) accelerator for NTT-based polynomial multiplier with the support of polynomials with degrees up to 32k. Compared to the fastest FPGA implementation of an NTT-based multiplier, CryptoPIM achieves on average 31x throughput improvement with the same energy and only 28% performance reduction, thereby showing promise for practical deployment of LBC.

量子计算机有望在多项式时间内解决整数分解和离散对数等数学难题，使标准化的公钥密码系统变得不安全。基于Lattice-Based Cryptography (LBC)是一种很有前途的后量子公钥加密协议，由于其固有的抗后量子特性、效率和通用性，它可以取代标准化的公钥加密。数论变换(NTT)是LBC中一个重要的数学工具，它是一种计算多项式乘法的常用方法。它是计算最密集的例程，需要加速LBC协议的实际部署。在本文中，我们提出了CryptoPIM，一个基于ntt的多项式乘子的高吞吐量内存处理(PIM)加速器，支持度高达32k的多项式。与基于ntt的乘法器的最快FPGA实现相比，CryptoPIM在相同的能量下实现了平均31倍的吞吐量提高，而性能仅降低了28%，因此显示出LBC实际部署的希望。

{"title":"CryptoPIM: In-memory Acceleration for Lattice-based Cryptographic Hardware","authors":"Hamid Nejatollahi, Saransh Gupta, M. Imani, T. Simunic, Rosario Cammarota, N. Dutt","doi":"10.1109/DAC18072.2020.9218730","DOIUrl":"https://doi.org/10.1109/DAC18072.2020.9218730","url":null,"abstract":"Quantum computers promise to solve hard mathematical problems such as integer factorization and discrete logarithms in polynomial time, making standardized public-key cryptosystems insecure. Lattice-Based Cryptography (LBC) is a promising post-quantum public key cryptographic protocol that could replace standardized public key cryptography, thanks to the inherent post-quantum resistant properties, efficiency, and versatility. A key mathematical tool in LBC is the Number Theoretic Transform (NTT), a common method to compute polynomial multiplication. It is the most compute-intensive routine and requires acceleration for practical deployment of LBC protocols. In this paper, we propose CryptoPIM, a high-throughput Processing In-Memory (PIM) accelerator for NTT-based polynomial multiplier with the support of polynomials with degrees up to 32k. Compared to the fastest FPGA implementation of an NTT-based multiplier, CryptoPIM achieves on average 31x throughput improvement with the same energy and only 28% performance reduction, thereby showing promise for practical deployment of LBC.","PeriodicalId":428807,"journal":{"name":"2020 57th ACM/IEEE Design Automation Conference (DAC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133527901","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Developing Privacy-preserving AI Systems: The Lessons learned 开发保护隐私的人工智能系统:经验教训

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218662

Huili Chen, S. Hussain, Fabian Boemer, Emmanuel Stapf, A. Sadeghi, F. Koushanfar, Rosario Cammarota

Advances in customers' data privacy laws create pressures and pain points across the entire lifecycle of AI products. Working figures such as data scientists and data engineers need to account for the correct use of privacy-enhancing technologies such as homomorphic encryption, secure multi-party computation, and trusted execution environment when they develop, test and deploy products embedding AI models while providing data protection guarantees. In this work, we share the lessons learned during the development of frameworks to aid data scientists and data engineers to map their optimized workloads onto privacy-enhancing technologies seamlessly and correctly.

客户数据隐私法的进步给人工智能产品的整个生命周期带来了压力和痛点。数据科学家和数据工程师等工作人员在开发、测试和部署嵌入人工智能模型的产品时，在提供数据保护保证的同时，需要考虑到正确使用同态加密、安全多方计算、可信执行环境等增强隐私的技术。在这项工作中，我们分享了在框架开发过程中获得的经验教训，以帮助数据科学家和数据工程师将其优化的工作负载无缝且正确地映射到隐私增强技术上。

引用次数: 8

Scalable Multi-FPGA Acceleration for Large RNNs with Full Parallelism Levels 具有完全并行性的大型rnn的可扩展多fpga加速

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218528

Dongup Kwon, Suyeon Hur, Hamin Jang, E. Nurvitadhi, Jangwoo Kim

The increasing size of recurrent neural networks (RNNs) makes it hard to meet the growing demand for real-time AI services. For low-latency RNN serving, FPGA-based accelerators can leverage specialized architectures with optimized dataflow. However, they also suffer from severe HW under-utilization when partitioning RNNs, and thus fail to obtain the scalable performance.In this paper, we identify the performance bottlenecks of existing RNN partitioning strategies. Then, we propose a novel RNN partitioning strategy to achieve the scalable multi-FPGA acceleration for large RNNs. First, we introduce three parallelism levels and exploit them by partitioning weight matrices, matrix/vector operations, and layers. Second, we examine the performance impact of collective communications and software pipelining to derive more accurate and optimal distribution results. We prototyped an FPGA-based acceleration system using multiple Intel high-end FPGAs, and our partitioning scheme allows up to 2.4x faster inference of modern RNN workloads than conventional partitioning methods.

递归神经网络(rnn)的规模不断扩大，难以满足日益增长的实时人工智能服务需求。对于低延迟RNN服务，基于fpga的加速器可以利用具有优化数据流的专用架构。然而，它们在对rnn进行分区时也存在严重的硬件利用率不足，无法获得可扩展性能。在本文中，我们识别了现有RNN分区策略的性能瓶颈。然后，我们提出了一种新的RNN划分策略，以实现大型RNN的可扩展多fpga加速。首先，我们引入了三个并行性级别，并通过划分权重矩阵、矩阵/向量操作和层来利用它们。其次，我们研究了集体通信和软件流水线对性能的影响，以得出更准确和最优的分布结果。我们使用多个英特尔高端fpga原型设计了一个基于fpga的加速系统，我们的分区方案允许比传统分区方法快2.4倍的现代RNN工作负载推理。

引用次数: 7

CAP’NN: Class-Aware Personalized Neural Network Inference 类别感知的个性化神经网络推理

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218741

Maedeh Hemmat, Joshua San Miguel, A. Davoodi

We propose CAP’NN, a framework for Class-Aware Personalized Neural Network Inference. CAP’NN prunes an already-trained neural network model based on the preferences of individual users. Specifically, by adapting to the subset of output classes that each user is expected to encounter, CAP’NN is able to prune not only ineffectual neurons but also miseffectual neurons that confuse classification, without the need to retrain the network. CAP’NN achieves up to 50% model size reduction while actually improving the top-l(5) classification accuracy by up to 2.3%(3.2%) when the user only encounters a subset of VGG-16 classes.

我们提出了一种基于类感知的个性化神经网络推理框架CAP 'NN。CAP 'NN根据个人用户的偏好对已经训练好的神经网络模型进行修剪。具体来说，通过适应每个用户预期会遇到的输出类的子集，CAP 'NN不仅能够修剪无效的神经元，还能够修剪混淆分类的无效神经元，而无需重新训练网络。当用户只遇到VGG-16类的一个子集时，CAP 'NN实现了高达50%的模型尺寸缩减，同时实际上将top- 1(5)分类精度提高了2.3%(3.2%)。

引用次数: 5

Romeo: Conversion and Evaluation of HDL Designs in the Encrypted Domain 罗密欧:加密领域中HDL设计的转换与评估

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218579

Charles Gouert, N. G. Tsoutsos

As cloud computing becomes increasingly ubiquitous, protecting the confidentiality of data outsourced to third parties becomes a priority. While encryption is a natural solution to this problem, traditional algorithms may only protect data at rest and in transit, but do not support encrypted processing. In this work we introduce ROMEO, which enables easy-to-use privacy-preserving processing of data in the cloud using homomorphic encryption. ROMEO automatically converts arbitrary programs expressed in Verilog HDL into equivalent homomorphic circuits that are evaluated using encrypted inputs. For our experiments, we employ cryptographic circuits, such as AES, and benchmarks from the ISCAS’85 and ISCAS’89 suites.

随着云计算变得越来越普遍，保护外包给第三方的数据的机密性成为一个优先事项。虽然加密是解决这个问题的自然方法，但传统算法可能只保护静态和传输中的数据，而不支持加密处理。在这项工作中，我们介绍了ROMEO，它可以使用同态加密对云中的数据进行易于使用的隐私保护处理。ROMEO自动转换在Verilog HDL中表达的任意程序为等效的同态电路，使用加密输入进行评估。在我们的实验中，我们使用了加密电路，如AES，以及来自ISCAS ' 85和ISCAS ' 89套件的基准测试。

引用次数: 8

ATUNs: Modular and Scalable Support for Atomic Operations in a Shared Memory Multiprocessor atun:共享内存多处理器中原子操作的模块化和可伸缩支持

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218661

Andreas Kurth, Samuel Riedel, Florian Zaruba, T. Hoefler, L. Benini

Atomic operations are crucial for most modern parallel and concurrent algorithms, which necessitates their optimized implementation in highly-scalable manycore processors. We pro-pose a modular and efficient, open-source ATomic UNit (ATUN) architecture that can be placed flexibly at different levels of the memory hierarchy. ATUN demonstrates near-optimal linear scaling for various synthetic and real-world workloads on an FPGA prototype with 32 RISC-V cores. We characterize the hardware complexity of our ATUN design in 22 nm FDSOI and find that it scales linearly in area (only 0.5 kGE per core) and logarithmically in the critical path.

原子操作对于大多数现代并行和并发算法至关重要，这就需要在高可伸缩的多核处理器中对其进行优化实现。我们提出了一个模块化的、高效的、开源的原子单元(ATUN)架构，它可以灵活地放置在内存层次结构的不同级别。ATUN在具有32个RISC-V内核的FPGA原型上为各种合成和实际工作负载展示了近乎最佳的线性缩放。我们在22 nm FDSOI中表征了我们的ATUN设计的硬件复杂性，并发现它在面积上呈线性缩放(每核仅0.5 kGE)，在关键路径上呈对数缩放。

引用次数: 3

CL(R)Early: An Early-stage DSE Methodology for Cross-Layer Reliability-aware Heterogeneous Embedded Systems 李志强(R)早期:跨层可靠性感知异构嵌入式系统的早期DSE方法

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218747

Siva Satyendra Sahoo, B. Veeravalli, Akash Kumar

Cross-layer reliability (CLR) presents a cost-effective alternative to traditional single-layer design in resource-constrained embedded systems. CLR provides the scope for leveraging the inherent fault-masking of multiple layers and exploiting application-specific tolerances to degradation in some Quality of Service (QoS) metrics. However, it can also lead to an explosion in the design complexity. State-of-the art approaches to such joint optimization across multiple degrees of freedom can lead to degradation in the system-level Design Space Exploration (DSE) results. To this end, we propose a DSE methodology for enabling CLR-aware task-mapping in heterogeneous embedded systems. Specifically, we present novel approaches to both task and system-level analysis for performing an early-stage exploration of various design decisions. The proposed methodology results in considerable improvements over other state-of-the-art approaches and shows significant scaling with application size.

在资源受限的嵌入式系统中，跨层可靠性(CLR)为传统的单层设计提供了一种经济有效的替代方案。CLR提供了利用多层固有的故障屏蔽的范围，并在某些服务质量(QoS)度量中利用特定于应用程序的降级容忍度。然而，它也可能导致设计复杂性的爆炸。这种跨多个自由度的联合优化的最新方法可能导致系统级设计空间探索(DSE)结果的退化。为此，我们提出了一种在异构嵌入式系统中实现clr感知任务映射的DSE方法。具体来说，我们提出了任务级和系统级分析的新方法，用于执行各种设计决策的早期探索。与其他最先进的方法相比，所提出的方法得到了相当大的改进，并显示出应用程序大小的显著可伸缩性。

引用次数: 5

A Cross-Layer Power and Timing Evaluation Method for Wide Voltage Scaling 宽电压标度的跨层功率和时序评估方法

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218682

Wenjie Fu, Leilei Jin, Ming Ling, Yu Zheng, Longxing Shi

Wide supply voltage scaling is critical to enable worthwhile dynamic adjustment of the processor efficiency against varying workloads. In this paper, a cross-layer power and timing evaluation method is proposed to estimate the processor energy efficiency using both circuit and architectural information in a wide voltage range. The process variations are considered through statistical static timing analysis while the voltage effect is modeled through secondary iterated fittings. The error for estimating processor energy efficiency decreases to 8.29% when the supply voltage is scaled from 1.1V to 0.6V, while traditional architectural evaluations behave more than 40% errors.

宽电源电压缩放对于实现针对不同工作负载的有价值的处理器效率动态调整至关重要。本文提出了一种跨层功率和时序评估方法，在宽电压范围内利用电路和结构信息来评估处理器的能量效率。通过统计静态时序分析来考虑过程变化，通过二次迭代拟合来模拟电压效应。当电源电压从1.1V缩放到0.6V时，处理器能量效率的估计误差降低到8.29%，而传统的架构评估误差超过40%。

引用次数: 3

ALSRAC: Approximate Logic Synthesis by Resubstitution with Approximate Care Set 基于近似关心集的近似逻辑综合

2020 57th ACM/IEEE Design Automation Conference (DAC)

Pub Date : 2020-07-01 DOI: 10.1109/DAC18072.2020.9218627

Chang Meng, Weikang Qian, A. Mishchenko

Approximate computing is an emerging design technique for error-resilient applications. It improves circuit area, power, and delay at the cost of introducing some errors. Approximate logic synthesis (ALS) is an automatic process to produce approximate circuits. This paper proposes approximate resubstitution with approximate care set and uses it to build a simulation-based ALS flow. The experimental results demonstrate that the proposed method saves 7%–18% area compared to state-of-the-art methods. The code of ALSRAC is made open-source.

近似计算是一种新兴的容错设计技术。它以引入一些误差为代价，改善了电路面积、功率和延迟。近似逻辑合成(ALS)是一种自动生成近似电路的过程。本文提出了近似关心集的近似重替换，并利用它构建了一个基于仿真的ALS流程。实验结果表明，与现有方法相比，该方法节省了7% ~ 18%的面积。ALSRAC的代码是开源的。

引用次数: 15

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2020 57th ACM/IEEE Design Automation Conference (DAC)

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀