首页 > 最新文献

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献

英文 中文
ASAP 2020 TOC
{"title":"ASAP 2020 TOC","authors":"","doi":"10.1109/asap49362.2020.00004","DOIUrl":"https://doi.org/10.1109/asap49362.2020.00004","url":null,"abstract":"","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125400548","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FPGAs in the Datacenters: the Case of Parallel Hybrid Super Scalar String Sample Sort 数据中心中的fpga:并行混合超标量字符串样本排序的情况
Mikhail Asiatici, Damian Maiorano, P. Ienne
String sorting is an important part of database and MapReduce applications; however, it has not been studied as extensively as sorting of fixed-length keys. Handling variable-length keys in hardware is challenging and it is no surprise that no string sorters on FPGA have been proposed yet. In this paper, we present Parallel Hybrid Super Scalar String Sample Sort (pHS5) on Intel HARPv2, a heterogeneous CPU-FPGA system with a server-grade multi-core CPU. Our pHS5 is based on the state-of-the-art string sorting algorithm for multi-core shared memory CPUs, pS5, which we extended with multiple processing elements (PEs) on the FPGA. Each PE accelerates one instance of the most effectively parallelizable dominant kernel of pS5 by up to 33% compared to a single Intel Xeon Broadwell core running at 3.4 GHz. Furthermore, we extended the job scheduling mechanism of pS5 to enable our PEs to compete with the CPU cores for processing the accelerable kernel, while retaining the complex high-level control flow and the sorting of the smaller data sets on the CPU. We accelerate the whole algorithm by up to 10% compared to the 28 thread software baseline running on the 14-core Xeon processor and by up to 36% at lower thread counts.
字符串排序是数据库和MapReduce应用程序的重要组成部分;然而,它还没有像定长键排序那样得到广泛的研究。在硬件中处理可变长度的键是具有挑战性的,因此在FPGA上没有提出字符串排序器也就不足为奇了。本文在Intel HARPv2上实现了并行混合标量字符串样本排序(pHS5), HARPv2是一种异构CPU- fpga系统,具有服务器级多核CPU。我们的pHS5基于用于多核共享内存cpu的最先进的字符串排序算法pS5,我们在FPGA上扩展了多个处理元素(pe)。与运行在3.4 GHz的单个Intel至强Broadwell内核相比,每个PE可将pS5最有效并行化主导内核的一个实例加速高达33%。此外,我们扩展了pS5的作业调度机制,使我们的pe能够与CPU内核竞争处理可加速内核,同时保留了CPU上复杂的高级控制流和较小数据集的排序。与在14核至强处理器上运行的28线程软件基准相比,我们将整个算法的速度提高了10%,在线程数较低的情况下,速度提高了36%。
{"title":"FPGAs in the Datacenters: the Case of Parallel Hybrid Super Scalar String Sample Sort","authors":"Mikhail Asiatici, Damian Maiorano, P. Ienne","doi":"10.1109/ASAP49362.2020.00031","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00031","url":null,"abstract":"String sorting is an important part of database and MapReduce applications; however, it has not been studied as extensively as sorting of fixed-length keys. Handling variable-length keys in hardware is challenging and it is no surprise that no string sorters on FPGA have been proposed yet. In this paper, we present Parallel Hybrid Super Scalar String Sample Sort (pHS5) on Intel HARPv2, a heterogeneous CPU-FPGA system with a server-grade multi-core CPU. Our pHS5 is based on the state-of-the-art string sorting algorithm for multi-core shared memory CPUs, pS5, which we extended with multiple processing elements (PEs) on the FPGA. Each PE accelerates one instance of the most effectively parallelizable dominant kernel of pS5 by up to 33% compared to a single Intel Xeon Broadwell core running at 3.4 GHz. Furthermore, we extended the job scheduling mechanism of pS5 to enable our PEs to compete with the CPU cores for processing the accelerable kernel, while retaining the complex high-level control flow and the sorting of the smaller data sets on the CPU. We accelerate the whole algorithm by up to 10% compared to the 28 thread software baseline running on the 14-core Xeon processor and by up to 36% at lower thread counts.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115188810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Persistent Fault Analysis of Neural Networks on FPGA-based Acceleration System 基于fpga加速系统的神经网络持续故障分析
Dawen Xu, Ziyang Zhu, Cheng Liu, Ying Wang, Huawei Li, Lei Zhang, K. Cheng
The increasing hardware failures caused by the shrinking semiconductor technologies pose substantial influence on the neural accelerators and improving the resilience of the neural network execution becomes a great design challenge especially to mission-critical applications such as self-driving and medical diagnose. The reliability analysis of the neural network execution is a key step to understand the influence of the hardware failures, and thus is highly demanded. Prior works typically conducted the fault analysis of neural network accelerators with simulation and concentrated on the prediction accuracy loss of the models. There is still a lack of systematic fault analysis of the neural network acceleration system that considers both the accuracy degradation and system exceptions such as system stall and early termination.In this work, we implemented a representative neural network accelerator and fault injection modules on a Xilinx ARM-FPGA platform and conducted fault analysis of the system using four typical neural network models. We had the system open-sourced on github. With comprehensive experiments, we identify the system exceptions based on the various abnormal behaviours of the FPGA-based neural network acceleration system and analyze the underlying reasons. Particularly, we find that the probability of the system exceptions dominates the reliability of the system and they are mainly caused by faults in the DMA, control unit and instruction memory of the accelerators. In addition, faults in these components also incur moderate accuracy degradation of the neural network models other than the system exceptions. Thus, these components are the most fragile part of the accelerators and need to be hardened for reliable neural network execution.
半导体技术的萎缩导致硬件故障的增加对神经加速器造成了重大影响,提高神经网络执行的弹性成为一个巨大的设计挑战,特别是对于自动驾驶和医疗诊断等关键任务应用。神经网络执行的可靠性分析是了解硬件故障影响的关键步骤,因此被寄予厚望。以往的研究多是通过仿真对神经网络加速器进行故障分析,主要关注模型的预测精度损失。对于神经网络加速系统,目前还缺乏既考虑精度下降又考虑系统异常(如系统失速和提前终止)的系统故障分析。本文在Xilinx ARM-FPGA平台上实现了具有代表性的神经网络加速器和故障注入模块,并使用四种典型的神经网络模型对系统进行了故障分析。我们在github上开源了这个系统。通过全面的实验,我们根据基于fpga的神经网络加速系统的各种异常行为来识别系统异常,并分析其潜在原因。特别地,我们发现系统异常的概率支配着系统的可靠性,它们主要是由加速器的DMA、控制单元和指令存储器的故障引起的。此外,除了系统异常外,这些部件的故障也会导致神经网络模型精度的适度下降。因此,这些组件是加速器中最脆弱的部分,需要加固才能可靠地执行神经网络。
{"title":"Persistent Fault Analysis of Neural Networks on FPGA-based Acceleration System","authors":"Dawen Xu, Ziyang Zhu, Cheng Liu, Ying Wang, Huawei Li, Lei Zhang, K. Cheng","doi":"10.1109/ASAP49362.2020.00024","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00024","url":null,"abstract":"The increasing hardware failures caused by the shrinking semiconductor technologies pose substantial influence on the neural accelerators and improving the resilience of the neural network execution becomes a great design challenge especially to mission-critical applications such as self-driving and medical diagnose. The reliability analysis of the neural network execution is a key step to understand the influence of the hardware failures, and thus is highly demanded. Prior works typically conducted the fault analysis of neural network accelerators with simulation and concentrated on the prediction accuracy loss of the models. There is still a lack of systematic fault analysis of the neural network acceleration system that considers both the accuracy degradation and system exceptions such as system stall and early termination.In this work, we implemented a representative neural network accelerator and fault injection modules on a Xilinx ARM-FPGA platform and conducted fault analysis of the system using four typical neural network models. We had the system open-sourced on github. With comprehensive experiments, we identify the system exceptions based on the various abnormal behaviours of the FPGA-based neural network acceleration system and analyze the underlying reasons. Particularly, we find that the probability of the system exceptions dominates the reliability of the system and they are mainly caused by faults in the DMA, control unit and instruction memory of the accelerators. In addition, faults in these components also incur moderate accuracy degradation of the neural network models other than the system exceptions. Thus, these components are the most fragile part of the accelerators and need to be hardened for reliable neural network execution.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114629896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Anytime Floating-Point Addition and Multiplication-Concepts and Implementations 任意时间浮点加法和乘法——概念和实现
Marcel Brand, Michael Witterauf, A. Bosio, J. Teich
In this paper, we present anytime instructions for floating-point additions and multiplications. Specific to such instructions is their ability to compute an arithmetic operation at a programmable accuracy of a most significant bits where a is encoded in the instruction itself. Contrary to reduced-precision architectures, the word length is maintained throughout the execution. Two approaches are presented for the efficient implementation of anytime additions and multiplications, one based on on-line arithmetic and the other on bitmasking. We propose implementations of anytime functional units for both approaches and evaluate them in terms of error, latency, area, as well as energy savings. As a result, 15% of energy can be saved on average while computing a floating-point addition with an error of less than 0.1%. Moreover, large latency and energy savings are reported for iterative algorithms such as a Jacobi algorithm with savings of up to 39% in energy.
在本文中,我们提供了浮点加法和乘法的任意指令。这类指令的特殊之处在于它们能够以最高有效位的可编程精度计算算术运算,其中a被编码在指令本身中。与降低精度的体系结构相反,字长在整个执行过程中保持不变。提出了两种有效实现任意时刻加法和乘法的方法,一种是基于在线算法,另一种是基于位掩码。我们为这两种方法提出了随时功能单元的实现,并从误差、延迟、面积和节能方面对它们进行了评估。因此,在计算误差小于0.1%的浮点加法时,平均可以节省15%的能量。此外,据报道,迭代算法(如Jacobi算法)的大延迟和节能可节省高达39%的能源。
{"title":"Anytime Floating-Point Addition and Multiplication-Concepts and Implementations","authors":"Marcel Brand, Michael Witterauf, A. Bosio, J. Teich","doi":"10.1109/ASAP49362.2020.00034","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00034","url":null,"abstract":"In this paper, we present anytime instructions for floating-point additions and multiplications. Specific to such instructions is their ability to compute an arithmetic operation at a programmable accuracy of a most significant bits where a is encoded in the instruction itself. Contrary to reduced-precision architectures, the word length is maintained throughout the execution. Two approaches are presented for the efficient implementation of anytime additions and multiplications, one based on on-line arithmetic and the other on bitmasking. We propose implementations of anytime functional units for both approaches and evaluate them in terms of error, latency, area, as well as energy savings. As a result, 15% of energy can be saved on average while computing a floating-point addition with an error of less than 0.1%. Moreover, large latency and energy savings are reported for iterative algorithms such as a Jacobi algorithm with savings of up to 39% in energy.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114691672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Reconfigurable Stream-based Tensor Unit with Variable-Precision Posit Arithmetic 可变精度位置算法的可重构流张量单元
Nuno Neves, P. Tomás, N. Roma
The increased adoption of DNN applications drove the emergence of dedicated tensor computing units to accelerate multi-dimensional matrix multiplication operations. Although they deploy highly efficient computing architectures, they often lack support for more general-purpose application domains. Such a limitation occurs both due to their consolidated computation scheme (restricted to matrix multiplication) and due to their frequent adoption of low-precision/custom floating-point formats (unsuited for general application domains). In contrast, this paper proposes a new Reconfigurable Tensor Unit (RTU) which deploys an array of variable-precision Vector MultiplyAccumulate (VMA) units. Furthermore, each VMA unit leverages the new Posit floating-point format and supports the full range of standardized posit precisions in a single SIMD unit, with variable vector-element width. Moreover, the proposed RTU explores the Posit format features for fused operations, together with spatial and time-multiplexing reconfiguration mechanisms to fuse and combine multiple VMAs to map high-level and complex operations. The RTU is also supported by an automatic data streaming infrastructure and a pipelined data movement scheme, allowing it to accelerate the computation of most data-parallel patterns commonly present in vectorizable applications. The proposed RTU showed to outperform state-of-the-art tensor and SIMD units, present in off-the-shelf platforms, in turn resulting in significant energy-efficiency improvements.
深度神经网络应用的日益普及推动了专用张量计算单元的出现,以加速多维矩阵乘法运算。尽管它们部署了高效的计算体系结构,但它们通常缺乏对更通用的应用程序域的支持。这种限制是由于它们的统一计算方案(仅限于矩阵乘法)和它们经常采用低精度/自定义浮点格式(不适合一般应用领域)造成的。相比之下,本文提出了一种新的可重构张量单元(RTU),它部署了一组可变精度的向量乘法累加(VMA)单元。此外,每个VMA单元都利用新的Posit浮点格式,并在单个SIMD单元中支持所有标准化的位置精度,具有可变的矢量元素宽度。此外,所提出的RTU探索了融合操作的Posit格式特征,以及空间和时间复用重构机制,以融合和组合多个vma以映射高级和复杂的操作。RTU还由自动数据流基础设施和流水线数据移动方案支持,允许它加速可向量化应用程序中常见的大多数数据并行模式的计算。RTU的性能优于现有平台中最先进的张量单元和SIMD单元,从而显著提高了能源效率。
{"title":"Reconfigurable Stream-based Tensor Unit with Variable-Precision Posit Arithmetic","authors":"Nuno Neves, P. Tomás, N. Roma","doi":"10.1109/ASAP49362.2020.00033","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00033","url":null,"abstract":"The increased adoption of DNN applications drove the emergence of dedicated tensor computing units to accelerate multi-dimensional matrix multiplication operations. Although they deploy highly efficient computing architectures, they often lack support for more general-purpose application domains. Such a limitation occurs both due to their consolidated computation scheme (restricted to matrix multiplication) and due to their frequent adoption of low-precision/custom floating-point formats (unsuited for general application domains). In contrast, this paper proposes a new Reconfigurable Tensor Unit (RTU) which deploys an array of variable-precision Vector MultiplyAccumulate (VMA) units. Furthermore, each VMA unit leverages the new Posit floating-point format and supports the full range of standardized posit precisions in a single SIMD unit, with variable vector-element width. Moreover, the proposed RTU explores the Posit format features for fused operations, together with spatial and time-multiplexing reconfiguration mechanisms to fuse and combine multiple VMAs to map high-level and complex operations. The RTU is also supported by an automatic data streaming infrastructure and a pipelined data movement scheme, allowing it to accelerate the computation of most data-parallel patterns commonly present in vectorizable applications. The proposed RTU showed to outperform state-of-the-art tensor and SIMD units, present in off-the-shelf platforms, in turn resulting in significant energy-efficiency improvements.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126531987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Array Aware Training/Pruning: Methods for Efficient Forward Propagation on Array-based Neural Network Accelerators 阵列感知训练/剪枝:基于阵列的神经网络加速器有效前向传播方法
Krishna Teja Chitty-Venkata, Arun Kumar Somani
Due to the increase in the use of large-sized Deep Neural Networks (DNNs) over the years, specialized hardware accelerators such as Tensor Processing Unit and Eyeriss have been developed to accelerate the forward pass of the network. The essential component of these devices is an array processor which is composed of multiple individual compute units for efficiently executing Multiplication and Accumulation (MAC) operation. As the size of this array limits the amount of DNN processing of a single layer, the computation is performed in several batches serially leading to extra compute cycles along both the axes. In practice, due to the mismatch between matrix and array sizes, the computation does not map on the array exactly. In this work, we address the issue of minimizing processing cycles on the array by adjusting the DNN model parameters by using a structured hardware array dependent optimization. We introduce two techniques in this paper: Array Aware Training (AAT) for efficient training and Array Aware Pruning (AAP) for efficient inference. Weight pruning is an approach to remove redundant parameters in the network to decrease the size of the network. The key idea behind pruning in this paper is to adjust the model parameters (the weight matrix) so that the array is fully utilized in each computation batch. Our goal is to compress the model based on the size of the array so as to reduce the number of computation cycles. We observe that both the proposed techniques results into similar accuracy as the original network while saving a significant number of processing cycles (75%).
近年来,由于大规模深度神经网络(dnn)的使用越来越多,专门的硬件加速器如Tensor Processing Unit和Eyeriss已经被开发出来,以加速网络的前向传递。这些设备的基本组件是一个阵列处理器,它由多个独立的计算单元组成,用于有效地执行乘法和累加(MAC)操作。由于这个数组的大小限制了单个层的深度神经网络处理的数量,计算是在几个批次中连续执行的,导致沿两个轴都有额外的计算周期。在实践中,由于矩阵和数组大小之间的不匹配,计算不能准确地映射到数组上。在这项工作中,我们通过使用结构化硬件阵列相关优化来调整DNN模型参数,从而解决了最小化阵列处理周期的问题。本文介绍了两种技术:用于高效训练的阵列感知训练(AAT)和用于高效推理的阵列感知修剪(AAP)。权值剪枝是一种去除网络中冗余参数以减小网络规模的方法。本文的关键思想是调整模型参数(权矩阵),使数组在每个计算批中得到充分利用。我们的目标是根据数组的大小对模型进行压缩,从而减少计算周期。我们观察到,这两种提出的技术都达到了与原始网络相似的精度,同时节省了大量的处理周期(75%)。
{"title":"Array Aware Training/Pruning: Methods for Efficient Forward Propagation on Array-based Neural Network Accelerators","authors":"Krishna Teja Chitty-Venkata, Arun Kumar Somani","doi":"10.1109/ASAP49362.2020.00016","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00016","url":null,"abstract":"Due to the increase in the use of large-sized Deep Neural Networks (DNNs) over the years, specialized hardware accelerators such as Tensor Processing Unit and Eyeriss have been developed to accelerate the forward pass of the network. The essential component of these devices is an array processor which is composed of multiple individual compute units for efficiently executing Multiplication and Accumulation (MAC) operation. As the size of this array limits the amount of DNN processing of a single layer, the computation is performed in several batches serially leading to extra compute cycles along both the axes. In practice, due to the mismatch between matrix and array sizes, the computation does not map on the array exactly. In this work, we address the issue of minimizing processing cycles on the array by adjusting the DNN model parameters by using a structured hardware array dependent optimization. We introduce two techniques in this paper: Array Aware Training (AAT) for efficient training and Array Aware Pruning (AAP) for efficient inference. Weight pruning is an approach to remove redundant parameters in the network to decrease the size of the network. The key idea behind pruning in this paper is to adjust the model parameters (the weight matrix) so that the array is fully utilized in each computation batch. Our goal is to compress the model based on the size of the array so as to reduce the number of computation cycles. We observe that both the proposed techniques results into similar accuracy as the original network while saving a significant number of processing cycles (75%).","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131713230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Training Neural Nets using only an Approximate Tableless LNS ALU 仅使用近似无表LNS ALU训练神经网络
M. Arnold, E. Chester, Corey Johnson
The Logarithmic Number System (LNS) is useful in applications that tolerate approximate computation, such as classification using multi-layer neural networks that compute nonlinear functions of weighted sums of inputs from previous layers. Supervised learning has two phases: training (find appropriate weights for the desired classification), and inference (use the weights with approximate sum of products). Several researchers have observed that LNS ALUs in inference may minimize area and power by being both low-precision and approximate (allowing low-cost, tableless implementations). However, the few works that have also trained with LNS report at least part of the system needs accurate LNS. This paper describes a novel approximate LNS ALU implemented simply as logic (without tables) that enables the entire back-propagation training to occur in LNS, at one-third the cost of fixed-point implementation.
对数系统(LNS)在允许近似计算的应用程序中非常有用,例如使用多层神经网络进行分类,该神经网络计算前几层输入的加权和的非线性函数。监督学习有两个阶段:训练(为期望的分类找到合适的权重)和推理(使用具有近似乘积和的权重)。一些研究人员观察到,推理中的LNS alu可以通过低精度和近似(允许低成本,无表实现)来最小化面积和功耗。然而,少数使用LNS进行训练的作品报告至少部分系统需要精确的LNS。本文描述了一种新颖的近似LNS ALU,它简单地实现为逻辑(没有表),使整个反向传播训练在LNS中发生,成本是定点实现的三分之一。
{"title":"Training Neural Nets using only an Approximate Tableless LNS ALU","authors":"M. Arnold, E. Chester, Corey Johnson","doi":"10.1109/ASAP49362.2020.00020","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00020","url":null,"abstract":"The Logarithmic Number System (LNS) is useful in applications that tolerate approximate computation, such as classification using multi-layer neural networks that compute nonlinear functions of weighted sums of inputs from previous layers. Supervised learning has two phases: training (find appropriate weights for the desired classification), and inference (use the weights with approximate sum of products). Several researchers have observed that LNS ALUs in inference may minimize area and power by being both low-precision and approximate (allowing low-cost, tableless implementations). However, the few works that have also trained with LNS report at least part of the system needs accurate LNS. This paper describes a novel approximate LNS ALU implemented simply as logic (without tables) that enables the entire back-propagation training to occur in LNS, at one-third the cost of fixed-point implementation.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131128834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
[ASAP 2020 Title page] [ASAP 2020首页]
{"title":"[ASAP 2020 Title page]","authors":"","doi":"10.1109/asap49362.2020.00002","DOIUrl":"https://doi.org/10.1109/asap49362.2020.00002","url":null,"abstract":"","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124676316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
External Referees - ASAP 2020 外部裁判-尽快2020
{"title":"External Referees - ASAP 2020","authors":"","doi":"10.1109/asap49362.2020.00009","DOIUrl":"https://doi.org/10.1109/asap49362.2020.00009","url":null,"abstract":"","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"103 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123165919","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Dynamic Sharing in Multi-accelerators of Neural Networks on an FPGA Edge Device FPGA边缘器件上神经网络多加速器的动态共享
Hsin-Yu Ting, Tootiya Giyahchi, A. A. Sani, E. Bozorgzadeh
Edge computing can potentially provide abundant processing resources for compute-intensive applications while bringing services close to end devices. With the increasing demands for computing acceleration at the edge, FPGAs have been deployed to provide custom deep neural network accelerators. This paper explores a DNN accelerator sharing system at the edge FPGA device, that serves various DNN applications from multiple end devices simultaneously. The proposed SharedDNN/PlanAhead policy exploits the regularity among requests for various DNN accelerators and determines which accelerator to allocate for each request and in what order to respond to the requests that achieve maximum responsiveness for a queue of acceleration requests. Our results show overall 2. 20x performance gain at best and utilization improvement by reducing up to 27% of DNN library usage while staying within the requests’ requirements and resource constraints.
边缘计算可以为计算密集型应用程序提供丰富的处理资源,同时使服务更接近终端设备。随着对边缘计算加速的需求不断增加,fpga已经被部署来提供定制的深度神经网络加速器。本文探讨了一种基于边缘FPGA器件的深度神经网络加速器共享系统,该系统可以同时服务于来自多个终端器件的各种深度神经网络应用。提出的SharedDNN/PlanAhead策略利用了各种DNN加速器请求之间的规律性,并确定为每个请求分配哪个加速器,以及以什么顺序响应对加速请求队列实现最大响应的请求。我们的结果显示总体2。在保持请求需求和资源限制的情况下,通过减少多达27%的DNN库使用,最多可获得20倍的性能提升和利用率提高。
{"title":"Dynamic Sharing in Multi-accelerators of Neural Networks on an FPGA Edge Device","authors":"Hsin-Yu Ting, Tootiya Giyahchi, A. A. Sani, E. Bozorgzadeh","doi":"10.1109/ASAP49362.2020.00040","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00040","url":null,"abstract":"Edge computing can potentially provide abundant processing resources for compute-intensive applications while bringing services close to end devices. With the increasing demands for computing acceleration at the edge, FPGAs have been deployed to provide custom deep neural network accelerators. This paper explores a DNN accelerator sharing system at the edge FPGA device, that serves various DNN applications from multiple end devices simultaneously. The proposed SharedDNN/PlanAhead policy exploits the regularity among requests for various DNN accelerators and determines which accelerator to allocate for each request and in what order to respond to the requests that achieve maximum responsiveness for a queue of acceleration requests. Our results show overall 2. 20x performance gain at best and utilization improvement by reducing up to 27% of DNN library usage while staying within the requests’ requirements and resource constraints.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124278105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
期刊
2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1