Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00031
Mikhail Asiatici, Damian Maiorano, P. Ienne
String sorting is an important part of database and MapReduce applications; however, it has not been studied as extensively as sorting of fixed-length keys. Handling variable-length keys in hardware is challenging and it is no surprise that no string sorters on FPGA have been proposed yet. In this paper, we present Parallel Hybrid Super Scalar String Sample Sort (pHS5) on Intel HARPv2, a heterogeneous CPU-FPGA system with a server-grade multi-core CPU. Our pHS5 is based on the state-of-the-art string sorting algorithm for multi-core shared memory CPUs, pS5, which we extended with multiple processing elements (PEs) on the FPGA. Each PE accelerates one instance of the most effectively parallelizable dominant kernel of pS5 by up to 33% compared to a single Intel Xeon Broadwell core running at 3.4 GHz. Furthermore, we extended the job scheduling mechanism of pS5 to enable our PEs to compete with the CPU cores for processing the accelerable kernel, while retaining the complex high-level control flow and the sorting of the smaller data sets on the CPU. We accelerate the whole algorithm by up to 10% compared to the 28 thread software baseline running on the 14-core Xeon processor and by up to 36% at lower thread counts.
{"title":"FPGAs in the Datacenters: the Case of Parallel Hybrid Super Scalar String Sample Sort","authors":"Mikhail Asiatici, Damian Maiorano, P. Ienne","doi":"10.1109/ASAP49362.2020.00031","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00031","url":null,"abstract":"String sorting is an important part of database and MapReduce applications; however, it has not been studied as extensively as sorting of fixed-length keys. Handling variable-length keys in hardware is challenging and it is no surprise that no string sorters on FPGA have been proposed yet. In this paper, we present Parallel Hybrid Super Scalar String Sample Sort (pHS5) on Intel HARPv2, a heterogeneous CPU-FPGA system with a server-grade multi-core CPU. Our pHS5 is based on the state-of-the-art string sorting algorithm for multi-core shared memory CPUs, pS5, which we extended with multiple processing elements (PEs) on the FPGA. Each PE accelerates one instance of the most effectively parallelizable dominant kernel of pS5 by up to 33% compared to a single Intel Xeon Broadwell core running at 3.4 GHz. Furthermore, we extended the job scheduling mechanism of pS5 to enable our PEs to compete with the CPU cores for processing the accelerable kernel, while retaining the complex high-level control flow and the sorting of the smaller data sets on the CPU. We accelerate the whole algorithm by up to 10% compared to the 28 thread software baseline running on the 14-core Xeon processor and by up to 36% at lower thread counts.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115188810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00024
Dawen Xu, Ziyang Zhu, Cheng Liu, Ying Wang, Huawei Li, Lei Zhang, K. Cheng
The increasing hardware failures caused by the shrinking semiconductor technologies pose substantial influence on the neural accelerators and improving the resilience of the neural network execution becomes a great design challenge especially to mission-critical applications such as self-driving and medical diagnose. The reliability analysis of the neural network execution is a key step to understand the influence of the hardware failures, and thus is highly demanded. Prior works typically conducted the fault analysis of neural network accelerators with simulation and concentrated on the prediction accuracy loss of the models. There is still a lack of systematic fault analysis of the neural network acceleration system that considers both the accuracy degradation and system exceptions such as system stall and early termination.In this work, we implemented a representative neural network accelerator and fault injection modules on a Xilinx ARM-FPGA platform and conducted fault analysis of the system using four typical neural network models. We had the system open-sourced on github. With comprehensive experiments, we identify the system exceptions based on the various abnormal behaviours of the FPGA-based neural network acceleration system and analyze the underlying reasons. Particularly, we find that the probability of the system exceptions dominates the reliability of the system and they are mainly caused by faults in the DMA, control unit and instruction memory of the accelerators. In addition, faults in these components also incur moderate accuracy degradation of the neural network models other than the system exceptions. Thus, these components are the most fragile part of the accelerators and need to be hardened for reliable neural network execution.
{"title":"Persistent Fault Analysis of Neural Networks on FPGA-based Acceleration System","authors":"Dawen Xu, Ziyang Zhu, Cheng Liu, Ying Wang, Huawei Li, Lei Zhang, K. Cheng","doi":"10.1109/ASAP49362.2020.00024","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00024","url":null,"abstract":"The increasing hardware failures caused by the shrinking semiconductor technologies pose substantial influence on the neural accelerators and improving the resilience of the neural network execution becomes a great design challenge especially to mission-critical applications such as self-driving and medical diagnose. The reliability analysis of the neural network execution is a key step to understand the influence of the hardware failures, and thus is highly demanded. Prior works typically conducted the fault analysis of neural network accelerators with simulation and concentrated on the prediction accuracy loss of the models. There is still a lack of systematic fault analysis of the neural network acceleration system that considers both the accuracy degradation and system exceptions such as system stall and early termination.In this work, we implemented a representative neural network accelerator and fault injection modules on a Xilinx ARM-FPGA platform and conducted fault analysis of the system using four typical neural network models. We had the system open-sourced on github. With comprehensive experiments, we identify the system exceptions based on the various abnormal behaviours of the FPGA-based neural network acceleration system and analyze the underlying reasons. Particularly, we find that the probability of the system exceptions dominates the reliability of the system and they are mainly caused by faults in the DMA, control unit and instruction memory of the accelerators. In addition, faults in these components also incur moderate accuracy degradation of the neural network models other than the system exceptions. Thus, these components are the most fragile part of the accelerators and need to be hardened for reliable neural network execution.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114629896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00034
Marcel Brand, Michael Witterauf, A. Bosio, J. Teich
In this paper, we present anytime instructions for floating-point additions and multiplications. Specific to such instructions is their ability to compute an arithmetic operation at a programmable accuracy of a most significant bits where a is encoded in the instruction itself. Contrary to reduced-precision architectures, the word length is maintained throughout the execution. Two approaches are presented for the efficient implementation of anytime additions and multiplications, one based on on-line arithmetic and the other on bitmasking. We propose implementations of anytime functional units for both approaches and evaluate them in terms of error, latency, area, as well as energy savings. As a result, 15% of energy can be saved on average while computing a floating-point addition with an error of less than 0.1%. Moreover, large latency and energy savings are reported for iterative algorithms such as a Jacobi algorithm with savings of up to 39% in energy.
{"title":"Anytime Floating-Point Addition and Multiplication-Concepts and Implementations","authors":"Marcel Brand, Michael Witterauf, A. Bosio, J. Teich","doi":"10.1109/ASAP49362.2020.00034","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00034","url":null,"abstract":"In this paper, we present anytime instructions for floating-point additions and multiplications. Specific to such instructions is their ability to compute an arithmetic operation at a programmable accuracy of a most significant bits where a is encoded in the instruction itself. Contrary to reduced-precision architectures, the word length is maintained throughout the execution. Two approaches are presented for the efficient implementation of anytime additions and multiplications, one based on on-line arithmetic and the other on bitmasking. We propose implementations of anytime functional units for both approaches and evaluate them in terms of error, latency, area, as well as energy savings. As a result, 15% of energy can be saved on average while computing a floating-point addition with an error of less than 0.1%. Moreover, large latency and energy savings are reported for iterative algorithms such as a Jacobi algorithm with savings of up to 39% in energy.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114691672","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00033
Nuno Neves, P. Tomás, N. Roma
The increased adoption of DNN applications drove the emergence of dedicated tensor computing units to accelerate multi-dimensional matrix multiplication operations. Although they deploy highly efficient computing architectures, they often lack support for more general-purpose application domains. Such a limitation occurs both due to their consolidated computation scheme (restricted to matrix multiplication) and due to their frequent adoption of low-precision/custom floating-point formats (unsuited for general application domains). In contrast, this paper proposes a new Reconfigurable Tensor Unit (RTU) which deploys an array of variable-precision Vector MultiplyAccumulate (VMA) units. Furthermore, each VMA unit leverages the new Posit floating-point format and supports the full range of standardized posit precisions in a single SIMD unit, with variable vector-element width. Moreover, the proposed RTU explores the Posit format features for fused operations, together with spatial and time-multiplexing reconfiguration mechanisms to fuse and combine multiple VMAs to map high-level and complex operations. The RTU is also supported by an automatic data streaming infrastructure and a pipelined data movement scheme, allowing it to accelerate the computation of most data-parallel patterns commonly present in vectorizable applications. The proposed RTU showed to outperform state-of-the-art tensor and SIMD units, present in off-the-shelf platforms, in turn resulting in significant energy-efficiency improvements.
{"title":"Reconfigurable Stream-based Tensor Unit with Variable-Precision Posit Arithmetic","authors":"Nuno Neves, P. Tomás, N. Roma","doi":"10.1109/ASAP49362.2020.00033","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00033","url":null,"abstract":"The increased adoption of DNN applications drove the emergence of dedicated tensor computing units to accelerate multi-dimensional matrix multiplication operations. Although they deploy highly efficient computing architectures, they often lack support for more general-purpose application domains. Such a limitation occurs both due to their consolidated computation scheme (restricted to matrix multiplication) and due to their frequent adoption of low-precision/custom floating-point formats (unsuited for general application domains). In contrast, this paper proposes a new Reconfigurable Tensor Unit (RTU) which deploys an array of variable-precision Vector MultiplyAccumulate (VMA) units. Furthermore, each VMA unit leverages the new Posit floating-point format and supports the full range of standardized posit precisions in a single SIMD unit, with variable vector-element width. Moreover, the proposed RTU explores the Posit format features for fused operations, together with spatial and time-multiplexing reconfiguration mechanisms to fuse and combine multiple VMAs to map high-level and complex operations. The RTU is also supported by an automatic data streaming infrastructure and a pipelined data movement scheme, allowing it to accelerate the computation of most data-parallel patterns commonly present in vectorizable applications. The proposed RTU showed to outperform state-of-the-art tensor and SIMD units, present in off-the-shelf platforms, in turn resulting in significant energy-efficiency improvements.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126531987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00016
Krishna Teja Chitty-Venkata, Arun Kumar Somani
Due to the increase in the use of large-sized Deep Neural Networks (DNNs) over the years, specialized hardware accelerators such as Tensor Processing Unit and Eyeriss have been developed to accelerate the forward pass of the network. The essential component of these devices is an array processor which is composed of multiple individual compute units for efficiently executing Multiplication and Accumulation (MAC) operation. As the size of this array limits the amount of DNN processing of a single layer, the computation is performed in several batches serially leading to extra compute cycles along both the axes. In practice, due to the mismatch between matrix and array sizes, the computation does not map on the array exactly. In this work, we address the issue of minimizing processing cycles on the array by adjusting the DNN model parameters by using a structured hardware array dependent optimization. We introduce two techniques in this paper: Array Aware Training (AAT) for efficient training and Array Aware Pruning (AAP) for efficient inference. Weight pruning is an approach to remove redundant parameters in the network to decrease the size of the network. The key idea behind pruning in this paper is to adjust the model parameters (the weight matrix) so that the array is fully utilized in each computation batch. Our goal is to compress the model based on the size of the array so as to reduce the number of computation cycles. We observe that both the proposed techniques results into similar accuracy as the original network while saving a significant number of processing cycles (75%).
{"title":"Array Aware Training/Pruning: Methods for Efficient Forward Propagation on Array-based Neural Network Accelerators","authors":"Krishna Teja Chitty-Venkata, Arun Kumar Somani","doi":"10.1109/ASAP49362.2020.00016","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00016","url":null,"abstract":"Due to the increase in the use of large-sized Deep Neural Networks (DNNs) over the years, specialized hardware accelerators such as Tensor Processing Unit and Eyeriss have been developed to accelerate the forward pass of the network. The essential component of these devices is an array processor which is composed of multiple individual compute units for efficiently executing Multiplication and Accumulation (MAC) operation. As the size of this array limits the amount of DNN processing of a single layer, the computation is performed in several batches serially leading to extra compute cycles along both the axes. In practice, due to the mismatch between matrix and array sizes, the computation does not map on the array exactly. In this work, we address the issue of minimizing processing cycles on the array by adjusting the DNN model parameters by using a structured hardware array dependent optimization. We introduce two techniques in this paper: Array Aware Training (AAT) for efficient training and Array Aware Pruning (AAP) for efficient inference. Weight pruning is an approach to remove redundant parameters in the network to decrease the size of the network. The key idea behind pruning in this paper is to adjust the model parameters (the weight matrix) so that the array is fully utilized in each computation batch. Our goal is to compress the model based on the size of the array so as to reduce the number of computation cycles. We observe that both the proposed techniques results into similar accuracy as the original network while saving a significant number of processing cycles (75%).","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"45 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131713230","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00020
M. Arnold, E. Chester, Corey Johnson
The Logarithmic Number System (LNS) is useful in applications that tolerate approximate computation, such as classification using multi-layer neural networks that compute nonlinear functions of weighted sums of inputs from previous layers. Supervised learning has two phases: training (find appropriate weights for the desired classification), and inference (use the weights with approximate sum of products). Several researchers have observed that LNS ALUs in inference may minimize area and power by being both low-precision and approximate (allowing low-cost, tableless implementations). However, the few works that have also trained with LNS report at least part of the system needs accurate LNS. This paper describes a novel approximate LNS ALU implemented simply as logic (without tables) that enables the entire back-propagation training to occur in LNS, at one-third the cost of fixed-point implementation.
{"title":"Training Neural Nets using only an Approximate Tableless LNS ALU","authors":"M. Arnold, E. Chester, Corey Johnson","doi":"10.1109/ASAP49362.2020.00020","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00020","url":null,"abstract":"The Logarithmic Number System (LNS) is useful in applications that tolerate approximate computation, such as classification using multi-layer neural networks that compute nonlinear functions of weighted sums of inputs from previous layers. Supervised learning has two phases: training (find appropriate weights for the desired classification), and inference (use the weights with approximate sum of products). Several researchers have observed that LNS ALUs in inference may minimize area and power by being both low-precision and approximate (allowing low-cost, tableless implementations). However, the few works that have also trained with LNS report at least part of the system needs accurate LNS. This paper describes a novel approximate LNS ALU implemented simply as logic (without tables) that enables the entire back-propagation training to occur in LNS, at one-third the cost of fixed-point implementation.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131128834","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/asap49362.2020.00002
{"title":"[ASAP 2020 Title page]","authors":"","doi":"10.1109/asap49362.2020.00002","DOIUrl":"https://doi.org/10.1109/asap49362.2020.00002","url":null,"abstract":"","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124676316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00040
Hsin-Yu Ting, Tootiya Giyahchi, A. A. Sani, E. Bozorgzadeh
Edge computing can potentially provide abundant processing resources for compute-intensive applications while bringing services close to end devices. With the increasing demands for computing acceleration at the edge, FPGAs have been deployed to provide custom deep neural network accelerators. This paper explores a DNN accelerator sharing system at the edge FPGA device, that serves various DNN applications from multiple end devices simultaneously. The proposed SharedDNN/PlanAhead policy exploits the regularity among requests for various DNN accelerators and determines which accelerator to allocate for each request and in what order to respond to the requests that achieve maximum responsiveness for a queue of acceleration requests. Our results show overall 2. 20x performance gain at best and utilization improvement by reducing up to 27% of DNN library usage while staying within the requests’ requirements and resource constraints.
{"title":"Dynamic Sharing in Multi-accelerators of Neural Networks on an FPGA Edge Device","authors":"Hsin-Yu Ting, Tootiya Giyahchi, A. A. Sani, E. Bozorgzadeh","doi":"10.1109/ASAP49362.2020.00040","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00040","url":null,"abstract":"Edge computing can potentially provide abundant processing resources for compute-intensive applications while bringing services close to end devices. With the increasing demands for computing acceleration at the edge, FPGAs have been deployed to provide custom deep neural network accelerators. This paper explores a DNN accelerator sharing system at the edge FPGA device, that serves various DNN applications from multiple end devices simultaneously. The proposed SharedDNN/PlanAhead policy exploits the regularity among requests for various DNN accelerators and determines which accelerator to allocate for each request and in what order to respond to the requests that achieve maximum responsiveness for a queue of acceleration requests. Our results show overall 2. 20x performance gain at best and utilization improvement by reducing up to 27% of DNN library usage while staying within the requests’ requirements and resource constraints.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124278105","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}