Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00037
Moritz Bärthel, Jochen Rust, S. Paul
Sets-Of-Real-Numbers (SORN) Arithmetic derives from the Unum type-II number format and provides high throughput and low complexity computations at the cost of a very rough resolution. This work presents a symbol detection unit for MIMO transmission with a BPSK modulation which consists of a SORN preprocessor applying an exhaustive search method and a fixed-point module processing the results from the SORN unit. Different SORN datatypes and hardware configurations are considered and evaluated throughout BER simulations and postsynthesis analyses.
{"title":"Combining Fixed-Point and SORN Arithmetic in a MIMO BPSK-Symbol Detection Architecture","authors":"Moritz Bärthel, Jochen Rust, S. Paul","doi":"10.1109/ASAP49362.2020.00037","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00037","url":null,"abstract":"Sets-Of-Real-Numbers (SORN) Arithmetic derives from the Unum type-II number format and provides high throughput and low complexity computations at the cost of a very rough resolution. This work presents a symbol detection unit for MIMO transmission with a BPSK modulation which consists of a SORN preprocessor applying an exhaustive search method and a fixed-point module processing the results from the SORN unit. Different SORN datatypes and hardware configurations are considered and evaluated throughout BER simulations and postsynthesis analyses.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1843 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127457732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00014
Shuhei Yoshida, Yuta Ukon, S. Ohteru, H. Uzawa, N. Ikeda, K. Nitta
Network microbursts, which are sub-millisecond order bursts of traffic, have gathered attention due to causing network delay and packet loss. However, there are two problems in analyzing the causes of microbursts: how to capture the packet included in the microburst and how to specify the flows causing microbursts. To resolve these problems, we propose a field-programmable gate array (FPGA)-based microburst analysis system. This system detects microbursts with dedicated hardware in sub-millisecond time resolution. It can capture only packets before and after microburst detection triggered by detection with a static threshold for whole traffic. In addition, it can specify the flows causing microbursts by detection with a dynamic threshold for each flow. The experimental results show that the proposed system can capture only packets before and after microburst detection and can correctly specify the flow causing microbursts even in a network with fluctuating bandwidth usage in practical traffic conditions on the basis of network trace data in a datacenter. The proposed system is implemented with Intel® PAC with Arria® 10 GX FPGA and consumes relatively small amounts of hardware resources: 51 % ALMs, 16 % registers, and 57 % block memories.
{"title":"FPGA-Based Network Microburst Analysis System with Flow Specification and Efficient Packet Capturing","authors":"Shuhei Yoshida, Yuta Ukon, S. Ohteru, H. Uzawa, N. Ikeda, K. Nitta","doi":"10.1109/ASAP49362.2020.00014","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00014","url":null,"abstract":"Network microbursts, which are sub-millisecond order bursts of traffic, have gathered attention due to causing network delay and packet loss. However, there are two problems in analyzing the causes of microbursts: how to capture the packet included in the microburst and how to specify the flows causing microbursts. To resolve these problems, we propose a field-programmable gate array (FPGA)-based microburst analysis system. This system detects microbursts with dedicated hardware in sub-millisecond time resolution. It can capture only packets before and after microburst detection triggered by detection with a static threshold for whole traffic. In addition, it can specify the flows causing microbursts by detection with a dynamic threshold for each flow. The experimental results show that the proposed system can capture only packets before and after microburst detection and can correctly specify the flow causing microbursts even in a network with fluctuating bandwidth usage in practical traffic conditions on the basis of network trace data in a datacenter. The proposed system is implemented with Intel® PAC with Arria® 10 GX FPGA and consumes relatively small amounts of hardware resources: 51 % ALMs, 16 % registers, and 57 % block memories.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127976879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00015
Seongyoung Kang, Jinyeong Moon, S. Jun
We present a case for FPGA-accelerated edge processing for low-power Internet-of-Things (IoT) devices, using time series similarity search as a driving application. As the data collection capabilities of low-power IoT device increase, the primary constraint on their capacity is becoming the resource requirements of wirelessly transferring collected data to a central repository. This work presents a solution to this limitation by augmenting the IoT device with a inexpensive, power-efficient FPGA accelerator, which can perform fairly complex edge mining operations and drastically reduce the wireless data transfer requirements. This approach reduces the total power consumption of the device despite the added FPGA component, while also reducing the computation requirements at the central server. We use the Dynamic Time Warping (DTW) algorithm as an example workload. Using a low-cost Lattice iCE40 UltraPlus FPGA, we demonstrate that the FPGA-augmented mining algorithm can both support significantly higher data collection rate while improving the computation power efficiency of the entire deployment by an order of magnitude.
{"title":"FPGA-Accelerated Time Series Mining on Low-Power IoT Devices","authors":"Seongyoung Kang, Jinyeong Moon, S. Jun","doi":"10.1109/ASAP49362.2020.00015","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00015","url":null,"abstract":"We present a case for FPGA-accelerated edge processing for low-power Internet-of-Things (IoT) devices, using time series similarity search as a driving application. As the data collection capabilities of low-power IoT device increase, the primary constraint on their capacity is becoming the resource requirements of wirelessly transferring collected data to a central repository. This work presents a solution to this limitation by augmenting the IoT device with a inexpensive, power-efficient FPGA accelerator, which can perform fairly complex edge mining operations and drastically reduce the wireless data transfer requirements. This approach reduces the total power consumption of the device despite the added FPGA component, while also reducing the computation requirements at the central server. We use the Dynamic Time Warping (DTW) algorithm as an example workload. Using a low-cost Lattice iCE40 UltraPlus FPGA, we demonstrate that the FPGA-augmented mining algorithm can both support significantly higher data collection rate while improving the computation power efficiency of the entire deployment by an order of magnitude.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131495739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00017
Zhigang Wei, Aman Arora, P. Patel, L. John
Deep Neural Networks (DNN) are crucial components of machine learning in the big data era. Significant effort has been put into the hardware acceleration of convolution and fully-connected layers of neural networks, while not too much attention has been put on the Softmax layer. Softmax is used in terminal classification layers in networks like ResNet, and is also used in intermediate layers in networks like the Transformer. As the speed for other DNN layers keeps improving, efficient and flexible designs for Softmax are required. With the existence of several ways to implement Softmax in hardware, we evaluate various softmax hardware designs and the trade-offs between them. In order to make the design space exploration more efficient, we also develop a parameterized generator which can produce softmax designs by varying multiple aspects of a base architecture. The aspects or knobs are parallelism, accuracy, storage and precision. The goal of the generator is to enable evaluation of tradeoffs between area, delay, power and accuracy in the architecture of a softmax unit. We simulate and synthesize the generated designs and present results comparing them with the existing state-of-the-art. Our exploration reveals that the design with parallelism of 16 can provide the best area-delay product among designs with parallelism ranging from 1 to 32. It is also observed that look-up table based approximate LOG and EXP units can be used to yield almost the same accuracy as the full LOG and EXP units, while providing area and energy benefits. Additionally, providing local registers for intermediate values is seen to provide energy savings.
{"title":"Design Space Exploration for Softmax Implementations","authors":"Zhigang Wei, Aman Arora, P. Patel, L. John","doi":"10.1109/ASAP49362.2020.00017","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00017","url":null,"abstract":"Deep Neural Networks (DNN) are crucial components of machine learning in the big data era. Significant effort has been put into the hardware acceleration of convolution and fully-connected layers of neural networks, while not too much attention has been put on the Softmax layer. Softmax is used in terminal classification layers in networks like ResNet, and is also used in intermediate layers in networks like the Transformer. As the speed for other DNN layers keeps improving, efficient and flexible designs for Softmax are required. With the existence of several ways to implement Softmax in hardware, we evaluate various softmax hardware designs and the trade-offs between them. In order to make the design space exploration more efficient, we also develop a parameterized generator which can produce softmax designs by varying multiple aspects of a base architecture. The aspects or knobs are parallelism, accuracy, storage and precision. The goal of the generator is to enable evaluation of tradeoffs between area, delay, power and accuracy in the architecture of a softmax unit. We simulate and synthesize the generated designs and present results comparing them with the existing state-of-the-art. Our exploration reveals that the design with parallelism of 16 can provide the best area-delay product among designs with parallelism ranging from 1 to 32. It is also observed that look-up table based approximate LOG and EXP units can be used to yield almost the same accuracy as the full LOG and EXP units, while providing area and energy benefits. Additionally, providing local registers for intermediate values is seen to provide energy savings.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124821669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00012
Matthew Naylor, S. Moore, A. Mokhov, David B. Thomas, J. Beaumont, Shane T. Fleming, A. T. Markettos, Thomas Bytheway, Andrew D. Brown
Barrier primitives provided by standard parallel programming APIs are the primary means by which applications implement global synchronisation. Typically these primitives are fully-committed to synchronisation in the sense that, once a barrier is entered, synchronisation is the only way out. For message-passing applications, this raises the question of what happens when a message arrives at a thread that already resides in a barrier. Without a satisfactory answer, barriers do not interact with message-passing in any useful way.In this paper, we propose a new refutable barrier primitive that combines with message-passing to form a simple, expressive, efficient, well-defined API. It has a clear semantics based on termination detection, and supports the development of both globally-synchronous and asynchronous parallel applications.To evaluate the new primitive, we implement it in a prototype large-scale message-passing machine with 49,152 RISC-V threads distributed over 48 FPGAs. We show that hardware support for the primitive leads to a highly-efficient implementation, capable of synchronisation rates that are an order-of-magnitude higher than what is achievable in software. Using the primitive, we implement synchronous and asynchronous versions of a range of applications, observing that each version can have significant advantages over the other, depending on the application. Therefore, a barrier primitive supporting both styles can greatly assist the development of parallel programs.
{"title":"Termination detection for fine-grained message-passing architectures","authors":"Matthew Naylor, S. Moore, A. Mokhov, David B. Thomas, J. Beaumont, Shane T. Fleming, A. T. Markettos, Thomas Bytheway, Andrew D. Brown","doi":"10.1109/ASAP49362.2020.00012","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00012","url":null,"abstract":"Barrier primitives provided by standard parallel programming APIs are the primary means by which applications implement global synchronisation. Typically these primitives are fully-committed to synchronisation in the sense that, once a barrier is entered, synchronisation is the only way out. For message-passing applications, this raises the question of what happens when a message arrives at a thread that already resides in a barrier. Without a satisfactory answer, barriers do not interact with message-passing in any useful way.In this paper, we propose a new refutable barrier primitive that combines with message-passing to form a simple, expressive, efficient, well-defined API. It has a clear semantics based on termination detection, and supports the development of both globally-synchronous and asynchronous parallel applications.To evaluate the new primitive, we implement it in a prototype large-scale message-passing machine with 49,152 RISC-V threads distributed over 48 FPGAs. We show that hardware support for the primitive leads to a highly-efficient implementation, capable of synchronisation rates that are an order-of-magnitude higher than what is achievable in software. Using the primitive, we implement synchronous and asynchronous versions of a range of applications, observing that each version can have significant advantages over the other, depending on the application. Therefore, a barrier primitive supporting both styles can greatly assist the development of parallel programs.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128411941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/asap49362.2020.00007
Lars Bauer, H. Blume, Byeong Kil Lee, T. Risset, M. Santambrogio
Kubilay Atasu, IBM Research – Zurich, Switzerland Jason Bakos, University of South Carolina, USA Lars Bauer, KIT, Germany Holger Blume, Leibniz Universität Hannover, Germany Christophe Bobda, University of Florida, USA Benjamin Carrion Schaefer, The University of Texas at Dallas, USA Anupam Chattopadhyay, Nanyang Technological University, Singapore Sudipta Chattopadhyay, Singapore University of Technology and Design, Singapore Thomas Chau, Samsung AI Centre, UK Xiang Chen, George Mason University, USA Zhe Chen, University of California, Los Angeles, USA Ray Cheung, City University of Hong Kong, Hong Kong Adrian Cristal, Barcelona Supercomputing Center, Spain Steven Derrien, University of Rennes, France Somdip Dey, University of Essex, UK Ken Eguro, Microsoft, USA Zhenman Fang, Simon Fraser University, Canada Diana Göhringer, TU Dresden, Germany Ann Gordon-Ross, University of Florida, USA Xinfei Guo, NVIDIA, University of Virginia, USA Frank Hannig, Friedrich-Alexander University Erlangen-Nürnberg, Germany Yuko Hara-Azumi, Tokyo Institute of Technology, Japan Martin Herbordt, Boston University, USA H. Peter Hofstee, IBM Austin, USA Ruirui (Raymond) Huang, Alibaba Cloud, USA Paolo Ienne, EPFL, Switzerland Sang-Woo Jun, University of California, Irvine, USA Ryan Kastner, University of California, San Diego, USA Dirk Koch, The University of Manchester, UK Herman Lam, University of Florida, USA Byeong Kil Lee, University of Colorado, USA Jingwen Leng, Shanghai Jiao Tong University, China Yun (Eric) Liang, Peking University, China Xue (Shelley) Lin, Northeastern University, USA Weiqiang Liu, Nanjing University of Aeronautics and Astronautics, China Iakovos Mavroidis, FORTH, Greece Nele Mentens, KU Leuven, Belgium Simon Moore, University of Cambridge, UK Roger Moussalli, Two Sigma, USA Jean-Michel Muller, CNRS, Laboratoire LIP, France Walid A. Najjar, University of California, Riverside, USA Javier Navaridas, The University of Manchester, UK Seda Ogrenci Memik, Northwestern University, USA Marco Platzner, University of Paderborn, Germany Viktor K. Prasanna, University of Southern California, USA Sanjay Rajopadhye, Colorado State University, USA
{"title":"ASAP 2020 Committees","authors":"Lars Bauer, H. Blume, Byeong Kil Lee, T. Risset, M. Santambrogio","doi":"10.1109/asap49362.2020.00007","DOIUrl":"https://doi.org/10.1109/asap49362.2020.00007","url":null,"abstract":"Kubilay Atasu, IBM Research – Zurich, Switzerland Jason Bakos, University of South Carolina, USA Lars Bauer, KIT, Germany Holger Blume, Leibniz Universität Hannover, Germany Christophe Bobda, University of Florida, USA Benjamin Carrion Schaefer, The University of Texas at Dallas, USA Anupam Chattopadhyay, Nanyang Technological University, Singapore Sudipta Chattopadhyay, Singapore University of Technology and Design, Singapore Thomas Chau, Samsung AI Centre, UK Xiang Chen, George Mason University, USA Zhe Chen, University of California, Los Angeles, USA Ray Cheung, City University of Hong Kong, Hong Kong Adrian Cristal, Barcelona Supercomputing Center, Spain Steven Derrien, University of Rennes, France Somdip Dey, University of Essex, UK Ken Eguro, Microsoft, USA Zhenman Fang, Simon Fraser University, Canada Diana Göhringer, TU Dresden, Germany Ann Gordon-Ross, University of Florida, USA Xinfei Guo, NVIDIA, University of Virginia, USA Frank Hannig, Friedrich-Alexander University Erlangen-Nürnberg, Germany Yuko Hara-Azumi, Tokyo Institute of Technology, Japan Martin Herbordt, Boston University, USA H. Peter Hofstee, IBM Austin, USA Ruirui (Raymond) Huang, Alibaba Cloud, USA Paolo Ienne, EPFL, Switzerland Sang-Woo Jun, University of California, Irvine, USA Ryan Kastner, University of California, San Diego, USA Dirk Koch, The University of Manchester, UK Herman Lam, University of Florida, USA Byeong Kil Lee, University of Colorado, USA Jingwen Leng, Shanghai Jiao Tong University, China Yun (Eric) Liang, Peking University, China Xue (Shelley) Lin, Northeastern University, USA Weiqiang Liu, Nanjing University of Aeronautics and Astronautics, China Iakovos Mavroidis, FORTH, Greece Nele Mentens, KU Leuven, Belgium Simon Moore, University of Cambridge, UK Roger Moussalli, Two Sigma, USA Jean-Michel Muller, CNRS, Laboratoire LIP, France Walid A. Najjar, University of California, Riverside, USA Javier Navaridas, The University of Manchester, UK Seda Ogrenci Memik, Northwestern University, USA Marco Platzner, University of Paderborn, Germany Viktor K. Prasanna, University of Southern California, USA Sanjay Rajopadhye, Colorado State University, USA","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113965900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00013
Riadh Ben Abdelhamid, Y. Yamaguchi, T. Boku
General-purpose processors offer the best programming flexibility to address a wide range of problems. Nonetheless, they still lack behind special-purpose processors when it comes to sustained computational performance. Here, we leverage the best from both worlds and we propose a flexible, highly scalable, high-performance computing architecture with versatility in mind. The proposed architecture code-named DRAGON, benefits from several forms of parallelism such as SIMD, VLIW, Memory Broadcasting and even vector processing.
{"title":"Condensing an overload of parallel computing ingredients into a single architecture recipe","authors":"Riadh Ben Abdelhamid, Y. Yamaguchi, T. Boku","doi":"10.1109/ASAP49362.2020.00013","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00013","url":null,"abstract":"General-purpose processors offer the best programming flexibility to address a wide range of problems. Nonetheless, they still lack behind special-purpose processors when it comes to sustained computational performance. Here, we leverage the best from both worlds and we propose a flexible, highly scalable, high-performance computing architecture with versatility in mind. The proposed architecture code-named DRAGON, benefits from several forms of parallelism such as SIMD, VLIW, Memory Broadcasting and even vector processing.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129994816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00035
Qian Xu, Guowei Sun, G. Qu
Approximate computing is a promising technique in improving the energy efficiency for error-resilient applications such as multimedia, signal processing and neural network. A large amount of reported work is on the design of approximate computation units with truncated data under error constraints. However, they mainly focus on simple arithmetic operations, addition and multiplication to be more specific. In this paper, we study how to apply the truncation method to the floating-point logarithmic operation which is getting increasingly popular. We analyze the tradeoff between the precision of computation and the energy it requires and derive a formula on the most energy efficient implementation of the logarithm unit for a given error variance range. Based on this theoretical result, we propose BWOLF (Bit-Width optimization for Logarithmic Function), which uses a sequential quadratic programming algorithm to determine the way to truncate data (i.e., bit-width optimization) in a program with logarithm and other arithmetic operations such that the energy consumption is minimized under a fixed error budget. We evaluate the efficacy of BWOLF in energy saving on two widely used applications: Kullback-Leibler Divergence and Bayesian Neural Network. The experimental results validate the correctness of our analysis and show significant amount of energy saving over both the full-precision computation and the uniform truncation method. The energy savings range from 27.18 % to 95.92% for different error constraints.
在多媒体、信号处理和神经网络等抗错误性应用中,近似计算是一种很有前途的提高能源效率的技术。大量的研究工作是在误差约束下截断数据的近似计算单元的设计。然而,他们主要集中在简单的算术运算,更具体地说,是加法和乘法。本文研究了如何将截断法应用于日益流行的浮点对数运算。我们分析了计算精度和所需能量之间的权衡,并推导出给定误差方差范围内对数单位最节能实现的公式。基于这一理论结果,我们提出了BWOLF (Bit-Width optimization for Logarithmic Function),它使用顺序二次规划算法来确定在具有对数和其他算术运算的程序中截断数据的方式(即位宽优化),从而在固定误差预算下最小化能耗。本文从Kullback-Leibler散度和贝叶斯神经网络两种广泛应用的角度对BWOLF的节能效果进行了评价。实验结果验证了分析结果的正确性,并表明在全精度计算和均匀截断方法上都能显著节省能量。对于不同的误差约束,节能幅度在27.18% ~ 95.92%之间。
{"title":"BWOLF: Bit-Width Optimization for Statistical Divergence with -Logarithmic Functions","authors":"Qian Xu, Guowei Sun, G. Qu","doi":"10.1109/ASAP49362.2020.00035","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00035","url":null,"abstract":"Approximate computing is a promising technique in improving the energy efficiency for error-resilient applications such as multimedia, signal processing and neural network. A large amount of reported work is on the design of approximate computation units with truncated data under error constraints. However, they mainly focus on simple arithmetic operations, addition and multiplication to be more specific. In this paper, we study how to apply the truncation method to the floating-point logarithmic operation which is getting increasingly popular. We analyze the tradeoff between the precision of computation and the energy it requires and derive a formula on the most energy efficient implementation of the logarithm unit for a given error variance range. Based on this theoretical result, we propose BWOLF (Bit-Width optimization for Logarithmic Function), which uses a sequential quadratic programming algorithm to determine the way to truncate data (i.e., bit-width optimization) in a program with logarithm and other arithmetic operations such that the energy consumption is minimized under a fixed error budget. We evaluate the efficacy of BWOLF in energy saving on two widely used applications: Kullback-Leibler Divergence and Bayesian Neural Network. The experimental results validate the correctness of our analysis and show significant amount of energy saving over both the full-precision computation and the uniform truncation method. The energy savings range from 27.18 % to 95.92% for different error constraints.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133380903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00022
Cristian Sestito, F. Spagnolo, P. Corsonello, S. Perri
This paper presents an efficient hardware architecture able to perform 2D dilated convolutions and suitable for the integration within modern heterogeneous embedded systems targeting semantic image segmentation. The proposed design supports multiple dilation rates. Moreover, it uses limited amounts of resources even when large convolution windows are processed. As a case study, the novel circuit has been integrated within a Xilinx Zynq-7000 FPSoC device to accelerate a state-of-the-art CNN model for medical images segmentation. Obtained results demonstrate that higher computational capabilities, reduced resources utilization and lower power consumption are achieved with respect to the competitors existing in literature.
{"title":"An Efficient Convolution Engine based on the À-trous Spatial Pyramid Pooling","authors":"Cristian Sestito, F. Spagnolo, P. Corsonello, S. Perri","doi":"10.1109/ASAP49362.2020.00022","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00022","url":null,"abstract":"This paper presents an efficient hardware architecture able to perform 2D dilated convolutions and suitable for the integration within modern heterogeneous embedded systems targeting semantic image segmentation. The proposed design supports multiple dilation rates. Moreover, it uses limited amounts of resources even when large convolution windows are processed. As a case study, the novel circuit has been integrated within a Xilinx Zynq-7000 FPSoC device to accelerate a state-of-the-art CNN model for medical images segmentation. Obtained results demonstrate that higher computational capabilities, reduced resources utilization and lower power consumption are achieved with respect to the competitors existing in literature.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117354752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2020-07-01DOI: 10.1109/ASAP49362.2020.00021
Somdip Dey, A. Singh, D. Prasad, K. Mcdonald-Maier
This paper proposes a novel human-inspired methodology called IRON-MAN (Integrated RatiONal prediction and Motionless ANalysis of videos) on mobile multi-processor systems-on-chips (MPSoCs). The methodology integrates analysis of the previous image frames of the video to represent the analysis of the current frame in order to perform Temporal Motionless Analysis of the Video (TMAV). This is the first work on TMAV using Convolutional Neural Network (CNN) for scene prediction in MPSoCs. Experimental results show that our methodology outperforms state-of-the-art. We also introduce a metric named, Energy Consumption per Training Image (ECTI) to assess the suitability of using a CNN model in mobile MPSoCs with a focus on energy consumption of the device.
{"title":"Temporal Motionless Analysis of Video using CNN in MPSoC","authors":"Somdip Dey, A. Singh, D. Prasad, K. Mcdonald-Maier","doi":"10.1109/ASAP49362.2020.00021","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00021","url":null,"abstract":"This paper proposes a novel human-inspired methodology called IRON-MAN (Integrated RatiONal prediction and Motionless ANalysis of videos) on mobile multi-processor systems-on-chips (MPSoCs). The methodology integrates analysis of the previous image frames of the video to represent the analysis of the current frame in order to perform Temporal Motionless Analysis of the Video (TMAV). This is the first work on TMAV using Convolutional Neural Network (CNN) for scene prediction in MPSoCs. Experimental results show that our methodology outperforms state-of-the-art. We also introduce a metric named, Energy Consumption per Training Image (ECTI) to assess the suitability of using a CNN model in mobile MPSoCs with a focus on energy consumption of the device.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132768766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}