首页 > 最新文献

2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)最新文献

英文 中文
Combining Fixed-Point and SORN Arithmetic in a MIMO BPSK-Symbol Detection Architecture 结合不动点和SORN算法的MIMO bpsk符号检测体系结构
Moritz Bärthel, Jochen Rust, S. Paul
Sets-Of-Real-Numbers (SORN) Arithmetic derives from the Unum type-II number format and provides high throughput and low complexity computations at the cost of a very rough resolution. This work presents a symbol detection unit for MIMO transmission with a BPSK modulation which consists of a SORN preprocessor applying an exhaustive search method and a fixed-point module processing the results from the SORN unit. Different SORN datatypes and hardware configurations are considered and evaluated throughout BER simulations and postsynthesis analyses.
实数集(SORN)算法源自Unum type-II数字格式,以非常粗糙的分辨率为代价提供高吞吐量和低复杂度的计算。本文提出了一种用于MIMO传输的BPSK调制的符号检测单元,该单元由应用穷举搜索方法的SORN预处理器和处理SORN单元结果的不动点模块组成。不同的SORN数据类型和硬件配置在BER模拟和合成后分析中被考虑和评估。
{"title":"Combining Fixed-Point and SORN Arithmetic in a MIMO BPSK-Symbol Detection Architecture","authors":"Moritz Bärthel, Jochen Rust, S. Paul","doi":"10.1109/ASAP49362.2020.00037","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00037","url":null,"abstract":"Sets-Of-Real-Numbers (SORN) Arithmetic derives from the Unum type-II number format and provides high throughput and low complexity computations at the cost of a very rough resolution. This work presents a symbol detection unit for MIMO transmission with a BPSK modulation which consists of a SORN preprocessor applying an exhaustive search method and a fixed-point module processing the results from the SORN unit. Different SORN datatypes and hardware configurations are considered and evaluated throughout BER simulations and postsynthesis analyses.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1843 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127457732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FPGA-Based Network Microburst Analysis System with Flow Specification and Efficient Packet Capturing 基于fpga的流量规范和高效抓包的网络微突发分析系统
Shuhei Yoshida, Yuta Ukon, S. Ohteru, H. Uzawa, N. Ikeda, K. Nitta
Network microbursts, which are sub-millisecond order bursts of traffic, have gathered attention due to causing network delay and packet loss. However, there are two problems in analyzing the causes of microbursts: how to capture the packet included in the microburst and how to specify the flows causing microbursts. To resolve these problems, we propose a field-programmable gate array (FPGA)-based microburst analysis system. This system detects microbursts with dedicated hardware in sub-millisecond time resolution. It can capture only packets before and after microburst detection triggered by detection with a static threshold for whole traffic. In addition, it can specify the flows causing microbursts by detection with a dynamic threshold for each flow. The experimental results show that the proposed system can capture only packets before and after microburst detection and can correctly specify the flow causing microbursts even in a network with fluctuating bandwidth usage in practical traffic conditions on the basis of network trace data in a datacenter. The proposed system is implemented with Intel® PAC with Arria® 10 GX FPGA and consumes relatively small amounts of hardware resources: 51 % ALMs, 16 % registers, and 57 % block memories.
网络微突发是一种次毫秒级的流量突发,由于引起网络延迟和丢包而引起人们的关注。然而,在分析微突发的原因时,存在两个问题:如何捕获包含在微突发中的数据包以及如何指定引起微突发的流。为了解决这些问题,我们提出了一种基于现场可编程门阵列(FPGA)的微突发分析系统。该系统通过专用硬件以亚毫秒的时间分辨率检测微爆发。它只捕获全流量静态阈值检测触发的微突发检测前后的报文。此外,它还可以通过为每个流设置动态阈值来检测导致微突发的流。实验结果表明,基于数据中心的网络跟踪数据,该系统可以捕获微突发检测前后的数据包,并且在实际流量条件下,即使在带宽使用波动的网络中,也能正确地确定引起微突发的流量。该系统采用Intel®PAC和Arria®10 GX FPGA实现,消耗相对较少的硬件资源:51%的alm, 16%的寄存器和57%的块存储器。
{"title":"FPGA-Based Network Microburst Analysis System with Flow Specification and Efficient Packet Capturing","authors":"Shuhei Yoshida, Yuta Ukon, S. Ohteru, H. Uzawa, N. Ikeda, K. Nitta","doi":"10.1109/ASAP49362.2020.00014","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00014","url":null,"abstract":"Network microbursts, which are sub-millisecond order bursts of traffic, have gathered attention due to causing network delay and packet loss. However, there are two problems in analyzing the causes of microbursts: how to capture the packet included in the microburst and how to specify the flows causing microbursts. To resolve these problems, we propose a field-programmable gate array (FPGA)-based microburst analysis system. This system detects microbursts with dedicated hardware in sub-millisecond time resolution. It can capture only packets before and after microburst detection triggered by detection with a static threshold for whole traffic. In addition, it can specify the flows causing microbursts by detection with a dynamic threshold for each flow. The experimental results show that the proposed system can capture only packets before and after microburst detection and can correctly specify the flow causing microbursts even in a network with fluctuating bandwidth usage in practical traffic conditions on the basis of network trace data in a datacenter. The proposed system is implemented with Intel® PAC with Arria® 10 GX FPGA and consumes relatively small amounts of hardware resources: 51 % ALMs, 16 % registers, and 57 % block memories.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"133 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127976879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
FPGA-Accelerated Time Series Mining on Low-Power IoT Devices 低功耗物联网设备上fpga加速时间序列挖掘
Seongyoung Kang, Jinyeong Moon, S. Jun
We present a case for FPGA-accelerated edge processing for low-power Internet-of-Things (IoT) devices, using time series similarity search as a driving application. As the data collection capabilities of low-power IoT device increase, the primary constraint on their capacity is becoming the resource requirements of wirelessly transferring collected data to a central repository. This work presents a solution to this limitation by augmenting the IoT device with a inexpensive, power-efficient FPGA accelerator, which can perform fairly complex edge mining operations and drastically reduce the wireless data transfer requirements. This approach reduces the total power consumption of the device despite the added FPGA component, while also reducing the computation requirements at the central server. We use the Dynamic Time Warping (DTW) algorithm as an example workload. Using a low-cost Lattice iCE40 UltraPlus FPGA, we demonstrate that the FPGA-augmented mining algorithm can both support significantly higher data collection rate while improving the computation power efficiency of the entire deployment by an order of magnitude.
我们提出了一个fpga加速边缘处理的案例,用于低功耗物联网(IoT)设备,使用时间序列相似性搜索作为驱动应用。随着低功耗物联网设备数据收集能力的提高,其容量的主要制约因素是将收集到的数据无线传输到中央存储库的资源需求。这项工作提出了一种解决方案,通过使用廉价,节能的FPGA加速器来增强物联网设备,可以执行相当复杂的边缘挖掘操作,并大大降低无线数据传输要求。尽管增加了FPGA组件,但这种方法降低了设备的总功耗,同时也降低了中央服务器的计算需求。我们使用动态时间扭曲(DTW)算法作为示例工作负载。使用低成本的Lattice iCE40 UltraPlus FPGA,我们证明了FPGA增强挖掘算法可以支持更高的数据采集速率,同时将整个部署的计算能力效率提高一个数量级。
{"title":"FPGA-Accelerated Time Series Mining on Low-Power IoT Devices","authors":"Seongyoung Kang, Jinyeong Moon, S. Jun","doi":"10.1109/ASAP49362.2020.00015","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00015","url":null,"abstract":"We present a case for FPGA-accelerated edge processing for low-power Internet-of-Things (IoT) devices, using time series similarity search as a driving application. As the data collection capabilities of low-power IoT device increase, the primary constraint on their capacity is becoming the resource requirements of wirelessly transferring collected data to a central repository. This work presents a solution to this limitation by augmenting the IoT device with a inexpensive, power-efficient FPGA accelerator, which can perform fairly complex edge mining operations and drastically reduce the wireless data transfer requirements. This approach reduces the total power consumption of the device despite the added FPGA component, while also reducing the computation requirements at the central server. We use the Dynamic Time Warping (DTW) algorithm as an example workload. Using a low-cost Lattice iCE40 UltraPlus FPGA, we demonstrate that the FPGA-augmented mining algorithm can both support significantly higher data collection rate while improving the computation power efficiency of the entire deployment by an order of magnitude.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131495739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Design Space Exploration for Softmax Implementations Softmax实现的设计空间探索
Zhigang Wei, Aman Arora, P. Patel, L. John
Deep Neural Networks (DNN) are crucial components of machine learning in the big data era. Significant effort has been put into the hardware acceleration of convolution and fully-connected layers of neural networks, while not too much attention has been put on the Softmax layer. Softmax is used in terminal classification layers in networks like ResNet, and is also used in intermediate layers in networks like the Transformer. As the speed for other DNN layers keeps improving, efficient and flexible designs for Softmax are required. With the existence of several ways to implement Softmax in hardware, we evaluate various softmax hardware designs and the trade-offs between them. In order to make the design space exploration more efficient, we also develop a parameterized generator which can produce softmax designs by varying multiple aspects of a base architecture. The aspects or knobs are parallelism, accuracy, storage and precision. The goal of the generator is to enable evaluation of tradeoffs between area, delay, power and accuracy in the architecture of a softmax unit. We simulate and synthesize the generated designs and present results comparing them with the existing state-of-the-art. Our exploration reveals that the design with parallelism of 16 can provide the best area-delay product among designs with parallelism ranging from 1 to 32. It is also observed that look-up table based approximate LOG and EXP units can be used to yield almost the same accuracy as the full LOG and EXP units, while providing area and energy benefits. Additionally, providing local registers for intermediate values is seen to provide energy savings.
深度神经网络(DNN)是大数据时代机器学习的重要组成部分。在卷积和神经网络全连接层的硬件加速方面已经投入了大量的精力,而在Softmax层上却没有得到太多的关注。Softmax用于ResNet等网络的终端分类层,也用于Transformer等网络的中间层。随着其他DNN层的速度不断提高,需要对Softmax进行高效灵活的设计。由于Softmax存在多种硬件实现方式,我们评估了各种Softmax硬件设计以及它们之间的权衡。为了使设计空间探索更有效,我们还开发了一个参数化生成器,它可以通过改变基础架构的多个方面来产生softmax设计。这四个方面分别是并行性、准确性、存储量和精度。该发生器的目标是能够在softmax单元的架构中评估面积、延迟、功率和精度之间的权衡。我们模拟和综合生成的设计,并将结果与现有的最先进的设计进行比较。我们的研究表明,在并行度为1到32的设计中,并行度为16的设计可以提供最好的面积延迟产品。还可以观察到,基于近似LOG和EXP单位的查找表可以用于产生几乎与完整LOG和EXP单位相同的精度,同时提供面积和能量优势。此外,为中间值提供本地寄存器被认为可以节省能源。
{"title":"Design Space Exploration for Softmax Implementations","authors":"Zhigang Wei, Aman Arora, P. Patel, L. John","doi":"10.1109/ASAP49362.2020.00017","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00017","url":null,"abstract":"Deep Neural Networks (DNN) are crucial components of machine learning in the big data era. Significant effort has been put into the hardware acceleration of convolution and fully-connected layers of neural networks, while not too much attention has been put on the Softmax layer. Softmax is used in terminal classification layers in networks like ResNet, and is also used in intermediate layers in networks like the Transformer. As the speed for other DNN layers keeps improving, efficient and flexible designs for Softmax are required. With the existence of several ways to implement Softmax in hardware, we evaluate various softmax hardware designs and the trade-offs between them. In order to make the design space exploration more efficient, we also develop a parameterized generator which can produce softmax designs by varying multiple aspects of a base architecture. The aspects or knobs are parallelism, accuracy, storage and precision. The goal of the generator is to enable evaluation of tradeoffs between area, delay, power and accuracy in the architecture of a softmax unit. We simulate and synthesize the generated designs and present results comparing them with the existing state-of-the-art. Our exploration reveals that the design with parallelism of 16 can provide the best area-delay product among designs with parallelism ranging from 1 to 32. It is also observed that look-up table based approximate LOG and EXP units can be used to yield almost the same accuracy as the full LOG and EXP units, while providing area and energy benefits. Additionally, providing local registers for intermediate values is seen to provide energy savings.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124821669","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Termination detection for fine-grained message-passing architectures 细粒度消息传递体系结构的终止检测
Matthew Naylor, S. Moore, A. Mokhov, David B. Thomas, J. Beaumont, Shane T. Fleming, A. T. Markettos, Thomas Bytheway, Andrew D. Brown
Barrier primitives provided by standard parallel programming APIs are the primary means by which applications implement global synchronisation. Typically these primitives are fully-committed to synchronisation in the sense that, once a barrier is entered, synchronisation is the only way out. For message-passing applications, this raises the question of what happens when a message arrives at a thread that already resides in a barrier. Without a satisfactory answer, barriers do not interact with message-passing in any useful way.In this paper, we propose a new refutable barrier primitive that combines with message-passing to form a simple, expressive, efficient, well-defined API. It has a clear semantics based on termination detection, and supports the development of both globally-synchronous and asynchronous parallel applications.To evaluate the new primitive, we implement it in a prototype large-scale message-passing machine with 49,152 RISC-V threads distributed over 48 FPGAs. We show that hardware support for the primitive leads to a highly-efficient implementation, capable of synchronisation rates that are an order-of-magnitude higher than what is achievable in software. Using the primitive, we implement synchronous and asynchronous versions of a range of applications, observing that each version can have significant advantages over the other, depending on the application. Therefore, a barrier primitive supporting both styles can greatly assist the development of parallel programs.
标准并行编程api提供的屏障原语是应用程序实现全局同步的主要手段。通常,这些原语完全致力于同步,因为一旦进入障碍,同步是唯一的出路。对于消息传递应用程序,这就提出了一个问题:当消息到达已经驻留在屏障中的线程时,会发生什么情况。如果没有满意的答案,屏障就不会以任何有用的方式与消息传递进行交互。在本文中,我们提出了一种新的可反驳的屏障原语,它与消息传递相结合,形成了一个简单、富有表现力、高效、定义良好的API。它具有基于终止检测的清晰语义,并支持全局同步和异步并行应用程序的开发。为了评估新的原语,我们在一个大型消息传递机器的原型中实现了它,该机器有49,152个RISC-V线程,分布在48个fpga上。我们展示了对原语的硬件支持导致了高效的实现,能够实现比软件可实现的同步率高一个数量级的同步率。使用原语,我们实现了一系列应用程序的同步和异步版本,并观察到每个版本都比其他版本具有显著的优势,这取决于应用程序。因此,支持两种风格的屏障原语可以极大地帮助并行程序的开发。
{"title":"Termination detection for fine-grained message-passing architectures","authors":"Matthew Naylor, S. Moore, A. Mokhov, David B. Thomas, J. Beaumont, Shane T. Fleming, A. T. Markettos, Thomas Bytheway, Andrew D. Brown","doi":"10.1109/ASAP49362.2020.00012","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00012","url":null,"abstract":"Barrier primitives provided by standard parallel programming APIs are the primary means by which applications implement global synchronisation. Typically these primitives are fully-committed to synchronisation in the sense that, once a barrier is entered, synchronisation is the only way out. For message-passing applications, this raises the question of what happens when a message arrives at a thread that already resides in a barrier. Without a satisfactory answer, barriers do not interact with message-passing in any useful way.In this paper, we propose a new refutable barrier primitive that combines with message-passing to form a simple, expressive, efficient, well-defined API. It has a clear semantics based on termination detection, and supports the development of both globally-synchronous and asynchronous parallel applications.To evaluate the new primitive, we implement it in a prototype large-scale message-passing machine with 49,152 RISC-V threads distributed over 48 FPGAs. We show that hardware support for the primitive leads to a highly-efficient implementation, capable of synchronisation rates that are an order-of-magnitude higher than what is achievable in software. Using the primitive, we implement synchronous and asynchronous versions of a range of applications, observing that each version can have significant advantages over the other, depending on the application. Therefore, a barrier primitive supporting both styles can greatly assist the development of parallel programs.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128411941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
ASAP 2020 Committees ASAP 2020委员会
Lars Bauer, H. Blume, Byeong Kil Lee, T. Risset, M. Santambrogio
Kubilay Atasu, IBM Research – Zurich, Switzerland Jason Bakos, University of South Carolina, USA Lars Bauer, KIT, Germany Holger Blume, Leibniz Universität Hannover, Germany Christophe Bobda, University of Florida, USA Benjamin Carrion Schaefer, The University of Texas at Dallas, USA Anupam Chattopadhyay, Nanyang Technological University, Singapore Sudipta Chattopadhyay, Singapore University of Technology and Design, Singapore Thomas Chau, Samsung AI Centre, UK Xiang Chen, George Mason University, USA Zhe Chen, University of California, Los Angeles, USA Ray Cheung, City University of Hong Kong, Hong Kong Adrian Cristal, Barcelona Supercomputing Center, Spain Steven Derrien, University of Rennes, France Somdip Dey, University of Essex, UK Ken Eguro, Microsoft, USA Zhenman Fang, Simon Fraser University, Canada Diana Göhringer, TU Dresden, Germany Ann Gordon-Ross, University of Florida, USA Xinfei Guo, NVIDIA, University of Virginia, USA Frank Hannig, Friedrich-Alexander University Erlangen-Nürnberg, Germany Yuko Hara-Azumi, Tokyo Institute of Technology, Japan Martin Herbordt, Boston University, USA H. Peter Hofstee, IBM Austin, USA Ruirui (Raymond) Huang, Alibaba Cloud, USA Paolo Ienne, EPFL, Switzerland Sang-Woo Jun, University of California, Irvine, USA Ryan Kastner, University of California, San Diego, USA Dirk Koch, The University of Manchester, UK Herman Lam, University of Florida, USA Byeong Kil Lee, University of Colorado, USA Jingwen Leng, Shanghai Jiao Tong University, China Yun (Eric) Liang, Peking University, China Xue (Shelley) Lin, Northeastern University, USA Weiqiang Liu, Nanjing University of Aeronautics and Astronautics, China Iakovos Mavroidis, FORTH, Greece Nele Mentens, KU Leuven, Belgium Simon Moore, University of Cambridge, UK Roger Moussalli, Two Sigma, USA Jean-Michel Muller, CNRS, Laboratoire LIP, France Walid A. Najjar, University of California, Riverside, USA Javier Navaridas, The University of Manchester, UK Seda Ogrenci Memik, Northwestern University, USA Marco Platzner, University of Paderborn, Germany Viktor K. Prasanna, University of Southern California, USA Sanjay Rajopadhye, Colorado State University, USA
Kubilay Atasu, IBM研究院-瑞士苏黎世Jason Bakos,美国南卡罗莱纳大学Lars Bauer, KIT,德国Holger Blume,莱布尼茨Universität德国汉诺威Christophe Bobda,美国佛罗里达大学Benjamin Carrion Schaefer,美国德克萨斯大学达拉斯分校Anupam Chattopadhyay,新加坡南洋理工大学Sudipta Chattopadhyay,新加坡科技与设计大学,新加坡Thomas Chau,三星AI中心,英国Xiang Chen美国乔治梅森大学陈哲、美国加州大学洛杉矶分校张雷、香港城市大学、香港Adrian Cristal、巴塞罗那超级计算中心、西班牙Steven Derrien、雷恩大学、法国Somdip Dey、埃塞克斯大学、英国Ken Eguro、微软、美国方振曼、西蒙弗雷泽大学、加拿大Diana Göhringer、德累斯顿工业大学、德国安戈登罗斯、美国佛罗里达大学、美国郭新飞、英伟达、美国弗吉尼亚大学、弗兰克汉尼格、Friedrich-Alexander University erlangen - n rnberg,德国Yuko Hara-Azumi,日本东京工业大学Martin Herbordt,美国波士顿大学H. Peter Hofstee,美国IBM Austin,美国Ruirui (Raymond) Huang,阿里云,美国Paolo Ienne, EPFL,瑞士sung - woo Jun,美国加州大学欧文分校Ryan Kastner,美国加州大学圣地亚哥分校Dirk Koch,英国曼彻斯特大学Herman Lam,美国佛罗里达大学Byeong Kil Lee,科罗拉多大学美国冷静雯、上海交通大学、中国梁云(Eric)、北京大学、中国林雪(Shelley)、东北大学、美国刘伟强、南京航空航天大学、中国Iakovos Mavroidis、FORTH、希腊Nele Mentens、KU Leuven、比利时Simon Moore、剑桥大学、英国Roger Moussalli、Two Sigma、美国Jean-Michel Muller、CNRS、laboratory laboratory LIP、法国Walid A. Najjar、加州大学河滨分校、美国Javier Navaridas、英国曼彻斯特大学Seda Ogrenci Memik,美国西北大学Marco Platzner,德国帕德博恩大学Viktor K. Prasanna,美国南加州大学Sanjay Rajopadhye,美国科罗拉多州立大学
{"title":"ASAP 2020 Committees","authors":"Lars Bauer, H. Blume, Byeong Kil Lee, T. Risset, M. Santambrogio","doi":"10.1109/asap49362.2020.00007","DOIUrl":"https://doi.org/10.1109/asap49362.2020.00007","url":null,"abstract":"Kubilay Atasu, IBM Research – Zurich, Switzerland Jason Bakos, University of South Carolina, USA Lars Bauer, KIT, Germany Holger Blume, Leibniz Universität Hannover, Germany Christophe Bobda, University of Florida, USA Benjamin Carrion Schaefer, The University of Texas at Dallas, USA Anupam Chattopadhyay, Nanyang Technological University, Singapore Sudipta Chattopadhyay, Singapore University of Technology and Design, Singapore Thomas Chau, Samsung AI Centre, UK Xiang Chen, George Mason University, USA Zhe Chen, University of California, Los Angeles, USA Ray Cheung, City University of Hong Kong, Hong Kong Adrian Cristal, Barcelona Supercomputing Center, Spain Steven Derrien, University of Rennes, France Somdip Dey, University of Essex, UK Ken Eguro, Microsoft, USA Zhenman Fang, Simon Fraser University, Canada Diana Göhringer, TU Dresden, Germany Ann Gordon-Ross, University of Florida, USA Xinfei Guo, NVIDIA, University of Virginia, USA Frank Hannig, Friedrich-Alexander University Erlangen-Nürnberg, Germany Yuko Hara-Azumi, Tokyo Institute of Technology, Japan Martin Herbordt, Boston University, USA H. Peter Hofstee, IBM Austin, USA Ruirui (Raymond) Huang, Alibaba Cloud, USA Paolo Ienne, EPFL, Switzerland Sang-Woo Jun, University of California, Irvine, USA Ryan Kastner, University of California, San Diego, USA Dirk Koch, The University of Manchester, UK Herman Lam, University of Florida, USA Byeong Kil Lee, University of Colorado, USA Jingwen Leng, Shanghai Jiao Tong University, China Yun (Eric) Liang, Peking University, China Xue (Shelley) Lin, Northeastern University, USA Weiqiang Liu, Nanjing University of Aeronautics and Astronautics, China Iakovos Mavroidis, FORTH, Greece Nele Mentens, KU Leuven, Belgium Simon Moore, University of Cambridge, UK Roger Moussalli, Two Sigma, USA Jean-Michel Muller, CNRS, Laboratoire LIP, France Walid A. Najjar, University of California, Riverside, USA Javier Navaridas, The University of Manchester, UK Seda Ogrenci Memik, Northwestern University, USA Marco Platzner, University of Paderborn, Germany Viktor K. Prasanna, University of Southern California, USA Sanjay Rajopadhye, Colorado State University, USA","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"113965900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Condensing an overload of parallel computing ingredients into a single architecture recipe 将过载的并行计算成分压缩到单个架构配方中
Riadh Ben Abdelhamid, Y. Yamaguchi, T. Boku
General-purpose processors offer the best programming flexibility to address a wide range of problems. Nonetheless, they still lack behind special-purpose processors when it comes to sustained computational performance. Here, we leverage the best from both worlds and we propose a flexible, highly scalable, high-performance computing architecture with versatility in mind. The proposed architecture code-named DRAGON, benefits from several forms of parallelism such as SIMD, VLIW, Memory Broadcasting and even vector processing.
通用处理器提供了最好的编程灵活性来解决各种各样的问题。尽管如此,当涉及到持续的计算性能时,它们仍然缺乏专用处理器。在这里,我们充分利用了这两个领域的优势,我们提出了一个灵活的、高度可扩展的、高性能的计算体系结构,并考虑了多功能性。所提议的架构代号为DRAGON,它受益于几种形式的并行性,如SIMD、VLIW、内存广播甚至矢量处理。
{"title":"Condensing an overload of parallel computing ingredients into a single architecture recipe","authors":"Riadh Ben Abdelhamid, Y. Yamaguchi, T. Boku","doi":"10.1109/ASAP49362.2020.00013","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00013","url":null,"abstract":"General-purpose processors offer the best programming flexibility to address a wide range of problems. Nonetheless, they still lack behind special-purpose processors when it comes to sustained computational performance. Here, we leverage the best from both worlds and we propose a flexible, highly scalable, high-performance computing architecture with versatility in mind. The proposed architecture code-named DRAGON, benefits from several forms of parallelism such as SIMD, VLIW, Memory Broadcasting and even vector processing.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129994816","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
BWOLF: Bit-Width Optimization for Statistical Divergence with -Logarithmic Functions 具有-对数函数的统计散度的位宽优化
Qian Xu, Guowei Sun, G. Qu
Approximate computing is a promising technique in improving the energy efficiency for error-resilient applications such as multimedia, signal processing and neural network. A large amount of reported work is on the design of approximate computation units with truncated data under error constraints. However, they mainly focus on simple arithmetic operations, addition and multiplication to be more specific. In this paper, we study how to apply the truncation method to the floating-point logarithmic operation which is getting increasingly popular. We analyze the tradeoff between the precision of computation and the energy it requires and derive a formula on the most energy efficient implementation of the logarithm unit for a given error variance range. Based on this theoretical result, we propose BWOLF (Bit-Width optimization for Logarithmic Function), which uses a sequential quadratic programming algorithm to determine the way to truncate data (i.e., bit-width optimization) in a program with logarithm and other arithmetic operations such that the energy consumption is minimized under a fixed error budget. We evaluate the efficacy of BWOLF in energy saving on two widely used applications: Kullback-Leibler Divergence and Bayesian Neural Network. The experimental results validate the correctness of our analysis and show significant amount of energy saving over both the full-precision computation and the uniform truncation method. The energy savings range from 27.18 % to 95.92% for different error constraints.
在多媒体、信号处理和神经网络等抗错误性应用中,近似计算是一种很有前途的提高能源效率的技术。大量的研究工作是在误差约束下截断数据的近似计算单元的设计。然而,他们主要集中在简单的算术运算,更具体地说,是加法和乘法。本文研究了如何将截断法应用于日益流行的浮点对数运算。我们分析了计算精度和所需能量之间的权衡,并推导出给定误差方差范围内对数单位最节能实现的公式。基于这一理论结果,我们提出了BWOLF (Bit-Width optimization for Logarithmic Function),它使用顺序二次规划算法来确定在具有对数和其他算术运算的程序中截断数据的方式(即位宽优化),从而在固定误差预算下最小化能耗。本文从Kullback-Leibler散度和贝叶斯神经网络两种广泛应用的角度对BWOLF的节能效果进行了评价。实验结果验证了分析结果的正确性,并表明在全精度计算和均匀截断方法上都能显著节省能量。对于不同的误差约束,节能幅度在27.18% ~ 95.92%之间。
{"title":"BWOLF: Bit-Width Optimization for Statistical Divergence with -Logarithmic Functions","authors":"Qian Xu, Guowei Sun, G. Qu","doi":"10.1109/ASAP49362.2020.00035","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00035","url":null,"abstract":"Approximate computing is a promising technique in improving the energy efficiency for error-resilient applications such as multimedia, signal processing and neural network. A large amount of reported work is on the design of approximate computation units with truncated data under error constraints. However, they mainly focus on simple arithmetic operations, addition and multiplication to be more specific. In this paper, we study how to apply the truncation method to the floating-point logarithmic operation which is getting increasingly popular. We analyze the tradeoff between the precision of computation and the energy it requires and derive a formula on the most energy efficient implementation of the logarithm unit for a given error variance range. Based on this theoretical result, we propose BWOLF (Bit-Width optimization for Logarithmic Function), which uses a sequential quadratic programming algorithm to determine the way to truncate data (i.e., bit-width optimization) in a program with logarithm and other arithmetic operations such that the energy consumption is minimized under a fixed error budget. We evaluate the efficacy of BWOLF in energy saving on two widely used applications: Kullback-Leibler Divergence and Bayesian Neural Network. The experimental results validate the correctness of our analysis and show significant amount of energy saving over both the full-precision computation and the uniform truncation method. The energy savings range from 27.18 % to 95.92% for different error constraints.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133380903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Efficient Convolution Engine based on the À-trous Spatial Pyramid Pooling 基于À-trous空间金字塔池的高效卷积引擎
Cristian Sestito, F. Spagnolo, P. Corsonello, S. Perri
This paper presents an efficient hardware architecture able to perform 2D dilated convolutions and suitable for the integration within modern heterogeneous embedded systems targeting semantic image segmentation. The proposed design supports multiple dilation rates. Moreover, it uses limited amounts of resources even when large convolution windows are processed. As a case study, the novel circuit has been integrated within a Xilinx Zynq-7000 FPSoC device to accelerate a state-of-the-art CNN model for medical images segmentation. Obtained results demonstrate that higher computational capabilities, reduced resources utilization and lower power consumption are achieved with respect to the competitors existing in literature.
本文提出了一种高效的硬件架构,能够进行二维展开卷积,并适用于现代异构嵌入式系统中以语义图像分割为目标的集成。提出的设计支持多种膨胀率。此外,即使在处理大型卷积窗口时,它也使用有限的资源。作为案例研究,该新型电路已集成在Xilinx Zynq-7000 FPSoC器件中,以加速用于医学图像分割的最先进的CNN模型。所得结果表明,相对于现有文献中的竞争对手,实现了更高的计算能力、更低的资源利用率和更低的功耗。
{"title":"An Efficient Convolution Engine based on the À-trous Spatial Pyramid Pooling","authors":"Cristian Sestito, F. Spagnolo, P. Corsonello, S. Perri","doi":"10.1109/ASAP49362.2020.00022","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00022","url":null,"abstract":"This paper presents an efficient hardware architecture able to perform 2D dilated convolutions and suitable for the integration within modern heterogeneous embedded systems targeting semantic image segmentation. The proposed design supports multiple dilation rates. Moreover, it uses limited amounts of resources even when large convolution windows are processed. As a case study, the novel circuit has been integrated within a Xilinx Zynq-7000 FPSoC device to accelerate a state-of-the-art CNN model for medical images segmentation. Obtained results demonstrate that higher computational capabilities, reduced resources utilization and lower power consumption are achieved with respect to the competitors existing in literature.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117354752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Temporal Motionless Analysis of Video using CNN in MPSoC 基于CNN的MPSoC视频暂态分析
Somdip Dey, A. Singh, D. Prasad, K. Mcdonald-Maier
This paper proposes a novel human-inspired methodology called IRON-MAN (Integrated RatiONal prediction and Motionless ANalysis of videos) on mobile multi-processor systems-on-chips (MPSoCs). The methodology integrates analysis of the previous image frames of the video to represent the analysis of the current frame in order to perform Temporal Motionless Analysis of the Video (TMAV). This is the first work on TMAV using Convolutional Neural Network (CNN) for scene prediction in MPSoCs. Experimental results show that our methodology outperforms state-of-the-art. We also introduce a metric named, Energy Consumption per Training Image (ECTI) to assess the suitability of using a CNN model in mobile MPSoCs with a focus on energy consumption of the device.
本文在移动多处理器片上系统(mpsoc)上提出了一种新的人类启发的方法,称为IRON-MAN(集成理性预测和静止视频分析)。该方法将对视频前一帧图像的分析整合为对当前帧的分析,从而实现视频的时域静止分析(TMAV)。这是在TMAV中首次使用卷积神经网络(CNN)在mpsoc中进行场景预测。实验结果表明,我们的方法优于最先进的方法。我们还引入了一个名为“每个训练图像能耗”(ECTI)的指标,以评估在移动mpsoc中使用CNN模型的适用性,重点关注设备的能耗。
{"title":"Temporal Motionless Analysis of Video using CNN in MPSoC","authors":"Somdip Dey, A. Singh, D. Prasad, K. Mcdonald-Maier","doi":"10.1109/ASAP49362.2020.00021","DOIUrl":"https://doi.org/10.1109/ASAP49362.2020.00021","url":null,"abstract":"This paper proposes a novel human-inspired methodology called IRON-MAN (Integrated RatiONal prediction and Motionless ANalysis of videos) on mobile multi-processor systems-on-chips (MPSoCs). The methodology integrates analysis of the previous image frames of the video to represent the analysis of the current frame in order to perform Temporal Motionless Analysis of the Video (TMAV). This is the first work on TMAV using Convolutional Neural Network (CNN) for scene prediction in MPSoCs. Experimental results show that our methodology outperforms state-of-the-art. We also introduce a metric named, Energy Consumption per Training Image (ECTI) to assess the suitability of using a CNN model in mobile MPSoCs with a focus on energy consumption of the device.","PeriodicalId":375691,"journal":{"name":"2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132768766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2020 IEEE 31st International Conference on Application-specific Systems, Architectures and Processors (ASAP)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1