首页 > 最新文献

Proceedings of the 49th Annual International Symposium on Computer Architecture最新文献

英文 中文
uBrain
Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527401
Di Wu, Jingjie Li, Zhewen Pan, Younghyun Kim, Joshua San Miguel
Brain computer interfaces (BCIs) have been widely adopted to enhance human perception via brain signals with abundant spatial-temporal dynamics, such as electroencephalogram (EEG). In recent years, BCI algorithms are moving from classical feature engineering to emerging deep neural networks (DNNs), allowing to identify the spatial-temporal dynamics with improved accuracy. However, existing BCI architectures are not leveraging such dynamics for hardware efficiency. In this work, we present uBrain, a unary computing BCI architecture for DNN models with cascaded convolutional and recurrent neural networks to achieve high task capability and hardware efficiency. uBrain co-designs the algorithm and hardware: the DNN architecture and the hardware architecture are optimized with customized unary operations and immediate signal processing after sensing, respectively. Experiments show that uBrain, with negligible accuracy loss, surpasses the CPU, systolic array and stochastic computing baselines in on-chip power efficiency by 9.0×, 6.2× and 2.0×.
{"title":"uBrain","authors":"Di Wu, Jingjie Li, Zhewen Pan, Younghyun Kim, Joshua San Miguel","doi":"10.1145/3470496.3527401","DOIUrl":"https://doi.org/10.1145/3470496.3527401","url":null,"abstract":"Brain computer interfaces (BCIs) have been widely adopted to enhance human perception via brain signals with abundant spatial-temporal dynamics, such as electroencephalogram (EEG). In recent years, BCI algorithms are moving from classical feature engineering to emerging deep neural networks (DNNs), allowing to identify the spatial-temporal dynamics with improved accuracy. However, existing BCI architectures are not leveraging such dynamics for hardware efficiency. In this work, we present uBrain, a unary computing BCI architecture for DNN models with cascaded convolutional and recurrent neural networks to achieve high task capability and hardware efficiency. uBrain co-designs the algorithm and hardware: the DNN architecture and the hardware architecture are optimized with customized unary operations and immediate signal processing after sensing, respectively. Experiments show that uBrain, with negligible accuracy loss, surpasses the CPU, systolic array and stochastic computing baselines in on-chip power efficiency by 9.0×, 6.2× and 2.0×.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128062451","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
EDAM
Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527424
Robert Hanhan, Esteban Garzón, Zuher Jahshan, A. Teman, M. Lanuzza, Leonid Yavits
We propose a novel edit distance-tolerant content addressable memory (EDAM) for energy-efficient approximate search applications. Unlike state-of-the-art approximate search solutions that tolerate certain Hamming distance between the query pattern and the stored data, EDAM tolerates edit distance, which makes it especially efficient in applications such as text processing and genome analysis. EDAM was designed using a commercial 65 nm 1.2 V CMOS technology and evaluated through extensive Monte Carlo simulations, while considering different process corners. Simulation results show that EDAM can achieve robust approximate search operation with a wide range of edit distance threshold levels. EDAM is functionally evaluated as a pathogen DNA detection and classification accelerator. EDAM achieves up to 1.7× higher F1 score for high-quality DNA reads and up to 19.55× higher F1 score for DNA reads with 15% error rate, compared to state-of-the-art DNA classification tool Kraken2. Simulated at 667 MHz, EDAM provides 1, 214× average speedup over Kraken2. This makes EDAM suitable for hardware acceleration of genomic surveillance of outbreaks, such as the ongoing Covid-19 pandemic.
{"title":"EDAM","authors":"Robert Hanhan, Esteban Garzón, Zuher Jahshan, A. Teman, M. Lanuzza, Leonid Yavits","doi":"10.1145/3470496.3527424","DOIUrl":"https://doi.org/10.1145/3470496.3527424","url":null,"abstract":"We propose a novel edit distance-tolerant content addressable memory (EDAM) for energy-efficient approximate search applications. Unlike state-of-the-art approximate search solutions that tolerate certain Hamming distance between the query pattern and the stored data, EDAM tolerates edit distance, which makes it especially efficient in applications such as text processing and genome analysis. EDAM was designed using a commercial 65 nm 1.2 V CMOS technology and evaluated through extensive Monte Carlo simulations, while considering different process corners. Simulation results show that EDAM can achieve robust approximate search operation with a wide range of edit distance threshold levels. EDAM is functionally evaluated as a pathogen DNA detection and classification accelerator. EDAM achieves up to 1.7× higher F1 score for high-quality DNA reads and up to 19.55× higher F1 score for DNA reads with 15% error rate, compared to state-of-the-art DNA classification tool Kraken2. Simulated at 667 MHz, EDAM provides 1, 214× average speedup over Kraken2. This makes EDAM suitable for hardware acceleration of genomic surveillance of outbreaks, such as the ongoing Covid-19 pandemic.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124591576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Rethinking programmable earable processors 重新思考可编程耳式处理器
Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527396
Nathaniel Bleier, Muhammad Husnain Mubarik, Srijan Chakraborty, S. Kishore, Rakesh Kumar
Earables such as earphones [15, 16, 73], hearing aids [28], and smart glasses [2, 14] are poised to be a prominent programmable computing platform in the future. In this paper, we ask the question: what kind of programmable hardware would be needed to support earable computing in future? To understand hardware requirements, we propose EarBench, a suite of representative emerging earable applications with diverse sensor-based inputs and computation requirements. Our analysis of EarBench applications shows that, on average, there is a 13.54×-3.97× performance gap between the computational needs of EarBench applications and the performance of the microprocessors that several of today's programmable earable SoCs are based on; more complex microprocessors have unacceptable energy efficiency for Earable applications. Our analysis also shows that EarBench applications are dominated by a small number of digital signal processing (DSP) and machine learning (ML)-based kernels that have significant computational similarity. We propose SpEaC --- a coarse-grained reconfigurable spatial architecture - as an energy-efficient programmable processor for earable applications. SpEaC targets earable applications efficiently using a) a reconfigurable fixed-point multiply-and-add augmented reduction tree-based substrate with support for vectorized complex operations that is optimized for the earable ML and DSP kernel code and b) a tightly coupled control core for executing other code (including non-matrix computation, or non-multiply or add operations in the earable DSP kernel code). Unlike other CGRAs that typically target general-purpose computations, SpEaC substrate is optimized for energy-efficient execution of the earable kernels at the expense of generality. Across all our kernels, SpEaC outperforms programmable cores modeled after M4, M7, A53, and HiFi4 DSP by 99.3×, 32.5×, 14.8×, and 9.8× respectively. At 63 mW in 28 nm, the energy efficiency benefits are 1.55 ×, 9.04×, 68.3 ×, and 32.7 × respectively; energy efficiency benefits are 15.7 × -- 1087 × over a low power Mali T628 MP6 GPU.
耳机[15,16,73]、助听器[28]和智能眼镜[2,14]等可穿戴设备将成为未来重要的可编程计算平台。在本文中,我们提出了这样一个问题:未来需要什么样的可编程硬件来支持可穿戴计算?为了了解硬件需求,我们提出了EarBench,这是一套具有不同传感器输入和计算需求的代表性新兴耳式应用程序。我们对EarBench应用程序的分析表明,平均而言,EarBench应用程序的计算需求与当今几种可编程耳式soc所基于的微处理器的性能之间存在13.54×-3.97×性能差距;更复杂的微处理器对于Earable应用具有不可接受的能源效率。我们的分析还表明,EarBench应用程序由少数具有显著计算相似性的基于数字信号处理(DSP)和机器学习(ML)的内核主导。我们提出SpEaC——一种粗粒度的可重构空间架构——作为可穿戴应用的节能可编程处理器。SpEaC有效地针对可耳应用,使用a)可重构的定点乘加增强约简树基板,支持针对可耳ML和DSP内核代码优化的矢量化复杂操作;b)用于执行其他代码(包括可耳DSP内核代码中的非矩阵计算或非乘法或加法操作)的紧密耦合控制核心。与其他通常针对通用计算的CGRAs不同,SpEaC基板以牺牲通用性为代价,优化了可听内核的节能执行。在我们所有的内核中,SpEaC的性能分别比M4、M7、A53和HiFi4 DSP的可编程内核高99.3倍、32.5倍、14.8倍和9.8倍。在28 nm、63 mW时,能效效益分别为1.55 ×、9.04×、68.3 ×和32.7 ×;与低功耗Mali T628 MP6 GPU相比,能效优势为15.7 ×—1087 ×。
{"title":"Rethinking programmable earable processors","authors":"Nathaniel Bleier, Muhammad Husnain Mubarik, Srijan Chakraborty, S. Kishore, Rakesh Kumar","doi":"10.1145/3470496.3527396","DOIUrl":"https://doi.org/10.1145/3470496.3527396","url":null,"abstract":"Earables such as earphones [15, 16, 73], hearing aids [28], and smart glasses [2, 14] are poised to be a prominent programmable computing platform in the future. In this paper, we ask the question: what kind of programmable hardware would be needed to support earable computing in future? To understand hardware requirements, we propose EarBench, a suite of representative emerging earable applications with diverse sensor-based inputs and computation requirements. Our analysis of EarBench applications shows that, on average, there is a 13.54×-3.97× performance gap between the computational needs of EarBench applications and the performance of the microprocessors that several of today's programmable earable SoCs are based on; more complex microprocessors have unacceptable energy efficiency for Earable applications. Our analysis also shows that EarBench applications are dominated by a small number of digital signal processing (DSP) and machine learning (ML)-based kernels that have significant computational similarity. We propose SpEaC --- a coarse-grained reconfigurable spatial architecture - as an energy-efficient programmable processor for earable applications. SpEaC targets earable applications efficiently using a) a reconfigurable fixed-point multiply-and-add augmented reduction tree-based substrate with support for vectorized complex operations that is optimized for the earable ML and DSP kernel code and b) a tightly coupled control core for executing other code (including non-matrix computation, or non-multiply or add operations in the earable DSP kernel code). Unlike other CGRAs that typically target general-purpose computations, SpEaC substrate is optimized for energy-efficient execution of the earable kernels at the expense of generality. Across all our kernels, SpEaC outperforms programmable cores modeled after M4, M7, A53, and HiFi4 DSP by 99.3×, 32.5×, 14.8×, and 9.8× respectively. At 63 mW in 28 nm, the energy efficiency benefits are 1.55 ×, 9.04×, 68.3 ×, and 32.7 × respectively; energy efficiency benefits are 15.7 × -- 1087 × over a low power Mali T628 MP6 GPU.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121636612","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
DIMMining
Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527388
Guohao Dai, Zhenhua Zhu, Tianyu Fu, Chiyue Wei, Bangyan Wang, Xiangyu Li, Yuan Xie, Huazhong Yang, Yu Wang
Graph mining, which finds specific patterns in the graph, is becoming increasingly important in various domains. We point out that accelerating graph mining suffers from the following challenges: (1) Heavy comparison for pruning: Pruning technique is widely used to reduce search space in graph mining. It applies constraints on vertex indices and involves massive index comparisons. (2) Low parallelism of set operations: The typical graph mining algorithms can be expressed as a series of set operations between neighbors of vertices, which suffer from low parallelism if vertices are streaming to the computation units. (3) Heavy data transfer: Graph mining needs to transfer intermediate data with two orders of magnitude larger than the original data volume between CPU and memory. To tackle these challenges, we propose DIMMining with four techniques from algorithm to architecture perspectives. The Index Pre-comparison scheme is proposed for efficient pruning. We introduce the self anchor and neighbor partition to enable pre-comparison for vertex indices. Thus, we can reduce comparisons during runtime. We propose a Flexible BCSR (Bitmap with Compressed Sparse Row) format to enable parallelism for set operations from the data structure perspective, which works on continuous vertices without memory space overheads. The Systolic Merge Array is designed to further explore the parallelism on discontinuous vertices from the architecture perspective. Then, we propose a DIMM-based Near-Memory-Computing architecture, which eliminates the large-volume data transfer between the computation and the memory. Extensive experimental results on real-world graphs show that DIMMining achieves 222.23X and 139.51X speedup compared with FPGAs and CPUs, and 3.61X speedup over the state-of-the-art graph mining architecture.
{"title":"DIMMining","authors":"Guohao Dai, Zhenhua Zhu, Tianyu Fu, Chiyue Wei, Bangyan Wang, Xiangyu Li, Yuan Xie, Huazhong Yang, Yu Wang","doi":"10.1145/3470496.3527388","DOIUrl":"https://doi.org/10.1145/3470496.3527388","url":null,"abstract":"Graph mining, which finds specific patterns in the graph, is becoming increasingly important in various domains. We point out that accelerating graph mining suffers from the following challenges: (1) Heavy comparison for pruning: Pruning technique is widely used to reduce search space in graph mining. It applies constraints on vertex indices and involves massive index comparisons. (2) Low parallelism of set operations: The typical graph mining algorithms can be expressed as a series of set operations between neighbors of vertices, which suffer from low parallelism if vertices are streaming to the computation units. (3) Heavy data transfer: Graph mining needs to transfer intermediate data with two orders of magnitude larger than the original data volume between CPU and memory. To tackle these challenges, we propose DIMMining with four techniques from algorithm to architecture perspectives. The Index Pre-comparison scheme is proposed for efficient pruning. We introduce the self anchor and neighbor partition to enable pre-comparison for vertex indices. Thus, we can reduce comparisons during runtime. We propose a Flexible BCSR (Bitmap with Compressed Sparse Row) format to enable parallelism for set operations from the data structure perspective, which works on continuous vertices without memory space overheads. The Systolic Merge Array is designed to further explore the parallelism on discontinuous vertices from the architecture perspective. Then, we propose a DIMM-based Near-Memory-Computing architecture, which eliminates the large-volume data transfer between the computation and the memory. Extensive experimental results on real-world graphs show that DIMMining achieves 222.23X and 139.51X speedup compared with FPGAs and CPUs, and 3.61X speedup over the state-of-the-art graph mining architecture.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115668736","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
RACOD
Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527383
Mohammad Bakhshalipour, Seyed Borna Ehsani, Mohamad Qadri, Dominic Guri, M. Likhachev, Phillip B. Gibbons
RACOD is an algorithm/hardware co-design for mobile robot path planning. It consists of two main components: CODAcc, a hardware accelerator for collision detection; and RASExp, an algorithm extension for runahead path exploration. CODAcc uses a novel MapReduce-style hardware computational model and massively parallelizes individual collision checks. RASExp predicts future path explorations and proactively computes its collision status ahead of time, thereby overlapping multiple collision detections. By affording multiple cheap CODAcc accelerators and overlapping collision detections using RASExp, RACOD significantly accelerates planning for mobile robots operating in arbitrary environments. Evaluations of popular benchmarks show up to 41.4× (self-driving cars) and 34.3× (pilotless drones) speedup with less than 0.3% area overhead. While the performance is maximized when CODAcc and RASExp are used together, they can also be used individually. To illustrate, we evaluate CODAcc alone in the context of a stationary robotic arm and show that it improves performance by 3.4×--3.8×. Also, we evaluate RASExp alone on commodity many-core CPU and GPU platforms by implementing it purely in software and show that with 32/128 CPU/GPU threads, it accelerates the end-to-end planning time by 8.6×/2.9×.
{"title":"RACOD","authors":"Mohammad Bakhshalipour, Seyed Borna Ehsani, Mohamad Qadri, Dominic Guri, M. Likhachev, Phillip B. Gibbons","doi":"10.1145/3470496.3527383","DOIUrl":"https://doi.org/10.1145/3470496.3527383","url":null,"abstract":"RACOD is an algorithm/hardware co-design for mobile robot path planning. It consists of two main components: CODAcc, a hardware accelerator for collision detection; and RASExp, an algorithm extension for runahead path exploration. CODAcc uses a novel MapReduce-style hardware computational model and massively parallelizes individual collision checks. RASExp predicts future path explorations and proactively computes its collision status ahead of time, thereby overlapping multiple collision detections. By affording multiple cheap CODAcc accelerators and overlapping collision detections using RASExp, RACOD significantly accelerates planning for mobile robots operating in arbitrary environments. Evaluations of popular benchmarks show up to 41.4× (self-driving cars) and 34.3× (pilotless drones) speedup with less than 0.3% area overhead. While the performance is maximized when CODAcc and RASExp are used together, they can also be used individually. To illustrate, we evaluate CODAcc alone in the context of a stationary robotic arm and show that it improves performance by 3.4×--3.8×. Also, we evaluate RASExp alone on commodity many-core CPU and GPU platforms by implementing it purely in software and show that with 32/128 CPU/GPU threads, it accelerates the end-to-end planning time by 8.6×/2.9×.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121969139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Dynamic global adaptive routing in high-radix networks 高基数网络中的动态全局自适应路由
Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527389
Hans Kasan, G. Kim, Yung Yi, John Kim
Global adaptive routing is a critical component of high-radix networks in large-scale systems and is necessary to fully exploit the path diversity of a high-radix topology. The routing decision in global adaptive routing is made between minimal and non-minimal paths, often based on local information (e.g., queue occupancy) and rely on "approximate" congestion information through backpressure. Different heuristic-based adaptive routing algorithms have been proposed for high-radix topologies; however, heuristic-based routing has performance trade-off for different traffic patterns and leads to inefficient routing decisions. In addition, previously proposed global adaptive routing algorithms are static as the same routing decision algorithm is used, even if the congestion information changes. In this work, we propose a novel global adaptive routing that we refer to as dynamic global adaptive routing that adjusts the routing decision algorithm through a dynamic bias based on the network traffic and congestion to maximize performance. In particular, we propose DGB - Decoupled, Gradient descent-based Bias global adaptive routing algorithm. DGB introduces a dynamic bias to the global adaptive routing decision by leveraging gradient descent to dynamically adjust the adaptive routing bias based on the network congestion. In addition, both the local and global congestion information are decoupled in the routing decision - global information is used for the dynamic bias while local information is used in the routing decision to more accurately estimate the network congestion. Our evaluations show that DGB consistently outperforms previously proposed routing algorithms across diverse range of traffic patterns and workloads. For asymmetric traffic pattern, DGB improves throughput by 65% compared to the state-of-the-art global adaptive routing algorithm while matching the performance for symmetric traffic patterns. For trace workloads, DGB provides average performance improvement of 26%.
全局自适应路由是大规模系统中高基数网络的关键组成部分,是充分利用高基数拓扑的路径多样性所必需的。全局自适应路由中的路由决策是在最小和非最小路径之间做出的,通常基于本地信息(例如,队列占用),并依赖于通过反压力的“近似”拥塞信息。针对高基数拓扑,提出了不同的启发式自适应路由算法;然而,基于启发式的路由对不同的流量模式有性能折衷,导致路由决策效率低下。此外,以前提出的全局自适应路由算法是静态的,因为即使拥塞信息发生变化,也使用相同的路由决策算法。在这项工作中,我们提出了一种新的全局自适应路由,我们称之为动态全局自适应路由,它通过基于网络流量和拥塞的动态偏差来调整路由决策算法,以最大限度地提高性能。特别地,我们提出了DGB解耦、基于梯度下降的偏置全局自适应路由算法。DGB基于网络拥塞情况,利用梯度下降动态调整自适应路由偏差,为全局自适应路由决策引入了动态偏差。此外,在路由决策中对局部和全局拥塞信息进行解耦,利用全局信息进行动态偏置,而在路由决策中使用局部信息来更准确地估计网络拥塞。我们的评估表明,DGB在不同的流量模式和工作负载范围内始终优于先前提出的路由算法。对于非对称流量模式,DGB比最先进的全局自适应路由算法提高了65%的吞吐量,同时匹配对称流量模式的性能。对于跟踪工作负载,DGB提供了26%的平均性能改进。
{"title":"Dynamic global adaptive routing in high-radix networks","authors":"Hans Kasan, G. Kim, Yung Yi, John Kim","doi":"10.1145/3470496.3527389","DOIUrl":"https://doi.org/10.1145/3470496.3527389","url":null,"abstract":"Global adaptive routing is a critical component of high-radix networks in large-scale systems and is necessary to fully exploit the path diversity of a high-radix topology. The routing decision in global adaptive routing is made between minimal and non-minimal paths, often based on local information (e.g., queue occupancy) and rely on \"approximate\" congestion information through backpressure. Different heuristic-based adaptive routing algorithms have been proposed for high-radix topologies; however, heuristic-based routing has performance trade-off for different traffic patterns and leads to inefficient routing decisions. In addition, previously proposed global adaptive routing algorithms are static as the same routing decision algorithm is used, even if the congestion information changes. In this work, we propose a novel global adaptive routing that we refer to as dynamic global adaptive routing that adjusts the routing decision algorithm through a dynamic bias based on the network traffic and congestion to maximize performance. In particular, we propose DGB - Decoupled, Gradient descent-based Bias global adaptive routing algorithm. DGB introduces a dynamic bias to the global adaptive routing decision by leveraging gradient descent to dynamically adjust the adaptive routing bias based on the network congestion. In addition, both the local and global congestion information are decoupled in the routing decision - global information is used for the dynamic bias while local information is used in the routing decision to more accurately estimate the network congestion. Our evaluations show that DGB consistently outperforms previously proposed routing algorithms across diverse range of traffic patterns and workloads. For asymmetric traffic pattern, DGB improves throughput by 65% compared to the state-of-the-art global adaptive routing algorithm while matching the performance for symmetric traffic patterns. For trace workloads, DGB provides average performance improvement of 26%.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129272943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
ACT: designing sustainable computer systems with an architectural carbon modeling tool ACT:用建筑碳模型工具设计可持续的计算机系统
Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527408
Udit Gupta, Mariam Elgamal, G. Hills, Gu-Yeon Wei, Hsien-Hsin S. Lee, David Brooks, Carole-Jean Wu
Given the performance and efficiency optimizations realized by the computer systems and architecture community over the last decades, the dominating source of computing's carbon footprint is shifting from operational emissions to embodied emissions. These embodied emissions owe to hardware manufacturing and infrastructure-related activities. Despite the rising embodied emissions, there is a distinct lack of architectural modeling tools to quantify and optimize the end-to-end carbon footprint of computing. This work proposes ACT, an architectural carbon footprint modeling framework, to enable carbon characterization and sustainability-driven early design space exploration. Using ACT we demonstrate optimizing hardware for carbon yields distinct solutions compared to optimizing for performance and efficiency. We construct use cases, based on the three tenets of sustainable design---Reduce, Reuse, Recycle---to highlight future methods that enable strong performance and efficiency scaling in an environmentally sustainable manner.
考虑到计算机系统和架构社区在过去几十年中实现的性能和效率优化,计算碳足迹的主要来源正在从操作排放转向隐含排放。这些隐含的排放源于硬件制造和基础设施相关活动。尽管隐含的排放量不断上升,但明显缺乏架构建模工具来量化和优化计算的端到端碳足迹。这项工作提出了ACT,一个建筑碳足迹建模框架,以实现碳表征和可持续性驱动的早期设计空间探索。使用ACT,我们演示了优化碳硬件与优化性能和效率的不同解决方案。我们基于可持续设计的三个原则——减少、再利用、再循环——构建了用例,以突出未来的方法,以环境可持续的方式实现强大的性能和效率扩展。
{"title":"ACT: designing sustainable computer systems with an architectural carbon modeling tool","authors":"Udit Gupta, Mariam Elgamal, G. Hills, Gu-Yeon Wei, Hsien-Hsin S. Lee, David Brooks, Carole-Jean Wu","doi":"10.1145/3470496.3527408","DOIUrl":"https://doi.org/10.1145/3470496.3527408","url":null,"abstract":"Given the performance and efficiency optimizations realized by the computer systems and architecture community over the last decades, the dominating source of computing's carbon footprint is shifting from operational emissions to embodied emissions. These embodied emissions owe to hardware manufacturing and infrastructure-related activities. Despite the rising embodied emissions, there is a distinct lack of architectural modeling tools to quantify and optimize the end-to-end carbon footprint of computing. This work proposes ACT, an architectural carbon footprint modeling framework, to enable carbon characterization and sustainability-driven early design space exploration. Using ACT we demonstrate optimizing hardware for carbon yields distinct solutions compared to optimizing for performance and efficiency. We construct use cases, based on the three tenets of sustainable design---Reduce, Reuse, Recycle---to highlight future methods that enable strong performance and efficiency scaling in an environmentally sustainable manner.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115124039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 45
Accelerating database analytic query workloads using an associative processor 使用关联处理器加速数据库分析查询工作负载
Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527435
Helena Caminal, Yannis Chronis, Tianshu Wu, J. Patel, José F. Martínez
Database analytic query workloads are heavy consumers of data-center cycles, and there is constant demand to improve their performance. Associative processors (AP) have re-emerged as an attractive architecture that offers very large data-level parallelism that can be used to implement a wide range of general-purpose operations. Associative processing is based primarily on efficient search and bulk update operations. Analytic query workloads benefit from data parallel execution and often feature both search and bulk update operations. In this paper, we investigate how amenable APs are to improving the performance of analytic query workloads. For this study, we use the recently proposed Content-Addressable Processing Engine (CAPE) framework. CAPE is an AP core that is highly programmable via the RISC-V ISA with standard vector extensions. By mapping key database operators to CAPE and introducing AP-aware changes to the query optimizer, we show that CAPE is a good match for database analytic workloads. We also propose a set of database-aware microarchitectural changes to CAPE to further improve performance. Overall, CAPE achieves a 10.8× speedup on average (up to 61.1×) on the SSB benchmark (a suite of 13 queries) compared to an iso-area aggressive out-of-order processor with AVX-512 SIMD support.
数据库分析查询工作负载是数据中心周期的主要消耗者,并且不断需要提高其性能。关联处理器(associated processor, AP)作为一种有吸引力的体系结构重新出现,它提供了非常大的数据级并行性,可用于实现广泛的通用操作。关联处理主要基于高效的搜索和批量更新操作。分析查询工作负载受益于数据并行执行,并且通常具有搜索和批量更新操作。在本文中,我们研究了ap是如何改进分析查询工作负载的性能的。在这项研究中,我们使用了最近提出的内容可寻址处理引擎(CAPE)框架。CAPE是一个AP核心,可通过带有标准矢量扩展的RISC-V ISA进行高度可编程。通过将关键数据库操作符映射到CAPE,并向查询优化器引入ap感知的更改,我们表明CAPE非常适合数据库分析工作负载。我们还对CAPE提出了一组数据库感知的微架构更改,以进一步提高性能。总的来说,与支持AVX-512 SIMD的等面积主动无序处理器相比,CAPE在SSB基准测试(一组13个查询)上平均实现了10.8倍的加速(高达61.1倍)。
{"title":"Accelerating database analytic query workloads using an associative processor","authors":"Helena Caminal, Yannis Chronis, Tianshu Wu, J. Patel, José F. Martínez","doi":"10.1145/3470496.3527435","DOIUrl":"https://doi.org/10.1145/3470496.3527435","url":null,"abstract":"Database analytic query workloads are heavy consumers of data-center cycles, and there is constant demand to improve their performance. Associative processors (AP) have re-emerged as an attractive architecture that offers very large data-level parallelism that can be used to implement a wide range of general-purpose operations. Associative processing is based primarily on efficient search and bulk update operations. Analytic query workloads benefit from data parallel execution and often feature both search and bulk update operations. In this paper, we investigate how amenable APs are to improving the performance of analytic query workloads. For this study, we use the recently proposed Content-Addressable Processing Engine (CAPE) framework. CAPE is an AP core that is highly programmable via the RISC-V ISA with standard vector extensions. By mapping key database operators to CAPE and introducing AP-aware changes to the query optimizer, we show that CAPE is a good match for database analytic workloads. We also propose a set of database-aware microarchitectural changes to CAPE to further improve performance. Overall, CAPE achieves a 10.8× speedup on average (up to 61.1×) on the SSB benchmark (a suite of 13 queries) compared to an iso-area aggressive out-of-order processor with AVX-512 SIMD support.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124340764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
A synthesis framework for stitching surface code with superconducting quantum devices 超导量子器件拼接表面码的合成框架
Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527381
Anbang Wu, Gushu Li, Hezi Zhang, G. Guerreschi, Yufei Ding, Yuan Xie
Quantum error correction (QEC) is the central building block of fault-tolerant quantum computation but the design of QEC codes may not always match the underlying hardware. To tackle the discrepancy between the quantum hardware and QEC codes, we propose a synthesis framework that can implement and optimize the surface code onto superconducting quantum architectures. In particular, we divide the surface code synthesis into three key subroutines. The first two optimize the mapping of data qubits and ancillary qubits including syndrome qubits on the connectivity-constrained superconducting architecture, while the last subroutine optimizes the surface code execution by rescheduling syndrome measurements. Our experiments on mainstream superconducting architectures demonstrate the effectiveness of the proposed synthesis framework. Especially, the surface codes synthesized by the proposed automatic synthesis framework can achieve comparable or even better error correction capability than manually designed QEC codes.
量子纠错(QEC)是容错量子计算的核心组成部分,但量子纠错代码的设计可能并不总是与底层硬件相匹配。为了解决量子硬件和QEC代码之间的差异,我们提出了一个可以在超导量子架构上实现和优化表面代码的综合框架。特别地,我们将表面代码合成分为三个关键子程序。前两个子程序在连接受限的超导架构上优化数据量子比特和辅助量子比特(包括综合症量子比特)的映射,而最后一个子程序通过重新调度综合症测量来优化表面代码的执行。我们在主流超导结构上的实验证明了所提出的合成框架的有效性。特别是采用自动合成框架合成的表面码,其纠错能力与人工设计的QEC码相当,甚至更好。
{"title":"A synthesis framework for stitching surface code with superconducting quantum devices","authors":"Anbang Wu, Gushu Li, Hezi Zhang, G. Guerreschi, Yufei Ding, Yuan Xie","doi":"10.1145/3470496.3527381","DOIUrl":"https://doi.org/10.1145/3470496.3527381","url":null,"abstract":"Quantum error correction (QEC) is the central building block of fault-tolerant quantum computation but the design of QEC codes may not always match the underlying hardware. To tackle the discrepancy between the quantum hardware and QEC codes, we propose a synthesis framework that can implement and optimize the surface code onto superconducting quantum architectures. In particular, we divide the surface code synthesis into three key subroutines. The first two optimize the mapping of data qubits and ancillary qubits including syndrome qubits on the connectivity-constrained superconducting architecture, while the last subroutine optimizes the surface code execution by rescheduling syndrome measurements. Our experiments on mainstream superconducting architectures demonstrate the effectiveness of the proposed synthesis framework. Especially, the surface codes synthesized by the proposed automatic synthesis framework can achieve comparable or even better error correction capability than manually designed QEC codes.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"130 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116062125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Free atomics: hardware atomic operations without fences 自由原子:没有栅栏的硬件原子操作
Pub Date : 2022-06-11 DOI: 10.1145/3470496.3527385
Ashkan Asgharzadeh, J. M. Cebrian, Arthur Perais, S. Kaxiras, Alberto Ros
Atomic Read-Modify-Write (RMW) instructions are primitive synchronization operations implemented in hardware that provide the building blocks for higher-abstraction synchronization mechanisms to programmers. According to publicly available documentation, current x86 implementations serialize atomic RMW operations, i.e., the store buffer is drained before issuing atomic RMWs and subsequent memory operations are stalled until the atomic RMW commits. This serialization, carried out by memory fences, incurs a performance cost which is expected to increase with deeper pipelines. This work proposes Free atomics, a lightweight, speculative, deadlock-free implementation of atomic operations that removes the need for memory fences, thus improving performance, while preserving atomicity and consistency. Free atomics is, to the best of our knowledge, the first proposal to enable store-to-load forwarding for atomic RMWs. Free atomics only requires simple modifications and incurs a small area overhead (15 bytes). Our evaluation using gem5-20 shows that, for a 32-core configuration, Free atomics improves performance by 12.5%, on average, for a large range of parallel workloads and 25.2%, on average, for atomic-intensive parallel workloads over a fenced atomic RMW implementation.
原子读-修改-写(RMW)指令是在硬件中实现的基本同步操作,为程序员提供更高抽象同步机制的构建块。根据公开可用的文档,当前的x86实现序列化原子RMW操作,也就是说,在发出原子RMW之前耗尽存储缓冲区,随后的内存操作将停止,直到原子RMW提交。这种由内存栅栏执行的序列化会产生性能成本,随着管道的加深,性能成本预计会增加。这项工作提出了自由原子,这是一种轻量级的、推测性的、无死锁的原子操作实现,它消除了对内存围栏的需求,从而提高了性能,同时保持了原子性和一致性。据我们所知,自由原子是为原子rmw启用存储到加载转发的第一个建议。自由原子只需要简单的修改,并且产生很小的区域开销(15字节)。我们使用gem5-20进行的评估表明,对于32核配置,对于大范围的并行工作负载,自由原子平均提高了12.5%的性能,对于通过隔离原子RMW实现的原子密集型并行工作负载,平均提高了25.2%的性能。
{"title":"Free atomics: hardware atomic operations without fences","authors":"Ashkan Asgharzadeh, J. M. Cebrian, Arthur Perais, S. Kaxiras, Alberto Ros","doi":"10.1145/3470496.3527385","DOIUrl":"https://doi.org/10.1145/3470496.3527385","url":null,"abstract":"Atomic Read-Modify-Write (RMW) instructions are primitive synchronization operations implemented in hardware that provide the building blocks for higher-abstraction synchronization mechanisms to programmers. According to publicly available documentation, current x86 implementations serialize atomic RMW operations, i.e., the store buffer is drained before issuing atomic RMWs and subsequent memory operations are stalled until the atomic RMW commits. This serialization, carried out by memory fences, incurs a performance cost which is expected to increase with deeper pipelines. This work proposes Free atomics, a lightweight, speculative, deadlock-free implementation of atomic operations that removes the need for memory fences, thus improving performance, while preserving atomicity and consistency. Free atomics is, to the best of our knowledge, the first proposal to enable store-to-load forwarding for atomic RMWs. Free atomics only requires simple modifications and incurs a small area overhead (15 bytes). Our evaluation using gem5-20 shows that, for a 32-core configuration, Free atomics improves performance by 12.5%, on average, for a large range of parallel workloads and 25.2%, on average, for atomic-intensive parallel workloads over a fenced atomic RMW implementation.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"104 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123434642","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
期刊
Proceedings of the 49th Annual International Symposium on Computer Architecture
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1