Compared to widely-used superconducting qubits, neutral-atom quantum computing technology promises potentially better scalability and flexible arrangement of qubits to allow higher operation parallelism and more relaxed cooling requirements. The high performance computing (HPC) and architecture community is beginning to design new solutions to take advantage of neutral-atom quantum architectures and overcome its unique challenges. We propose Geyser, the first work to take advantage of the multi-qubit gates natively supported by neutral-atom quantum computers by appropriately mapping quantum circuits to three-qubit-friendly physical arrangement of qubits. Then, Geyser creates multiple logical blocks in the quantum circuit to exploit quantum parallelism and reduce the number of pulses needed to realize physical gates. These circuit blocks elegantly enable Geyser to compose equivalent circuits with three-qubit gates, even when the original program does not have any multi-qubit gates. Our evaluation results show Geyser reduces the number of operation pulses by 25%-90% and improves the algorithm's output fidelity by 25%-60% points across different algorithms.
{"title":"Geyser","authors":"Tirthak Patel, Daniel Silver, Devesh Tiwari","doi":"10.1145/3470496.3527428","DOIUrl":"https://doi.org/10.1145/3470496.3527428","url":null,"abstract":"Compared to widely-used superconducting qubits, neutral-atom quantum computing technology promises potentially better scalability and flexible arrangement of qubits to allow higher operation parallelism and more relaxed cooling requirements. The high performance computing (HPC) and architecture community is beginning to design new solutions to take advantage of neutral-atom quantum architectures and overcome its unique challenges. We propose Geyser, the first work to take advantage of the multi-qubit gates natively supported by neutral-atom quantum computers by appropriately mapping quantum circuits to three-qubit-friendly physical arrangement of qubits. Then, Geyser creates multiple logical blocks in the quantum circuit to exploit quantum parallelism and reduce the number of pulses needed to realize physical gates. These circuit blocks elegantly enable Geyser to compose equivalent circuits with three-qubit gates, even when the original program does not have any multi-qubit gates. Our evaluation results show Geyser reduces the number of operation pulses by 25%-90% and improves the algorithm's output fidelity by 25%-60% points across different algorithms.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"15 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123514219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Performance and efficiency of running modern Deep Neural Networks (DNNs) are heavily bounded by data movement. To mitigate the data movement bottlenecks, recent DNN inference accelerator designs widely adopt aggressive compression techniques and sparse-skipping mechanisms. These mechanisms avoid transferring or computing with zero-valued weights or activations to save time and energy. However, such sparse-skipping logic involves large input buffers and irregular data access patterns, thus precluding many energy-efficient data reuse opportunities and dataflows. In this work, we propose Cascading Structured Pruning (CSP), a technique that preserves significantly more data reuse opportunities for higher energy efficiency while maintaining comparable performance relative to recent sparse architectures such as SparTen. CSP includes the following two components: At algorithm level, CSP-A induces a predictable sparsity pattern that allows for low-overhead compression of weight data and sequential access to both activation and weight data. At architecture level, CSP-H leverages CSP-A's induced sparsity pattern with a novel dataflow to access unique activation data only once, thus removing the demand for large input buffers. Each CSP-H processing element (PE) employs a novel accumulation buffer design and a counter-based sparse-skipping mechanism to support the dataflow with minimum controller overhead. We verify our approach on several representative models. Our simulated results show that CSP achieves on average 15× energy efficiency improvement over SparTen with comparable or superior speedup under most evaluations.
{"title":"Cascading structured pruning: enabling high data reuse for sparse DNN accelerators","authors":"Edward Hanson, Shiyu Li, H. Li, Yiran Chen","doi":"10.1145/3470496.3527419","DOIUrl":"https://doi.org/10.1145/3470496.3527419","url":null,"abstract":"Performance and efficiency of running modern Deep Neural Networks (DNNs) are heavily bounded by data movement. To mitigate the data movement bottlenecks, recent DNN inference accelerator designs widely adopt aggressive compression techniques and sparse-skipping mechanisms. These mechanisms avoid transferring or computing with zero-valued weights or activations to save time and energy. However, such sparse-skipping logic involves large input buffers and irregular data access patterns, thus precluding many energy-efficient data reuse opportunities and dataflows. In this work, we propose Cascading Structured Pruning (CSP), a technique that preserves significantly more data reuse opportunities for higher energy efficiency while maintaining comparable performance relative to recent sparse architectures such as SparTen. CSP includes the following two components: At algorithm level, CSP-A induces a predictable sparsity pattern that allows for low-overhead compression of weight data and sequential access to both activation and weight data. At architecture level, CSP-H leverages CSP-A's induced sparsity pattern with a novel dataflow to access unique activation data only once, thus removing the demand for large input buffers. Each CSP-H processing element (PE) employs a novel accumulation buffer design and a counter-based sparse-skipping mechanism to support the dataflow with minimum controller overhead. We verify our approach on several representative models. Our simulated results show that CSP achieves on average 15× energy efficiency improvement over SparTen with comparable or superior speedup under most evaluations.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130722359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jonathan Lew, Y. Liu, Wenyi Gong, Negar Goli, R. D. Evans, Tor M. Aamodt
Deep Neural Networks (DNNs) are the state of art in image, speech, and text processing. To address long training times and high energy consumption, custom accelerators can exploit sparsity, that is zero-valued weights, activations, and gradients. Proposed sparse Convolution Neural Network (CNN) accelerators support training with no more than one dynamic sparse convolution input. Among existing accelerator classes, the only ones supporting two-sided dynamic sparsity are outer-product-based accelerators. However, when mapping a convolution onto an outer product, multiplications occur that do not correspond to any valid output. These Redundant Cartesian Products (RCPs) decrease energy efficiency and performance. We observe that in sparse training, up to 90% of computations are RCPs resulting from the convolution of large matrices for weight updates during the backward pass of CNN training. In this work, we design a mechanism, ANT, to anticipate and eliminate RCPs, enabling more efficient sparse training when integrated with an outer-product accelerator. By anticipating over 90% of RCPs, ANT achieves a geometric mean of 3.71× speed up over an SCNN-like accelerator [67] on 90% sparse training using DenseNet-121 [38], ResNet18 [35], VGG16 [73], Wide ResNet (WRN) [85], and ResNet-50 [35], with 4.40× decrease in energy consumption and 0.0017mm2 of additional area. We extend ANT to sparse matrix multiplication, so that the same accelerator can anticipate RCPs in sparse fully-connected layers, transformers, and RNNs.
{"title":"Anticipating and eliminating redundant computations in accelerated sparse training","authors":"Jonathan Lew, Y. Liu, Wenyi Gong, Negar Goli, R. D. Evans, Tor M. Aamodt","doi":"10.1145/3470496.3527404","DOIUrl":"https://doi.org/10.1145/3470496.3527404","url":null,"abstract":"Deep Neural Networks (DNNs) are the state of art in image, speech, and text processing. To address long training times and high energy consumption, custom accelerators can exploit sparsity, that is zero-valued weights, activations, and gradients. Proposed sparse Convolution Neural Network (CNN) accelerators support training with no more than one dynamic sparse convolution input. Among existing accelerator classes, the only ones supporting two-sided dynamic sparsity are outer-product-based accelerators. However, when mapping a convolution onto an outer product, multiplications occur that do not correspond to any valid output. These Redundant Cartesian Products (RCPs) decrease energy efficiency and performance. We observe that in sparse training, up to 90% of computations are RCPs resulting from the convolution of large matrices for weight updates during the backward pass of CNN training. In this work, we design a mechanism, ANT, to anticipate and eliminate RCPs, enabling more efficient sparse training when integrated with an outer-product accelerator. By anticipating over 90% of RCPs, ANT achieves a geometric mean of 3.71× speed up over an SCNN-like accelerator [67] on 90% sparse training using DenseNet-121 [38], ResNet18 [35], VGG16 [73], Wide ResNet (WRN) [85], and ResNet-50 [35], with 4.40× decrease in energy consumption and 0.0017mm2 of additional area. We extend ANT to sparse matrix multiplication, so that the same accelerator can anticipate RCPs in sparse fully-connected layers, transformers, and RNNs.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131732243","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jian Chen, Xiaoyu Zhang, Tao Wang, Ying Zhang, Tao Chen, Jiajun Chen, Mingxu Xie, Qiang Liu
Network intrusion detection systems (IDS) are crucial for secure cloud computing, but they are also severely constrained by CPU computation capacity as the network bandwidth increases. Therefore, hardware offloading is essential for the IDS servers to support the ever-growing throughput demand for packet processing. Based on the experience of large-scale IDS deployment, we find the existing hardware offloading solutions have fundamental limitations that prevent them from being massively deployed in the production environment. In this paper, we present Fidas, an FPGA-based intrusion detection offload system that avoids the limitations of the existing hardware solutions by comprehensively offloading the primary NIC, rule pattern matching, and traffic flow rate classification. The pattern matching module in Fidas uses a multi-level filter-based approach for efficient regex processing, and the flow rate classification module employs a novel dual-stack memory scheme to identify the hot flows under volumetric attacks. Our evaluation shows that Fidas achieves the state-of-the-art throughput in pattern matching and flow rate classification while freeing up processors for other security-related functionalities. Fidas is deployed in the production data center and has been battle-tested for its performance, cost-effectiveness, and DevOps agility.
{"title":"Fidas","authors":"Jian Chen, Xiaoyu Zhang, Tao Wang, Ying Zhang, Tao Chen, Jiajun Chen, Mingxu Xie, Qiang Liu","doi":"10.1145/3470496.3533043","DOIUrl":"https://doi.org/10.1145/3470496.3533043","url":null,"abstract":"Network intrusion detection systems (IDS) are crucial for secure cloud computing, but they are also severely constrained by CPU computation capacity as the network bandwidth increases. Therefore, hardware offloading is essential for the IDS servers to support the ever-growing throughput demand for packet processing. Based on the experience of large-scale IDS deployment, we find the existing hardware offloading solutions have fundamental limitations that prevent them from being massively deployed in the production environment. In this paper, we present Fidas, an FPGA-based intrusion detection offload system that avoids the limitations of the existing hardware solutions by comprehensively offloading the primary NIC, rule pattern matching, and traffic flow rate classification. The pattern matching module in Fidas uses a multi-level filter-based approach for efficient regex processing, and the flow rate classification module employs a novel dual-stack memory scheme to identify the hot flows under volumetric attacks. Our evaluation shows that Fidas achieves the state-of-the-art throughput in pattern matching and flow rate classification while freeing up processors for other security-related functionalities. Fidas is deployed in the production data center and has been battle-tested for its performance, cost-effectiveness, and DevOps agility.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128741843","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivasa R. Sridharan, T. Krishna
Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multidimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of keeping all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of the single All-Reduce by 1.72× (2.70× max), and improve the end-to-end training iteration performance of real workloads such as ResNet-152, GNMT, DLRM, and Transformer-1T by 1.49× (2.25× max), 1.30× (1.78× max), 1.30× (1.77× max), and 1.25× (1.53× max), respectively.
{"title":"Themis","authors":"Saeed Rashidi, William Won, Sudarshan Srinivasan, Srinivasa R. Sridharan, T. Krishna","doi":"10.1145/3470496.3527382","DOIUrl":"https://doi.org/10.1145/3470496.3527382","url":null,"abstract":"Distributed training is a solution to reduce DNN training time by splitting the task across multiple NPUs (e.g., GPU/TPU). However, distributed training adds communication overhead between the NPUs in order to synchronize the gradients and/or activation, depending on the parallelization strategy. In next-generation platforms for training at scale, NPUs will be connected through multidimensional networks with diverse, heterogeneous bandwidths. This work identifies a looming challenge of keeping all network dimensions busy and maximizing the network BW within the hybrid environment if we leverage scheduling techniques for collective communication on systems today. We propose Themis, a novel collective scheduling scheme that dynamically schedules collectives (divided into chunks) to balance the communication loads across all dimensions, further improving the network BW utilization. Our results show that on average, Themis can improve the network BW utilization of the single All-Reduce by 1.72× (2.70× max), and improve the end-to-end training iteration performance of real workloads such as ResNet-152, GNMT, DLRM, and Transformer-1T by 1.49× (2.25× max), 1.30× (1.78× max), 1.30× (1.77× max), and 1.25× (1.53× max), respectively.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114761716","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joseph Ravichandran, Weon Taek Na, Jay Lang, Mengjia Yan
Clock grid is a mainstream clock network methodology for high performance microprocessor and SOC designs. Clock skew, power usage and robustness to PVT (power, voltage, temperature) are all important metrics for a high quality clock grid design. Tree-driven-grid clock network is a typical clock grid clock network. It includes a clock source, a buffered tree, leaf buffers, a mesh clock grid, local clock buffers, and latches as shown in Fig. 1. For such network, one big challenge is how to connect the leaf level buffers of the global tree to the grid with nonuniform loads under tight slew and skew constraints. The choice of tapping points that connect the leaf buffers to the clock grid are critical to the quality of the clock designs. Good tapping points can minimize the clock skew and reduce power. In this paper, we proposed a new algorithm to select the tapping points to build the global tree as regular and symmetric as possible. From our experimental results, the proposed algorithm can efficiently reduce global clock skew, rising slew, maximum overshoot, reduce power, and avoid local skew violation.
{"title":"PACMAN","authors":"Joseph Ravichandran, Weon Taek Na, Jay Lang, Mengjia Yan","doi":"10.1145/2633948.2633951","DOIUrl":"https://doi.org/10.1145/2633948.2633951","url":null,"abstract":"Clock grid is a mainstream clock network methodology for high performance microprocessor and SOC designs. Clock skew, power usage and robustness to PVT (power, voltage, temperature) are all important metrics for a high quality clock grid design. Tree-driven-grid clock network is a typical clock grid clock network. It includes a clock source, a buffered tree, leaf buffers, a mesh clock grid, local clock buffers, and latches as shown in Fig. 1. For such network, one big challenge is how to connect the leaf level buffers of the global tree to the grid with nonuniform loads under tight slew and skew constraints. The choice of tapping points that connect the leaf buffers to the clock grid are critical to the quality of the clock designs. Good tapping points can minimize the clock skew and reduce power. In this paper, we proposed a new algorithm to select the tapping points to build the global tree as regular and symmetric as possible. From our experimental results, the proposed algorithm can efficiently reduce global clock skew, rising slew, maximum overshoot, reduce power, and avoid local skew violation.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114768999","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jin Zhao, Yun Yang, Yu Zhang, Xiaofei Liao, Lin Gu, Ligang He, Bin He, Hai Jin, Haikun Liu, Xinyu Jiang, Hui Yu
Many solutions have been recently proposed to support the processing of streaming graphs. However, for the processing of each graph snapshot of a streaming graph, the new states of the vertices affected by the graph updates are propagated irregularly along the graph topology. Despite the years' research efforts, existing approaches still suffer from the serious problems of redundant computation overhead and irregular memory access, which severely underutilizes a many-core processor. To address these issues, this paper proposes a topology-driven programmable accelerator TDGraph, which is the first accelerator to augment the many-core processors to achieve high performance processing of streaming graphs. Specifically, we propose an efficient topology-driven incremental execution approach into the accelerator design for more regular state propagation and better data locality. TDGraph takes the vertices affected by graph updates as the roots to prefetch other vertices along the graph topology and synchronizes the incremental computations of them on the fly. In this way, most state propagations originated from multiple vertices affected by different graph updates can be conducted together along the graph topology, which help reduce the redundant computations and data access cost. Besides, through the efficient coalescing of the accesses to vertex states, TDGraph further improves the utilization of the cache and memory bandwidth. We have evaluated TDGraph on a simulated 64-core processor. The results show that, the state-of-the-art software system achieves the speedup of 7.1~21.4 times after integrating with TDGraph, while incurring only 0.73% area cost. Compared with four cutting-edge accelerators, i.e., HATS, Minnow, PHI, and DepGraph, TDGraph gains the speedups of 4.6~12.7, 3.2~8.6, 3.8~9.7, and 2.3~6.1 times, respectively.
{"title":"TDGraph","authors":"Jin Zhao, Yun Yang, Yu Zhang, Xiaofei Liao, Lin Gu, Ligang He, Bin He, Hai Jin, Haikun Liu, Xinyu Jiang, Hui Yu","doi":"10.1145/3470496.3527409","DOIUrl":"https://doi.org/10.1145/3470496.3527409","url":null,"abstract":"Many solutions have been recently proposed to support the processing of streaming graphs. However, for the processing of each graph snapshot of a streaming graph, the new states of the vertices affected by the graph updates are propagated irregularly along the graph topology. Despite the years' research efforts, existing approaches still suffer from the serious problems of redundant computation overhead and irregular memory access, which severely underutilizes a many-core processor. To address these issues, this paper proposes a topology-driven programmable accelerator TDGraph, which is the first accelerator to augment the many-core processors to achieve high performance processing of streaming graphs. Specifically, we propose an efficient topology-driven incremental execution approach into the accelerator design for more regular state propagation and better data locality. TDGraph takes the vertices affected by graph updates as the roots to prefetch other vertices along the graph topology and synchronizes the incremental computations of them on the fly. In this way, most state propagations originated from multiple vertices affected by different graph updates can be conducted together along the graph topology, which help reduce the redundant computations and data access cost. Besides, through the efficient coalescing of the accesses to vertex states, TDGraph further improves the utilization of the cache and memory bandwidth. We have evaluated TDGraph on a simulated 64-core processor. The results show that, the state-of-the-art software system achieves the speedup of 7.1~21.4 times after integrating with TDGraph, while incurring only 0.73% area cost. Compared with four cutting-edge accelerators, i.e., HATS, Minnow, PHI, and DepGraph, TDGraph gains the speedups of 4.6~12.7, 3.2~8.6, 3.8~9.7, and 2.3~6.1 times, respectively.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"506 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120882129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Lichtenau, A. Buyuktosunoglu, Ramon Bertran Monfort, P. Figuli, C. Jacobi, N. Papandreou, H. Pozidis, A. Saporito, Andrew Sica, Elpida Tzortzatos
IBM Telum is the next generation processor chip for IBM Z and LinuxONE systems. The Telum design is focused on enterprise class workloads and it achieves over 40% per socket performance growth compared to IBM z15. The IBM Telum is the first server-class chip with a dedicated on-chip AI accelerator that enables clients to gain real time insights from their data as it is getting processed. Seamlessly infusing AI in all enterprise workloads is highly desirable to get real business insight on every transaction as well as to improve IT operation, security, and data privacy. While it would undeniably provide significant additional value, its application in practice is often accompanied by hurdles from low throughput if run on-platform to security concerns and inconsistent latency if run off-platform. The IBM Telum chip introduces an on-chip AI accelerator that provides consistent low latency and high throughput (over 200 TFLOPS in 32 chip system) inference capacity usable by all threads. The accelerator is memory coherent and directly connected to the fabric like any other general-purpose core to support low latency inference while meeting the system's transaction rate. A scalable architecture providing transparent access to AI accelerator functions via a non-privileged general-purpose core instruction further reduces software orchestration and library complexity as well as provides extensibility to the AI functions. On a global bank customer credit card fraud detection model, the AI accelerator achieves 22× speed up in latency compared to a general purpose core utilizing vector execution units. For the same model, the AI accelerator achieves 116k inferences every second with a latency of only 1.1 msec. As the system is scaled up from one chip to 32 chips, it performs more than 3.5 Million inferences/sec and the latency still stays very low at only 1.2 msec. This paper briefly introduces the IBM Telum chip and later describes the integrated AI accelerator. IBM Telum's AI accelerator architecture, microarchitecture, integration into the system stack, performance, and power are covered in detail.
IBM Telum是IBM Z和LinuxONE系统的下一代处理器芯片。Telum设计专注于企业级工作负载,与IBM z15相比,它实现了超过40%的每个插槽性能增长。IBM Telum是第一款具有专用片上AI加速器的服务器级芯片,它使客户能够从正在处理的数据中获得实时洞察。在所有企业工作负载中无缝地注入人工智能是非常可取的,以便在每笔交易中获得真正的业务洞察力,并改善IT运营、安全性和数据隐私。虽然不可否认它将提供重要的附加价值,但它在实践中的应用通常伴随着障碍,从在平台上运行的低吞吐量到在平台外运行的安全问题和不一致的延迟。IBM Telum芯片引入了一个片上AI加速器,它提供了所有线程可用的一致的低延迟和高吞吐量(32芯片系统中超过200 TFLOPS)推理能力。加速器是内存一致的,并像任何其他通用核心一样直接连接到结构,以支持低延迟推理,同时满足系统的事务速率。可扩展的架构通过非特权通用核心指令提供对AI加速器功能的透明访问,进一步降低了软件编排和库的复杂性,并为AI功能提供了可扩展性。在全球银行客户信用卡欺诈检测模型中,与使用矢量执行单元的通用核心相比,AI加速器的延迟速度提高了22倍。对于相同的模型,AI加速器每秒实现116k次推理,延迟仅为1.1毫秒。当系统从一个芯片扩展到32个芯片时,它执行超过350万次推理/秒,延迟仍然保持在非常低的1.2毫秒。本文简要介绍了IBM Telum芯片,然后介绍了集成的AI加速器。详细介绍了IBM Telum的AI加速器架构、微架构、集成到系统堆栈、性能和功耗。
{"title":"AI accelerator on IBM Telum processor: industrial product","authors":"C. Lichtenau, A. Buyuktosunoglu, Ramon Bertran Monfort, P. Figuli, C. Jacobi, N. Papandreou, H. Pozidis, A. Saporito, Andrew Sica, Elpida Tzortzatos","doi":"10.1145/3470496.3533042","DOIUrl":"https://doi.org/10.1145/3470496.3533042","url":null,"abstract":"IBM Telum is the next generation processor chip for IBM Z and LinuxONE systems. The Telum design is focused on enterprise class workloads and it achieves over 40% per socket performance growth compared to IBM z15. The IBM Telum is the first server-class chip with a dedicated on-chip AI accelerator that enables clients to gain real time insights from their data as it is getting processed. Seamlessly infusing AI in all enterprise workloads is highly desirable to get real business insight on every transaction as well as to improve IT operation, security, and data privacy. While it would undeniably provide significant additional value, its application in practice is often accompanied by hurdles from low throughput if run on-platform to security concerns and inconsistent latency if run off-platform. The IBM Telum chip introduces an on-chip AI accelerator that provides consistent low latency and high throughput (over 200 TFLOPS in 32 chip system) inference capacity usable by all threads. The accelerator is memory coherent and directly connected to the fabric like any other general-purpose core to support low latency inference while meeting the system's transaction rate. A scalable architecture providing transparent access to AI accelerator functions via a non-privileged general-purpose core instruction further reduces software orchestration and library complexity as well as provides extensibility to the AI functions. On a global bank customer credit card fraud detection model, the AI accelerator achieves 22× speed up in latency compared to a general purpose core utilizing vector execution units. For the same model, the AI accelerator achieves 116k inferences every second with a latency of only 1.1 msec. As the system is scaled up from one chip to 32 chips, it performs more than 3.5 Million inferences/sec and the latency still stays very low at only 1.2 msec. This paper briefly introduces the IBM Telum chip and later describes the integrated AI accelerator. IBM Telum's AI accelerator architecture, microarchitecture, integration into the system stack, performance, and power are covered in detail.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"213 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122111348","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In presence of permafrost, the negative heat flux coming up from the cold frozen subsurface will lead to strongly negative WEqT (typically less than -2 °C), whereas on non frozen soil, the WEqT will be close to 0 °C or moderately negative (Haeberli 1973). Thus the WEqT can be a good indicator of permafrost occurrence and can help to discriminate permafrost from non-permafrost areas, provided that the snow cover developped early in the winter and remained sufficient to isolate the soil surface from atmospheric influence.
{"title":"BTS","authors":"","doi":"10.1145/3470496.3527415","DOIUrl":"https://doi.org/10.1145/3470496.3527415","url":null,"abstract":"In presence of permafrost, the negative heat flux coming up from the cold frozen subsurface will lead to strongly negative WEqT (typically less than -2 °C), whereas on non frozen soil, the WEqT will be close to 0 °C or moderately negative (Haeberli 1973). Thus the WEqT can be a good indicator of permafrost occurrence and can help to discriminate permafrost from non-permafrost areas, provided that the snow cover developped early in the winter and remained sufficient to isolate the soil surface from atmospheric influence.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122178818","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sudhanshu Shukla, Sumeet Bandishte, Jayesh Gaur, S. Subramoney
The memory wall continues to limit the performance of modern out-of-order (OOO) processors, despite the expensive provisioning of large multi-level caches and advancements in memory prefetching. In this paper, we put forth an important observation that the memory wall is not monolithic, but is constituted of many latency walls arising due to the latency of each tier of cache/memory. Our results show that even though level-1 (L1) data cache latency is nearly 40X lower than main memory latency, mitigating this latency offers a very similar performance opportunity as the more widely studied, main memory latency. This motivates our proposal Register File Prefetch (RFP) that intelligently utilizes the existing OOO scheduling pipeline and available L1 data cache/Register File bandwidth to successfully prefetch 43.4% of load requests from the L1 cache to the Register File. Simulation results on 65 diverse workloads show that this translates to 3.1% performance gain over a baseline with parameters similar to Intel Tiger Lake processor, which further increases to 5.7% for a futuristic up-scaled core. We also contrast and differentiate register file prefetching from techniques like load value and address prediction that enhance performance by speculatively breaking data dependencies. Our analysis shows that RFP is synergistic with value prediction, with both the features together delivering 4.1% average performance improvement, which is significantly higher than the 2.2% performance gain obtained from just doing value prediction.
{"title":"Register file prefetching","authors":"Sudhanshu Shukla, Sumeet Bandishte, Jayesh Gaur, S. Subramoney","doi":"10.1145/3470496.3527398","DOIUrl":"https://doi.org/10.1145/3470496.3527398","url":null,"abstract":"The memory wall continues to limit the performance of modern out-of-order (OOO) processors, despite the expensive provisioning of large multi-level caches and advancements in memory prefetching. In this paper, we put forth an important observation that the memory wall is not monolithic, but is constituted of many latency walls arising due to the latency of each tier of cache/memory. Our results show that even though level-1 (L1) data cache latency is nearly 40X lower than main memory latency, mitigating this latency offers a very similar performance opportunity as the more widely studied, main memory latency. This motivates our proposal Register File Prefetch (RFP) that intelligently utilizes the existing OOO scheduling pipeline and available L1 data cache/Register File bandwidth to successfully prefetch 43.4% of load requests from the L1 cache to the Register File. Simulation results on 65 diverse workloads show that this translates to 3.1% performance gain over a baseline with parameters similar to Intel Tiger Lake processor, which further increases to 5.7% for a futuristic up-scaled core. We also contrast and differentiate register file prefetching from techniques like load value and address prediction that enhance performance by speculatively breaking data dependencies. Our analysis shows that RFP is synergistic with value prediction, with both the features together delivering 4.1% average performance improvement, which is significantly higher than the 2.2% performance gain obtained from just doing value prediction.","PeriodicalId":337932,"journal":{"name":"Proceedings of the 49th Annual International Symposium on Computer Architecture","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130374551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}