IEEE Transactions on Computers最新文献_第2页

Scalpel: High Performance Contention-Aware Task Co-Scheduling for Shared Cache Hierarchy

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-11-18 DOI: 10.1109/TC.2024.3500381

Song Liu;Jie Ma;Zengyuan Zhang;Xinhe Wan;Bo Zhao;Weiguo Wu

For scientific computing applications that consist of many loosely coupled tasks, efficient scheduling is critical to achieve high performance and good quality of service (QoS). One of the challenges for co-running tasks is the frequent contention for shared cache hierarchy of multi-core processors. Such contention significantly increases cache miss rate and therefore, results in performance deterioration for computational tasks. This paper presents Scalpel, a contention-aware task grouping and co-scheduling approach for efficient task scheduling on shared cache hierarchy. Scalpel utilizes the shared cache access features of tasks to group them in a heuristic way, which reduces the contention within groups by achieving equal shared cache locality, while maintaining load balancing between groups. Based thereon, it proposes a two-level scheduling strategy to schedule groups to processors and assign tasks to available cores in a timely manner, while considering the impact of task scheduling on shared cache locality to minimize task execution time. Experiments show that Scalpel reduces the shared cache miss rate by up to 2.14× and optimizes the execution time by up to 1.53× for scientific computing benchmarks, compared to several baseline approaches.

{"title":"Scalpel: High Performance Contention-Aware Task Co-Scheduling for Shared Cache Hierarchy","authors":"Song Liu;Jie Ma;Zengyuan Zhang;Xinhe Wan;Bo Zhao;Weiguo Wu","doi":"10.1109/TC.2024.3500381","DOIUrl":"https://doi.org/10.1109/TC.2024.3500381","url":null,"abstract":"For scientific computing applications that consist of many loosely coupled tasks, efficient scheduling is critical to achieve high performance and good quality of service (QoS). One of the challenges for co-running tasks is the frequent contention for shared cache hierarchy of multi-core processors. Such contention significantly increases cache miss rate and therefore, results in performance deterioration for computational tasks. This paper presents Scalpel, a contention-aware task grouping and co-scheduling approach for efficient task scheduling on shared cache hierarchy. Scalpel utilizes the shared cache access features of tasks to group them in a heuristic way, which reduces the contention within groups by achieving equal shared cache locality, while maintaining load balancing between groups. Based thereon, it proposes a two-level scheduling strategy to schedule groups to processors and assign tasks to available cores in a timely manner, while considering the impact of task scheduling on shared cache locality to minimize task execution time. Experiments show that Scalpel reduces the shared cache miss rate by up to 2.14× and optimizes the execution time by up to 1.53× for scientific computing benchmarks, compared to several baseline approaches.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"678-690"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

On Task Mapping in Multi-chiplet Based Many-Core Systems to Optimize Inter- and Intra-chiplet Communications

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-11-18 DOI: 10.1109/TC.2024.3500354

Xiaohang Wang;Yifan Wang;Yingtao Jiang;Amit Kumar Singh;Mei Yang

Multi-chiplet system design, by integrating multiple chiplets/dielets within a single package, has emerged as a promising paradigm in the post-Moore era. This paper introduces a novel task mapping algorithm for multi-chiplet based many-core systems, addressing the unique challenges posed by intra- and inter-chiplet communications under power and thermal constraints. Traditional task mapping algorithms fail to account for the latency and bandwidth differences between these communications, leading to sub-optimal performance in multi-chiplet systems. Our proposed algorithm employs a two-step process: (1) task assignment to chiplets using binary linear programming, leveraging a totally unimodular constraint matrix, and (2) intra-chiplet mapping that minimizes communication latency while considering both thermal and power constraints. This method strategically positions tasks with extensive inter-chiplet communication near interface nodes and centralizes those with predominant intra-chiplet communication. Experimental results demonstrate that the proposed algorithm outperforms existing methods (DAR and IOA) with a 37.5% and 24.7% reduction in execution time, respectively. Communication latency is also reduced by up to 43.2% and 32.9%, compared to DAR and IOA. These findings affirm that the proposed task mapping algorithm aligns well with the characteristics of multi-chiplet based many-core systems, and thus improves optimal performance.

{"title":"On Task Mapping in Multi-chiplet Based Many-Core Systems to Optimize Inter- and Intra-chiplet Communications","authors":"Xiaohang Wang;Yifan Wang;Yingtao Jiang;Amit Kumar Singh;Mei Yang","doi":"10.1109/TC.2024.3500354","DOIUrl":"https://doi.org/10.1109/TC.2024.3500354","url":null,"abstract":"Multi-chiplet system design, by integrating multiple chiplets/dielets within a single package, has emerged as a promising paradigm in the post-Moore era. This paper introduces a novel task mapping algorithm for multi-chiplet based many-core systems, addressing the unique challenges posed by intra- and inter-chiplet communications under power and thermal constraints. Traditional task mapping algorithms fail to account for the latency and bandwidth differences between these communications, leading to sub-optimal performance in multi-chiplet systems. Our proposed algorithm employs a two-step process: (1) task assignment to chiplets using binary linear programming, leveraging a totally unimodular constraint matrix, and (2) intra-chiplet mapping that minimizes communication latency while considering both thermal and power constraints. This method strategically positions tasks with extensive inter-chiplet communication near interface nodes and centralizes those with predominant intra-chiplet communication. Experimental results demonstrate that the proposed algorithm outperforms existing methods (DAR and IOA) with a 37.5% and 24.7% reduction in execution time, respectively. Communication latency is also reduced by up to 43.2% and 32.9%, compared to DAR and IOA. These findings affirm that the proposed task mapping algorithm aligns well with the characteristics of multi-chiplet based many-core systems, and thus improves optimal performance.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"510-525"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Optimizing the Deployment of Tiny Transformers on Low-Power MCUs

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-11-18 DOI: 10.1109/TC.2024.3500360

Victor Jean-Baptiste Jung;Alessio Burrello;Moritz Scherer;Francesco Conti;Luca Benini

Transformer networks are rapidly becoming State of the Art (SotA) in many fields, such as Natural Language Processing (NLP) and Computer Vision (CV). Similarly to Convolutional Neural Networks (CNNs), there is a strong push for deploying Transformer models at the extreme edge, ultimately fitting the tiny power budget and memory footprint of Micro-Controller Units (MCUs). However, the early approaches in this direction are mostly ad-hoc, platform, and model-specific. This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. We propose a complete framework to perform end-to-end deployment of Transformer models onto single and multi-core MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid unnecessary data marshaling operations into the crucial attention block. A novel Multi-Head Self-Attention (MHSA) inference schedule, named Fused-Weight Self-Attention (FWSA), is introduced, fusing the linear projection weights offline to further reduce the number of operations and parameters. Furthermore, to mitigate the memory peak reached by the computation of the attention map, we present a Depth-First Tiling (DFT) scheme for MHSA tailored for cache-less MCU devices that allows splitting the computation of the attention map into successive steps, never materializing the whole matrix in memory. We evaluate our framework on three different MCU classes exploiting ARM and RISC-V Instruction Set Architecture (ISA), namely the STM32H7 (ARM Cortex M7), the STM32L4 (ARM Cortex M4), and GAP9 (RV32IMC-XpulpV2). We reach an average of 4.79

$times$

and 2.0

$times$

lower latency compared to SotA libraries CMSIS-NN (ARM) and PULP-NN (RISC-V), respectively. Moreover, we show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19

$times$

, while the fused-weight attention can reduce the runtime by 1.53

$times$

, and number of parameters by 25%. Leveraging the optimizations proposed in this work, we run end-to-end inference of three SotA Tiny Transformers for three applications characterized by different input dimensions and network hyperparameters. We report significant improvements across the networks: for instance, when executing a transformer block for the task of radar-based hand-gesture recognition on GAP9, we achieve a latency of

$0.14 textrm{ms}$

and energy consumption of

$4.92 boldsymbol{mu}textrm{J}$

, 2.32

$times$

lower than the SotA PULP-NN library on the same platform.

{"title":"Optimizing the Deployment of Tiny Transformers on Low-Power MCUs","authors":"Victor Jean-Baptiste Jung;Alessio Burrello;Moritz Scherer;Francesco Conti;Luca Benini","doi":"10.1109/TC.2024.3500360","DOIUrl":"https://doi.org/10.1109/TC.2024.3500360","url":null,"abstract":"Transformer networks are rapidly becoming State of the Art (SotA) in many fields, such as Natural Language Processing (NLP) and Computer Vision (CV). Similarly to Convolutional Neural Networks (CNNs), there is a strong push for deploying Transformer models at the extreme edge, ultimately fitting the tiny power budget and memory footprint of Micro-Controller Units (MCUs). However, the early approaches in this direction are mostly ad-hoc, platform, and model-specific. This work aims to enable and optimize the flexible, multi-platform deployment of encoder Tiny Transformers on commercial MCUs. We propose a complete framework to perform end-to-end deployment of Transformer models onto single and multi-core MCUs. Our framework provides an optimized library of kernels to maximize data reuse and avoid unnecessary data marshaling operations into the crucial attention block. A novel Multi-Head Self-Attention (MHSA) inference schedule, named Fused-Weight Self-Attention (FWSA), is introduced, fusing the linear projection weights offline to further reduce the number of operations and parameters. Furthermore, to mitigate the memory peak reached by the computation of the attention map, we present a Depth-First Tiling (DFT) scheme for MHSA tailored for cache-less MCU devices that allows splitting the computation of the attention map into successive steps, never materializing the whole matrix in memory. We evaluate our framework on three different MCU classes exploiting ARM and RISC-V Instruction Set Architecture (ISA), namely the STM32H7 (ARM Cortex M7), the STM32L4 (ARM Cortex M4), and GAP9 (RV32IMC-XpulpV2). We reach an average of 4.79 <inline-formula><tex-math>$times$</tex-math></inline-formula> and 2.0 <inline-formula><tex-math>$times$</tex-math></inline-formula> lower latency compared to SotA libraries CMSIS-NN (ARM) and PULP-NN (RISC-V), respectively. Moreover, we show that our MHSA depth-first tiling scheme reduces the memory peak by up to 6.19 <inline-formula><tex-math>$times$</tex-math></inline-formula>, while the fused-weight attention can reduce the runtime by 1.53 <inline-formula><tex-math>$times$</tex-math></inline-formula>, and number of parameters by 25%. Leveraging the optimizations proposed in this work, we run end-to-end inference of three SotA Tiny Transformers for three applications characterized by different input dimensions and network hyperparameters. We report significant improvements across the networks: for instance, when executing a transformer block for the task of radar-based hand-gesture recognition on GAP9, we achieve a latency of <inline-formula><tex-math>$0.14 textrm{ms}$</tex-math></inline-formula> and energy consumption of <inline-formula><tex-math>$4.92 boldsymbol{mu}textrm{J}$</tex-math></inline-formula>, 2.32 <inline-formula><tex-math>$times$</tex-math></inline-formula> lower than the SotA PULP-NN library on the same platform.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"526-541"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Evaluating GPU's Instruction-Level Error Characteristics Under Low Supply Voltages

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-11-18 DOI: 10.1109/TC.2024.3500366

Jingweijia Tan;Jiashuo Wang;Kaige Yan;Xiaohui Wei;Xin Fu

Supply voltage underscaling has been an effective approach to improve the energy-efficiency of modern high-performance processors, such as GPUs. However, energy efficiency and reliability are two sides of a trade-off. Undervolting will inevitably undermine reliability, since it reduces chip manufacturers’ voltage guardbands that is designed to ensure correct operations under worst-case scenarios. To achieve optimal energy efficiency while maintaining enough reliability, it is necessary to deeply understand the error characteristics caused by undervolting. Unlike previous works which focus mostly on program level, we perform the first comprehensive instruction-level voltage margin and error characteristics evaluation for GPU architectures. We systematically measure the error probability and patterns of GPU instructions during undervolting. Then, we also analyze the impact of locations (SMs, threads, and bits) and operand data values on the error characteristics. Based on our observations, we reduce the voltage to the minimum safe limit for different instructions which achieves 18.37% energy saving, and we further propose an error detection strategy which reduces the performance and energy overhead by 14.8% with negligible 0.01% degradation for error detection rate.

{"title":"Evaluating GPU's Instruction-Level Error Characteristics Under Low Supply Voltages","authors":"Jingweijia Tan;Jiashuo Wang;Kaige Yan;Xiaohui Wei;Xin Fu","doi":"10.1109/TC.2024.3500366","DOIUrl":"https://doi.org/10.1109/TC.2024.3500366","url":null,"abstract":"Supply voltage underscaling has been an effective approach to improve the energy-efficiency of modern high-performance processors, such as GPUs. However, energy efficiency and reliability are two sides of a trade-off. Undervolting will inevitably undermine reliability, since it reduces chip manufacturers’ voltage guardbands that is designed to ensure correct operations under worst-case scenarios. To achieve optimal energy efficiency while maintaining enough reliability, it is necessary to deeply understand the error characteristics caused by undervolting. Unlike previous works which focus mostly on program level, we perform the first comprehensive instruction-level voltage margin and error characteristics evaluation for GPU architectures. We systematically measure the error probability and patterns of GPU instructions during undervolting. Then, we also analyze the impact of locations (SMs, threads, and bits) and operand data values on the error characteristics. Based on our observations, we reduce the voltage to the minimum safe limit for different instructions which achieves 18.37% energy saving, and we further propose an error detection strategy which reduces the performance and energy overhead by 14.8% with negligible 0.01% degradation for error detection rate.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"555-568"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Energy-Efficient, Delay-Constrained Edge Computing of a Network of DNNs

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-11-18 DOI: 10.1109/TC.2024.3500368

Mehdi Ghasemi;Soroush Heidari;Young Geun Kim;Carole-Jean Wu;Sarma Vrudhula

This paper presents a novel approach for executing the inference of a network of pre-trained deep neural networks (DNNs) on commercial-off-the-shelf devices that are deployed at the edge. The problem is to partition the computation of the DNNs between an energy-constrained and performance-limited edge device

$boldsymbol{mathcal{E}}$

, and an energy-unconstrained, higher performance device

$boldsymbol{mathcal{C}}$

, referred to as the cloudlet, with the objective of minimizing the energy consumption of

$boldsymbol{mathcal{E}}$

subject to a deadline constraint. The proposed partitioning algorithm takes into account the performance profiles of executing DNNs on the devices, the power consumption profiles, and the variability in the delay of the wireless channel. The algorithm is demonstrated on a platform that consists of an NVIDIA Jetson Nano as the edge device

$boldsymbol{mathcal{E}}$

and a Dell workstation with a Titan Xp GPU as the cloudlet. Experimental results show significant improvements both in terms of energy consumption of

$boldsymbol{mathcal{E}}$

and processing delay of the application. Additionally, it is shown how the energy-optimal solution is changed when the deadline constraint is altered. Moreover, the overhead of decision-making for our proposed method is significantly lower than the state-of-the-art Integer Linear Programming (ILP) solutions.

{"title":"Energy-Efficient, Delay-Constrained Edge Computing of a Network of DNNs","authors":"Mehdi Ghasemi;Soroush Heidari;Young Geun Kim;Carole-Jean Wu;Sarma Vrudhula","doi":"10.1109/TC.2024.3500368","DOIUrl":"https://doi.org/10.1109/TC.2024.3500368","url":null,"abstract":"This paper presents a novel approach for executing the inference of a network of pre-trained deep neural networks (DNNs) on commercial-off-the-shelf devices that are deployed at the edge. The problem is to partition the computation of the DNNs between an energy-constrained and performance-limited edge device <inline-formula><tex-math>$boldsymbol{mathcal{E}}$</tex-math></inline-formula>, and an energy-unconstrained, higher performance device <inline-formula><tex-math>$boldsymbol{mathcal{C}}$</tex-math></inline-formula>, referred to as the <i>cloudlet</i>, with the objective of minimizing the energy consumption of <inline-formula><tex-math>$boldsymbol{mathcal{E}}$</tex-math></inline-formula> subject to a deadline constraint. The proposed partitioning algorithm takes into account the performance profiles of executing DNNs on the devices, the power consumption profiles, and the variability in the delay of the wireless channel. The algorithm is demonstrated on a platform that consists of an NVIDIA Jetson Nano as the edge device <inline-formula><tex-math>$boldsymbol{mathcal{E}}$</tex-math></inline-formula> and a Dell workstation with a Titan Xp GPU as the cloudlet. Experimental results show significant improvements both in terms of energy consumption of <inline-formula><tex-math>$boldsymbol{mathcal{E}}$</tex-math></inline-formula> and processing delay of the application. Additionally, it is shown how the energy-optimal solution is changed when the deadline constraint is altered. Moreover, the overhead of decision-making for our proposed method is significantly lower than the state-of-the-art Integer Linear Programming (ILP) solutions.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"569-581"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143107048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Performance Characteristics and Guidelines of Offloading Middleboxes Onto BlueField-2 DPU

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-11-18 DOI: 10.1109/TC.2024.3500372

Fuliang Li;Qin Chen;Jiaxing Shen;Xingwei Wang;Jiannong Cao

With the rapid growth in data center network bandwidth far outpacing improvements in CPU performance, traditional software middleboxes running on servers have become inefficient. The emerging data processing units aim to address this by offloading network functions from the CPU. However, as DPUs are still a new technology, there lacks comprehensive evaluation of their capabilities for accelerating middleboxes. This paper benchmarks and analyzes the performance of offloading middleboxes onto the NVIDIA BlueField-2 DPU. Three key DPU capabilities are explored: flow tables offloading, ARM subsystem packet processing, and connection tracking hardware offload. By applying these to implement representative middleboxes for firewall, packet scheduling, and load balancing, their performance is characterized and compared to conventional CPU-based versions. Results reveal the high throughput of flow tables offloading for stateless firewalls, but limitations as pipeline depth increases. Packet scheduling using ARM cores is shown to currently reduce performance versus CPU-based scheduling. Finally, while connection tracking hardware offload boosts load balancer bandwidth, it also weakens connection creation abilities. Key lessons on efficient middleboxes offloading strategies with DPUs are provided to guide further research and development. Overall, this paper offers useful benchmarking and analysis of emerging DPUs for accelerating middleboxes in modern data centers.

{"title":"Performance Characteristics and Guidelines of Offloading Middleboxes Onto BlueField-2 DPU","authors":"Fuliang Li;Qin Chen;Jiaxing Shen;Xingwei Wang;Jiannong Cao","doi":"10.1109/TC.2024.3500372","DOIUrl":"https://doi.org/10.1109/TC.2024.3500372","url":null,"abstract":"With the rapid growth in data center network bandwidth far outpacing improvements in CPU performance, traditional software middleboxes running on servers have become inefficient. The emerging data processing units aim to address this by offloading network functions from the CPU. However, as DPUs are still a new technology, there lacks comprehensive evaluation of their capabilities for accelerating middleboxes. This paper benchmarks and analyzes the performance of offloading middleboxes onto the NVIDIA BlueField-2 DPU. Three key DPU capabilities are explored: flow tables offloading, ARM subsystem packet processing, and connection tracking hardware offload. By applying these to implement representative middleboxes for firewall, packet scheduling, and load balancing, their performance is characterized and compared to conventional CPU-based versions. Results reveal the high throughput of flow tables offloading for stateless firewalls, but limitations as pipeline depth increases. Packet scheduling using ARM cores is shown to currently reduce performance versus CPU-based scheduling. Finally, while connection tracking hardware offload boosts load balancer bandwidth, it also weakens connection creation abilities. Key lessons on efficient middleboxes offloading strategies with DPUs are provided to guide further research and development. Overall, this paper offers useful benchmarking and analysis of emerging DPUs for accelerating middleboxes in modern data centers.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"609-622"},"PeriodicalIF":3.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106760","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Scaling Persistent In-Memory Key-Value Stores Over Modern Tiered, Heterogeneous Memory Hierarchies

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-11-15 DOI: 10.1109/TC.2024.3500352

Miao Cai;Junru Shen;Yifan Yuan;Zhihao Qu;Baoliu Ye

Recent advances in ultra-fast non-volatile memories (e.g., 3D XPoint) and high-speed interconnect fabrics (e.g., RDMA) enable a high-performance tiered, heterogeneous memory system, effectively overcoming the cost, scaling, and capacity limitations in DRAM-based key-value stores. To fully unleash the performance potential of such memory systems, this paper presents BonsaiKV+, a key-value store that makes the best use of different components in a modern RDMA-enabled heterogeneous memory system. The core of BonsaiKV+ is a tri-layer architecture that achieves efficient, elastic scaling up/out using a set of novel mechanisms and techniques—pipelined tiered indexing, NVM congestion control mechanisms, fine-grained data striping, and NUMA-aware data management—to leverage hardware strengths and tackle device deficiencies. We compare BonsaiKV+ with state-of-the-art key-value stores using a variety of YCSB workloads. Evaluation results demonstrate that BonsaiKV+ outperforms others by up to 7.30

$times$

, 18.89

$times$

, and 13.67

$times$

in read-, write-, and scan-intensive scenarios, respectively.

{"title":"Scaling Persistent In-Memory Key-Value Stores Over Modern Tiered, Heterogeneous Memory Hierarchies","authors":"Miao Cai;Junru Shen;Yifan Yuan;Zhihao Qu;Baoliu Ye","doi":"10.1109/TC.2024.3500352","DOIUrl":"https://doi.org/10.1109/TC.2024.3500352","url":null,"abstract":"Recent advances in ultra-fast non-volatile memories (e.g., 3D XPoint) and high-speed interconnect fabrics (e.g., RDMA) enable a high-performance tiered, heterogeneous memory system, effectively overcoming the cost, scaling, and capacity limitations in DRAM-based key-value stores. To fully unleash the performance potential of such memory systems, this paper presents BonsaiKV+, a key-value store that makes the best use of different components in a modern RDMA-enabled heterogeneous memory system. The core of BonsaiKV+ is a tri-layer architecture that achieves efficient, elastic scaling up/out using a set of novel mechanisms and techniques—pipelined tiered indexing, NVM congestion control mechanisms, fine-grained data striping, and NUMA-aware data management—to leverage hardware strengths and tackle device deficiencies. We compare BonsaiKV+ with state-of-the-art key-value stores using a variety of YCSB workloads. Evaluation results demonstrate that BonsaiKV+ outperforms others by up to 7.30<inline-formula><tex-math>$times$</tex-math></inline-formula>, 18.89<inline-formula><tex-math>$times$</tex-math></inline-formula>, and 13.67<inline-formula><tex-math>$times$</tex-math></inline-formula> in read-, write-, and scan-intensive scenarios, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"495-509"},"PeriodicalIF":3.6,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106579","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SADIMM: Accelerating $underline{text{S}}$S―parse $underline{text{A}}$A―ttention Using $underline{text{DIMM}}$DIMM―-Based Near-Memory Processing

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-11-15 DOI: 10.1109/TC.2024.3500362

Huize Li;Dan Chen;Tulika Mitra

Self-attention mechanism is the performance bottleneck of Transformer based language models. In response, researchers have proposed sparse attention to expedite Transformer execution. However, sparse attention involves massive random access, rendering it as a memory-intensive kernel. Memory-based architectures, such as near-memory processing (NMP), demonstrate notable performance enhancements in memory-intensive applications. Nonetheless, existing NMP-based sparse attention accelerators face suboptimal performance due to hardware and software challenges. On the hardware front, current solutions employ homogeneous logic integration, struggling to support the diverse operations in sparse attention. On the software side, token-based dataflow is commonly adopted, leading to load imbalance after the pruning of weakly connected tokens. To address these challenges, this paper introduces SADIMM, a hardware-software co-designed NMP-based sparse attention accelerator. In hardware, we propose a heterogeneous integration approach to efficiently support various operations within the attention mechanism. This involves employing different logic units for different operations, thereby improving hardware efficiency. In software, we implement a dimension-based dataflow, dividing input sequences by model dimensions. This approach achieves load balancing after the pruning of weakly connected tokens. Compared to NVIDIA RTX A6000 GPU, the experimental results on BERT, BART, and GPT-2 models demonstrate that SADIMM achieves 48

$boldsymbol{times}$

, 35

$boldsymbol{times}$

, 37

$boldsymbol{times}$

speedups and 194

$boldsymbol{times}$

, 202

$boldsymbol{times}$

, 191

$boldsymbol{times}$

energy efficiency improvement, respectively.

{"title":"SADIMM: Accelerating $underline{text{S}}$S―parse $underline{text{A}}$A―ttention Using $underline{text{DIMM}}$DIMM―-Based Near-Memory Processing","authors":"Huize Li;Dan Chen;Tulika Mitra","doi":"10.1109/TC.2024.3500362","DOIUrl":"https://doi.org/10.1109/TC.2024.3500362","url":null,"abstract":"Self-attention mechanism is the performance bottleneck of Transformer based language models. In response, researchers have proposed sparse attention to expedite Transformer execution. However, sparse attention involves massive random access, rendering it as a memory-intensive kernel. Memory-based architectures, such as <i>near-memory processing</i> (NMP), demonstrate notable performance enhancements in memory-intensive applications. Nonetheless, existing NMP-based sparse attention accelerators face suboptimal performance due to hardware and software challenges. On the hardware front, current solutions employ homogeneous logic integration, struggling to support the diverse operations in sparse attention. On the software side, token-based dataflow is commonly adopted, leading to load imbalance after the pruning of weakly connected tokens. To address these challenges, this paper introduces SADIMM, a hardware-software co-designed NMP-based sparse attention accelerator. In hardware, we propose a heterogeneous integration approach to efficiently support various operations within the attention mechanism. This involves employing different logic units for different operations, thereby improving hardware efficiency. In software, we implement a dimension-based dataflow, dividing input sequences by model dimensions. This approach achieves load balancing after the pruning of weakly connected tokens. Compared to NVIDIA RTX A6000 GPU, the experimental results on BERT, BART, and GPT-2 models demonstrate that SADIMM achieves 48<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 35<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 37<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> speedups and 194<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 202<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula>, 191<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> energy efficiency improvement, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"542-554"},"PeriodicalIF":3.6,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Mitigation of Phase Transitions in Self-Organizing NoC for Stable Queueing Dynamics

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-11-15 DOI: 10.1109/TC.2024.3500373

Sneha Agarwal;Keshav Goel;Mitali Sinha;Sujay Deb

Most complex cooperative systems, such as networks on chip (NoCs), possess self-organizing properties and exhibit fluctuations in data traffic with similar statistical characteristics across multiple timescales, a.k.a., scaling behavior. Abrupt transitions in the scaling behavior of these fluctuations, caused by spikes in data traffic, network congestion, etc., indicate instability in the queueing dynamics of NoC routers. This instability hampers the predictability of real-time flow control mechanisms, leading to unpredictable delays and communication failures. Detecting and mitigating these instabilities or phase transitions is crucial in domains requiring stability and real-time control, such as aviation and healthcare. In this paper, we propose a real-time monitoring and characterization strategy for data traffic from influential routers to identify and mitigate impending instabilities before their onset. Leveraging the self-organization characteristic of NoCs, we propose to implement targeted mitigation on influential nodes to achieve network-wide effects. We demonstrate the effectiveness of our strategy on various benchmarks by comparing traffic analysis plots before and after mitigation. Our results show that the proposed phase transition mitigation improves the network performance by an average of 39.6% and buffer utilization by an average of 4.62%.

{"title":"Mitigation of Phase Transitions in Self-Organizing NoC for Stable Queueing Dynamics","authors":"Sneha Agarwal;Keshav Goel;Mitali Sinha;Sujay Deb","doi":"10.1109/TC.2024.3500373","DOIUrl":"https://doi.org/10.1109/TC.2024.3500373","url":null,"abstract":"Most complex cooperative systems, such as networks on chip (NoCs), possess self-organizing properties and exhibit fluctuations in data traffic with similar statistical characteristics across multiple timescales, a.k.a., scaling behavior. Abrupt transitions in the scaling behavior of these fluctuations, caused by spikes in data traffic, network congestion, etc., indicate instability in the queueing dynamics of NoC routers. This instability hampers the predictability of real-time flow control mechanisms, leading to unpredictable delays and communication failures. Detecting and mitigating these instabilities or phase transitions is crucial in domains requiring stability and real-time control, such as aviation and healthcare. In this paper, we propose a real-time monitoring and characterization strategy for data traffic from influential routers to identify and mitigate impending instabilities before their onset. Leveraging the self-organization characteristic of NoCs, we propose to implement targeted mitigation on influential nodes to achieve network-wide effects. We demonstrate the effectiveness of our strategy on various benchmarks by comparing traffic analysis plots before and after mitigation. Our results show that the proposed phase transition mitigation improves the network performance by an average of 39.6% and buffer utilization by an average of 4.62%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"623-636"},"PeriodicalIF":3.6,"publicationDate":"2024-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106761","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CPI: A Collaborative Partial Indexing Design for Large-Scale Deduplication Systems

IF 3.6 2区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computers

Pub Date : 2024-11-08 DOI: 10.1109/TC.2024.3485238

Yixun Wei;Zhichao Cao;David H. C. Du

Data deduplication relies on a chunk index to identify the redundancy of incoming chunks. As backup data scales, it is impractical to maintain the entire chunk index in memory. Consequently, an index lookup needs to search the portion of the on-storage index, causing a dramatic regression of index lookup throughput. Existing studies propose to search a subset of the whole index (partial index) to limit the storage I/Os and guarantee a high index lookup throughput. However, several core factors of designing partial indexing are not fully exploited. In this paper, we first comprehensively investigate the trade-offs of using different meta-groups, sampling methods, and meta-group selection policies for a partial index. We then propose a Collaborative Partial Index (CPI) which takes advantage of two meta-groups including recipe-segment and container-catalog to achieve more efficient and effective unique chunk identification. CPI further introduces a hook-entry sharing technology and a two-stage eviction policy to reduce memory usage without hurting the deduplication ratio. According to evaluation, with the same constraints of memory usage and storage I/O, CPI achieves a 1.21x-2.17x higher deduplication ratio than the state-of-the-art partial indexing schemes. Alternatively, CPI achieves 1.8X-4.98x higher index lookup throughput than others when the same deduplication ratio is achieved. Compared with full indexing, CPI's maximum deduplication ratio is only 4.07% lower but its throughput is 37.1x - 122.2x of that of full indexing depending on different storage I/O constraints in our evaluation cases.

{"title":"CPI: A Collaborative Partial Indexing Design for Large-Scale Deduplication Systems","authors":"Yixun Wei;Zhichao Cao;David H. C. Du","doi":"10.1109/TC.2024.3485238","DOIUrl":"https://doi.org/10.1109/TC.2024.3485238","url":null,"abstract":"Data deduplication relies on a chunk index to identify the redundancy of incoming chunks. As backup data scales, it is impractical to maintain the entire chunk index in memory. Consequently, an index lookup needs to search the portion of the on-storage index, causing a dramatic regression of index lookup throughput. Existing studies propose to search a subset of the whole index (partial index) to limit the storage I/Os and guarantee a high index lookup throughput. However, several core factors of designing partial indexing are not fully exploited. In this paper, we first comprehensively investigate the trade-offs of using different meta-groups, sampling methods, and meta-group selection policies for a partial index. We then propose a Collaborative Partial Index (CPI) which takes advantage of two meta-groups including recipe-segment and container-catalog to achieve more efficient and effective unique chunk identification. CPI further introduces a hook-entry sharing technology and a two-stage eviction policy to reduce memory usage without hurting the deduplication ratio. According to evaluation, with the same constraints of memory usage and storage I/O, CPI achieves a 1.21x-2.17x higher deduplication ratio than the state-of-the-art partial indexing schemes. Alternatively, CPI achieves 1.8X-4.98x higher index lookup throughput than others when the same deduplication ratio is achieved. Compared with full indexing, CPI's maximum deduplication ratio is only 4.07% lower but its throughput is 37.1x - 122.2x of that of full indexing depending on different storage I/O constraints in our evaluation cases.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 2","pages":"483-494"},"PeriodicalIF":3.6,"publicationDate":"2024-11-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143106578","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0