Sanle Zhao;Yujuan Tan;Zhaoyang Zeng;Jing Yu;Zhuoxin Bai;Ao Ren;Xianzhang Chen;Duo Liu
Shared cache systems have become increasingly crucial, especially in cloud services, where the Miss Ratio Curve (MRC) is a widely used tool for evaluating cache performance. The MRC depicts the relationship between the cache miss ratio and cache size, indicating how cache performance trends with varying cache sizes. Recent advancements have enabled efficient MRC construction for stack replacement policies. For non-stack policies, miniature simulation downsizes the actual cache size and data stream through spatially hashed sampling, providing a general method for MRC construction. However, this approach still faces significant challenges. Firstly, constructing an MRC requires numerous mini-caches to obtain miss ratios, consuming significant cache resources, leading to tremendous memory and computing overhead. Secondly, it cannot adapt to the dynamic I/O workloads, resulting in less precise MRC. To address these issues, we propose LAShards, a low-overhead and self-adaptive MRC construction method for non-stack replacement policies. The key idea behind LAShards is to exploit the locality and burstiness in access patterns. It can statically reduce memory usage and dynamically adapt to workloads. Compared to previous works, LAShards can save up to $20boldsymbol{times}$ of memory resources, and increase throughput by up to $10boldsymbol{times}$.
共享缓存系统变得越来越重要,特别是在云服务中,Miss Ratio Curve (MRC)是一种广泛使用的评估缓存性能的工具。MRC描述了缓存缺失率和缓存大小之间的关系,表明缓存性能随缓存大小的变化趋势。最近的进展使得有效的MRC构建堆栈替换策略成为可能。对于非堆栈策略,微型模拟通过空间散列采样缩小了实际缓存大小和数据流,为MRC构建提供了通用方法。然而,这种方法仍然面临着重大挑战。首先,构建MRC需要大量的迷你缓存来获得缺失率,消耗大量的缓存资源,导致巨大的内存和计算开销。其次,它不能适应动态I/O工作负载,导致MRC不太精确。为了解决这些问题,我们提出了一种低开销、自适应的非堆栈替换策略MRC构建方法。lasards背后的关键思想是利用访问模式的局部性和突发性。它可以静态地减少内存使用并动态地适应工作负载。与以前的作品相比,lasards可以节省高达$20boldsymbol{times}$的内存资源,并将吞吐量提高高达$10boldsymbol{times}$。
{"title":"LAShards: Low-Overhead and Self-Adaptive MRC Construction for Non-Stack Algorithms","authors":"Sanle Zhao;Yujuan Tan;Zhaoyang Zeng;Jing Yu;Zhuoxin Bai;Ao Ren;Xianzhang Chen;Duo Liu","doi":"10.1109/TC.2025.3590811","DOIUrl":"https://doi.org/10.1109/TC.2025.3590811","url":null,"abstract":"Shared cache systems have become increasingly crucial, especially in cloud services, where the Miss Ratio Curve (MRC) is a widely used tool for evaluating cache performance. The MRC depicts the relationship between the cache miss ratio and cache size, indicating how cache performance trends with varying cache sizes. Recent advancements have enabled efficient MRC construction for stack replacement policies. For non-stack policies, miniature simulation downsizes the actual cache size and data stream through spatially hashed sampling, providing a general method for MRC construction. However, this approach still faces significant challenges. Firstly, constructing an MRC requires numerous mini-caches to obtain miss ratios, consuming significant cache resources, leading to tremendous memory and computing overhead. Secondly, it cannot adapt to the dynamic I/O workloads, resulting in less precise MRC. To address these issues, we propose LAShards, a low-overhead and self-adaptive MRC construction method for non-stack replacement policies. The key idea behind LAShards is to exploit the locality and burstiness in access patterns. It can statically reduce memory usage and dynamically adapt to workloads. Compared to previous works, LAShards can save up to <inline-formula><tex-math>$20boldsymbol{times}$</tex-math></inline-formula> of memory resources, and increase throughput by up to <inline-formula><tex-math>$10boldsymbol{times}$</tex-math></inline-formula>.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3490-3503"},"PeriodicalIF":3.8,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061865","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chong Wang;Wanyi Fu;Jiangwei Zhang;Shiyao Li;Rui Hou;Jian Yang;Yu Wang
The rapid advancement of Transformer-based large language models (LLMs) is presenting significant challenges for their deployment, primarily due to their enormous parameter sizes and intermediate results, which create a bottleneck in memory capacity for effective inference. Compared to traditional DRAM, Non-Volatile Memory (NVM) technologies such as Resistive Random-Access Memory (RRAM) and Phase-Change Memory (PCM) offer higher integration density, making them promising alternatives. However, before NVM can be widely adopted, its reliability issues, particularly manufacturing defects and endurance faults, must be addressed. In response to the limited memory capacity and reliability challenges of deploying LLMs in NVM, we introduce a novel low-overhead weight-level map, named Wolf. Wolf not only integrates the addresses of faulty weights to support efficient fault tolerance but also includes the addresses of outlier weights in LLMs. This allows for tensor-wise segmented quantization of both outliers and regular weights, enabling lower-bitwidth quantization. The Wolf framework uses a Bloom Filter-based map to efficiently manage outliers and faults. By employing shared hashes for outliers and faults and specific hashes for faults, Wolf significantly reduces the area overhead. Building on Wolf, we propose a novel fault tolerance method that resolves the observed issue of clustering critical incorrect outliers and fully leverages the inherent resilience of LLMs to improve fault tolerance capabilities. As a result, Wolf achieves segment-wise INT4 quantization with enhanced accuracy. Moreover, Wolf can adeptly handle Bit Error Rates as high as $1 {boldsymbol{times}} 10^{-2}$ without compromising accuracy, in stark contrast to the state-of-the-art approach where accuracy declines by more than 20%.
{"title":"WOLF: Weight-Level OutLier and Fault Integration for Reliable LLM Deployment","authors":"Chong Wang;Wanyi Fu;Jiangwei Zhang;Shiyao Li;Rui Hou;Jian Yang;Yu Wang","doi":"10.1109/TC.2025.3587957","DOIUrl":"https://doi.org/10.1109/TC.2025.3587957","url":null,"abstract":"The rapid advancement of Transformer-based large language models (LLMs) is presenting significant challenges for their deployment, primarily due to their enormous parameter sizes and intermediate results, which create a bottleneck in memory capacity for effective inference. Compared to traditional DRAM, Non-Volatile Memory (NVM) technologies such as Resistive Random-Access Memory (RRAM) and Phase-Change Memory (PCM) offer higher integration density, making them promising alternatives. However, before NVM can be widely adopted, its reliability issues, particularly manufacturing defects and endurance faults, must be addressed. In response to the limited memory capacity and reliability challenges of deploying LLMs in NVM, we introduce a novel low-overhead weight-level map, named <small>Wolf</small>. <small>Wolf</small> not only integrates the addresses of faulty weights to support efficient fault tolerance but also includes the addresses of outlier weights in LLMs. This allows for tensor-wise segmented quantization of both outliers and regular weights, enabling lower-bitwidth quantization. The <small>Wolf</small> framework uses a Bloom Filter-based map to efficiently manage outliers and faults. By employing shared hashes for outliers and faults and specific hashes for faults, <small>Wolf</small> significantly reduces the area overhead. Building on <small>Wolf</small>, we propose a novel fault tolerance method that resolves the observed issue of clustering critical incorrect outliers and fully leverages the inherent resilience of LLMs to improve fault tolerance capabilities. As a result, <small>Wolf</small> achieves segment-wise INT4 quantization with enhanced accuracy. Moreover, <small>Wolf</small> can adeptly handle Bit Error Rates as high as <inline-formula><tex-math>$1 {boldsymbol{times}} 10^{-2}$</tex-math></inline-formula> without compromising accuracy, in stark contrast to the state-of-the-art approach where accuracy declines by more than 20%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3390-3403"},"PeriodicalIF":3.8,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061844","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Salonik Resch;Hüsrev Cılasun;Zamshed I. Chowdhury;Masoud Zabihi;Yang Lv;Jian-Ping Wang;Sachin S. Sapatnekar;Ismail Akturk;Ulya R. Karpuzcu
Beyond edge devices can function off the power grid and without batteries, making them suitable for deployment in hard-to-reach environments. As the energy budget is extremely tight, energy-hungry long-distance communication required for offloading computation or reporting results to a server becomes a significant limitation. Based on the observation that the energy required for communication decreases with shorter distances, this paper makes a case for the deployment of secure beyond edge miniservers. These are strategically positioned, lightweight local servers designed to support beyond edge devices without compromising the privacy of sensitive information. We demonstrate that even for relatively small scale representative computations – which are more likely to fit into the tight power budget of a beyond edge device for local processing – deploying a beyond edge miniserver can lead to higher performance. To this end, we consider representative deployment scenarios of practical importance, including but not limited to agricultural systems or building structures, where beyond edge miniservers enable highly energy-efficient real-time data processing.
{"title":"The Case for Secure Miniservers Beyond the Edge","authors":"Salonik Resch;Hüsrev Cılasun;Zamshed I. Chowdhury;Masoud Zabihi;Yang Lv;Jian-Ping Wang;Sachin S. Sapatnekar;Ismail Akturk;Ulya R. Karpuzcu","doi":"10.1109/TC.2025.3589691","DOIUrl":"https://doi.org/10.1109/TC.2025.3589691","url":null,"abstract":"<italic>Beyond edge devices</i> can function off the power grid and without batteries, making them suitable for deployment in hard-to-reach environments. As the energy budget is extremely tight, energy-hungry long-distance communication required for offloading computation or reporting results to a server becomes a significant limitation. Based on the observation that the energy required for communication decreases with shorter distances, this paper makes a case for the deployment of <italic>secure beyond edge miniservers</i>. These are strategically positioned, lightweight local servers designed to support beyond edge devices without compromising the privacy of sensitive information. We demonstrate that even for relatively small scale representative computations – which are more likely to fit into the tight power budget of a beyond edge device for local processing – deploying a beyond edge miniserver can lead to higher performance. To this end, we consider representative deployment scenarios of practical importance, including but not limited to agricultural systems or building structures, where beyond edge miniservers enable highly energy-efficient real-time data processing.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3448-3461"},"PeriodicalIF":3.8,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061808","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The design and optimization of network topologies play a critical role in ensuring the performance and efficiency of high-performance computing (HPC) systems. Traditional topology designs often fall short in satisfying the stringent requirements of HPC environments, particularly with respect to fault tolerance, latency, and bandwidth. To address these limitations, we propose a novel class of hierarchical networks, termed Hypercube-Structured Hierarchical Networks (HHNs). This architecture generalizes and extends existing architectures such as half hypercube networks and complete cubic networks, while also introducing previously unexplored hierarchical designs. HHNs exhibit several advantages, particularly in high-performance computing. Most notably, their high connectivity enables efficient parallel data processing, and their hierarchical structure supports scalability to accommodate growing computational demands. Furthermore, we present a unicast routing strategy and a broadcast algorithm for HHNs. A fault-tolerant algorithm is also designed based on the construction of disjoint paths. Experimental evaluations demonstrate that HHNs consistently outperform mainstream architectures in critical performance metrics, including scalability, latency, and robustness to failures.
{"title":"A Highly Reliable Multiplexing Scheme in Hypercube-Structured Hierarchical Networks","authors":"Xuanli Liu;Zhenjiang Dong;Weibei Fan;Mengjie Lv;Xueli Sun;Jin Qi;Sun-Yuan Hsieh","doi":"10.1109/TC.2025.3589732","DOIUrl":"https://doi.org/10.1109/TC.2025.3589732","url":null,"abstract":"The design and optimization of network topologies play a critical role in ensuring the performance and efficiency of high-performance computing (HPC) systems. Traditional topology designs often fall short in satisfying the stringent requirements of HPC environments, particularly with respect to fault tolerance, latency, and bandwidth. To address these limitations, we propose a novel class of hierarchical networks, termed Hypercube-Structured Hierarchical Networks (HHNs). This architecture generalizes and extends existing architectures such as half hypercube networks and complete cubic networks, while also introducing previously unexplored hierarchical designs. HHNs exhibit several advantages, particularly in high-performance computing. Most notably, their high connectivity enables efficient parallel data processing, and their hierarchical structure supports scalability to accommodate growing computational demands. Furthermore, we present a unicast routing strategy and a broadcast algorithm for HHNs. A fault-tolerant algorithm is also designed based on the construction of disjoint paths. Experimental evaluations demonstrate that HHNs consistently outperform mainstream architectures in critical performance metrics, including scalability, latency, and robustness to failures.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3462-3475"},"PeriodicalIF":3.8,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061886","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Optical Data Center Networks (ODCNs) are high-performance interconnect architectures in parallel and distributed computing, providing higher bandwidth and lower power consumption. However, current optical DCNs struggle to achieve both high scalability and incremental scalability simultaneously. In this paper, we propose an extended Exchanged hyperCube, denoted by ExCube, which is a highly scalable network architecture for optical data centers. Firstly, we detail the address scheme and constructing method for ExCube, including exponential, linear, and composite scalability, which can adapt to different scalability requirements. ExCube boasts flexible scalability modes, including exponential, linear, and composite scalability, meeting diverse scalability requirements. In particular, the diameter of ExCube remains unchanged as its size increases linearly, indicating superior incremental scalability. Secondly, an efficient routing algorithm with linear time complexity is presented to determine the shortest path between any two different ToRs in ExCube. Additionally, we propose a per-flow scheduling algorithm based on the disjoint paths to enhance the performance of ExCube. The optical devices in ExCube are identical to those in existing optical DCNs, such as WaveCube and OSA, facilitating its construction. Experimental results demonstrate that ExCube outperforms WaveCube in terms of throughput and reduces data transmission time by 5%-35%. Further analysis reveals that ExCube maintains comparable performance to WaveCube across several critical metrics, including low diameter and link complexity. Compared with advanced networks, the overall cost-effectiveness and energy efficiency of ExCube have been reduced by 36.7% and 46.5%, respectively.
光数据中心网络(Optical Data Center network, ODCNs)是一种并行、分布式计算的高性能互联架构,具有更高的带宽和更低的功耗。然而,目前的光纤DCNs很难同时实现高可扩展性和增量可扩展性。在本文中,我们提出了一个扩展的交换超立方体,表示为ExCube,这是一个高度可扩展的光数据中心网络架构。首先,详细介绍了ExCube的地址方案和构建方法,包括指数可扩展性、线性可扩展性和复合可扩展性,以适应不同的可扩展性需求。ExCube具有灵活的扩展方式,包括指数扩展、线性扩展和复合扩展,可满足不同的扩展需求。特别是,当ExCube的大小线性增加时,它的直径保持不变,这表明它具有更好的增量可伸缩性。其次,提出了一种具有线性时间复杂度的高效路由算法,用于确定ExCube中任意两个不同tor之间的最短路径。此外,我们还提出了一种基于不相交路径的逐流调度算法,以提高ExCube的性能。ExCube中的光设备与现有光DCNs(如WaveCube、OSA)中的光设备完全相同,便于构建。实验结果表明,ExCube在吞吐量方面优于WaveCube,并将数据传输时间减少了5%-35%。进一步的分析表明,ExCube在几个关键指标上与WaveCube保持相当的性能,包括低直径和链路复杂性。与先进网络相比,ExCube的整体成本效益和能源效率分别降低了36.7%和46.5%。
{"title":"A Highly Scalable Network Architecture for Optical Data Centers","authors":"Weibei Fan;Yao Pan;Fu Xiao;Pinchang Zhang;Lei Han;Sun-Yuan Hsieh","doi":"10.1109/TC.2025.3589688","DOIUrl":"https://doi.org/10.1109/TC.2025.3589688","url":null,"abstract":"Optical Data Center Networks (ODCNs) are high-performance interconnect architectures in parallel and distributed computing, providing higher bandwidth and lower power consumption. However, current optical DCNs struggle to achieve both high scalability and incremental scalability simultaneously. In this paper, we propose an extended <italic>Ex</i>changed hyper<italic>Cube</i>, denoted by ExCube, which is a highly scalable network architecture for optical data centers. Firstly, we detail the address scheme and constructing method for ExCube, including exponential, linear, and composite scalability, which can adapt to different scalability requirements. ExCube boasts flexible scalability modes, including exponential, linear, and composite scalability, meeting diverse scalability requirements. In particular, the diameter of ExCube remains unchanged as its size increases linearly, indicating superior incremental scalability. Secondly, an efficient routing algorithm with linear time complexity is presented to determine the shortest path between any two different ToRs in ExCube. Additionally, we propose a per-flow scheduling algorithm based on the disjoint paths to enhance the performance of ExCube. The optical devices in ExCube are identical to those in existing optical DCNs, such as WaveCube and OSA, facilitating its construction. Experimental results demonstrate that ExCube outperforms WaveCube in terms of throughput and reduces data transmission time by 5%-35%. Further analysis reveals that ExCube maintains comparable performance to WaveCube across several critical metrics, including low diameter and link complexity. Compared with advanced networks, the overall cost-effectiveness and energy efficiency of ExCube have been reduced by 36.7% and 46.5%, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3433-3447"},"PeriodicalIF":3.8,"publicationDate":"2025-07-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present AdaptDQC, an adaptive compiler framework for optimizing distributed quantum computing (DQC) under diverse performance metrics and inter-chip communication (ICC) architectures. AdaptDQC leverages a novel spatial-temporal graph model to describe quantum circuits, model ICC architectures, and quantify critical performance metrics in DQC systems, yielding a systematic and adaptive approach to constructing circuit-partitioning and chip-mapping strategies that admit hybrid ICC architectures and are optimized against various objectives. Experimental results on a collection of benchmarks show that AdaptDQC outperforms state-of-the-art compiler frameworks: It reduces, on average, the communication cost by up to 35.4% and the latency by up to 38.4%.
{"title":"AdaptDQC: Adaptive Distributed Quantum Computing With Quantitative Performance Analysis","authors":"Debin Xiang;Liqiang Lu;Siwei Tan;Xinghui Jia;Zhe Zhou;Guangyu Sun;Mingshuai Chen;Jianwei Yin","doi":"10.1109/TC.2025.3586027","DOIUrl":"https://doi.org/10.1109/TC.2025.3586027","url":null,"abstract":"We present AdaptDQC, an adaptive compiler framework for optimizing distributed quantum computing (DQC) under diverse performance metrics and inter-chip communication (ICC) architectures. AdaptDQC leverages a novel spatial-temporal graph model to describe quantum circuits, model ICC architectures, and quantify critical performance metrics in DQC systems, yielding a systematic and adaptive approach to constructing circuit-partitioning and chip-mapping strategies that admit hybrid ICC architectures and are optimized against various objectives. Experimental results on a collection of benchmarks show that AdaptDQC outperforms state-of-the-art compiler frameworks: It reduces, on average, the communication cost by up to 35.4% and the latency by up to 38.4%.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3277-3290"},"PeriodicalIF":3.8,"publicationDate":"2025-07-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11080164","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Shiyan Yi;Yudi Qiu;Guohao Xu;Lingfei Lu;Xiaoyang Zeng;Yibo Fan
Graph Attention Network (GAT) has gained widespread adoption thanks to its exceptional performance in processing non-Euclidean graphs. The critical components of a GAT model involve aggregation and attention, which cause numerous main-memory access, occupying significant inference time. Recently, much research has proposed near-memory processing (NMP) architectures to accelerate aggregation. However, graph attention requires additional operations distinct from aggregation, making previous NMP architectures less suitable for supporting GAT, as they typically target aggregation-only workloads. In this paper, we propose GATe, a practical and efficient GAT accelerator with NMP architecture. To the best of our knowledge, this is the first time that accelerates both attention and aggregation computation on DIMM. We unify feature vector access to eliminate the two repetitive memory accesses to source nodes caused by the sequential phase-by-phase execution of attention and aggregation. Next, we refine the computation flow to reduce data dependencies in concatenation and softmax, which lowers on-chip memory usage and communication overhead. Additionally, we introduce a novel sharding method that enhances data reusability of high-degree nodes. Experiments show that GATe achieves substantial speedup of GAT attention and aggregation phases up to 6.77${boldsymboltimes}$ and 2.46${boldsymboltimes}$, with average to 3.69${boldsymboltimes}$ and 2.24${boldsymboltimes}$, respectively, compared to state-of-the-art NMP works GNNear and GraNDe.
{"title":"GATe: Efficient Graph Attention Network Acceleration With Near-Memory Processing","authors":"Shiyan Yi;Yudi Qiu;Guohao Xu;Lingfei Lu;Xiaoyang Zeng;Yibo Fan","doi":"10.1109/TC.2025.3588317","DOIUrl":"https://doi.org/10.1109/TC.2025.3588317","url":null,"abstract":"Graph Attention Network (GAT) has gained widespread adoption thanks to its exceptional performance in processing non-Euclidean graphs. The critical components of a GAT model involve aggregation and attention, which cause numerous main-memory access, occupying significant inference time. Recently, much research has proposed near-memory processing (NMP) architectures to accelerate aggregation. However, graph attention requires additional operations distinct from aggregation, making previous NMP architectures less suitable for supporting GAT, as they typically target aggregation-only workloads. In this paper, we propose GATe, a practical and efficient <u>GAT</u> acc<u>e</u>lerator with NMP architecture. To the best of our knowledge, this is the first time that accelerates both attention and aggregation computation on DIMM. We unify feature vector access to eliminate the two repetitive memory accesses to source nodes caused by the sequential phase-by-phase execution of attention and aggregation. Next, we refine the computation flow to reduce data dependencies in concatenation and softmax, which lowers on-chip memory usage and communication overhead. Additionally, we introduce a novel sharding method that enhances data reusability of high-degree nodes. Experiments show that GATe achieves substantial speedup of GAT attention and aggregation phases up to 6.77<inline-formula><tex-math>${boldsymboltimes}$</tex-math></inline-formula> and 2.46<inline-formula><tex-math>${boldsymboltimes}$</tex-math></inline-formula>, with average to 3.69<inline-formula><tex-math>${boldsymboltimes}$</tex-math></inline-formula> and 2.24<inline-formula><tex-math>${boldsymboltimes}$</tex-math></inline-formula>, respectively, compared to state-of-the-art NMP works GNNear and GraNDe.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3419-3432"},"PeriodicalIF":3.8,"publicationDate":"2025-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061822","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Direct current (DC) analysis lies at the heart of integrated circuit design in seeking DC operating points. Although pseudo-transient analysis (PTA) methods have been widely used in DC analysis in both industry and academia, their initial parameters and stepping strategy require expert knowledge and labor tuning to deliver efficient performance, which hinders their further applications. In this paper, we leverage the latest advancements in machine learning to deploy PTA with more efficient setups for different problems. More specifically, active learning, which automatically draws knowledge from other circuits, is used to provide suitable initial parameters for PTA solver, and then calibrate on-the-fly to further accelerate the simulation process using TD3-based reinforcement learning (RL). To expedite model convergence, we introduce dual agents and a public sampling buffer in our RL method to enhance sample utilization. To further improve the learning efficiency of the RL agent, we incorporate imitation learning to improve reward function and introduce supervised learning to provide a better dual-agent rotation strategy. We make the proposed algorithm a general out-of-the-box SPICE-like solver and assess it on a variety of circuits, demonstrating up to 3.10$boldsymboltimes$ reduction in NR iterations for the initial stage and 285.71$boldsymboltimes$ for the RL stage.
{"title":"ML-PTA: A Two-Stage ML-Enhanced Framework for Accelerating Nonlinear DC Circuit Simulation With Pseudo-Transient Analysis","authors":"Zhou Jin;Wenhao Li;Haojie Pei;Xiaru Zha;Yichao Dong;Xiang Jin;Xiao Wu;Dan Niu;Wei W. Xing","doi":"10.1109/TC.2025.3587470","DOIUrl":"https://doi.org/10.1109/TC.2025.3587470","url":null,"abstract":"Direct current (DC) analysis lies at the heart of integrated circuit design in seeking DC operating points. Although pseudo-transient analysis (PTA) methods have been widely used in DC analysis in both industry and academia, their initial parameters and stepping strategy require expert knowledge and labor tuning to deliver efficient performance, which hinders their further applications. In this paper, we leverage the latest advancements in machine learning to deploy PTA with more efficient setups for different problems. More specifically, active learning, which automatically draws knowledge from other circuits, is used to provide suitable initial parameters for PTA solver, and then calibrate on-the-fly to further accelerate the simulation process using TD3-based reinforcement learning (RL). To expedite model convergence, we introduce dual agents and a public sampling buffer in our RL method to enhance sample utilization. To further improve the learning efficiency of the RL agent, we incorporate imitation learning to improve reward function and introduce supervised learning to provide a better dual-agent rotation strategy. We make the proposed algorithm a general out-of-the-box SPICE-like solver and assess it on a variety of circuits, demonstrating up to 3.10<inline-formula><tex-math>$boldsymboltimes$</tex-math></inline-formula> reduction in NR iterations for the initial stage and 285.71<inline-formula><tex-math>$boldsymboltimes$</tex-math></inline-formula> for the RL stage.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3319-3331"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061859","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Balancing energy efficiency and high performance in embedded systems requires fine-tuning hardware and software components to co-optimize their interaction. In this work, we address the automated optimization of memory usage through a compiler toolchain that leverages DMA-aware precision tuning and mathematical function memorization. The proposed solution extends the llvm infrastructure, employing the taffo plugins for precision tuning, with the SeTHet extension for DMA-aware precision tuning and luTHet for automated, DMA-aware mathematical function memorization. We performed an experimental assessment on hero, a heterogeneous platform employing risc-v cores as a parallel accelerator. Our solution enables speedups ranging from $1.5boldsymbol{times}$ to $51.1boldsymbol{times}$ on AxBench benchmarks that employ trigonometrical functions and $4.23-48.4boldsymbol{times}$ on Polybench benchmarks over the baseline hero platform.
{"title":"Synergistic Memory Optimisations: Precision Tuning in Heterogeneous Memory Hierarchies","authors":"Gabriele Magnani;Daniele Cattaneo;Lev Denisov;Giuseppe Tagliavini;Giovanni Agosta;Stefano Cherubin","doi":"10.1109/TC.2025.3586025","DOIUrl":"https://doi.org/10.1109/TC.2025.3586025","url":null,"abstract":"Balancing energy efficiency and high performance in embedded systems requires fine-tuning hardware and software components to co-optimize their interaction. In this work, we address the automated optimization of memory usage through a compiler toolchain that leverages DMA-aware precision tuning and mathematical function memorization. The proposed solution extends the <small>llvm</small> infrastructure, employing the <small>taffo</small> plugins for precision tuning, with the <small>SeTHet</small> extension for DMA-aware precision tuning and <small>luTHet</small> for automated, DMA-aware mathematical function memorization. We performed an experimental assessment on <small>hero</small>, a heterogeneous platform employing <small>risc-v</small> cores as a parallel accelerator. Our solution enables speedups ranging from <inline-formula><tex-math>$1.5boldsymbol{times}$</tex-math></inline-formula> to <inline-formula><tex-math>$51.1boldsymbol{times}$</tex-math></inline-formula> on AxBench benchmarks that employ trigonometrical functions and <inline-formula><tex-math>$4.23-48.4boldsymbol{times}$</tex-math></inline-formula> on Polybench benchmarks over the baseline <small>hero</small> platform.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"3168-3180"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xiaosong Peng;Laurence T. Yang;Xiaokang Wang;Debin Liu;Jie Li
Canonical Polyadic decomposition (CPD) obtains the low-rank approximation for high-order multidimensional tensors through the summation of a sequence of rank-one tensors, greatly reducing storage and computation overhead. It is increasingly being used in the lightweight design of artificial intelligence and big data processing. The existing CPD technology exhibits inherent limitations in simultaneously achieving high accuracy and high efficiency. In this paper, a heterogeneous computing method for CPD is proposed to optimize computing efficiency with guaranteed convergence accuracy. Specifically, a quasi-convex decomposition loss function is constructed and the extreme points of the Kruskal matrix rows have been solved. Further, the massively parallelized operators in the algorithm are extracted, a software-hardware integrated scheduling method is designed, and the deployment of CPD on heterogeneous computing platforms is achieved. Finally, the memory access strategy is optimized to improve memory access efficiency. We tested the algorithm on real-world and synthetic sparse tensor datasets, numerical experimental results show that compared with the state-of-the-art method, the proposed method has a higher convergence accuracy and computing efficiency. Compared to the standard CPD parallel library, the method achieves efficiency improvements of tens to hundreds of times while maintaining the same accuracy.
{"title":"A High-Efficiency Parallel Mechanism for Canonical Polyadic Decomposition on Heterogeneous Computing Platform","authors":"Xiaosong Peng;Laurence T. Yang;Xiaokang Wang;Debin Liu;Jie Li","doi":"10.1109/TC.2025.3587623","DOIUrl":"https://doi.org/10.1109/TC.2025.3587623","url":null,"abstract":"Canonical Polyadic decomposition (CPD) obtains the low-rank approximation for high-order multidimensional tensors through the summation of a sequence of rank-one tensors, greatly reducing storage and computation overhead. It is increasingly being used in the lightweight design of artificial intelligence and big data processing. The existing CPD technology exhibits inherent limitations in simultaneously achieving high accuracy and high efficiency. In this paper, a heterogeneous computing method for CPD is proposed to optimize computing efficiency with guaranteed convergence accuracy. Specifically, a quasi-convex decomposition loss function is constructed and the extreme points of the Kruskal matrix rows have been solved. Further, the massively parallelized operators in the algorithm are extracted, a software-hardware integrated scheduling method is designed, and the deployment of CPD on heterogeneous computing platforms is achieved. Finally, the memory access strategy is optimized to improve memory access efficiency. We tested the algorithm on real-world and synthetic sparse tensor datasets, numerical experimental results show that compared with the state-of-the-art method, the proposed method has a higher convergence accuracy and computing efficiency. Compared to the standard CPD parallel library, the method achieves efficiency improvements of tens to hundreds of times while maintaining the same accuracy.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3377-3389"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}