首页 > 最新文献

IEEE Transactions on Computers最新文献

英文 中文
A High-Efficiency Parallel Mechanism for Canonical Polyadic Decomposition on Heterogeneous Computing Platform 异构计算平台上标准多元分解的高效并行机制
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-10 DOI: 10.1109/TC.2025.3587623
Xiaosong Peng;Laurence T. Yang;Xiaokang Wang;Debin Liu;Jie Li
Canonical Polyadic decomposition (CPD) obtains the low-rank approximation for high-order multidimensional tensors through the summation of a sequence of rank-one tensors, greatly reducing storage and computation overhead. It is increasingly being used in the lightweight design of artificial intelligence and big data processing. The existing CPD technology exhibits inherent limitations in simultaneously achieving high accuracy and high efficiency. In this paper, a heterogeneous computing method for CPD is proposed to optimize computing efficiency with guaranteed convergence accuracy. Specifically, a quasi-convex decomposition loss function is constructed and the extreme points of the Kruskal matrix rows have been solved. Further, the massively parallelized operators in the algorithm are extracted, a software-hardware integrated scheduling method is designed, and the deployment of CPD on heterogeneous computing platforms is achieved. Finally, the memory access strategy is optimized to improve memory access efficiency. We tested the algorithm on real-world and synthetic sparse tensor datasets, numerical experimental results show that compared with the state-of-the-art method, the proposed method has a higher convergence accuracy and computing efficiency. Compared to the standard CPD parallel library, the method achieves efficiency improvements of tens to hundreds of times while maintaining the same accuracy.
正则多元分解(CPD)通过对一阶张量序列求和得到高阶多维张量的低秩逼近,大大减少了存储和计算开销。它越来越多地应用于人工智能的轻量化设计和大数据处理。现有的CPD技术在同时实现高精度和高效率方面存在固有的局限性。为了在保证收敛精度的前提下优化计算效率,提出了一种异构计算方法。具体而言,构造了拟凸分解损失函数,求解了Kruskal矩阵行极值点。进一步提取算法中大规模并行化的运算符,设计软硬件集成调度方法,实现了CPD在异构计算平台上的部署。最后,对存储器访问策略进行优化,以提高存储器访问效率。在实际稀疏张量数据集和合成稀疏张量数据集上对算法进行了测试,数值实验结果表明,与现有方法相比,本文提出的算法具有更高的收敛精度和计算效率。与标准CPD并行库相比,该方法在保持相同精度的情况下,效率提高了数十倍至数百倍。
{"title":"A High-Efficiency Parallel Mechanism for Canonical Polyadic Decomposition on Heterogeneous Computing Platform","authors":"Xiaosong Peng;Laurence T. Yang;Xiaokang Wang;Debin Liu;Jie Li","doi":"10.1109/TC.2025.3587623","DOIUrl":"https://doi.org/10.1109/TC.2025.3587623","url":null,"abstract":"Canonical Polyadic decomposition (CPD) obtains the low-rank approximation for high-order multidimensional tensors through the summation of a sequence of rank-one tensors, greatly reducing storage and computation overhead. It is increasingly being used in the lightweight design of artificial intelligence and big data processing. The existing CPD technology exhibits inherent limitations in simultaneously achieving high accuracy and high efficiency. In this paper, a heterogeneous computing method for CPD is proposed to optimize computing efficiency with guaranteed convergence accuracy. Specifically, a quasi-convex decomposition loss function is constructed and the extreme points of the Kruskal matrix rows have been solved. Further, the massively parallelized operators in the algorithm are extracted, a software-hardware integrated scheduling method is designed, and the deployment of CPD on heterogeneous computing platforms is achieved. Finally, the memory access strategy is optimized to improve memory access efficiency. We tested the algorithm on real-world and synthetic sparse tensor datasets, numerical experimental results show that compared with the state-of-the-art method, the proposed method has a higher convergence accuracy and computing efficiency. Compared to the standard CPD parallel library, the method achieves efficiency improvements of tens to hundreds of times while maintaining the same accuracy.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3377-3389"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061974","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Reliable and Efficient Multi-Path Transmission Based on Disjoint Paths in Data Center Networks 数据中心网络中基于不相交路径的可靠高效多径传输
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-10 DOI: 10.1109/TC.2025.3587618
Weibei Fan;Yao Pan;Fu Xiao;Mengjie Lv;Lei Han;Shui Yu
Multi-path transmission enables load balancing and improves network performance in data center networks (DCNs). It increases the possibility of network congestion and makes traditional network traffic engineering methods inefficient due to the uneven distribution of network traffic in data centers. In this paper, we present a reliable and efficient Disjoint paths based Multi-Path Transmission scheme (DMPT) that selects distributed requests through topology awareness. Firstly, we propose disjoint path construction algorithms through rigorous theoretical proof, aiming at the different transmission requirements of DCNs. Secondly, we offer an optimal solution to the disjoint multi-path selection problem, which is aimed at the trade-off between link load and transmission time. Furthermore, DMPT can split the flow over multiple transmission paths based on the link status. Finally, extensive experiments are executed for DMPT on a novel EHDC of DCN that is based on exchanged hypercube. The experimental results show that DMPT can reduce the average running time by 18.6%, and the average path length is close to the optimal path. Furthermore, it achieves significant improvements in balancing network link traffic and facilitating deployment, which also reflects the advantages of topology aware multiplexing in practice.
在数据中心网络中,多路径传输可以实现负载均衡,提高网络性能。由于数据中心内网络流量分布不均,增加了网络拥塞的可能性,使传统的网络流量工程方法效率低下。本文提出了一种可靠、高效的基于不相交路径的多路径传输方案(DMPT),该方案通过拓扑感知选择分布式请求。首先,针对不同的dcn传输需求,通过严格的理论证明,提出了不相交路径构建算法。其次,提出了不相交多路径选择问题的最优解,解决了链路负载和传输时间之间的权衡问题。此外,DMPT可以根据链路状态在多个传输路径上分割流。最后,在一种基于交换超立方体的新型DCN EHDC上进行了DMPT的大量实验。实验结果表明,DMPT算法平均运行时间缩短18.6%,平均路径长度接近最优路径。此外,它在均衡网络链路流量和方便部署方面取得了显著的改进,这也体现了拓扑感知复用在实践中的优势。
{"title":"Reliable and Efficient Multi-Path Transmission Based on Disjoint Paths in Data Center Networks","authors":"Weibei Fan;Yao Pan;Fu Xiao;Mengjie Lv;Lei Han;Shui Yu","doi":"10.1109/TC.2025.3587618","DOIUrl":"https://doi.org/10.1109/TC.2025.3587618","url":null,"abstract":"Multi-path transmission enables load balancing and improves network performance in data center networks (DCNs). It increases the possibility of network congestion and makes traditional network traffic engineering methods inefficient due to the uneven distribution of network traffic in data centers. In this paper, we present a reliable and efficient Disjoint paths based Multi-Path Transmission scheme (<italic>DMPT</i>) that selects distributed requests through topology awareness. Firstly, we propose disjoint path construction algorithms through rigorous theoretical proof, aiming at the different transmission requirements of DCNs. Secondly, we offer an optimal solution to the disjoint multi-path selection problem, which is aimed at the trade-off between link load and transmission time. Furthermore, <italic>DMPT</i> can split the flow over multiple transmission paths based on the link status. Finally, extensive experiments are executed for <italic>DMPT</i> on a novel EHDC of DCN that is based on exchanged hypercube. The experimental results show that <italic>DMPT</i> can reduce the average running time by 18.6%, and the average path length is close to the optimal path. Furthermore, it achieves significant improvements in balancing network link traffic and facilitating deployment, which also reflects the advantages of topology aware multiplexing in practice.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3362-3376"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145059841","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hybrid Redundancy for Reliable Task Offloading in Collaborative Edge Computing 协同边缘计算中可靠任务卸载的混合冗余
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-10 DOI: 10.1109/TC.2025.3587620
Hao Guo;Lei Yang;Qingfeng Zhang;Jiannong Cao
Collaborative edge computing enables task execution on the computing resources of geo-distributed edge nodes. One of the key challenges in this field is to realize reliable task offloading by deciding whether to execute tasks locally or delegate them to neighboring nodes while ensuring task reliability. Achieving reliable task offloading is essential for preventing task failures and maintaining optimal system performance. Existing works commonly rely on task redundancy strategies, such as active or passive redundancy. However, these approaches lack adaptive redundancy mechanisms to respond to changes in the network environment, potentially resulting in resource wastage from excessive redundancy or task failures due to insufficient redundancy. In this work, we introduce a novel approach called Hybrid Redundancy for Task Offloading (HRTO) to optimize task latency and reliability. Specifically, HRTO utilizes deep reinforcement learning (DRL) to learn a task offloading policy that maximizes task success rates. With this policy, edge nodes dynamically adjust task redundancy levels based on real-time network load conditions and meanwhile assess whether the task instance is necessary for re-execution in case of task failure. Extensive experiments on real-world network topologies and a Kubernetes-based testbed evaluate the effectiveness of HRTO, showing a 14.6% increase in success rate over the benchmarks.
协作边缘计算支持在地理分布式边缘节点的计算资源上执行任务。如何在保证任务可靠性的前提下,决定任务是在本地执行还是委托给邻近节点,从而实现可靠的任务卸载是该领域的关键挑战之一。实现可靠的任务卸载对于防止任务失败和保持最佳系统性能至关重要。现有的工作通常依赖于任务冗余策略,如主动冗余或被动冗余。然而,这些方法缺乏自适应冗余机制来应对网络环境的变化,可能导致冗余过多造成资源浪费或冗余不足导致任务失败。在这项工作中,我们引入了一种称为任务卸载混合冗余(HRTO)的新方法来优化任务延迟和可靠性。具体来说,HRTO利用深度强化学习(DRL)来学习任务卸载策略,以最大限度地提高任务成功率。通过该策略,边缘节点可以根据实时网络负载情况动态调整任务冗余级别,同时评估任务失败时是否需要重新执行任务实例。在真实网络拓扑和基于kubernetes的测试平台上进行的大量实验评估了HRTO的有效性,显示成功率比基准测试提高了14.6%。
{"title":"Hybrid Redundancy for Reliable Task Offloading in Collaborative Edge Computing","authors":"Hao Guo;Lei Yang;Qingfeng Zhang;Jiannong Cao","doi":"10.1109/TC.2025.3587620","DOIUrl":"https://doi.org/10.1109/TC.2025.3587620","url":null,"abstract":"Collaborative edge computing enables task execution on the computing resources of geo-distributed edge nodes. One of the key challenges in this field is to realize reliable task offloading by deciding whether to execute tasks locally or delegate them to neighboring nodes while ensuring task reliability. Achieving reliable task offloading is essential for preventing task failures and maintaining optimal system performance. Existing works commonly rely on task redundancy strategies, such as active or passive redundancy. However, these approaches lack adaptive redundancy mechanisms to respond to changes in the network environment, potentially resulting in resource wastage from excessive redundancy or task failures due to insufficient redundancy. In this work, we introduce a novel approach called Hybrid Redundancy for Task Offloading (HRTO) to optimize task latency and reliability. Specifically, HRTO utilizes deep reinforcement learning (DRL) to learn a task offloading policy that maximizes task success rates. With this policy, edge nodes dynamically adjust task redundancy levels based on real-time network load conditions and meanwhile assess whether the task instance is necessary for re-execution in case of task failure. Extensive experiments on real-world network topologies and a Kubernetes-based testbed evaluate the effectiveness of HRTO, showing a 14.6% increase in success rate over the benchmarks.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"3238-3250"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144814149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scavenger+: Revisiting Space-Time Tradeoffs in Key-Value Separated LSM-Trees 清道夫+:重访键值分离lsm树中的时空权衡
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-10 DOI: 10.1109/TC.2025.3587513
Jianshun Zhang;Fang Wang;Jiaxin Ou;Yi Wang;Ming Zhao;Sheng Qiu;Junxun Huang;Baoquan Li;Peng Fang;Dan Feng
Key-Value Stores (KVS) based on log-structured merge-trees (LSM-trees) are widely used in storage systems but face significant challenges, such as high write amplification caused by compaction. KV-separated LSM-trees address write amplification but introduce significant space amplification, a critical concern in cost-sensitive scenarios. Garbage collection (GC) can reduce space amplification, but existing strategies are often inefficient and fail to account for workload characteristics. Moreover, current key-value (KV) separated LSM-trees overlook the space amplification caused by the index LSM-tree. In this paper, we systematically analyze the sources of space amplification in KV-separated LSM-trees and propose Scavenger+, which achieves a better performance-space trade-off. Scavenger+ introduces (1) an I/O-efficient garbage collection scheme to reduce I/O overhead, (2) a space-aware compaction strategy based on compensated size to mitigate index-induced space amplification, and (3) a dynamic GC scheduler that adapts to system load to make better use of CPU and storage resources. Extensive experiments demonstrate that Scavenger+ significantly improves write performance and reduces space amplification compared to state-of-the-art KV-separated LSM-trees, including BlobDB, Titan, and TerarkDB.
基于日志结构合并树(lsm -tree)的键值存储(Key-Value Stores, KVS)在存储系统中得到了广泛的应用,但由于压缩导致的写放大较大,KVS在存储系统中应用非常广泛。kv分离的lsm树解决了写入放大问题,但引入了显著的空间放大,这在成本敏感的场景中是一个关键问题。垃圾收集(GC)可以减少空间的扩大,但是现有的策略通常效率低下,而且不能考虑到工作负载特征。此外,电流键值(KV)分离的lsm树忽略了索引lsm树造成的空间放大。本文系统分析了kv分离lsm树中空间放大的来源,提出了一种性能空间折衷方案Scavenger+。Scavenger+引入了(1)一个I/O高效的垃圾收集方案,以减少I/O开销;(2)一个基于补偿大小的空间感知压缩策略,以减轻索引引起的空间放大;(3)一个适应系统负载的动态GC调度程序,以更好地利用CPU和存储资源。大量实验表明,与最先进的kv分离lsm树(包括BlobDB、Titan和TerarkDB)相比,Scavenger+显著提高了写入性能,并减少了空间放大。
{"title":"Scavenger+: Revisiting Space-Time Tradeoffs in Key-Value Separated LSM-Trees","authors":"Jianshun Zhang;Fang Wang;Jiaxin Ou;Yi Wang;Ming Zhao;Sheng Qiu;Junxun Huang;Baoquan Li;Peng Fang;Dan Feng","doi":"10.1109/TC.2025.3587513","DOIUrl":"https://doi.org/10.1109/TC.2025.3587513","url":null,"abstract":"Key-Value Stores (KVS) based on log-structured merge-trees (LSM-trees) are widely used in storage systems but face significant challenges, such as high write amplification caused by compaction. KV-separated LSM-trees address write amplification but introduce significant space amplification, a critical concern in cost-sensitive scenarios. Garbage collection (GC) can reduce space amplification, but existing strategies are often inefficient and fail to account for workload characteristics. Moreover, current key-value (KV) separated LSM-trees overlook the space amplification caused by the index LSM-tree. In this paper, we systematically analyze the sources of space amplification in KV-separated LSM-trees and propose Scavenger+, which achieves a better performance-space trade-off. Scavenger+ introduces (1) an I/O-efficient garbage collection scheme to reduce I/O overhead, (2) a space-aware compaction strategy based on compensated size to mitigate index-induced space amplification, and (3) a dynamic GC scheduler that adapts to system load to make better use of CPU and storage resources. Extensive experiments demonstrate that Scavenger+ significantly improves write performance and reduces space amplification compared to state-of-the-art KV-separated LSM-trees, including BlobDB, Titan, and TerarkDB.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3332-3346"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatic Generation of System-Level Test for Un-Core Logic of Large Automotive SoC 大型汽车SoC非核逻辑系统级测试自动生成
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-10 DOI: 10.1109/TC.2025.3587515
Francesco Angione;Paolo Bernardi;Giusy Iaria;Claudia Bertani;Vincenzo Tancorre
Traditional structural tests are powerful automatic approaches for capturing faulty behavior in integrated circuits. Besides the ease of generating test patterns, structural methods are known to be able to cover a vast but incomplete spectrum of all possible faults in a System-on-Chip (SoC). A new step in the manufacturing test flow has been added to fill the leftover gaps of structural tests, called the System-Level Test (SLT), which resembles the final workload, and environment. This work illustrates how to build up an automated generation engine to synthesize SLT programs that effectively attack structural test weaknesses from both a holistic and an analytical perspective. The methodology targets the crossbar module, as one of the most critical areas in the SoC, and it simultaneously creates a ripple effect across the un-core logic. Experimental results are conducted on an automotive SoC manufactured by STMicroelectronics.
传统的结构测试是捕获集成电路故障行为的有效自动方法。除了易于生成测试模式之外,结构方法还可以覆盖片上系统(SoC)中所有可能出现的故障的广泛但不完整的范围。在制造测试流程中增加了一个新的步骤,以填补结构测试的剩余空白,称为系统级测试(SLT),它类似于最终的工作负载和环境。这项工作说明了如何建立一个自动生成引擎来综合从整体和分析角度有效地攻击结构测试弱点的SLT程序。该方法针对交叉模块,作为SoC中最关键的领域之一,它同时在非核心逻辑中产生连锁反应。在意法半导体公司生产的汽车SoC上进行了实验。
{"title":"Automatic Generation of System-Level Test for Un-Core Logic of Large Automotive SoC","authors":"Francesco Angione;Paolo Bernardi;Giusy Iaria;Claudia Bertani;Vincenzo Tancorre","doi":"10.1109/TC.2025.3587515","DOIUrl":"https://doi.org/10.1109/TC.2025.3587515","url":null,"abstract":"Traditional structural tests are powerful automatic approaches for capturing faulty behavior in integrated circuits. Besides the ease of generating test patterns, structural methods are known to be able to cover a vast but incomplete spectrum of all possible faults in a System-on-Chip (SoC). A new step in the manufacturing test flow has been added to fill the leftover gaps of structural tests, called the System-Level Test (SLT), which resembles the final workload, and environment. This work illustrates how to build up an automated generation engine to synthesize SLT programs that effectively attack structural test weaknesses from both a holistic and an analytical perspective. The methodology targets the crossbar module, as one of the most critical areas in the SoC, and it simultaneously creates a ripple effect across the un-core logic. Experimental results are conducted on an automotive SoC manufactured by STMicroelectronics.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"3195-3209"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11077711","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144814155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards a Unified Framework for Modeling and Analyzing User-Defined Online Non-Preemptive Scheduling Policies 基于统一框架的自定义在线非抢占调度策略建模与分析
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-10 DOI: 10.1109/TC.2025.3587514
Pourya Gohari;Jeroen Voeten;Mitra Nasri
This paper presents a unified formal framework, called ReTA, that allows users to define scheduling problems using a user-friendly domain-specific language (DSL) and automatically obtain response times of jobs in return. ReTA supports user-defined online scheduling policies (beyond work-conserving or priority-based scheduling) for heterogeneous computing resource types with multiple instances per type (e.g., multiple CPU cores, GPUs, DSPs, and FPGAs on one single chip), thus supporting global, partitioned, and clustered scheduling. In the current version of ReTA, we focus on non-preemptive periodic tasks as these are susceptible to scheduling anomalies and hence harder to analyze. ReTA performs response-time analysis by constructing a timed labeled transition system (TLTS) from the domain model as a basis for performing a reachability analysis enriched with efficient state-space reduction techniques. Our empirical evaluations show that ReTA identifies up to 50 times more schedulable task sets than fixed-point iteration-based analyses. With a runtime on the order of a few minutes, ReTA produces highly accurate results two-orders of magnitude faster than an exact Timed Automata-based analysis in UPPAAL (e.g., for systems with 16 cores and 32 tasks).
本文提出了一个统一的形式化框架,称为ReTA,它允许用户使用用户友好的领域特定语言(DSL)定义调度问题,并自动获得作业的响应时间作为回报。ReTA支持用户自定义在线调度策略(不局限于工作节约或基于优先级的调度),用于异构计算资源类型(如单个芯片上的多个CPU内核、gpu、dsp和fpga),从而支持全局调度、分区调度和集群调度。在当前版本的ReTA中,我们专注于非抢占式周期性任务,因为这些任务容易受到调度异常的影响,因此难以分析。ReTA通过从域模型构建一个时间标记转换系统(TLTS)来执行响应时间分析,该系统作为执行可达性分析的基础,丰富了有效的状态空间缩减技术。我们的经验评估表明,ReTA识别的可调度任务集比基于定点迭代的分析多50倍。在几分钟的运行时间内,ReTA产生高度精确的结果,比UPPAAL中基于定时自动机的精确分析快两个数量级(例如,对于具有16核和32个任务的系统)。
{"title":"Towards a Unified Framework for Modeling and Analyzing User-Defined Online Non-Preemptive Scheduling Policies","authors":"Pourya Gohari;Jeroen Voeten;Mitra Nasri","doi":"10.1109/TC.2025.3587514","DOIUrl":"https://doi.org/10.1109/TC.2025.3587514","url":null,"abstract":"This paper presents a unified formal framework, called ReTA, that allows users to define <italic>scheduling problems</i> using a user-friendly domain-specific language (DSL) and automatically obtain response times of jobs in return. ReTA supports user-defined online scheduling policies (beyond work-conserving or priority-based scheduling) for heterogeneous computing resource types with multiple instances per type (e.g., multiple CPU cores, GPUs, DSPs, and FPGAs on one single chip), thus supporting global, partitioned, and clustered scheduling. In the current version of ReTA, we focus on non-preemptive periodic tasks as these are susceptible to scheduling anomalies and hence harder to analyze. ReTA performs response-time analysis by constructing a <italic>timed labeled transition system</i> (TLTS) from the domain model as a basis for performing a reachability analysis enriched with efficient state-space reduction techniques. Our empirical evaluations show that ReTA identifies up to <italic>50 times more schedulable task sets</i> than fixed-point iteration-based analyses. With a runtime on the order of a few minutes, ReTA produces highly accurate results <italic>two-orders of magnitude faster</i> than an exact Timed Automata-based analysis in UPPAAL (e.g., for systems with 16 cores and 32 tasks).","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3347-3361"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Trajectory Optimization and Power Allocation for Multi-UAV Wireless Networks: A Communication-Based Multi-Agent Deep Reinforcement Learning Approach 多无人机无线网络的轨迹优化与功率分配:基于通信的多智能体深度强化学习方法
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-10 DOI: 10.1109/TC.2025.3587976
Zimeng Yuan;Yuanguo Bi;Yanbo Fan;Yuheng Liu;Lianbo Ma;Liang Zhao;Qiang He
Uncrewed Aerial Vehicles (UAVs) play a crucial role in next-generation mobile communication systems, serving as aerial base stations to provide services when ground base stations fail to meet coverage requirements. However, trajectory planning and power allocation for collaborative UAVs as Aerial Base Stations (UAV-ABSs) face several challenges, including energy limitations, flight time constraints, high optimization complexity due to dynamic environment interactions, and insufficient decision-making information. To address these challenges, this paper proposes a multi-agent reinforcement learning algorithm, namely Communication Actor Centralized Attention Critic Algorithm (CATEN), to jointly optimize the flight trajectory and power allocation strategies of UAV-ABSs. The proposed algorithm aims to maximize the number of users meeting Quality of Service (QoS) requirements while minimizing UAV-ABSs energy consumption. To achieve this, firstly, an information sharing mechanism is designed to improve the collaboration efficiency among UAV-ABSs. It leverages distributed storage, intelligent scheduling of UAV-ABSs interaction experiences, and gating units to enhance information screening and fusion. Secondly, a multi-head attention critic network is proposed to capture correlations among UAV-ABSs from different subspaces. This allows the network to prioritize value information, reduce redundancy, and strengthen UAV-ABSs collaboration and decision-making capabilities. Simulation results demonstrate that CATEN achieves better performance in terms of the number of served users and energy consumption compared to existing algorithms, exhibiting good robustness and adaptability in dynamic environments.
无人驾驶飞行器(uav)在下一代移动通信系统中发挥着至关重要的作用,在地面基站无法满足覆盖要求时充当空中基站提供服务。然而,协同无人机作为空中基站(uav - abs)的轨迹规划和功率分配面临着能量限制、飞行时间限制、动态环境交互导致的高优化复杂性以及决策信息不足等挑战。针对这些挑战,本文提出了一种多智能体强化学习算法,即通信Actor集中注意力批评算法(CATEN),用于联合优化无人机- abs的飞行轨迹和功率分配策略。该算法旨在最大限度地满足服务质量(QoS)要求的用户数量,同时最大限度地降低无人机- abs的能耗。为了实现这一目标,首先设计了一种信息共享机制,以提高无人机- abs之间的协作效率。利用分布式存储、无人机-无人机交互体验智能调度和门控单元,加强信息筛选和融合。其次,提出了一种多头注意力批评网络,用于捕获不同子空间的无人机- abss之间的相关性。这使得网络能够优先考虑价值信息,减少冗余,并加强uav - abs的协作和决策能力。仿真结果表明,与现有算法相比,CATEN在服务用户数量和能耗方面具有更好的性能,在动态环境中具有良好的鲁棒性和自适应性。
{"title":"Trajectory Optimization and Power Allocation for Multi-UAV Wireless Networks: A Communication-Based Multi-Agent Deep Reinforcement Learning Approach","authors":"Zimeng Yuan;Yuanguo Bi;Yanbo Fan;Yuheng Liu;Lianbo Ma;Liang Zhao;Qiang He","doi":"10.1109/TC.2025.3587976","DOIUrl":"https://doi.org/10.1109/TC.2025.3587976","url":null,"abstract":"Uncrewed Aerial Vehicles (UAVs) play a crucial role in next-generation mobile communication systems, serving as aerial base stations to provide services when ground base stations fail to meet coverage requirements. However, trajectory planning and power allocation for collaborative UAVs as Aerial Base Stations (UAV-ABSs) face several challenges, including energy limitations, flight time constraints, high optimization complexity due to dynamic environment interactions, and insufficient decision-making information. To address these challenges, this paper proposes a multi-agent reinforcement learning algorithm, namely Communication Actor Centralized Attention Critic Algorithm (CATEN), to jointly optimize the flight trajectory and power allocation strategies of UAV-ABSs. The proposed algorithm aims to maximize the number of users meeting Quality of Service (QoS) requirements while minimizing UAV-ABSs energy consumption. To achieve this, firstly, an information sharing mechanism is designed to improve the collaboration efficiency among UAV-ABSs. It leverages distributed storage, intelligent scheduling of UAV-ABSs interaction experiences, and gating units to enhance information screening and fusion. Secondly, a multi-head attention critic network is proposed to capture correlations among UAV-ABSs from different subspaces. This allows the network to prioritize value information, reduce redundancy, and strengthen UAV-ABSs collaboration and decision-making capabilities. Simulation results demonstrate that CATEN achieves better performance in terms of the number of served users and energy consumption compared to existing algorithms, exhibiting good robustness and adaptability in dynamic environments.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3404-3418"},"PeriodicalIF":3.8,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
UKFaaS: Lightweight, High-Performance and Secure FaaS Communication With Unikernel UKFaaS:使用Unikernel的轻量级、高性能和安全的FaaS通信
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-04 DOI: 10.1109/TC.2025.3586031
Zhenqian Chen;Yuchun Zhan;Peng Hu;Xinkui Zhao;Muyu Yang;Siwei Tan;Lufei Zhang;Liqiang Lu;Jianwei Yin;Zuoning Chen
Unikernel is a promising runtime for serverless computing with its lightweight and isolated architecture. It offers a secure and efficient environment for applications. However, famous serverless frameworks like Knative have introduced heavyweight component sidecars to assist function instance deployment in a non-intrusive manner. But the sidecar not only hinders the throughput of unikernel function services but also consumes excessive memory resources. Moreover, the intricate network communication pathways among various services pose significant challenges for deploying unikernels in production serverless environments. Although shared-memory based communication on the same server can solve the communication bottleneck of unikernel-based function instances. The situation where malicious programs on the server make the shared memory untrustworthy limits the deployment of such technologies. We propose UKFaaS, a lightweight and high-performance serverless framework. UKFaaS leverages the advantages of customized operating systems through unikernel and it non-intrusively integrates sidecar functionality into the unikernel, avoiding the overhead of sidecar request forwarding. Additionally, UKFaaS innovatively implements data communication between unikernels in the same server to eliminate VM-Exit bottlenecks in RPC (remote process call) based on VMFUNC without relying on memory sharing. The preliminary experimental results indicate that UKFaaS can realize $1.8boldsymbol{times}$-$3.5boldsymbol{times}$ request throughput per second (RPS) compared with the advanced serverless system FaasFlow, UaaF and Nightcore in the Google online boutique microservice benchmark.
Unikernel是一种很有前途的无服务器计算运行时,它具有轻量级和隔离的体系结构。它为应用程序提供了一个安全高效的环境。然而,著名的无服务器框架(如Knative)已经引入了重量级组件侧车,以非侵入式的方式协助功能实例的部署。但这不仅影响了单内核功能服务的吞吐量,而且消耗了过多的内存资源。此外,各种服务之间复杂的网络通信路径对在无服务器生产环境中部署unikernels提出了重大挑战。虽然在同一服务器上基于共享内存的通信可以解决基于unikernel的函数实例的通信瓶颈。服务器上的恶意程序使共享内存不可信的情况限制了此类技术的部署。我们提出了UKFaaS,一个轻量级和高性能的无服务器框架。UKFaaS通过unikernel利用了定制操作系统的优势,它非侵入性地将sidecar功能集成到unikernel中,避免了sidecar请求转发的开销。此外,UKFaaS创新地实现了同一服务器上不同内核之间的数据通信,以消除基于VMFUNC的RPC(远程进程调用)中的VM-Exit瓶颈,而不依赖于内存共享。初步实验结果表明,在谷歌在线精品微服务基准测试中,与先进的无服务器系统FaasFlow、UaaF和Nightcore相比,UKFaaS可以实现$1.8boldsymbol{times}$-$3.5boldsymbol{times}$的每秒请求吞吐量(RPS)。
{"title":"UKFaaS: Lightweight, High-Performance and Secure FaaS Communication With Unikernel","authors":"Zhenqian Chen;Yuchun Zhan;Peng Hu;Xinkui Zhao;Muyu Yang;Siwei Tan;Lufei Zhang;Liqiang Lu;Jianwei Yin;Zuoning Chen","doi":"10.1109/TC.2025.3586031","DOIUrl":"https://doi.org/10.1109/TC.2025.3586031","url":null,"abstract":"Unikernel is a promising runtime for serverless computing with its lightweight and isolated architecture. It offers a secure and efficient environment for applications. However, famous serverless frameworks like Knative have introduced heavyweight component sidecars to assist function instance deployment in a non-intrusive manner. But the sidecar not only hinders the throughput of unikernel function services but also consumes excessive memory resources. Moreover, the intricate network communication pathways among various services pose significant challenges for deploying unikernels in production serverless environments. Although shared-memory based communication on the same server can solve the communication bottleneck of unikernel-based function instances. The situation where malicious programs on the server make the shared memory untrustworthy limits the deployment of such technologies. We propose UKFaaS, a lightweight and high-performance serverless framework. UKFaaS leverages the advantages of customized operating systems through unikernel and it non-intrusively integrates sidecar functionality into the unikernel, avoiding the overhead of sidecar request forwarding. Additionally, UKFaaS innovatively implements data communication between unikernels in the same server to eliminate VM-Exit bottlenecks in RPC (remote process call) based on VMFUNC without relying on memory sharing. The preliminary experimental results indicate that UKFaaS can realize <inline-formula><tex-math>$1.8boldsymbol{times}$</tex-math></inline-formula>-<inline-formula><tex-math>$3.5boldsymbol{times}$</tex-math></inline-formula> request throughput per second (RPS) compared with the advanced serverless system FaasFlow, UaaF and Nightcore in the Google online boutique microservice benchmark.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3305-3318"},"PeriodicalIF":3.8,"publicationDate":"2025-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145059832","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
COFFA: A Co-Design Framework for Fused-Grained Reconfigurable Architecture Towards Efficient Irregular Loop Handling 面向高效不规则循环处理的融合粒度可重构架构协同设计框架
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-02 DOI: 10.1109/TC.2025.3585345
Yuan Dai;Xuchen Gao;Yunhui Qiu;Jingyuan Li;Yuhang Cao;Yiqing Mao;Sichao Chen;Wenbo Yin;Wai-Shing Luk;Lingli Wang
Coarse-Grained Reconfigurable Architecture (CGRA) emerges as a competitive accelerator due to its high flexibility and energy efficiency. However, most CGRAs are effective for computation-intensive applications with regular loops but struggle with irregular loops containing control flows. These loops introduce fine-grained logic operations and are costly to execute by coarse-grained arithmetic units in CGRA. Efficiently handling such logic operations necessitates incorporating Boolean algebra optimization, which can improve logic density and reduce logic depth. Unfortunately, no previous research has incorporated it into the compilation flow to support irregular loops efficiently. We propose COFFA, an open-source framework for heterogeneous architecture with a RISC-V CPU and a fused-grained reconfigurable accelerator, which integrates coarse-grained arithmetic and fine-grained logic units, along with flexible IO units and distributed interconnects. As a software/hardware co-design framework, COFFA has a powerful compiler that extracts and optimizes fine-grained logic operations from irregular loops, performs coarse-grained arithmetic and memory optimizations, and offloads the loops to the accelerator. Across various challenging benchmarks with irregular loops, COFFA achieves significant performance and energy efficiency improvements over an in-order, an out-of-order RISC-V CPUs, and a recent FPGA, respectively. Moreover, compared with the state-of-the-art CGRA UE-CGRA and Hycube, COFFA can achieve 2.5$times$ and 3.5$times$ performance gains, respectively.
粗粒度可重构体系结构(CGRA)由于其高灵活性和高能效而成为一种极具竞争力的加速器。然而,大多数CGRAs对于具有规则循环的计算密集型应用程序是有效的,但是对于包含控制流的不规则循环却很吃力。这些循环引入了细粒度的逻辑操作,并且在CGRA中由粗粒度的算术单元执行成本很高。有效地处理这样的逻辑操作需要结合布尔代数优化,这可以提高逻辑密度和减少逻辑深度。不幸的是,以前没有研究将其纳入编译流程以有效地支持不规则循环。我们提出了COFFA,一个异构架构的开源框架,具有RISC-V CPU和融合粒度可重构加速器,它集成了粗粒度算法和细粒度逻辑单元,以及灵活的IO单元和分布式互连。作为一个软件/硬件协同设计框架,COFFA有一个强大的编译器,可以从不规则的循环中提取和优化细粒度的逻辑操作,执行粗粒度的算术和内存优化,并将循环卸载给加速器。在具有不规则循环的各种具有挑战性的基准测试中,COFFA分别比有序、无序的RISC-V cpu和最新的FPGA实现了显着的性能和能效改进。此外,与最先进的CGRA UE-CGRA和Hycube相比,COFFA的性能分别提高了2.5倍和3.5倍。
{"title":"COFFA: A Co-Design Framework for Fused-Grained Reconfigurable Architecture Towards Efficient Irregular Loop Handling","authors":"Yuan Dai;Xuchen Gao;Yunhui Qiu;Jingyuan Li;Yuhang Cao;Yiqing Mao;Sichao Chen;Wenbo Yin;Wai-Shing Luk;Lingli Wang","doi":"10.1109/TC.2025.3585345","DOIUrl":"https://doi.org/10.1109/TC.2025.3585345","url":null,"abstract":"Coarse-Grained Reconfigurable Architecture (CGRA) emerges as a competitive accelerator due to its high flexibility and energy efficiency. However, most CGRAs are effective for computation-intensive applications with regular loops but struggle with irregular loops containing control flows. These loops introduce fine-grained logic operations and are costly to execute by coarse-grained arithmetic units in CGRA. Efficiently handling such logic operations necessitates incorporating Boolean algebra optimization, which can improve logic density and reduce logic depth. Unfortunately, no previous research has incorporated it into the compilation flow to support irregular loops efficiently. We propose <i>COFFA</i>, an open-source framework for heterogeneous architecture with a RISC-V CPU and a fused-grained reconfigurable accelerator, which integrates coarse-grained arithmetic and fine-grained logic units, along with flexible IO units and distributed interconnects. As a software/hardware co-design framework, <i>COFFA</i> has a powerful compiler that extracts and optimizes fine-grained logic operations from irregular loops, performs coarse-grained arithmetic and memory optimizations, and offloads the loops to the accelerator. Across various challenging benchmarks with irregular loops, <i>COFFA</i> achieves significant performance and energy efficiency improvements over an in-order, an out-of-order RISC-V CPUs, and a recent FPGA, respectively. Moreover, compared with the state-of-the-art CGRA <i>UE-CGRA</i> and <i>Hycube</i>, <i>COFFA</i> can achieve 2.5<inline-formula><tex-math>$times$</tex-math></inline-formula> and 3.5<inline-formula><tex-math>$times$</tex-math></inline-formula> performance gains, respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 9","pages":"3099-3113"},"PeriodicalIF":3.8,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FlashDecoding++Next: High Throughput LLM Inference With Latency and Memory Optimization FlashDecoding++Next:高吞吐量LLM推理与延迟和内存优化
IF 3.8 2区 计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-07-02 DOI: 10.1109/TC.2025.3585339
Guohao Dai;Ke Hong;Qiuli Mao;Xiuhong Li;Jiaming Xu;Haofeng Huang;Hongtu Xia;Xuefei Ning;Shengen Yan;Yun Liang;Yu Wang
As the Large Language Model (LLM) becomes increasingly important in various domains, the performance of LLM inference is crucial to massive LLM applications. However, centering around the computational efficiency and the memory utilization, the following challenges remain unsolved in achieving high-throughput LLM inference: (1) Synchronous partial softmax update. The softmax operation requires a synchronous update operation among each partial softmax result, leading to <inline-formula><tex-math>$sim$</tex-math></inline-formula>20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference tends to be flat, leading to under-utilized computation and 50% performance loss after padding zeros in previous designs (<i>e.g.,</i> cuBLAS, CUTLASS, etc.). (3) Memory redundancy caused by activations. Dynamic allocation of activations during inference leads to redundant storage of useless variables, bringing 22% more memory consumption. We present <i>FlashDecoding++Next</i>, a high-throughput inference engine supporting mainstream LLMs and hardware backends. To tackle the above challenges, <i>FlashDecoding++Next</i> creatively proposes: <b>(1) Asynchronous softmax with unified maximum.</b> <i>FlashDecoding++Next</i> introduces a unified maximum technique for different partial softmax computations to avoid synchronization. Based on this, a fine-grained pipelining is proposed, leading to 1.18<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> and 1.14<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> for the <i>prefill</i> and decode phases in LLM inference, respectively. <b>(2) Flat GEMM optimization with double buffering.</b> <i>FlashDecoding++Next</i> points out that flat GEMMs with different shapes face varied bottlenecks. Then, techniques like double buffering are introduced, resulting in up to 52% speedup for the flat GEMM operation. (3) Buffer reusing and unified memory management. <i>FlashDecoding++Next</i> reuses the pre-allocated activation buffers throughout the inference process to remove redundancy. Based on that, we unify the management of different types of storage to further exploit the reusing opportunity. The memory optimization enables up to 1.57<inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> longer sequence to be processed. <i>FlashDecoding++Next</i> demonstrates remarkable throughput improvement, delivering up to <b>68.88</b><inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> higher throughput compared to the HuggingFace <xref>[1]</xref> implementation. On average, <i>FlashDecoding++Next</i> achieves <b>1.25</b><inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> and <b>1.46</b><inline-formula><tex-math>$boldsymbol{times}$</tex-math></inline-formula> higher throughput compared to vLLM <xref>[2]</xref> and TensorRT-LLM <xref>[3]</xref> on ma
随着大型语言模型(LLM)在各个领域的重要性日益提高,LLM推理的性能对大规模LLM应用至关重要。然而,围绕计算效率和内存利用率,实现高吞吐量LLM推理还面临以下挑战:(1)同步部分softmax更新。softmax操作需要在每个部分softmax结果之间进行同步更新操作,导致llm中注意力计算的开销为20%。(2)平面GEMM计算未充分利用。在LLM推理中执行GEMM的矩阵的形状趋于平坦,导致未充分利用的计算和在以前的设计(例如,cuBLAS, CUTLASS等)填充零后50%的性能损失。(3)激活引起的内存冗余。在推理期间动态分配激活会导致无用变量的冗余存储,从而增加22%的内存消耗。我们提出FlashDecoding++Next,一个支持主流llm和硬件后端的高吞吐量推理引擎。为了解决上述挑战,FlashDecoding++Next创造性地提出:(1)统一最大值的异步softmax。接下来介绍了一种统一的最大值技术,用于不同部分的softmax计算,以避免同步。在此基础上,提出了一种细粒度的流水线,在LLM推理中预填充阶段和解码阶段分别得到1.18$boldsymbol{times}$和1.14$boldsymbol{times}$。(2)双缓冲平面GEMM优化。FlashDecoding++Next指出,不同形状的平面gem面临不同的瓶颈。然后,引入了双缓冲之类的技术,使平面GEMM操作的速度提高了52%。(3)缓冲区重用和统一内存管理。FlashDecoding++Next在整个推理过程中重用预分配的激活缓冲区以消除冗余。在此基础上,我们统一了对不同类型存储的管理,进一步挖掘了重用的机会。内存优化可以处理最多1.57$boldsymbol{times}$长的序列。FlashDecoding++Next展示了显著的吞吐量改进,与HuggingFace[1]实现相比,提供高达68.88$boldsymbol{times}$的高吞吐量。与主流llm上的vLLM[2]和TensorRT-LLM[3]相比,FlashDecoding++Next平均实现了1.25$boldsymbol{times}$和1.46$boldsymbol{times}$的高吞吐量。
{"title":"FlashDecoding++Next: High Throughput LLM Inference With Latency and Memory Optimization","authors":"Guohao Dai;Ke Hong;Qiuli Mao;Xiuhong Li;Jiaming Xu;Haofeng Huang;Hongtu Xia;Xuefei Ning;Shengen Yan;Yun Liang;Yu Wang","doi":"10.1109/TC.2025.3585339","DOIUrl":"https://doi.org/10.1109/TC.2025.3585339","url":null,"abstract":"As the Large Language Model (LLM) becomes increasingly important in various domains, the performance of LLM inference is crucial to massive LLM applications. However, centering around the computational efficiency and the memory utilization, the following challenges remain unsolved in achieving high-throughput LLM inference: (1) Synchronous partial softmax update. The softmax operation requires a synchronous update operation among each partial softmax result, leading to &lt;inline-formula&gt;&lt;tex-math&gt;$sim$&lt;/tex-math&gt;&lt;/inline-formula&gt;20% overheads for the attention computation in LLMs. (2) Under-utilized computation of flat GEMM. The shape of matrices performing GEMM in LLM inference tends to be flat, leading to under-utilized computation and 50% performance loss after padding zeros in previous designs (&lt;i&gt;e.g.,&lt;/i&gt; cuBLAS, CUTLASS, etc.). (3) Memory redundancy caused by activations. Dynamic allocation of activations during inference leads to redundant storage of useless variables, bringing 22% more memory consumption. We present &lt;i&gt;FlashDecoding++Next&lt;/i&gt;, a high-throughput inference engine supporting mainstream LLMs and hardware backends. To tackle the above challenges, &lt;i&gt;FlashDecoding++Next&lt;/i&gt; creatively proposes: &lt;b&gt;(1) Asynchronous softmax with unified maximum.&lt;/b&gt; &lt;i&gt;FlashDecoding++Next&lt;/i&gt; introduces a unified maximum technique for different partial softmax computations to avoid synchronization. Based on this, a fine-grained pipelining is proposed, leading to 1.18&lt;inline-formula&gt;&lt;tex-math&gt;$boldsymbol{times}$&lt;/tex-math&gt;&lt;/inline-formula&gt; and 1.14&lt;inline-formula&gt;&lt;tex-math&gt;$boldsymbol{times}$&lt;/tex-math&gt;&lt;/inline-formula&gt; for the &lt;i&gt;prefill&lt;/i&gt; and decode phases in LLM inference, respectively. &lt;b&gt;(2) Flat GEMM optimization with double buffering.&lt;/b&gt; &lt;i&gt;FlashDecoding++Next&lt;/i&gt; points out that flat GEMMs with different shapes face varied bottlenecks. Then, techniques like double buffering are introduced, resulting in up to 52% speedup for the flat GEMM operation. (3) Buffer reusing and unified memory management. &lt;i&gt;FlashDecoding++Next&lt;/i&gt; reuses the pre-allocated activation buffers throughout the inference process to remove redundancy. Based on that, we unify the management of different types of storage to further exploit the reusing opportunity. The memory optimization enables up to 1.57&lt;inline-formula&gt;&lt;tex-math&gt;$boldsymbol{times}$&lt;/tex-math&gt;&lt;/inline-formula&gt; longer sequence to be processed. &lt;i&gt;FlashDecoding++Next&lt;/i&gt; demonstrates remarkable throughput improvement, delivering up to &lt;b&gt;68.88&lt;/b&gt;&lt;inline-formula&gt;&lt;tex-math&gt;$boldsymbol{times}$&lt;/tex-math&gt;&lt;/inline-formula&gt; higher throughput compared to the HuggingFace &lt;xref&gt;[1]&lt;/xref&gt; implementation. On average, &lt;i&gt;FlashDecoding++Next&lt;/i&gt; achieves &lt;b&gt;1.25&lt;/b&gt;&lt;inline-formula&gt;&lt;tex-math&gt;$boldsymbol{times}$&lt;/tex-math&gt;&lt;/inline-formula&gt; and &lt;b&gt;1.46&lt;/b&gt;&lt;inline-formula&gt;&lt;tex-math&gt;$boldsymbol{times}$&lt;/tex-math&gt;&lt;/inline-formula&gt; higher throughput compared to vLLM &lt;xref&gt;[2]&lt;/xref&gt; and TensorRT-LLM &lt;xref&gt;[3]&lt;/xref&gt; on ma","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"74 10","pages":"3263-3276"},"PeriodicalIF":3.8,"publicationDate":"2025-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145061900","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Computers
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1