首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
PreTrans: Enabling Efficient CGRA Multi-Task Context Switch Through Config Pre-Mapping and Data Transceiving PreTrans:通过配置预映射和数据收发实现高效的CGRA多任务上下文切换
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-09-02 DOI: 10.1109/TPDS.2025.3604815
Yufei Yang;Chenhao Xie;Liansheng Liu;Xiyuan Peng;Yu Peng;Hailong Yang;Depei Qian
Dynamic resource allocation guarantees the performance of CGRA multi-task, but incurs a wide range of incompatible contexts (config & data) to the CGRA architecture. However, traditional context switch approaches including online config transformation and data reloading may significantly block the task to process inputs under new resource allocation decisions, resulting in the limited task throughput. To address this issue, online config transformation can be avoided if compatible configs have been prepared through offline pre-mapping, but traditional CGRA mappers require days to achieve comprehensive pre-mapping with considerable quality. Besides, online data reloading can also be eliminated through memory sharing, but the traditional arbiter-based approach has the difficulty of trading off physical complexity and memory access parallelism. PreTrans is the first system design to achieve the efficient CGRA multi-task context switch. PreTrans first avoids the online config transformation through a software incremental pre-mapper, which re-utilizes the previously finished pre-mapping results to dramatically accelerate the pre-mapping of subsequent resource allocation decisions with negligible mapping quality loss. Second, PreTrans replaces the traditional arbiter with a hardware data transceiver to better support the memory sharing that eliminates data reloading, which allows each tile to possess an individual memory that maximizes the access parallelism without introducing significant physical overhead. The overall evaluation demonstrates that PreTrans achieves 1.13 $sim 2.46times$ throughput improvement on pipeline and parallel multi-task scenarios, and can reach the target throughput immediately after the new resource allocation decision takes effect. Ablation study further shows that the pre-mapper is more than 3 magnitudes faster than the traditional CGRA mapper while maintaining more than 99% of the optimal mapping quality, and the data transceiver only introduces 9.02% hardware area overhead under 16 × 16 CGRA.
动态资源分配保证了CGRA多任务的性能,但会导致CGRA体系结构中存在大量不兼容的上下文(配置和数据)。然而,传统的上下文切换方法,包括在线配置转换和数据重载,可能会严重阻碍任务在新的资源分配决策下处理输入,导致任务吞吐量有限。为了解决这个问题,如果已经通过离线预映射准备了兼容的配置,可以避免在线配置转换,但是传统的CGRA映射器需要几天的时间来实现高质量的全面预映射。此外,通过内存共享也可以消除在线数据重新加载,但是传统的基于仲裁器的方法在权衡物理复杂性和内存访问并行性方面存在困难。PreTrans是第一个实现高效CGRA多任务上下文切换的系统设计。PreTrans首先通过软件增量预映射器避免了在线配置转换,重新利用之前完成的预映射结果,显著加快后续资源分配决策的预映射,而映射质量损失可以忽略不计。其次,PreTrans用硬件数据收发器取代了传统的仲裁器,以更好地支持内存共享,从而消除了数据重新加载,这允许每个tile拥有一个单独的内存,从而最大限度地提高访问并行性,而不会引入显著的物理开销。综合评价表明,在流水线和并行多任务场景下,PreTrans实现了1.13 $ $ / sim 2.46 $ $的吞吐量提升,并能在新的资源分配决策生效后立即达到目标吞吐量。进一步的消融研究表明,预成像器比传统的CGRA成像器快3个数量级以上,同时保持99%以上的最佳映射质量,而数据收发器在16 × 16 CGRA下仅引入9.02%的硬件面积开销。
{"title":"PreTrans: Enabling Efficient CGRA Multi-Task Context Switch Through Config Pre-Mapping and Data Transceiving","authors":"Yufei Yang;Chenhao Xie;Liansheng Liu;Xiyuan Peng;Yu Peng;Hailong Yang;Depei Qian","doi":"10.1109/TPDS.2025.3604815","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3604815","url":null,"abstract":"Dynamic resource allocation guarantees the performance of CGRA multi-task, but incurs a wide range of incompatible contexts (config & data) to the CGRA architecture. However, traditional context switch approaches including online config transformation and data reloading may significantly block the task to process inputs under new resource allocation decisions, resulting in the limited task throughput. To address this issue, online config transformation can be avoided if compatible configs have been prepared through offline pre-mapping, but traditional CGRA mappers require days to achieve comprehensive pre-mapping with considerable quality. Besides, online data reloading can also be eliminated through memory sharing, but the traditional arbiter-based approach has the difficulty of trading off physical complexity and memory access parallelism. PreTrans is the first system design to achieve the efficient CGRA multi-task context switch. PreTrans first avoids the online config transformation through a software incremental pre-mapper, which re-utilizes the previously finished pre-mapping results to dramatically accelerate the pre-mapping of subsequent resource allocation decisions with negligible mapping quality loss. Second, PreTrans replaces the traditional arbiter with a hardware data transceiver to better support the memory sharing that eliminates data reloading, which allows each tile to possess an individual memory that maximizes the access parallelism without introducing significant physical overhead. The overall evaluation demonstrates that PreTrans achieves 1.13 <inline-formula><tex-math>$sim 2.46times$</tex-math></inline-formula> throughput improvement on pipeline and parallel multi-task scenarios, and can reach the target throughput immediately after the new resource allocation decision takes effect. Ablation study further shows that the pre-mapper is more than 3 magnitudes faster than the traditional CGRA mapper while maintaining more than 99% of the optimal mapping quality, and the data transceiver only introduces 9.02% hardware area overhead under 16 × 16 CGRA.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2214-2228"},"PeriodicalIF":6.0,"publicationDate":"2025-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Sparse Function Prediction Approach for Cold Start Optimization and User Satisfaction Guarantee in Serverless 无服务器冷启动优化与用户满意度保证的稀疏函数预测方法
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-08-25 DOI: 10.1109/TPDS.2025.3602440
Wang Zhang;Yuyang Zhu;Zhan Shi;Manyu Dang;Yutong Wu;Fang Wang;Dan Feng
Serverless computing relies on keeping functions alive or pre-warming them before invocation to mitigate the cold start problem, stemming from the overhead of initializing function startup environments. However, under constrained cloud resources, accurately predicting the invocation patterns of sparse functions remains challenging. This limits the formulation of effective pre-warm and keep-alive strategies, leading to frequent cold starts and degraded user satisfaction. To address these challenges, we propose SPFaaS, a hybrid framework based on sparse function prediction. To enhance the learnability of sparse function invocation data, SPFaaS takes into account the characteristics of cloud service workloads along with the features of pre-warm and keep-alive strategies, transforming function invocation records into probabilistic data. It captures the underlying periodicity and temporal dependencies in the data through multiple rounds of sampling and the combined use of Gated Recurrent Units and Temporal Convolutional Networks for accurate prediction. Based on the final prediction outcome and real-time system states, SPFaaS determines adaptive pre-warm and keep-alive strategies for each function. Experiments conducted on two real-world serverless clusters demonstrate that SPFaaS outperforms state-of-the-art methods in reducing cold starts and improving user satisfaction.
无服务器计算依赖于使函数保持活动状态或在调用前预热它们,以减轻冷启动问题,这种问题源于初始化函数启动环境的开销。然而,在受约束的云资源下,准确预测稀疏函数的调用模式仍然具有挑战性。这限制了有效预热和保持活力策略的制定,导致频繁的冷启动和降低用户满意度。为了解决这些挑战,我们提出了基于稀疏函数预测的混合框架SPFaaS。为了增强稀疏函数调用数据的可学习性,SPFaaS考虑了云服务工作负载的特点以及预热和保持活动策略的特点,将函数调用记录转换为概率数据。它通过多轮采样和门控循环单元和时间卷积网络的组合使用来捕获数据中的潜在周期性和时间依赖性,以进行准确的预测。根据最终预测结果和实时系统状态,SPFaaS为每个功能确定自适应预热和保持存活策略。在两个真实的无服务器集群上进行的实验表明,SPFaaS在减少冷启动和提高用户满意度方面优于最先进的方法。
{"title":"A Sparse Function Prediction Approach for Cold Start Optimization and User Satisfaction Guarantee in Serverless","authors":"Wang Zhang;Yuyang Zhu;Zhan Shi;Manyu Dang;Yutong Wu;Fang Wang;Dan Feng","doi":"10.1109/TPDS.2025.3602440","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3602440","url":null,"abstract":"Serverless computing relies on keeping functions alive or pre-warming them before invocation to mitigate the cold start problem, stemming from the overhead of initializing function startup environments. However, under constrained cloud resources, accurately predicting the invocation patterns of sparse functions remains challenging. This limits the formulation of effective pre-warm and keep-alive strategies, leading to frequent cold starts and degraded user satisfaction. To address these challenges, we propose <italic>SPFaaS</i>, a hybrid framework based on sparse function prediction. To enhance the learnability of sparse function invocation data, <italic>SPFaaS</i> takes into account the characteristics of cloud service workloads along with the features of pre-warm and keep-alive strategies, transforming function invocation records into probabilistic data. It captures the underlying periodicity and temporal dependencies in the data through multiple rounds of sampling and the combined use of Gated Recurrent Units and Temporal Convolutional Networks for accurate prediction. Based on the final prediction outcome and real-time system states, <italic>SPFaaS</i> determines adaptive pre-warm and keep-alive strategies for each function. Experiments conducted on two real-world serverless clusters demonstrate that <italic>SPFaaS</i> outperforms state-of-the-art methods in reducing cold starts and improving user satisfaction.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2198-2213"},"PeriodicalIF":6.0,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mapping Large-Scale Spiking Neural Network on Arbitrary Meshed Neuromorphic Hardware 在任意网格神经形态硬件上映射大规模脉冲神经网络
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-08-25 DOI: 10.1109/TPDS.2025.3601993
Ouwen Jin;Qinghui Xing;Zhuo Chen;Ming Zhang;De Ma;Ying Li;Xin Du;Shuibing He;Shuiguang Deng;Gang Pan
Neuromorphic hardware systems—designed as 2D-mesh structures with parallel neurosynaptic cores—have proven highly efficient at executing large-scale spiking neural networks (SNNs). A critical challenge, however, lies in mapping neurons efficiently to these cores. While existing approaches work well with regular, fully functional mesh structures, they falter in real-world scenarios where hardware has irregular shapes or non-functional cores caused by defects or resource fragmentation. To address these limitations, we propose a novel mapping method based on an innovative space-filling curve: the Adaptive Locality-Preserving (ALP) curve. Using a unique divide-and-conquer construction algorithm, the ALP curve ensures adaptability to meshes of any shape while maintaining crucial locality properties—essential for efficient mapping. Our method demonstrates exceptional computational efficiency, making it ideal for large-scale deployments. These distinctive characteristics enable our approach to handle complex scenarios that challenge conventional methods. Experimental results show that our method matches state-of-the-art solutions in regular-shape mapping while achieving significant improvements in irregular scenarios, reducing communication overhead by up to 57.1%.
神经形态硬件系统被设计成具有平行神经突触核心的二维网格结构,已被证明在执行大规模尖峰神经网络(snn)方面非常有效。然而,一个关键的挑战在于如何有效地将神经元映射到这些核心。虽然现有的方法可以很好地处理规则的、功能齐全的网格结构,但在硬件具有不规则形状或由缺陷或资源碎片引起的无功能核心的现实场景中,它们就会出现问题。为了解决这些限制,我们提出了一种新的基于创新的空间填充曲线的映射方法:自适应位置保持(ALP)曲线。使用一种独特的分治构造算法,ALP曲线确保了对任何形状的网格的适应性,同时保持了重要的局部性,这对于有效的映射至关重要。我们的方法展示了卓越的计算效率,使其成为大规模部署的理想选择。这些独特的特性使我们的方法能够处理挑战传统方法的复杂场景。实验结果表明,我们的方法在规则形状映射中与最先进的解决方案相匹配,同时在不规则场景中取得了显着改进,将通信开销降低了57.1%。
{"title":"Mapping Large-Scale Spiking Neural Network on Arbitrary Meshed Neuromorphic Hardware","authors":"Ouwen Jin;Qinghui Xing;Zhuo Chen;Ming Zhang;De Ma;Ying Li;Xin Du;Shuibing He;Shuiguang Deng;Gang Pan","doi":"10.1109/TPDS.2025.3601993","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3601993","url":null,"abstract":"Neuromorphic hardware systems—designed as 2D-mesh structures with parallel neurosynaptic cores—have proven highly efficient at executing large-scale spiking neural networks (SNNs). A critical challenge, however, lies in mapping neurons efficiently to these cores. While existing approaches work well with regular, fully functional mesh structures, they falter in real-world scenarios where hardware has irregular shapes or non-functional cores caused by defects or resource fragmentation. To address these limitations, we propose a novel mapping method based on an innovative space-filling curve: the Adaptive Locality-Preserving (ALP) curve. Using a unique divide-and-conquer construction algorithm, the ALP curve ensures adaptability to meshes of any shape while maintaining crucial locality properties—essential for efficient mapping. Our method demonstrates exceptional computational efficiency, making it ideal for large-scale deployments. These distinctive characteristics enable our approach to handle complex scenarios that challenge conventional methods. Experimental results show that our method matches state-of-the-art solutions in regular-shape mapping while achieving significant improvements in irregular scenarios, reducing communication overhead by up to 57.1%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2325-2340"},"PeriodicalIF":6.0,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210126","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EdgeAIBus: AI-Driven Joint Container Management and Model Selection Framework for Heterogeneous Edge Computing EdgeAIBus:人工智能驱动的异构边缘计算联合容器管理和模型选择框架
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-08-25 DOI: 10.1109/TPDS.2025.3602521
Babar Ali;Muhammed Golec;Sukhpal Singh Gill;Felix Cuadrado;Steve Uhlig
Containerized Edge computing offers lightweight, reliable, and quick solutions to latency-critical Machine Learning (ML) and Deep Learning (DL) applications. Existing solutions considering multiple Quality of Service (QoS) parameters either overlook the intricate relation of QoS parameters or pose significant scheduling overheads. Furthermore, reactive decision-making can damage Edge servers at peak load, incurring escalated costs and wasted computations. Resource provisioning, scheduling, and ML model selection substantially influence energy consumption, user-perceived accuracy, and delay-oriented Service Level Agreement (SLA) violations. Addressing contrasting objectives and QoS simultaneously while avoiding server faults is highly challenging in the exposed heterogeneous and resource-constrained Edge continuum. In this work, we propose the EdgeAIBus framework that offers a novel joint container management and ML model selection algorithm based on Importance Weighted Actor-Learner Architecture to optimize energy, accuracy, SLA violations, and avoid server faults. First, Patch Time Series Transformer (PatchTST) is utilized for CPU usage predictions of Edge servers for its 8.51% Root Mean Squared Error and 5.62% Mean Absolute Error. Leveraging pipelined predictions, EdgeAIBus conducts consolidation, resource oversubscription, and ML/DL model switching with possible migrations to conserve energy, maximize utilization and user-perceived accuracy, and reduce SLA violations. Simulation results show EdgeAIBus oversubscribed 110% cluster-wide CPU with real usage up to 70%, conserved 14 CPU cores, incurred less than 1% SLA violations with 2.54% drop in inference accuracy against industry-led Model Switching Balanced load and Google Kubernetes Optimized schedulers. Google Kubernetes Engine experiments demonstrate 80% oversubscription, 14 CPU cores conservation, 1% SLA violations, and 3.81% accuracy loss against the counterparts. Finally, constrained setting experiment analysis shows that PatchTST and EdgeAIBus can produce decisions within 100 ms in a 1-core and 1 GB memory device.
容器化边缘计算为延迟关键型机器学习(ML)和深度学习(DL)应用程序提供轻量级、可靠和快速的解决方案。现有的考虑多个服务质量(QoS)参数的解决方案要么忽略了QoS参数之间的复杂关系,要么带来了巨大的调度开销。此外,被动决策可能会在峰值负载下损坏边缘服务器,从而导致成本升级和计算浪费。资源配置、调度和ML模型选择对能源消耗、用户感知的准确性和面向延迟的服务水平协议(SLA)违反有很大影响。在公开的异构和资源受限的Edge连续体中,在避免服务器故障的同时解决对比目标和QoS是极具挑战性的。在这项工作中,我们提出了EdgeAIBus框架,该框架提供了一种新的基于重要性加权行动者-学习者架构的联合容器管理和机器学习模型选择算法,以优化能源、准确性、SLA违规和避免服务器故障。首先,补丁时间序列变压器(PatchTST)用于边缘服务器的CPU使用预测,其均方根误差为8.51%,平均绝对误差为5.62%。利用流水线预测,EdgeAIBus通过可能的迁移进行整合、资源超额订阅和ML/DL模型切换,以节省能源、最大化利用率和用户感知的准确性,并减少SLA违规。仿真结果表明,EdgeAIBus超额订阅了110%的集群范围内的CPU,实际使用率高达70%,节省了14个CPU内核,在业界领先的模型交换平衡负载和谷歌Kubernetes优化调度程序中,发生的SLA违规低于1%,推理精度下降2.54%。谷歌Kubernetes引擎实验证明了80%的超额订阅,14个CPU内核守恒,1%的SLA违规和3.81%的准确性损失。最后,约束设置实验分析表明,PatchTST和EdgeAIBus在1核1gb内存的设备上可以在100 ms内产生决策。
{"title":"EdgeAIBus: AI-Driven Joint Container Management and Model Selection Framework for Heterogeneous Edge Computing","authors":"Babar Ali;Muhammed Golec;Sukhpal Singh Gill;Felix Cuadrado;Steve Uhlig","doi":"10.1109/TPDS.2025.3602521","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3602521","url":null,"abstract":"Containerized Edge computing offers lightweight, reliable, and quick solutions to latency-critical Machine Learning (ML) and Deep Learning (DL) applications. Existing solutions considering multiple Quality of Service (QoS) parameters either overlook the intricate relation of QoS parameters or pose significant scheduling overheads. Furthermore, reactive decision-making can damage Edge servers at peak load, incurring escalated costs and wasted computations. Resource provisioning, scheduling, and ML model selection substantially influence energy consumption, user-perceived accuracy, and delay-oriented Service Level Agreement (SLA) violations. Addressing contrasting objectives and QoS simultaneously while avoiding server faults is highly challenging in the exposed heterogeneous and resource-constrained Edge continuum. In this work, we propose the EdgeAIBus framework that offers a novel joint container management and ML model selection algorithm based on Importance Weighted Actor-Learner Architecture to optimize energy, accuracy, SLA violations, and avoid server faults. First, Patch Time Series Transformer (PatchTST) is utilized for CPU usage predictions of Edge servers for its 8.51% Root Mean Squared Error and 5.62% Mean Absolute Error. Leveraging pipelined predictions, EdgeAIBus conducts consolidation, resource oversubscription, and ML/DL model switching with possible migrations to conserve energy, maximize utilization and user-perceived accuracy, and reduce SLA violations. Simulation results show EdgeAIBus oversubscribed 110% cluster-wide CPU with real usage up to 70%, conserved 14 CPU cores, incurred less than 1% SLA violations with 2.54% drop in inference accuracy against industry-led Model Switching Balanced load and Google Kubernetes Optimized schedulers. Google Kubernetes Engine experiments demonstrate 80% oversubscription, 14 CPU cores conservation, 1% SLA violations, and 3.81% accuracy loss against the counterparts. Finally, constrained setting experiment analysis shows that PatchTST and EdgeAIBus can produce decisions within 100 ms in a 1-core and 1 GB memory device.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2412-2424"},"PeriodicalIF":6.0,"publicationDate":"2025-08-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11139102","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MemTunnel: A CXL-Based Rack-Scale Host Memory Pooling Architecture for Cloud Service MemTunnel:一种基于cxl的机架级云服务主机内存池架构
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-08-20 DOI: 10.1109/TPDS.2025.3598190
Tianchan Guan;Yijin Guan;Zhaoyang Du;Jiacheng Ma;Boyu Tian;Zhao Wang;Teng Ma;Zheng Liu;Yang Kong;Yuan Xie;Mingyu Gao;Guangyu Sun;Hongzhong Zheng;Dimin Niu
Memory underutilization poses a significant challenge in cloud services, leading to performance inefficiencies and resource wastage. The tightly coupled computing and memory resources in cloud servers are identified as the root cause of this problem. To address this issue, memory pooling has been the subject of extensive research for decades, providing centralized or distributed shared memory pools as flexible memory resources for various applications running on different servers. However, existing memory disaggregation solutions sacrifice memory resources, add extra hardware (such as memory boxes/blades/drives), and degrade memory performance to achieve flexibility. To overcome these limitations, this paper proposes MemTunnel, a rack-scale host memory pooling architecture that provides a low-cost memory pooling solution based on Compute Express Link (CXL). MemTunnel is the first hardware and software architecture to offer symmetric, memory-semantic memory pooling over CXL, with an FPGA-based platform to demonstrate its feasibility in a real implementation. MemTunnel is orthogonal to the existing CXL-based memory pool and provides an additional layer of abstraction for memory disaggregation. Evaluation results show that MemTunnel achieves comparable performance to the existing CXL-based memory pool for a single machine and provides better rack-scale performance with minor hardware overheads.
内存利用率不足对云服务构成重大挑战,导致性能效率低下和资源浪费。云服务器中紧密耦合的计算和内存资源被认为是造成这个问题的根本原因。为了解决这个问题,几十年来,内存池一直是广泛研究的主题,它为运行在不同服务器上的各种应用程序提供集中式或分布式共享内存池作为灵活的内存资源。但是,现有的内存分解解决方案牺牲了内存资源,增加了额外的硬件(例如内存盒/刀片/驱动器),并且降低了内存性能以实现灵活性。为了克服这些限制,本文提出了MemTunnel,这是一种机架规模的主机内存池架构,它提供了一种基于Compute Express Link (CXL)的低成本内存池解决方案。MemTunnel是第一个在CXL上提供对称、内存语义内存池的硬件和软件架构,并使用基于fpga的平台来演示其在实际实现中的可行性。MemTunnel与现有的基于cxl的内存池是正交的,并为内存分解提供了一个额外的抽象层。评估结果表明,MemTunnel在单个机器上实现了与现有基于cxl的内存池相当的性能,并且以较小的硬件开销提供了更好的机架级性能。
{"title":"MemTunnel: A CXL-Based Rack-Scale Host Memory Pooling Architecture for Cloud Service","authors":"Tianchan Guan;Yijin Guan;Zhaoyang Du;Jiacheng Ma;Boyu Tian;Zhao Wang;Teng Ma;Zheng Liu;Yang Kong;Yuan Xie;Mingyu Gao;Guangyu Sun;Hongzhong Zheng;Dimin Niu","doi":"10.1109/TPDS.2025.3598190","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3598190","url":null,"abstract":"Memory underutilization poses a significant challenge in cloud services, leading to performance inefficiencies and resource wastage. The tightly coupled computing and memory resources in cloud servers are identified as the root cause of this problem. To address this issue, memory pooling has been the subject of extensive research for decades, providing centralized or distributed shared memory pools as flexible memory resources for various applications running on different servers. However, existing memory disaggregation solutions sacrifice memory resources, add extra hardware (such as memory boxes/blades/drives), and degrade memory performance to achieve flexibility. To overcome these limitations, this paper proposes MemTunnel, a rack-scale host memory pooling architecture that provides a low-cost memory pooling solution based on Compute Express Link (CXL). MemTunnel is the first hardware and software architecture to offer symmetric, memory-semantic memory pooling over CXL, with an FPGA-based platform to demonstrate its feasibility in a real implementation. MemTunnel is orthogonal to the existing CXL-based memory pool and provides an additional layer of abstraction for memory disaggregation. Evaluation results show that MemTunnel achieves comparable performance to the existing CXL-based memory pool for a single machine and provides better rack-scale performance with minor hardware overheads.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2182-2197"},"PeriodicalIF":6.0,"publicationDate":"2025-08-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141737","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploiting Fine-Grained Task-Level Parallelism for Variant Calling Acceleration 利用细粒度任务级并行性实现变量调用加速
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-08-19 DOI: 10.1109/TPDS.2025.3600285
Menghao Guo;Longlong Chen;Yichi Zhang;Hongyi Guan;Shaojun Wei;Jianfeng Zhu;Leibo Liu
Variant calling, which identifies genomic differences relative to a reference genome, is critical for understanding disease mechanisms, identifying therapeutic targets, and advancing precision medicine. However, as two critical stages in this process, serial processing in local assembly and the computational dependencies in Pair-HMM make variant calling highly time-consuming. Moreover, optimizing only one of these stages often shifts the performance bottleneck to the other. This article observes that the similarity between reads allows parallel processing in the local assembly and that alignment information from the local assembly can significantly diminish the burdensome computations in Pair-HMM. Accordingly, this article co-optimizes the software and hardware for both steps to achieve the best performance. First, we collect $k$-mer locations in each read during the local assembly process and utilize the similarity between reads to make it parallel. Second, we propose the mPair-HMM algorithm, leveraging location information to split a Pair-HMM computation task into multiple independent sub-tasks, improving the computation’s parallelism. To fully exploit the parallelism stemming from the novel algorithms, we propose an end-to-end accelerator VCAx for variant calling that accelerates both stages in collaboration. Evaluation results demonstrate that our implementation achieves up to a 7× speedup over the GPU baseline for local assembly and a 3.16× performance improvement compared to the state-of-the-art ASIC implementation for Pair-HMM.
变异召唤,识别相对于参考基因组的基因组差异,对于理解疾病机制,确定治疗靶点和推进精准医学至关重要。然而,局部装配中的串行处理和Pair-HMM中的计算依赖性使得变量调用非常耗时。此外,只优化其中一个阶段往往会将性能瓶颈转移到另一个阶段。本文观察到,读取之间的相似性允许在局部程序集中并行处理,并且来自局部程序集的对齐信息可以显着减少Pair-HMM中的繁重计算。因此,本文将为这两个步骤共同优化软件和硬件,以实现最佳性能。首先,我们在本地组装过程中收集每个读取的$k$ mer位置,并利用读取之间的相似性使其并行。其次,我们提出了mPair-HMM算法,利用位置信息将一个Pair-HMM计算任务拆分为多个独立的子任务,提高了计算的并行性。为了充分利用新算法产生的并行性,我们提出了一个端到端加速器VCAx,用于变体调用,以加速协作中的两个阶段。评估结果表明,我们的实现在本地组装的GPU基线上实现了高达7倍的加速,与Pair-HMM最先进的ASIC实现相比,性能提高了3.16倍。
{"title":"Exploiting Fine-Grained Task-Level Parallelism for Variant Calling Acceleration","authors":"Menghao Guo;Longlong Chen;Yichi Zhang;Hongyi Guan;Shaojun Wei;Jianfeng Zhu;Leibo Liu","doi":"10.1109/TPDS.2025.3600285","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3600285","url":null,"abstract":"Variant calling, which identifies genomic differences relative to a reference genome, is critical for understanding disease mechanisms, identifying therapeutic targets, and advancing precision medicine. However, as two critical stages in this process, serial processing in local assembly and the computational dependencies in Pair-HMM make variant calling highly time-consuming. Moreover, optimizing only one of these stages often shifts the performance bottleneck to the other. This article observes that the similarity between reads allows parallel processing in the local assembly and that alignment information from the local assembly can significantly diminish the burdensome computations in Pair-HMM. Accordingly, this article co-optimizes the software and hardware for both steps to achieve the best performance. First, we collect <inline-formula><tex-math>$k$</tex-math></inline-formula>-mer locations in each read during the local assembly process and utilize the similarity between reads to make it parallel. Second, we propose the mPair-HMM algorithm, leveraging location information to split a Pair-HMM computation task into multiple independent sub-tasks, improving the computation’s parallelism. To fully exploit the parallelism stemming from the novel algorithms, we propose an end-to-end accelerator VCAx for variant calling that accelerates both stages in collaboration. Evaluation results demonstrate that our implementation achieves up to a 7× speedup over the GPU baseline for local assembly and a 3.16× performance improvement compared to the state-of-the-art ASIC implementation for Pair-HMM.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2169-2181"},"PeriodicalIF":6.0,"publicationDate":"2025-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Rethinking Virtual Machines Live Migration for Memory Disaggregation 重新思考虚拟机动态迁移的内存分解
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-08-18 DOI: 10.1109/TPDS.2025.3597149
Xingzi Yu;Xingguo Jia;Jin Zhang;Yun Wang;Senhao Yu;Zhengwei Qi
Resource underutilization has troubled data centers for several decades. On the CPU front, live migration plays a crucial role in reallocating CPU resources. Nevertheless, contemporary Virtual Machine (VM) live migration methods are burdened by substantial resource consumption. In terms of memory management, disaggregated memory offers an effective solution to enhance memory utilization, but leaves a gap in addressing CPU underutilization. Our findings highlight a considerable opportunity to optimize live migration in the context of disaggregated memory systems. We introduce Anemoi, a resource management system that seamlessly integrates VM live migration with memory disaggregation to address the aforementioned gap. In the context of disaggregated memory, remote memory becomes accessible from destination nodes, effectively eliminating the need for extensive network transmission of memory pages, and thereby significantly reducing migration time. In addition, we propose using memory replicas as an optimization to the live migration system. To mitigate the overhead of potential excessive memory consumption, we develop a dedicated compression algorithm. Our evaluations demonstrate that Anemoi leads to a notable 69% reduction in network bandwidth utilization and an impressive 83% reduction in migration time compared to traditional VM live migration. Additionally, our compression algorithm achieves an outstanding space-saving rate of 83.6%.
几十年来,资源利用不足一直困扰着数据中心。在CPU方面,热迁移在重新分配CPU资源方面起着至关重要的作用。然而,当前的虚拟机(VM)热迁移方法存在大量的资源消耗。在内存管理方面,分解内存提供了一种提高内存利用率的有效解决方案,但是在处理CPU利用率不足的问题上留下了空白。我们的发现突出了在分解内存系统的背景下优化实时迁移的相当大的机会。我们介绍Anemoi,一个资源管理系统,无缝集成虚拟机实时迁移和内存分解,以解决上述差距。在分解内存的上下文中,可以从目标节点访问远程内存,从而有效地消除了对内存页面的大量网络传输的需要,从而大大减少了迁移时间。此外,我们建议使用内存副本作为实时迁移系统的优化。为了减轻潜在的过度内存消耗的开销,我们开发了一个专用的压缩算法。我们的评估表明,与传统的虚拟机实时迁移相比,Anemoi使网络带宽利用率显著降低69%,迁移时间显著减少83%。此外,我们的压缩算法实现了出色的空间节省率83.6%。
{"title":"Rethinking Virtual Machines Live Migration for Memory Disaggregation","authors":"Xingzi Yu;Xingguo Jia;Jin Zhang;Yun Wang;Senhao Yu;Zhengwei Qi","doi":"10.1109/TPDS.2025.3597149","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3597149","url":null,"abstract":"Resource underutilization has troubled data centers for several decades. On the CPU front, live migration plays a crucial role in reallocating CPU resources. Nevertheless, contemporary Virtual Machine (VM) live migration methods are burdened by substantial resource consumption. In terms of memory management, disaggregated memory offers an effective solution to enhance memory utilization, but leaves a gap in addressing CPU underutilization. Our findings highlight a considerable opportunity to optimize live migration in the context of disaggregated memory systems. We introduce Anemoi, a resource management system that seamlessly integrates VM live migration with memory disaggregation to address the aforementioned gap. In the context of disaggregated memory, remote memory becomes accessible from destination nodes, effectively eliminating the need for extensive network transmission of memory pages, and thereby significantly reducing migration time. In addition, we propose using memory replicas as an optimization to the live migration system. To mitigate the overhead of potential excessive memory consumption, we develop a dedicated compression algorithm. Our evaluations demonstrate that Anemoi leads to a notable 69% reduction in network bandwidth utilization and an impressive 83% reduction in migration time compared to traditional VM live migration. Additionally, our compression algorithm achieves an outstanding space-saving rate of 83.6%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2310-2324"},"PeriodicalIF":6.0,"publicationDate":"2025-08-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141748","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RL-Based Hybrid CPU Scaling for Soft Deadline Constrained Tasks in Container Clouds 基于rl的容器云软期限约束任务混合CPU扩展
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-08-08 DOI: 10.1109/TPDS.2025.3597195
Yepeng Zhang;Haitao Zhang;Huadong Ma
Existing CPU scaling approaches have limitations that can lead to inefficient resource allocation and increased penalty costs for tasks with soft deadlines running in container clouds. First, quota allocation based approaches overlook the gap between the obtainable CPU time and allocated quota, causing inefficient CPU utilization and unexpected task behaviors. Second, core allocation based approaches ignore workload dynamics within decision intervals, potentially increasing contention for CPU time among tasks on the same core. Third, existing approaches lack strategies to allocate more resources to critical tasks that incur higher penalty costs when the node’s capacity is insufficient. This article proposes a reinforcement learning based hybrid CPU scaling approach that allocates quota and cores jointly, aiming to minimize penalty costs for timeouts. Based on the embedding generated from a fine-grained CPU demand series, we allocate CPU quotas and determine a dynamic workload-aware core sharing scheme using an attention mechanism that combines respective demands and global criticality regarding penalty costs. Additionally, we integrate the resource gap, CPU time contention, and penalty costs into the reward function to update our model online. The experimental results show the proposed approach achieves state-of-the-art performance.
现有的CPU扩展方法存在局限性,可能导致资源分配效率低下,并且会增加在容器云中运行的具有软截止日期的任务的惩罚成本。首先,基于配额分配的方法忽略了可获得的CPU时间和已分配的配额之间的差距,导致CPU利用率低下和意外的任务行为。其次,基于核心分配的方法忽略了决策间隔内的工作负载动态,可能会增加同一核心上的任务之间对CPU时间的争用。第三,现有方法缺乏在节点容量不足时将更多资源分配给关键任务的策略,这些任务会导致更高的惩罚成本。本文提出了一种基于强化学习的混合CPU扩展方法,该方法联合分配配额和内核,旨在最大限度地减少超时的惩罚成本。基于从细粒度CPU需求序列生成的嵌入,我们分配CPU配额,并使用结合各自需求和关于惩罚成本的全局临界性的关注机制确定动态工作负载感知的核心共享方案。此外,我们将资源差距、CPU时间争用和惩罚成本集成到奖励函数中,以在线更新我们的模型。实验结果表明,该方法达到了最先进的性能。
{"title":"RL-Based Hybrid CPU Scaling for Soft Deadline Constrained Tasks in Container Clouds","authors":"Yepeng Zhang;Haitao Zhang;Huadong Ma","doi":"10.1109/TPDS.2025.3597195","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3597195","url":null,"abstract":"Existing CPU scaling approaches have limitations that can lead to inefficient resource allocation and increased penalty costs for tasks with soft deadlines running in container clouds. First, quota allocation based approaches overlook the gap between the obtainable CPU time and allocated quota, causing inefficient CPU utilization and unexpected task behaviors. Second, core allocation based approaches ignore workload dynamics within decision intervals, potentially increasing contention for CPU time among tasks on the same core. Third, existing approaches lack strategies to allocate more resources to critical tasks that incur higher penalty costs when the node’s capacity is insufficient. This article proposes a reinforcement learning based hybrid CPU scaling approach that allocates quota and cores jointly, aiming to minimize penalty costs for timeouts. Based on the embedding generated from a fine-grained CPU demand series, we allocate CPU quotas and determine a dynamic workload-aware core sharing scheme using an attention mechanism that combines respective demands and global criticality regarding penalty costs. Additionally, we integrate the resource gap, CPU time contention, and penalty costs into the reward function to update our model online. The experimental results show the proposed approach achieves state-of-the-art performance.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2104-2118"},"PeriodicalIF":6.0,"publicationDate":"2025-08-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144904881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mariana: Exploring Native SkipList Index Design for Disaggregated Memory 探索分解内存的本地SkipList索引设计
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-08-07 DOI: 10.1109/TPDS.2025.3596988
Xing Wei;Ke Wang;Yinjun Han;Hao Jin;Yaofeng Tu;Huiqi Hu;Xuan Zhou;Minghao Zhao
Memory disaggregation has emerged as a promising architecture for improving resource efficiency by decoupling the computing and memory resources. But building efficient range indices in such an architecture faces three critical challenges: (1) coarse-grained concurrency control schemes for coordinating concurrent read/write operations with node splitting incur high contention under the skewed and write-intensive workloads; (2) existing data layouts fail to balance consistency verification and hardware acceleration via SIMD (Single Instruction Multiple Data); and (3) naive caching schemes struggle to adapt to rapidly changing access patterns. To address these challenges, we propose Mariana, a memory-disaggregated skiplist index that integrates three key innovations. First, it uses a fine-grained (i.e., entry-level) latch mechanism combined with dynamic node resizing to minimize the contention and splitting frequency. Second, it employs a tailored data layout for leaf node, which separates keys and values to enable SIMD acceleration while maintaining consistency checks with minimal write overhead. Third, it implements an adaptive caching strategy that tracks node popularity in real-time to optimize network bandwidth utilization during the index traversal. Experimental results show that Mariana achieves $1.7times$ higher throughput under write-intensive workloads and reduces the P90 latency by 23% under the read-intensive workloads, when comparing to the state-of-the-art indices on disaggregated memory.
内存分解通过解耦计算资源和内存资源来提高资源效率,已经成为一种很有前途的架构。但是,在这种架构中构建高效的范围索引面临三个关键挑战:(1)在倾斜和写密集型工作负载下,用于协调并发读写操作的粗粒度并发控制方案会导致高争用;(2)现有数据布局无法通过SIMD (Single Instruction Multiple data)平衡一致性验证和硬件加速;(3)幼稚的缓存方案难以适应快速变化的访问模式。为了应对这些挑战,我们提出了Mariana,这是一个整合了三个关键创新的记忆分类跳跃列表索引。首先,它使用细粒度(即入门级)锁存机制,并结合动态节点调整大小来最小化争用和分裂频率。其次,它为叶节点采用了定制的数据布局,它将键和值分开,以支持SIMD加速,同时以最小的写开销维护一致性检查。第三,它实现了一种自适应缓存策略,实时跟踪节点的流行程度,以优化索引遍历期间的网络带宽利用率。实验结果表明,与分类内存上最先进的索引相比,Mariana在写密集型工作负载下的吞吐量提高了1.7倍,在读密集型工作负载下的P90延迟降低了23%。
{"title":"Mariana: Exploring Native SkipList Index Design for Disaggregated Memory","authors":"Xing Wei;Ke Wang;Yinjun Han;Hao Jin;Yaofeng Tu;Huiqi Hu;Xuan Zhou;Minghao Zhao","doi":"10.1109/TPDS.2025.3596988","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3596988","url":null,"abstract":"Memory disaggregation has emerged as a promising architecture for improving resource efficiency by decoupling the computing and memory resources. But building efficient range indices in such an architecture faces three critical challenges: (1) coarse-grained concurrency control schemes for coordinating concurrent read/write operations with node splitting incur high contention under the skewed and write-intensive workloads; (2) existing data layouts fail to balance consistency verification and hardware acceleration via SIMD (Single Instruction Multiple Data); and (3) naive caching schemes struggle to adapt to rapidly changing access patterns. To address these challenges, we propose <small>Mariana</small>, a memory-disaggregated skiplist index that integrates three key innovations. First, it uses a fine-grained (i.e., entry-level) latch mechanism combined with dynamic node resizing to minimize the contention and splitting frequency. Second, it employs a tailored data layout for leaf node, which separates keys and values to enable SIMD acceleration while maintaining consistency checks with minimal write overhead. Third, it implements an adaptive caching strategy that tracks node popularity in real-time to optimize network bandwidth utilization during the index traversal. Experimental results show that <small>Mariana</small> achieves <inline-formula><tex-math>$1.7times$</tex-math></inline-formula> higher throughput under write-intensive workloads and reduces the P90 latency by 23% under the read-intensive workloads, when comparing to the state-of-the-art indices on disaggregated memory.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2137-2151"},"PeriodicalIF":6.0,"publicationDate":"2025-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144990153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Online Container Caching for IoT Data Processing in Serverless Edge Computing 无服务器边缘计算中物联网数据处理的在线容器缓存
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-08-05 DOI: 10.1109/TPDS.2025.3595965
Guopeng Li;Haisheng Tan;Chi Zhang;Xuan Zhang;Zhenhua Han;Guoliang Chen
Serverless edge computing is an efficient way to execute event-driven, short-duration, and bursty IoT data processing tasks on resource-limited edge servers, using on-demand resource allocation and dynamic auto-scaling. In this paradigm, function requests are handled in virtualized environments, e.g., containers. When a function request arrives online, if there is no container in memory to execute it, the serverless platform will initialize such a container with non-negligible latency, known as cold start. Otherwise, it results in a warm start with no latency in previous studies. However, based on our experiments, we find there is a remarkable third case called Late-Warm, i.e., when a request arrives during the container initializing, its latency is less than a cold start but not zero. In this paper, we study online container caching in serverless edge computing to minimize the total latency with Late-Warm and other practical issues considered. We propose OnCoLa, a novel $O(T_{c}K)$-competitive algorithm supporting request relaying on multiple edge servers. Here, $T_{c}$ and $K$ are the maximum container cold start latency and the memory size, respectively. Extensive simulations on two real-world traces demonstrate that OnCoLa consistently outperforms the state-of-the-art container caching algorithms and reduces the latency by 23.33%. Experiments on Raspberry Pi and Jetson Nano show that OnCoLa reduces latency by up to 21.38% compared with the representative lightweight policy.
无服务器边缘计算是一种在资源有限的边缘服务器上执行事件驱动、短时间和突发物联网数据处理任务的有效方法,使用按需资源分配和动态自动扩展。在这个范例中,功能请求是在虚拟化环境中处理的,例如容器。当一个函数请求在线到达时,如果内存中没有容器来执行它,无服务器平台将初始化这样一个容器,具有不可忽略的延迟,称为冷启动。否则,在以前的研究中,它会导致一个温暖的开始,没有延迟。然而,根据我们的实验,我们发现了第三种值得注意的情况,即当请求在容器初始化期间到达时,其延迟时间小于冷启动,但不为零。在本文中,我们研究了无服务器边缘计算中的在线容器缓存,以最大限度地减少延迟和其他实际问题。我们提出了一种新的$O(T_{c}K)$竞争算法OnCoLa,支持在多个边缘服务器上中继请求。这里,$T_{c}$和$K$分别是最大容器冷启动延迟和内存大小。在两个真实世界中进行的大量模拟表明,OnCoLa始终优于最先进的容器缓存算法,并将延迟降低了23.33%。在Raspberry Pi和Jetson Nano上的实验表明,与代表性的轻量级策略相比,OnCoLa可减少高达21.38%的延迟。
{"title":"Online Container Caching for IoT Data Processing in Serverless Edge Computing","authors":"Guopeng Li;Haisheng Tan;Chi Zhang;Xuan Zhang;Zhenhua Han;Guoliang Chen","doi":"10.1109/TPDS.2025.3595965","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3595965","url":null,"abstract":"Serverless edge computing is an efficient way to execute event-driven, short-duration, and bursty IoT data processing tasks on resource-limited edge servers, using on-demand resource allocation and dynamic auto-scaling. In this paradigm, function requests are handled in virtualized environments, e.g., containers. When a function request arrives online, if there is no container in memory to execute it, the serverless platform will initialize such a container with non-negligible latency, known as cold start. Otherwise, it results in a warm start with no latency in previous studies. However, based on our experiments, we find there is a remarkable third case called Late-Warm, i.e., when a request arrives during the container initializing, its latency is less than a cold start but not zero. In this paper, we study online container caching in serverless edge computing to minimize the total latency with Late-Warm and other practical issues considered. We propose <monospace>OnCoLa</monospace>, a novel <inline-formula><tex-math>$O(T_{c}K)$</tex-math></inline-formula>-competitive algorithm supporting request relaying on multiple edge servers. Here, <inline-formula><tex-math>$T_{c}$</tex-math></inline-formula> and <inline-formula><tex-math>$K$</tex-math></inline-formula> are the maximum container cold start latency and the memory size, respectively. Extensive simulations on two real-world traces demonstrate that <monospace>OnCoLa</monospace> consistently outperforms the state-of-the-art container caching algorithms and reduces the latency by 23.33%. Experiments on Raspberry Pi and Jetson Nano show that <monospace>OnCoLa</monospace> reduces latency by up to 21.38% compared with the representative lightweight policy.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2524-2536"},"PeriodicalIF":6.0,"publicationDate":"2025-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1