IEEE Transactions on Cloud Computing最新文献

COCSN: A Multi-Tiered Cascaded Optical Circuit Switching Network for Data Center COCSN：用于数据中心的多层级联光路交换网络

IF 5.3 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing

Pub Date : 2024-11-11 DOI: 10.1109/TCC.2024.3488275

Shuo Li;Huaxi Gu;Xiaoshan Yu;Hua Huang;Songyan Wang;Zeshan Chang

A cascaded network represents a classic scaling-out model in traditional electrical switching networks. Recent proposals have integrated optical circuit switching at specific tiers of these networks to reduce power consumption and enhance topological flexibility. Utilizing a multi-tiered cascaded optical circuit switching network is expected to extend the advantages of optical circuit switching further. The main challenges fall into two categories. First, an architecture with sufficient connectivity is required to support varying workloads. Second, the network reconfiguration is more complex and necessitates a low-complexity scheduling algorithm. In this work, we propose COCSN, a multi-tiered cascaded optical circuit switching network architecture for data center. COCSN employs wavelength-selective switches that integrate multiple wavelengths to enhance network connectivity. We formulate a mathematical model covering lightpath establishment, network reconfiguration, and reconfiguration goals, and propose theorems to optimize the model. Based on the theorems, we introduce an over-subscription-supported wavelength-by-wavelength scheduling algorithm, facilitating agile establishment of lightpaths in COCSN tailored to communication demand. This algorithm effectively addresses scheduling complexities and mitigates the issue of lengthy WSS configuration times. Simulation studies investigate the impact of flow length, WSS reconfiguration time, and communication domain on COCSN, verifying its significantly lower complexity and superior performance over classical cascaded networks.

级联网络是传统电力交换网络中典型的向外扩展模型。最近的建议是在这些网络的特定层集成光电路交换，以降低功耗并增强拓扑灵活性。利用多层级联的光电路交换网络有望进一步扩展光电路交换的优势。主要的挑战分为两类。首先，需要具有足够连接性的体系结构来支持不同的工作负载。其次，网络重构更为复杂，需要低复杂度的调度算法。在这项工作中，我们提出了一种多层级联的数据中心光电路交换网络架构COCSN。COCSN采用波长选择性交换机，集成多个波长以增强网络连通性。我们建立了一个涵盖光路建立、网络重构和重构目标的数学模型，并提出了优化模型的定理。基于这些定理，我们引入了一种支持超订阅的逐波长调度算法，促进了COCSN中根据通信需求量身定制的光路的敏捷建立。该算法有效地解决了调度复杂性，减轻了WSS配置时间过长的问题。仿真研究了流量长度、WSS重构时间和通信域对COCSN的影响，验证了其比经典级联网络显著降低的复杂度和优越的性能。

{"title":"COCSN: A Multi-Tiered Cascaded Optical Circuit Switching Network for Data Center","authors":"Shuo Li;Huaxi Gu;Xiaoshan Yu;Hua Huang;Songyan Wang;Zeshan Chang","doi":"10.1109/TCC.2024.3488275","DOIUrl":"https://doi.org/10.1109/TCC.2024.3488275","url":null,"abstract":"A cascaded network represents a classic scaling-out model in traditional electrical switching networks. Recent proposals have integrated optical circuit switching at specific tiers of these networks to reduce power consumption and enhance topological flexibility. Utilizing a multi-tiered cascaded optical circuit switching network is expected to extend the advantages of optical circuit switching further. The main challenges fall into two categories. First, an architecture with sufficient connectivity is required to support varying workloads. Second, the network reconfiguration is more complex and necessitates a low-complexity scheduling algorithm. In this work, we propose COCSN, a multi-tiered cascaded optical circuit switching network architecture for data center. COCSN employs wavelength-selective switches that integrate multiple wavelengths to enhance network connectivity. We formulate a mathematical model covering lightpath establishment, network reconfiguration, and reconfiguration goals, and propose theorems to optimize the model. Based on the theorems, we introduce an over-subscription-supported wavelength-by-wavelength scheduling algorithm, facilitating agile establishment of lightpaths in COCSN tailored to communication demand. This algorithm effectively addresses scheduling complexities and mitigates the issue of lengthy WSS configuration times. Simulation studies investigate the impact of flow length, WSS reconfiguration time, and communication domain on COCSN, verifying its significantly lower complexity and superior performance over classical cascaded networks.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"12 4","pages":"1463-1475"},"PeriodicalIF":5.3,"publicationDate":"2024-11-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Aggregate Monitoring for Geo-Distributed Kubernetes Cluster Federations 针对地理分布式 Kubernetes 集群联盟的聚合监控

IF 5.3 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing

Pub Date : 2024-10-17 DOI: 10.1109/TCC.2024.3482574

Chih-Kai Huang;Guillaume Pierre

Distributed monitoring is an essential functionality to allow large cluster federations to efficiently schedule applications on a set of available geo-distributed resources. However, periodically reporting the precise status of each available server is both unnecessary to allow accurate scheduling and unscalable when the number of servers grows. This paper proposes Acala, an aggregate monitoring framework for geo-distributed Kubernetes cluster federations which aims to provide the management cluster with aggregated information about the entire cluster instead of individual servers. Based on actual deployment under a controlled environment in the geo-distributed Grid’5000 testbed, our evaluations show that Acala reduces the cross-cluster network traffic by up to 97% and the scrape duration by up to 55% in the single member cluster experiment. Our solution also decreases cross-cluster network traffic by 95% and memory resource consumption by 83% in multiple member cluster scenarios. A comparison of scheduling efficiency with and without data aggregation shows that aggregation has minimal effects on the system’s scheduling function. These results indicate that our approach is superior to the existing solution and is suitable to handle large-scale geo-distributed Kubernetes cluster federation environments.

分布式监控是一项基本功能，它允许大型集群联合在一组可用的地理分布式资源上有效地调度应用程序。但是，定期报告每个可用服务器的精确状态对于实现精确的调度是不必要的，而且当服务器数量增加时也无法进行扩展。本文提出了Acala，这是一个用于地理分布式Kubernetes集群联合的聚合监控框架，旨在为管理集群提供有关整个集群而不是单个服务器的聚合信息。基于在地理分布式网格5000测试平台受控环境下的实际部署，我们的评估表明，在单个成员集群实验中，Acala将跨集群网络流量减少了97%，将抓取持续时间减少了55%。在多成员集群场景中，我们的解决方案还可以将跨集群的网络流量减少95%，内存资源消耗减少83%。通过对有数据聚合和没有数据聚合的调度效率的比较表明，数据聚合对系统调度功能的影响最小。这些结果表明，我们的方法优于现有的解决方案，适合处理大规模地理分布式Kubernetes集群联合环境。

{"title":"Aggregate Monitoring for Geo-Distributed Kubernetes Cluster Federations","authors":"Chih-Kai Huang;Guillaume Pierre","doi":"10.1109/TCC.2024.3482574","DOIUrl":"https://doi.org/10.1109/TCC.2024.3482574","url":null,"abstract":"Distributed monitoring is an essential functionality to allow large cluster federations to efficiently schedule applications on a set of available geo-distributed resources. However, periodically reporting the precise status of each available server is both unnecessary to allow accurate scheduling and unscalable when the number of servers grows. This paper proposes Acala, an aggregate monitoring framework for geo-distributed Kubernetes cluster federations which aims to provide the management cluster with aggregated information about the entire cluster instead of individual servers. Based on actual deployment under a controlled environment in the geo-distributed Grid’5000 testbed, our evaluations show that Acala reduces the cross-cluster network traffic by up to 97% and the scrape duration by up to 55% in the single member cluster experiment. Our solution also decreases cross-cluster network traffic by 95% and memory resource consumption by 83% in multiple member cluster scenarios. A comparison of scheduling efficiency with and without data aggregation shows that aggregation has minimal effects on the system’s scheduling function. These results indicate that our approach is superior to the existing solution and is suitable to handle large-scale geo-distributed Kubernetes cluster federation environments.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"12 4","pages":"1449-1462"},"PeriodicalIF":5.3,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Group Formation and Sampling in Group-Based Hierarchical Federated Learning 基于群体的分层联邦学习中的群体形成与抽样

IF 5.3 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing

Pub Date : 2024-10-17 DOI: 10.1109/TCC.2024.3482865

Jiyao Liu;Xuanzhang Liu;Xinliang Wei;Hongchang Gao;Yu Wang

Hierarchical federated learning has emerged as a pragmatic approach to addressing scalability, robustness, and privacy concerns within distributed machine learning, particularly in the context of edge computing. This hierarchical method involves grouping clients at the edge, where the constitution of client groups significantly impacts overall learning performance, influenced by both the benefits obtained and costs incurred during group operations (such as group formation and group training). This is especially true for edge and mobile devices, which are more sensitive to computation and communication overheads. The formation of groups is critical for group-based hierarchical federated learning but often neglected by researchers, especially in the realm of edge systems. In this paper, we present a comprehensive exploration of a group-based federated edge learning framework utilizing the hierarchical cloud-edge-client architecture and employing probabilistic group sampling. Our theoretical analysis of its convergence rate, considering the characteristics of client groups, reveals the pivotal role played by group heterogeneity in achieving convergence. Building on this insight, we introduce new methods for group formation and group sampling, aiming to mitigate data heterogeneity within groups and enhance the convergence and overall performance of federated learning. Our proposed methods are validated through extensive experiments, demonstrating their superiority over current algorithms in terms of prediction accuracy and training cost.

分层联邦学习已经成为解决分布式机器学习中可扩展性、鲁棒性和隐私问题的一种实用方法，特别是在边缘计算的背景下。这种分层方法涉及在边缘对客户进行分组，其中客户群体的构成显著影响整体学习绩效，受到群体操作（如群体形成和群体培训）过程中获得的利益和产生的成本的影响。对于边缘设备和移动设备来说尤其如此，它们对计算和通信开销更为敏感。群体的形成对于基于群体的分层联邦学习至关重要，但往往被研究者所忽视，尤其是在边缘系统领域。在本文中，我们利用分层云边缘客户端架构和采用概率组抽样对基于组的联邦边缘学习框架进行了全面的探索。考虑到客户群体的特点，我们对其收敛速度进行了理论分析，揭示了群体异质性在实现收敛方面的关键作用。在此基础上，我们引入了组形成和组抽样的新方法，旨在减轻组内的数据异质性，增强联邦学习的收敛性和整体性能。通过大量的实验验证了我们提出的方法，证明了它们在预测精度和训练成本方面优于当前算法。

{"title":"Group Formation and Sampling in Group-Based Hierarchical Federated Learning","authors":"Jiyao Liu;Xuanzhang Liu;Xinliang Wei;Hongchang Gao;Yu Wang","doi":"10.1109/TCC.2024.3482865","DOIUrl":"https://doi.org/10.1109/TCC.2024.3482865","url":null,"abstract":"Hierarchical federated learning has emerged as a pragmatic approach to addressing scalability, robustness, and privacy concerns within distributed machine learning, particularly in the context of edge computing. This hierarchical method involves grouping clients at the edge, where the constitution of client groups significantly impacts overall learning performance, influenced by both the benefits obtained and costs incurred during group operations (such as group formation and group training). This is especially true for edge and mobile devices, which are more sensitive to computation and communication overheads. The formation of groups is critical for group-based hierarchical federated learning but often neglected by researchers, especially in the realm of edge systems. In this paper, we present a comprehensive exploration of a group-based federated edge learning framework utilizing the hierarchical cloud-edge-client architecture and employing probabilistic group sampling. Our theoretical analysis of its convergence rate, considering the characteristics of client groups, reveals the pivotal role played by group heterogeneity in achieving convergence. Building on this insight, we introduce new methods for group formation and group sampling, aiming to mitigate data heterogeneity within groups and enhance the convergence and overall performance of federated learning. Our proposed methods are validated through extensive experiments, demonstrating their superiority over current algorithms in terms of prediction accuracy and training cost.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"12 4","pages":"1433-1448"},"PeriodicalIF":5.3,"publicationDate":"2024-10-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

HEXO: Offloading Long-Running Compute- and Memory-Intensive Workloads on Low-Cost, Low-Power Embedded Systems HEXO：在低成本、低功耗嵌入式系统上卸载长时间运行的计算和内存密集型工作负载

IF 5.3 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing

Pub Date : 2024-10-16 DOI: 10.1109/TCC.2024.3482178

Pierre Olivier;A K M Fazla Mehrab;Sandeep Errabelly;Stefan Lankes;Mohamed Lamine Karaoui;Robert Lyerly;Sang-Hoon Kim;Antonio Barbalace;Binoy Ravindran

OS-capable embedded systems exhibiting a very low power consumption are available at an extremely low price point. It makes them highly compelling in a datacenter context. We show that sharing long-running, compute-intensive datacenter workloads between a server machine and one or a few connected embedded boards of negligible cost and power consumption can yield significant performance and energy benefits. Our approach, named Heterogeneous EXecution Offloading (HEXO), selectively offloads Virtual Machines (VMs) from server-class machines to embedded boards. Our design tackles several challenges. We address the Instruction Set Architecture (ISA) difference between typical servers (x86) and embedded systems (ARM) through hypervisor and guest OS-level support for heterogeneous-ISA runtime VM migration. We cope with the low amount of resources in embedded systems by using lightweight VMs – unikernels – and by using the server's free RAM as remote memory for embedded boards through a transparent lightweight memory disaggregation mechanism for heterogeneous server-embedded clusters, called Netswap. VMs are offloaded based on an estimation of the slowdown expected from running on a given board. We build a prototype of HEXO and demonstrate significant increases in throughput (up to 67%) and energy efficiency (up to 56%) using benchmarks representative of compute-intensive long-running workloads.

支持操作系统的嵌入式系统功耗极低，价格极低。这使得它们在数据中心上下文中非常引人注目。我们展示了在服务器机器和一个或几个连接的嵌入式板之间共享长时间运行的计算密集型数据中心工作负载，其成本和功耗可以忽略不计，可以产生显著的性能和能源效益。我们的方法，称为异构执行卸载（HEXO），选择性地将虚拟机（vm）从服务器级机器卸载到嵌入式板。我们的设计解决了几个挑战。我们通过对异构ISA运行时VM迁移的管理程序和客户机操作系统级别的支持来解决典型服务器（x86）和嵌入式系统（ARM）之间的指令集架构（ISA）差异。我们通过使用轻量级虚拟机（unikernels）来解决嵌入式系统中资源不足的问题，并通过一种透明的轻量级内存分解机制（称为Netswap）为异构服务器嵌入式集群使用服务器的空闲RAM作为嵌入式板的远程内存。虚拟机的卸载是基于对在给定板上运行的预期减速的估计。我们构建了HEXO的原型，并使用代表计算密集型长时间工作负载的基准测试，展示了吞吐量（高达67%）和能源效率（高达56%）的显著提高。

{"title":"HEXO: Offloading Long-Running Compute- and Memory-Intensive Workloads on Low-Cost, Low-Power Embedded Systems","authors":"Pierre Olivier;A K M Fazla Mehrab;Sandeep Errabelly;Stefan Lankes;Mohamed Lamine Karaoui;Robert Lyerly;Sang-Hoon Kim;Antonio Barbalace;Binoy Ravindran","doi":"10.1109/TCC.2024.3482178","DOIUrl":"https://doi.org/10.1109/TCC.2024.3482178","url":null,"abstract":"OS-capable embedded systems exhibiting a very low power consumption are available at an extremely low price point. It makes them highly compelling in a datacenter context. We show that sharing long-running, compute-intensive datacenter workloads between a server machine and one or a few connected embedded boards of negligible cost and power consumption can yield significant performance and energy benefits. Our approach, named Heterogeneous EXecution Offloading (HEXO), selectively offloads Virtual Machines (VMs) from server-class machines to embedded boards. Our design tackles several challenges. We address the Instruction Set Architecture (ISA) difference between typical servers (x86) and embedded systems (ARM) through hypervisor and guest OS-level support for heterogeneous-ISA runtime VM migration. We cope with the low amount of resources in embedded systems by using lightweight VMs – unikernels – and by using the server's free RAM as remote memory for embedded boards through a transparent lightweight memory disaggregation mechanism for heterogeneous server-embedded clusters, called Netswap. VMs are offloaded based on an estimation of the slowdown expected from running on a given board. We build a prototype of HEXO and demonstrate significant increases in throughput (up to 67%) and energy efficiency (up to 56%) using benchmarks representative of compute-intensive long-running workloads.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"12 4","pages":"1415-1432"},"PeriodicalIF":5.3,"publicationDate":"2024-10-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Joint Offloading and Resource Allocation for Collaborative Cloud Computing With Dependent Subtask Scheduling on Multi-Core Server 基于子任务调度的多核协同云计算联合卸载与资源分配

IF 5.3 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing

Pub Date : 2024-10-15 DOI: 10.1109/TCC.2024.3481039

Zihan Gao;Peixiao Zheng;Wanming Hao;Shouyi Yang

Collaborative cloud computing (CCC) has emerged as a promising paradigm to support computation-intensive and delay-sensitive applications by leveraging MEC and MCC technologies. However, the coupling between multiple variables and subtask dependencies within an application poses significant challenges to the computation offloading mechanism. To address this, we investigate the computation offloading problem for CCC by jointly optimizing offloading decisions, resource allocation, and subtask scheduling across a multi-core edge server. First, we exploit latency to design a subtask dependency model within the application. Next, we formulate a System Energy-Time Cost (

$SETC$

) minimization problem that considers the trade-off between time and energy consumption while satisfying subtask dependencies. Due to the complexity of directly solving the formulated problem, we decompose it and propose two offloading algorithms, namely Maximum Local Searching Offloading (MLSO) and Sequential Searching Offloading (SSO), to jointly optimize offloading decisions and resource allocation. We then model dependent subtask scheduling across the multi-core edge server as a Job-Shop Scheduling Problem (JSSP) and propose a Genetic-based Task Scheduling (GTS) algorithm to achieve optimal dependent subtask scheduling on the multi-core edge server. Finally, our simulation results demonstrate the effectiveness of the proposed MLSO, SSO, and GTS algorithms under different parameter settings.

协作云计算（CCC）已经成为一种很有前途的范例，通过利用MEC和MCC技术来支持计算密集型和延迟敏感型应用程序。然而，应用程序中多个变量和子任务依赖关系之间的耦合对计算卸载机制提出了重大挑战。为了解决这个问题，我们通过联合优化跨多核边缘服务器的卸载决策、资源分配和子任务调度来研究CCC的计算卸载问题。首先，我们利用延迟来设计应用程序中的子任务依赖模型。接下来，我们制定了一个系统能量-时间成本（SETC$）最小化问题，该问题考虑了时间和能量消耗之间的权衡，同时满足子任务依赖性。针对直接求解该问题的复杂性，对其进行分解，提出最大局部搜索卸载（MLSO）和顺序搜索卸载（SSO）两种卸载算法，共同优化卸载决策和资源分配。然后将多核边缘服务器上的依赖子任务调度建模为作业车间调度问题（Job-Shop scheduling Problem， JSSP），并提出了一种基于遗传的任务调度（Genetic-based Task scheduling， GTS）算法来实现多核边缘服务器上的最优依赖子任务调度。最后，我们的仿真结果验证了所提出的MLSO、SSO和GTS算法在不同参数设置下的有效性。

{"title":"Joint Offloading and Resource Allocation for Collaborative Cloud Computing With Dependent Subtask Scheduling on Multi-Core Server","authors":"Zihan Gao;Peixiao Zheng;Wanming Hao;Shouyi Yang","doi":"10.1109/TCC.2024.3481039","DOIUrl":"https://doi.org/10.1109/TCC.2024.3481039","url":null,"abstract":"Collaborative cloud computing (CCC) has emerged as a promising paradigm to support computation-intensive and delay-sensitive applications by leveraging MEC and MCC technologies. However, the coupling between multiple variables and subtask dependencies within an application poses significant challenges to the computation offloading mechanism. To address this, we investigate the computation offloading problem for CCC by jointly optimizing offloading decisions, resource allocation, and subtask scheduling across a multi-core edge server. First, we exploit latency to design a subtask dependency model within the application. Next, we formulate a System Energy-Time Cost (\u0000<inline-formula><tex-math>$SETC$</tex-math></inline-formula>\u0000) minimization problem that considers the trade-off between time and energy consumption while satisfying subtask dependencies. Due to the complexity of directly solving the formulated problem, we decompose it and propose two offloading algorithms, namely Maximum Local Searching Offloading (MLSO) and Sequential Searching Offloading (SSO), to jointly optimize offloading decisions and resource allocation. We then model dependent subtask scheduling across the multi-core edge server as a Job-Shop Scheduling Problem (JSSP) and propose a Genetic-based Task Scheduling (GTS) algorithm to achieve optimal dependent subtask scheduling on the multi-core edge server. Finally, our simulation results demonstrate the effectiveness of the proposed MLSO, SSO, and GTS algorithms under different parameter settings.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"12 4","pages":"1401-1414"},"PeriodicalIF":5.3,"publicationDate":"2024-10-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798034","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RAM: A Resource-Aware DDoS Attack Mitigation Framework in Clouds RAM：云中的资源感知型DDoS攻击缓解框架

IF 5.3 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing

Pub Date : 2024-10-14 DOI: 10.1109/TCC.2024.3480194

Fangyuan Xing;Fei Tong;Jialong Yang;Guang Cheng;Shibo He

Distributed Denial of Service (DDoS) attacks threaten cloud servers by flooding redundant requests, leading to system resource exhaustion and legitimate service shutdown. Existing DDoS attack mitigation mechanisms mainly rely on resource expansion, which may result in unexpected resource over-provisioning and accordingly increase cloud system costs. To effectively mitigate DDoS attacks without consuming extra resources, the main challenges lie in the compromisesbetween incoming requests and available cloud resources. This paper proposes a resource-aware DDoS attack mitigation framework named RAM, where the mechanism of feedback in control theory is employed to adaptively adjust the interaction between incoming requests and available cloud resources. Specifically, two indicators including request confidence level and maximum cloud workload are designed. In terms of these two indicators, the incoming requests will be classified using proportional-integral-derivative (PID) feedback control-based classification scheme with request determination adaptation. The incoming requests can be subsequently processed according to their confidence levels as well as the workload and available resources of cloud servers, which achieves an effective resource-aware mitigation of DDoS attacks. Extensive experiments have been conducted to verify the effectiveness of RAM, which demonstrate that the proposed RAM can improve the request classification performance and guarantee the quality of service.

分布式拒绝服务（DDoS）攻击通过大量的冗余请求威胁云服务器，导致系统资源耗尽和合法服务关闭。现有的DDoS攻击缓解机制主要依赖于资源扩展，这可能会导致意想不到的资源过剩，从而增加云系统成本。为了在不消耗额外资源的情况下有效地缓解DDoS攻击，主要挑战在于传入请求和可用云资源之间的折衷。本文提出了一种资源感知的DDoS攻击缓解框架RAM，该框架利用控制论中的反馈机制自适应调整传入请求与可用云资源之间的交互。具体而言，设计了请求置信度和最大云工作负载两个指标。根据这两个指标，将使用基于比例-积分-导数（PID）反馈控制的分类方案对传入请求进行分类，并具有请求确定适应性。随后可以根据其置信度以及云服务器的工作负载和可用资源处理传入请求，从而实现对资源敏感的DDoS攻击的有效缓解。大量的实验验证了RAM的有效性，结果表明所提出的RAM可以提高请求分类性能，保证服务质量。

{"title":"RAM: A Resource-Aware DDoS Attack Mitigation Framework in Clouds","authors":"Fangyuan Xing;Fei Tong;Jialong Yang;Guang Cheng;Shibo He","doi":"10.1109/TCC.2024.3480194","DOIUrl":"https://doi.org/10.1109/TCC.2024.3480194","url":null,"abstract":"Distributed Denial of Service (DDoS) attacks threaten cloud servers by flooding redundant requests, leading to system resource exhaustion and legitimate service shutdown. Existing DDoS attack mitigation mechanisms mainly rely on resource expansion, which may result in unexpected resource over-provisioning and accordingly increase cloud system costs. To effectively mitigate DDoS attacks without consuming extra resources, the main challenges lie in the compromisesbetween incoming requests and available cloud resources. This paper proposes a resource-aware DDoS attack mitigation framework named RAM, where the mechanism of feedback in control theory is employed to adaptively adjust the interaction between incoming requests and available cloud resources. Specifically, two indicators including request confidence level and maximum cloud workload are designed. In terms of these two indicators, the incoming requests will be classified using proportional-integral-derivative (PID) feedback control-based classification scheme with request determination adaptation. The incoming requests can be subsequently processed according to their confidence levels as well as the workload and available resources of cloud servers, which achieves an effective resource-aware mitigation of DDoS attacks. Extensive experiments have been conducted to verify the effectiveness of RAM, which demonstrate that the proposed RAM can improve the request classification performance and guarantee the quality of service.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"12 4","pages":"1387-1400"},"PeriodicalIF":5.3,"publicationDate":"2024-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142798035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Minimizing Response Delay in UAV-Assisted Mobile Edge Computing by Joint UAV Deployment and Computation Offloading 基于联合部署和卸载的无人机辅助移动边缘计算响应延迟最小化

IF 5.3 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing

Pub Date : 2024-10-10 DOI: 10.1109/TCC.2024.3478172

Jianshan Zhang;Haibo Luo;Xing Chen;Hong Shen;Longkun Guo

As a promising technique for offloading computation tasks from mobile devices, Unmanned Aerial Vehicle (UAV)-assisted Mobile Edge Computing (MEC) utilizes UAVs as computational resources. A popular method for enhancing the quality of service (QoS) of UAV-assisted MEC systems is to jointly optimize UAV deployment and computation task offloading. This imposes the challenge of dynamically adjusting UAV deployment and computation offloading to accommodate the changing positions and computational requirements of mobile devices. Due to the real-time requirements of MEC computation tasks, finding an efficient joint optimization approach is imperative. This paper proposes an algorithm aimed at minimizing the average response delay in a UAV-assisted MEC system. The approach revolves around the joint optimization of UAV deployment and computation offloading through convex optimization. We break down the problem into three sub-problems: UAV deployment, Ground Device (GD) access, and computation tasks offloading, which we address using the block coordinate descent algorithm. Observing the

$NP$

-hardness nature of the original problem, we present near-optimal solutions to the decomposed sub-problems. Simulation results demonstrate that our approach can generate a joint optimization solution within seconds and diminish the average response delay compared to state-of-the-art algorithms and other advanced algorithms, with improvements ranging from 4.70% to 42.94%.

无人机辅助移动边缘计算（UAV -assisted mobile Edge Computing， MEC）利用无人机作为计算资源，是一种很有前途的从移动设备上卸载计算任务的技术。提高无人机辅助MEC系统服务质量的常用方法是联合优化无人机部署和计算任务卸载。这就提出了动态调整无人机部署和计算卸载以适应移动设备位置变化和计算需求的挑战。由于MEC计算任务的实时性要求，寻找一种有效的联合优化方法势在必行。本文提出了一种最小化无人机辅助MEC系统平均响应延迟的算法。该方法通过凸优化方法，围绕无人机部署和计算卸载的联合优化。我们将问题分解为三个子问题：无人机部署，地面设备（GD）访问和计算任务卸载，我们使用块坐标下降算法来解决这些子问题。观察到原问题的NP -硬度性质，我们给出了分解子问题的近最优解。仿真结果表明，与现有算法和其他先进算法相比，我们的方法可以在几秒内生成联合优化解，并减少平均响应延迟，改进幅度从4.70%到42.94%不等。

{"title":"Minimizing Response Delay in UAV-Assisted Mobile Edge Computing by Joint UAV Deployment and Computation Offloading","authors":"Jianshan Zhang;Haibo Luo;Xing Chen;Hong Shen;Longkun Guo","doi":"10.1109/TCC.2024.3478172","DOIUrl":"https://doi.org/10.1109/TCC.2024.3478172","url":null,"abstract":"As a promising technique for offloading computation tasks from mobile devices, Unmanned Aerial Vehicle (UAV)-assisted Mobile Edge Computing (MEC) utilizes UAVs as computational resources. A popular method for enhancing the quality of service (QoS) of UAV-assisted MEC systems is to jointly optimize UAV deployment and computation task offloading. This imposes the challenge of dynamically adjusting UAV deployment and computation offloading to accommodate the changing positions and computational requirements of mobile devices. Due to the real-time requirements of MEC computation tasks, finding an efficient joint optimization approach is imperative. This paper proposes an algorithm aimed at minimizing the average response delay in a UAV-assisted MEC system. The approach revolves around the joint optimization of UAV deployment and computation offloading through convex optimization. We break down the problem into three sub-problems: UAV deployment, Ground Device (GD) access, and computation tasks offloading, which we address using the block coordinate descent algorithm. Observing the \u0000<inline-formula><tex-math>$NP$</tex-math></inline-formula>\u0000-hardness nature of the original problem, we present near-optimal solutions to the decomposed sub-problems. Simulation results demonstrate that our approach can generate a joint optimization solution within seconds and diminish the average response delay compared to state-of-the-art algorithms and other advanced algorithms, with improvements ranging from 4.70% to 42.94%.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"12 4","pages":"1372-1386"},"PeriodicalIF":5.3,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797965","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CloudBrain-ReconAI: A Cloud Computing Platform for MRI Reconstruction and Radiologists’ Image Quality Evaluation 云脑-ReconAI：用于核磁共振成像重建和放射医师图像质量评估的云计算平台

IF 5.3 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing

Pub Date : 2024-10-08 DOI: 10.1109/TCC.2024.3476418

Yirong Zhou;Chen Qian;Jiayu Li;Zi Wang;Yu Hu;Biao Qu;Liuhong Zhu;Jianjun Zhou;Taishan Kang;Jianzhong Lin;Qing Hong;Jiyang Dong;Di Guo;Xiaobo Qu

Efficient collaboration between engineers and radiologists is important for image reconstruction algorithm development and image quality evaluation in magnetic resonance imaging (MRI). Here, we develop CloudBrain-ReconAI, an online cloud computing platform, for algorithm deployment, fast and blind reader study. This platform supports online image reconstruction using state-of-the-art artificial intelligence and compressed sensing algorithms with applications for fast imaging (Cartesian and non-Cartesian sampling) and high-resolution diffusion imaging. Through visiting the website, radiologists can easily score and mark images. Then, automatic statistical analysis will be provided.

工程师和放射科医生之间的高效协作对于磁共振成像（MRI）图像重建算法的开发和图像质量评估至关重要。在这里，我们开发了一个在线云计算平台cloudbrain - recoai，用于算法部署，快速盲读研究。该平台支持使用最先进的人工智能和压缩感知算法进行在线图像重建，并应用于快速成像（笛卡尔和非笛卡尔采样）和高分辨率扩散成像。通过访问该网站，放射科医生可以轻松地对图像进行评分和标记。然后，将提供自动统计分析。

引用次数: 0

D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs D-STACK：通过 GPU 的有效复用和时空调度实现高吞吐量 DNN 推理

IF 5.3 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing

Pub Date : 2024-10-07 DOI: 10.1109/TCC.2024.3476210

Aditya Dhakal;Sameer G. Kulkarni;K. K. Ramakrishnan

Hardware accelerators such as GPUs are required for real-time, low latency inference with Deep Neural Networks (DNN). Providing inference services in the cloud can be resource intensive, and effectively utilizing accelerators in the cloud is important. Spatial multiplexing of the GPU, while limiting the GPU resources (GPU%) to each DNN to the right amount, leads to higher GPU utilization and higher inference throughput. Right-sizing the GPU for each DNN the optimal batching of requests to balance throughput and service level objectives (SLOs), and maximizing throughput by appropriately scheduling DNNs are still significant challenges.This article introduces a dynamic and fair spatio-temporal scheduler (D-STACK) for multiple DNNs to run in the GPU concurrently. We develop and validate a model that estimates the parallelism each DNN can utilize and a lightweight optimization formulation to find an efficient batch size for each DNN. Our holistic inference framework provides high throughput while meeting application SLOs. We compare D-STACK with other GPU multiplexing and scheduling methods (e.g., NVIDIA Triton, Clipper, Nexus), using popular DNN models. Our controlled experiments with multiplexing several popular DNN models achieve up to

$1.6times$

improvement in GPU utilization and up to

$4times$

improvement in inference throughput.

使用深度神经网络（DNN）进行实时、低延迟推理需要gpu等硬件加速器。在云中提供推理服务可能是资源密集型的，因此有效地利用云中的加速器非常重要。GPU的空间复用在将每个DNN的GPU资源（GPU%）限制在合适的数量的同时，可以提高GPU的利用率和推理吞吐量。为每个DNN正确调整GPU大小，优化请求批处理以平衡吞吐量和服务水平目标（slo），以及通过适当调度DNN来最大化吞吐量仍然是重大挑战。本文介绍了一个动态的、公平的时空调度程序（D-STACK），用于多个dnn在GPU中并发运行。我们开发并验证了一个模型，该模型估计每个DNN可以利用的并行性，并使用轻量级优化公式为每个DNN找到有效的批大小。我们的整体推理框架在满足应用程序slo的同时提供高吞吐量。我们将D-STACK与其他GPU多路复用和调度方法（例如，NVIDIA Triton, Clipper, Nexus）进行比较，使用流行的DNN模型。我们对几种流行的DNN模型进行多路复用的对照实验，在GPU利用率方面提高了1.6倍，在推理吞吐量方面提高了4倍。

{"title":"D-STACK: High Throughput DNN Inference by Effective Multiplexing and Spatio-Temporal Scheduling of GPUs","authors":"Aditya Dhakal;Sameer G. Kulkarni;K. K. Ramakrishnan","doi":"10.1109/TCC.2024.3476210","DOIUrl":"https://doi.org/10.1109/TCC.2024.3476210","url":null,"abstract":"Hardware accelerators such as GPUs are required for real-time, low latency inference with Deep Neural Networks (DNN). Providing inference services in the cloud can be resource intensive, and effectively utilizing accelerators in the cloud is important. Spatial multiplexing of the GPU, while limiting the GPU resources (GPU%) to each DNN to the right amount, leads to higher GPU utilization and higher inference throughput. Right-sizing the GPU for each DNN the optimal batching of requests to balance throughput and service level objectives (SLOs), and maximizing throughput by appropriately scheduling DNNs are still significant challenges.This article introduces a dynamic and fair spatio-temporal scheduler (D-STACK) for multiple DNNs to run in the GPU concurrently. We develop and validate a model that estimates the parallelism each DNN can utilize and a lightweight optimization formulation to find an efficient batch size for each DNN. Our holistic inference framework provides high throughput while meeting application SLOs. We compare D-STACK with other GPU multiplexing and scheduling methods (e.g., NVIDIA Triton, Clipper, Nexus), using popular DNN models. Our controlled experiments with multiplexing several popular DNN models achieve up to \u0000<inline-formula><tex-math>$1.6times$</tex-math></inline-formula>\u0000 improvement in GPU utilization and up to \u0000<inline-formula><tex-math>$4times$</tex-math></inline-formula>\u0000 improvement in inference throughput.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"12 4","pages":"1344-1358"},"PeriodicalIF":5.3,"publicationDate":"2024-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797987","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

FaaSCtrl: A Comprehensive-Latency Controller for Serverless Platforms FaaSCtrl：无服务器平台的综合延迟控制器

IF 5.3 2区计算机科学 Q1 COMPUTER SCIENCE, INFORMATION SYSTEMS

IEEE Transactions on Cloud Computing

Pub Date : 2024-10-02 DOI: 10.1109/TCC.2024.3473015

Abhisek Panda;Smruti R. Sarangi

Serverless computing systems have become very popular because of their natural advantages with respect to auto-scaling, load balancing and fast distributed processing. As of today, almost all serverless systems define two QoS classes: best-effort (

$BE$

) and latency-sensitive (

$LS$

). Systems typically do not offer any latency or QoS guarantees for

$BE$

jobs and run them on a best-effort basis. In contrast, systems strive to minimize the processing time for

$LS$

jobs. This work proposes a precise definition for these job classes and argues that we need to consider a bouquet of performance metrics for serverless applications, not just a single one. We thus propose the comprehensive latency (

$CL$

) that comprises the mean, tail latency, median and standard deviation of a series of invocations for a given serverless function. Next, we design a system FaaSCtrl, whose main objective is to ensure that every component of the

$CL$

is within a prespecified limit for an LS application, and for BE applications, these components are minimized on a best-effort basis. Given the sheer complexity of the scheduling problem in a large multi-application setup, we use the method of surrogate functions in optimization theory to design a simpler optimization problem that relies on performance and fairness. We rigorously establish the relevance of these metrics through characterization studies. Instead of using standard approaches based on optimization theory, we use a much faster reinforcement learning (RL) based approach to tune the knobs that govern process scheduling in Linux, namely the real-time priority and the assigned number of cores. RL works well in this scenario because the benefit of a given optimization is probabilistic in nature, owing to the inherent complexity of the system. We show using rigorous experiments on a set of real-world workloads that FaaSCtrl achieves its objectives for both LS and BE applications and outperforms the state-of-the-art by 36.9% (for tail response latency) and 44.6% (for response latency's std. dev.) for LS applications.

无服务器计算系统由于其在自动扩展、负载平衡和快速分布式处理方面的天然优势而变得非常流行。到目前为止，几乎所有无服务器系统都定义了两种QoS类：尽力而为（best-effort, $BE$）和延迟敏感（latency-sensitive, $LS$）。系统通常不会为$BE$作业提供任何延迟或QoS保证，并且会尽最大努力运行它们。相反，系统努力最小化$LS$作业的处理时间。这项工作提出了这些作业类的精确定义，并认为我们需要考虑无服务器应用程序的一系列性能指标，而不仅仅是一个。因此，我们提出了综合延迟（$CL$），它包括给定无服务器函数的一系列调用的平均值、尾部延迟、中位数和标准差。接下来，我们设计一个FaaSCtrl系统，其主要目标是确保$CL$的每个组件都在LS应用程序的预先指定的限制范围内，而对于BE应用程序，这些组件将尽最大努力最小化。考虑到大型多应用程序设置中调度问题的复杂性，我们使用优化理论中的代理函数方法来设计一个依赖于性能和公平性的更简单的优化问题。我们通过表征研究严格地建立了这些指标的相关性。我们没有使用基于优化理论的标准方法，而是使用一种更快的基于强化学习（RL）的方法来调整Linux中控制进程调度的旋钮，即实时优先级和分配的内核数量。强化学习在这种情况下工作得很好，因为给定优化的好处本质上是概率性的，这是由于系统固有的复杂性。我们在一组真实工作负载上进行了严格的实验，结果表明FaaSCtrl在LS和BE应用程序上都达到了目标，并且在LS应用程序上的性能比最先进的技术高出36.9%（对于尾部响应延迟）和44.6%（对于响应延迟的std. dev）。

{"title":"FaaSCtrl: A Comprehensive-Latency Controller for Serverless Platforms","authors":"Abhisek Panda;Smruti R. Sarangi","doi":"10.1109/TCC.2024.3473015","DOIUrl":"https://doi.org/10.1109/TCC.2024.3473015","url":null,"abstract":"Serverless computing systems have become very popular because of their natural advantages with respect to auto-scaling, load balancing and fast distributed processing. As of today, almost all serverless systems define two QoS classes: best-effort (\u0000<inline-formula><tex-math>$BE$</tex-math></inline-formula>\u0000) and latency-sensitive (\u0000<inline-formula><tex-math>$LS$</tex-math></inline-formula>\u0000). Systems typically do not offer any latency or QoS guarantees for \u0000<inline-formula><tex-math>$BE$</tex-math></inline-formula>\u0000 jobs and run them on a best-effort basis. In contrast, systems strive to minimize the processing time for \u0000<inline-formula><tex-math>$LS$</tex-math></inline-formula>\u0000 jobs. This work proposes a precise definition for these job classes and argues that we need to consider a bouquet of performance metrics for serverless applications, not just a single one. We thus propose the comprehensive latency (\u0000<inline-formula><tex-math>$CL$</tex-math></inline-formula>\u0000) that comprises the mean, tail latency, median and standard deviation of a series of invocations for a given serverless function. Next, we design a system \u0000<i>FaaSCtrl</i>\u0000, whose main objective is to ensure that every component of the \u0000<inline-formula><tex-math>$CL$</tex-math></inline-formula>\u0000 is within a prespecified limit for an LS application, and for BE applications, these components are minimized on a best-effort basis. Given the sheer complexity of the scheduling problem in a large multi-application setup, we use the method of surrogate functions in optimization theory to design a simpler optimization problem that relies on performance and fairness. We rigorously establish the relevance of these metrics through characterization studies. Instead of using standard approaches based on optimization theory, we use a much faster reinforcement learning (RL) based approach to tune the knobs that govern process scheduling in Linux, namely the real-time priority and the assigned number of cores. RL works well in this scenario because the benefit of a given optimization is probabilistic in nature, owing to the inherent complexity of the system. We show using rigorous experiments on a set of real-world workloads that \u0000<i>FaaSCtrl</i>\u0000 achieves its objectives for both LS and BE applications and outperforms the state-of-the-art by 36.9% (for tail response latency) and 44.6% (for response latency's std. dev.) for LS applications.","PeriodicalId":13202,"journal":{"name":"IEEE Transactions on Cloud Computing","volume":"12 4","pages":"1328-1343"},"PeriodicalIF":5.3,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142797964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0