Pub Date : 2024-08-28DOI: 10.1109/TNSM.2024.3450964
Qianqian Wu;Qiang Liu;Wenliang Zhu;Zefan Wu
With the advancements in technologies such as 5G, Unmanned Aerial Vehicles (UAVs) have exhibited their potential in various application scenarios, including wireless coverage, search operations, and disaster response. In this paper, we consider the utilization of a group of UAVs as aerial base stations (BS) to collect data from IoT sensor devices. The objective is to maximize the volume of collected data while simultaneously enhancing the geographical fairness among these points of interest, all within the constraints of limited energy resources. Therefore, we propose a deep reinforcement learning (DRL) method based on Graph Attention Networks (GAT), referred to as “GADRL”. GADRL utilizes graph convolutional neural networks to extract spatial correlations among multiple UAVs and makes decisions in a distributed manner under the guidance of DRL. Furthermore, we employ Long Short-Term Memory to establish memory units for storing and utilizing historical information. Numerical results demonstrate that GADRL consistently outperforms four baseline methods, validating its computational efficiency.
{"title":"Energy Efficient UAV-Assisted IoT Data Collection: A Graph-Based Deep Reinforcement Learning Approach","authors":"Qianqian Wu;Qiang Liu;Wenliang Zhu;Zefan Wu","doi":"10.1109/TNSM.2024.3450964","DOIUrl":"10.1109/TNSM.2024.3450964","url":null,"abstract":"With the advancements in technologies such as 5G, Unmanned Aerial Vehicles (UAVs) have exhibited their potential in various application scenarios, including wireless coverage, search operations, and disaster response. In this paper, we consider the utilization of a group of UAVs as aerial base stations (BS) to collect data from IoT sensor devices. The objective is to maximize the volume of collected data while simultaneously enhancing the geographical fairness among these points of interest, all within the constraints of limited energy resources. Therefore, we propose a deep reinforcement learning (DRL) method based on Graph Attention Networks (GAT), referred to as “GADRL”. GADRL utilizes graph convolutional neural networks to extract spatial correlations among multiple UAVs and makes decisions in a distributed manner under the guidance of DRL. Furthermore, we employ Long Short-Term Memory to establish memory units for storing and utilizing historical information. Numerical results demonstrate that GADRL consistently outperforms four baseline methods, validating its computational efficiency.","PeriodicalId":13423,"journal":{"name":"IEEE Transactions on Network and Service Management","volume":"21 6","pages":"6082-6094"},"PeriodicalIF":4.7,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187274","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-28DOI: 10.1109/TNSM.2024.3451295
Madyan Alsenwi;Eva Lagunas;Symeon Chatzinotas
Next-generation (NextG) cellular networks are expected to evolve towards virtualization and openness, incorporating reprogrammable components that facilitate intelligence and real-time analytics. This paper builds on these innovations to address the network slicing problem in multi-cell open radio access wireless networks, focusing on two key services: enhanced Mobile BroadBand (eMBB) and Ultra-Reliable Low Latency Communications (URLLC). A stochastic resource allocation problem is formulated with the goal of balancing the average eMBB data rate and its variance, while ensuring URLLC constraints. A distributed learning framework based on the Deep Reinforcement Learning (DRL) technique is developed following the Open Radio Access Networks (O-RAN) architectures to solve the formulated optimization problem. The proposed learning approach enables training a global machine learning model at a central cloud server and sharing it with edge servers for executions. Specifically, deep learning agents are distributed at network edge servers and embedded within the Near-Real-Time Radio access network Intelligent Controller (Near-RT RIC) to collect network information and perform online executions. A global deep learning model is trained by a central training engine embedded within the Non-Real-Time RIC (Non-RT RIC) at the central server using received data from edge servers. The performed simulation results validate the efficacy of the proposed algorithm in achieving URLLC constraints while maintaining the eMBB Quality of Service (QoS).
{"title":"Distributed Learning Framework for eMBB-URLLC Multiplexing in Open Radio Access Networks","authors":"Madyan Alsenwi;Eva Lagunas;Symeon Chatzinotas","doi":"10.1109/TNSM.2024.3451295","DOIUrl":"10.1109/TNSM.2024.3451295","url":null,"abstract":"Next-generation (NextG) cellular networks are expected to evolve towards virtualization and openness, incorporating reprogrammable components that facilitate intelligence and real-time analytics. This paper builds on these innovations to address the network slicing problem in multi-cell open radio access wireless networks, focusing on two key services: enhanced Mobile BroadBand (eMBB) and Ultra-Reliable Low Latency Communications (URLLC). A stochastic resource allocation problem is formulated with the goal of balancing the average eMBB data rate and its variance, while ensuring URLLC constraints. A distributed learning framework based on the Deep Reinforcement Learning (DRL) technique is developed following the Open Radio Access Networks (O-RAN) architectures to solve the formulated optimization problem. The proposed learning approach enables training a global machine learning model at a central cloud server and sharing it with edge servers for executions. Specifically, deep learning agents are distributed at network edge servers and embedded within the Near-Real-Time Radio access network Intelligent Controller (Near-RT RIC) to collect network information and perform online executions. A global deep learning model is trained by a central training engine embedded within the Non-Real-Time RIC (Non-RT RIC) at the central server using received data from edge servers. The performed simulation results validate the efficacy of the proposed algorithm in achieving URLLC constraints while maintaining the eMBB Quality of Service (QoS).","PeriodicalId":13423,"journal":{"name":"IEEE Transactions on Network and Service Management","volume":"21 5","pages":"5718-5732"},"PeriodicalIF":4.7,"publicationDate":"2024-08-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-27DOI: 10.1109/TNSM.2024.3450670
Xiaoyang Zhao;Chuan Wu;Xia Zhu
Distributed deep learning (DL) training constitutes a significant portion of workloads in modern data centers that are equipped with high computational capacities, such as GPU servers. However, frequent tensor exchanges among workers during distributed deep neural network (DNN) training can result in heavy traffic in the data center network, leading to congestion at server NICs and in the switching network. Unfortunately, none of the existing DL communication libraries support active flow control to optimize tensor transmission performance, instead relying on passive adjustments to the congestion window or sending rate based on packet loss or delay. To address this issue, we propose a flow scheduler per host that dynamically tunes the sending rates of outgoing tensor flows from each server, maximizing network bandwidth utilization and expediting job training progress. Our scheduler comprises two main components: a monitoring module that interacts with state-of-the-art communication libraries supporting parameter server and all-reduce paradigms to track the training progress of DNN jobs, and a congestion control protocol that receives in-network feedback from traversing switches and computes optimized flow sending rates. For data centers where switches are not programmable, we provide a software solution that emulates switch behavior and interacts with the scheduler on servers. Experiments with real-world GPU testbed and trace-driven simulation demonstrate that our scheduler outperforms common rate control protocols and representative learning-based schemes in various settings.
{"title":"Dynamic Flow Scheduling for DNN Training Workloads in Data Centers","authors":"Xiaoyang Zhao;Chuan Wu;Xia Zhu","doi":"10.1109/TNSM.2024.3450670","DOIUrl":"10.1109/TNSM.2024.3450670","url":null,"abstract":"Distributed deep learning (DL) training constitutes a significant portion of workloads in modern data centers that are equipped with high computational capacities, such as GPU servers. However, frequent tensor exchanges among workers during distributed deep neural network (DNN) training can result in heavy traffic in the data center network, leading to congestion at server NICs and in the switching network. Unfortunately, none of the existing DL communication libraries support active flow control to optimize tensor transmission performance, instead relying on passive adjustments to the congestion window or sending rate based on packet loss or delay. To address this issue, we propose a flow scheduler per host that dynamically tunes the sending rates of outgoing tensor flows from each server, maximizing network bandwidth utilization and expediting job training progress. Our scheduler comprises two main components: a monitoring module that interacts with state-of-the-art communication libraries supporting parameter server and all-reduce paradigms to track the training progress of DNN jobs, and a congestion control protocol that receives in-network feedback from traversing switches and computes optimized flow sending rates. For data centers where switches are not programmable, we provide a software solution that emulates switch behavior and interacts with the scheduler on servers. Experiments with real-world GPU testbed and trace-driven simulation demonstrate that our scheduler outperforms common rate control protocols and representative learning-based schemes in various settings.","PeriodicalId":13423,"journal":{"name":"IEEE Transactions on Network and Service Management","volume":"21 6","pages":"6643-6657"},"PeriodicalIF":4.7,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-27DOI: 10.1109/TNSM.2024.3450597
Jin Ye;Tiantian Yu;Zhaoyi Li;Jiawei Huang
In recent years, motivated by new datacenter applications and the well-known shortcomings of TCP in data center, many receiver-driven transport protocols have been proposed to provide ultra-low latency and zero packet loss by using the proactive congestion control. However, in the scenario of mixed short and long flows, the short flows with ON/OFF pattern generate micro-burst traffic, which significantly deteriorates the performance of existing receiver-driven transport protocols. Firstly, when the short flows turn into ON mode, the long flows cannot immediately concede bandwidth to the short ones, resulting in queue buildup and even packet loss. Secondly, when the short flows change from ON to OFF mode, the released bandwidth cannot be fully utilized by the long flows, leading to serious bandwidth waste. To address these issues, we propose a new receiver-driven transport protocol, called SAR, which predicts the micro burst generated by short flows and adjusts the sending rate of long flows accordingly. With the aid of micro-burst prediction mechanism, SAR mitigates the bandwidth competition due to the arrival of short flows, and alleviates the bandwidth waste when the short flows leave. The testbed and NS2 simulation experiments demonstrate that SAR reduces the average flow completion time (AFCT) by up to 66% compared to typical receiver-driven transport protocols.
{"title":"SAR: Receiver-Driven Transport Protocol With Micro-Burst Prediction in Data Center Networks","authors":"Jin Ye;Tiantian Yu;Zhaoyi Li;Jiawei Huang","doi":"10.1109/TNSM.2024.3450597","DOIUrl":"10.1109/TNSM.2024.3450597","url":null,"abstract":"In recent years, motivated by new datacenter applications and the well-known shortcomings of TCP in data center, many receiver-driven transport protocols have been proposed to provide ultra-low latency and zero packet loss by using the proactive congestion control. However, in the scenario of mixed short and long flows, the short flows with ON/OFF pattern generate micro-burst traffic, which significantly deteriorates the performance of existing receiver-driven transport protocols. Firstly, when the short flows turn into ON mode, the long flows cannot immediately concede bandwidth to the short ones, resulting in queue buildup and even packet loss. Secondly, when the short flows change from ON to OFF mode, the released bandwidth cannot be fully utilized by the long flows, leading to serious bandwidth waste. To address these issues, we propose a new receiver-driven transport protocol, called SAR, which predicts the micro burst generated by short flows and adjusts the sending rate of long flows accordingly. With the aid of micro-burst prediction mechanism, SAR mitigates the bandwidth competition due to the arrival of short flows, and alleviates the bandwidth waste when the short flows leave. The testbed and NS2 simulation experiments demonstrate that SAR reduces the average flow completion time (AFCT) by up to 66% compared to typical receiver-driven transport protocols.","PeriodicalId":13423,"journal":{"name":"IEEE Transactions on Network and Service Management","volume":"21 6","pages":"6409-6422"},"PeriodicalIF":4.7,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-27DOI: 10.1109/TNSM.2024.3450596
Jinbin Hu;Zikai Zhou;Jin Zhang
In modern datacenter networks (DCNs), mainstream congestion control (CC) mechanisms essentially rely on Explicit Congestion Notification (ECN) to reflect congestion. The traditional static ECN threshold performs poorly under dynamic scenarios, and setting a proper ECN threshold under various traffic patterns is challenging and time-consuming. The recently proposed reinforcement learning (RL) based ECN Tuning algorithm (ACC) consumes a large number of computational resources, making it difficult to deploy on switches. In this paper, we present a lightweight and hierarchical automated ECN tuning algorithm called LAECN, which can fully exploit the performance benefits of deep reinforcement learning with ultra-low overhead. The simulation results show that LAECN improves performance significantly by reducing latency and increasing throughput in stable network conditions, and also shows consistent high performance in small flows network environments. For example, LAECN effectively improves throughput by up to 47%, 34%, 32% and 24% over DCQCN, TIMELY, HPCC and ACC, respectively.
{"title":"Lightweight Automatic ECN Tuning Based on Deep Reinforcement Learning With Ultra-Low Overhead in Datacenter Networks","authors":"Jinbin Hu;Zikai Zhou;Jin Zhang","doi":"10.1109/TNSM.2024.3450596","DOIUrl":"10.1109/TNSM.2024.3450596","url":null,"abstract":"In modern datacenter networks (DCNs), mainstream congestion control (CC) mechanisms essentially rely on Explicit Congestion Notification (ECN) to reflect congestion. The traditional static ECN threshold performs poorly under dynamic scenarios, and setting a proper ECN threshold under various traffic patterns is challenging and time-consuming. The recently proposed reinforcement learning (RL) based ECN Tuning algorithm (ACC) consumes a large number of computational resources, making it difficult to deploy on switches. In this paper, we present a lightweight and hierarchical automated ECN tuning algorithm called LAECN, which can fully exploit the performance benefits of deep reinforcement learning with ultra-low overhead. The simulation results show that LAECN improves performance significantly by reducing latency and increasing throughput in stable network conditions, and also shows consistent high performance in small flows network environments. For example, LAECN effectively improves throughput by up to 47%, 34%, 32% and 24% over DCQCN, TIMELY, HPCC and ACC, respectively.","PeriodicalId":13423,"journal":{"name":"IEEE Transactions on Network and Service Management","volume":"21 6","pages":"6398-6408"},"PeriodicalIF":4.7,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, Remote Direct Memory Access (RDMA) is gaining popularity in data centers for low CPU overhead, high throughput, and ultra-low latency. As one of the state-of-the-art RDMA Congestion Control (CC) mechanisms, HPCC leverages the In-band Network Telemetry (INT) features to achieve accurate control and significantly shortens the Flow Completion Time (FCT) for short flows. However, there exists redundant INT information increasing the processing latency at switches and affecting flows’ throughput. Besides, its end-to-end feedback mechanism is not timely enough to help senders cope well with bursty traffic, and there still exists a high probability of triggering Priority-based Flow Control (PFC) pauses under large-scale incast. In this paper, we propose a Congestion-Aware (CA) control mechanism called CACC, which attempts to push CC to the theoretical low INT overhead and PFC pause delay. CACC introduces two CA algorithms to quantize switch buffer and egress port congestion, separately, along with a fine-grained window size adjustment algorithm at the sender. Specifically, the buffer CA algorithm perceives large-scale congestion that may trigger PFC pauses and provides early feedback, significantly reducing the PFC pause delay. The egress port CA algorithm perceives the link state and selectively inserts useful INT data, achieving lower queue sizes and reducing the average overhead per packet from 42 bytes to 2 bits. In our evaluation, compared with HPCC, PINT, and Bolt, CACC shortens the average and tail FCT by up to 27% and 60.1%, respectively.
{"title":"CACC: A Congestion-Aware Control Mechanism to Reduce INT Overhead and PFC Pause Delay","authors":"Xiwen Jie;Jiangping Han;Guanglei Chen;Hang Wang;Peilin Hong;Kaiping Xue","doi":"10.1109/TNSM.2024.3449699","DOIUrl":"10.1109/TNSM.2024.3449699","url":null,"abstract":"Nowadays, Remote Direct Memory Access (RDMA) is gaining popularity in data centers for low CPU overhead, high throughput, and ultra-low latency. As one of the state-of-the-art RDMA Congestion Control (CC) mechanisms, HPCC leverages the In-band Network Telemetry (INT) features to achieve accurate control and significantly shortens the Flow Completion Time (FCT) for short flows. However, there exists redundant INT information increasing the processing latency at switches and affecting flows’ throughput. Besides, its end-to-end feedback mechanism is not timely enough to help senders cope well with bursty traffic, and there still exists a high probability of triggering Priority-based Flow Control (PFC) pauses under large-scale incast. In this paper, we propose a Congestion-Aware (CA) control mechanism called CACC, which attempts to push CC to the theoretical low INT overhead and PFC pause delay. CACC introduces two CA algorithms to quantize switch buffer and egress port congestion, separately, along with a fine-grained window size adjustment algorithm at the sender. Specifically, the buffer CA algorithm perceives large-scale congestion that may trigger PFC pauses and provides early feedback, significantly reducing the PFC pause delay. The egress port CA algorithm perceives the link state and selectively inserts useful INT data, achieving lower queue sizes and reducing the average overhead per packet from 42 bytes to 2 bits. In our evaluation, compared with HPCC, PINT, and Bolt, CACC shortens the average and tail FCT by up to 27% and 60.1%, respectively.","PeriodicalId":13423,"journal":{"name":"IEEE Transactions on Network and Service Management","volume":"21 6","pages":"6382-6397"},"PeriodicalIF":4.7,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-26DOI: 10.1109/TNSM.2024.3449575
Hui Wang;Zhenyu Yang;Ming Li;Xiaowei Zhang;Yanlan Hu;Donghui Hu
As the origin of blockchains, the Nakamoto Consensus protocol is the primary protocol for many public blockchains (e.g., Bitcoin) used in cryptocurrencies. Blockchains need to be decentralized as a core feature, yet it is difficult to strike a balance between scalability and security. Many approaches to improving blockchain scalability often result in diminished security or compromise the decentralized nature of the system. Inspired by network science, especially the epidemic model, we try to solve this problem by mapping the propagation of transactions and blocks as two interacting epidemics, called the CoSIS model. We extend the transaction propagation process to increase the efficiency of block propagation, which reduces the number of unknown transactions. The reduction of the block propagation latency ultimately increases the blockchain throughput. The theory of complex networks is employed to offer an optimal boundary condition. Finally, the node scores are stored in the chain, so that it also provides a new incentive approach. Our experiments show that CoSIS accelerates blocks’ propagation and TPS is raised by 20% $sim ~33$