Pub Date : 2024-08-27DOI: 10.1109/TNSM.2024.3450597
Jin Ye;Tiantian Yu;Zhaoyi Li;Jiawei Huang
In recent years, motivated by new datacenter applications and the well-known shortcomings of TCP in data center, many receiver-driven transport protocols have been proposed to provide ultra-low latency and zero packet loss by using the proactive congestion control. However, in the scenario of mixed short and long flows, the short flows with ON/OFF pattern generate micro-burst traffic, which significantly deteriorates the performance of existing receiver-driven transport protocols. Firstly, when the short flows turn into ON mode, the long flows cannot immediately concede bandwidth to the short ones, resulting in queue buildup and even packet loss. Secondly, when the short flows change from ON to OFF mode, the released bandwidth cannot be fully utilized by the long flows, leading to serious bandwidth waste. To address these issues, we propose a new receiver-driven transport protocol, called SAR, which predicts the micro burst generated by short flows and adjusts the sending rate of long flows accordingly. With the aid of micro-burst prediction mechanism, SAR mitigates the bandwidth competition due to the arrival of short flows, and alleviates the bandwidth waste when the short flows leave. The testbed and NS2 simulation experiments demonstrate that SAR reduces the average flow completion time (AFCT) by up to 66% compared to typical receiver-driven transport protocols.
{"title":"SAR: Receiver-Driven Transport Protocol With Micro-Burst Prediction in Data Center Networks","authors":"Jin Ye;Tiantian Yu;Zhaoyi Li;Jiawei Huang","doi":"10.1109/TNSM.2024.3450597","DOIUrl":"10.1109/TNSM.2024.3450597","url":null,"abstract":"In recent years, motivated by new datacenter applications and the well-known shortcomings of TCP in data center, many receiver-driven transport protocols have been proposed to provide ultra-low latency and zero packet loss by using the proactive congestion control. However, in the scenario of mixed short and long flows, the short flows with ON/OFF pattern generate micro-burst traffic, which significantly deteriorates the performance of existing receiver-driven transport protocols. Firstly, when the short flows turn into ON mode, the long flows cannot immediately concede bandwidth to the short ones, resulting in queue buildup and even packet loss. Secondly, when the short flows change from ON to OFF mode, the released bandwidth cannot be fully utilized by the long flows, leading to serious bandwidth waste. To address these issues, we propose a new receiver-driven transport protocol, called SAR, which predicts the micro burst generated by short flows and adjusts the sending rate of long flows accordingly. With the aid of micro-burst prediction mechanism, SAR mitigates the bandwidth competition due to the arrival of short flows, and alleviates the bandwidth waste when the short flows leave. The testbed and NS2 simulation experiments demonstrate that SAR reduces the average flow completion time (AFCT) by up to 66% compared to typical receiver-driven transport protocols.","PeriodicalId":13423,"journal":{"name":"IEEE Transactions on Network and Service Management","volume":"21 6","pages":"6409-6422"},"PeriodicalIF":4.7,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187208","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-27DOI: 10.1109/TNSM.2024.3450596
Jinbin Hu;Zikai Zhou;Jin Zhang
In modern datacenter networks (DCNs), mainstream congestion control (CC) mechanisms essentially rely on Explicit Congestion Notification (ECN) to reflect congestion. The traditional static ECN threshold performs poorly under dynamic scenarios, and setting a proper ECN threshold under various traffic patterns is challenging and time-consuming. The recently proposed reinforcement learning (RL) based ECN Tuning algorithm (ACC) consumes a large number of computational resources, making it difficult to deploy on switches. In this paper, we present a lightweight and hierarchical automated ECN tuning algorithm called LAECN, which can fully exploit the performance benefits of deep reinforcement learning with ultra-low overhead. The simulation results show that LAECN improves performance significantly by reducing latency and increasing throughput in stable network conditions, and also shows consistent high performance in small flows network environments. For example, LAECN effectively improves throughput by up to 47%, 34%, 32% and 24% over DCQCN, TIMELY, HPCC and ACC, respectively.
{"title":"Lightweight Automatic ECN Tuning Based on Deep Reinforcement Learning With Ultra-Low Overhead in Datacenter Networks","authors":"Jinbin Hu;Zikai Zhou;Jin Zhang","doi":"10.1109/TNSM.2024.3450596","DOIUrl":"10.1109/TNSM.2024.3450596","url":null,"abstract":"In modern datacenter networks (DCNs), mainstream congestion control (CC) mechanisms essentially rely on Explicit Congestion Notification (ECN) to reflect congestion. The traditional static ECN threshold performs poorly under dynamic scenarios, and setting a proper ECN threshold under various traffic patterns is challenging and time-consuming. The recently proposed reinforcement learning (RL) based ECN Tuning algorithm (ACC) consumes a large number of computational resources, making it difficult to deploy on switches. In this paper, we present a lightweight and hierarchical automated ECN tuning algorithm called LAECN, which can fully exploit the performance benefits of deep reinforcement learning with ultra-low overhead. The simulation results show that LAECN improves performance significantly by reducing latency and increasing throughput in stable network conditions, and also shows consistent high performance in small flows network environments. For example, LAECN effectively improves throughput by up to 47%, 34%, 32% and 24% over DCQCN, TIMELY, HPCC and ACC, respectively.","PeriodicalId":13423,"journal":{"name":"IEEE Transactions on Network and Service Management","volume":"21 6","pages":"6398-6408"},"PeriodicalIF":4.7,"publicationDate":"2024-08-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nowadays, Remote Direct Memory Access (RDMA) is gaining popularity in data centers for low CPU overhead, high throughput, and ultra-low latency. As one of the state-of-the-art RDMA Congestion Control (CC) mechanisms, HPCC leverages the In-band Network Telemetry (INT) features to achieve accurate control and significantly shortens the Flow Completion Time (FCT) for short flows. However, there exists redundant INT information increasing the processing latency at switches and affecting flows’ throughput. Besides, its end-to-end feedback mechanism is not timely enough to help senders cope well with bursty traffic, and there still exists a high probability of triggering Priority-based Flow Control (PFC) pauses under large-scale incast. In this paper, we propose a Congestion-Aware (CA) control mechanism called CACC, which attempts to push CC to the theoretical low INT overhead and PFC pause delay. CACC introduces two CA algorithms to quantize switch buffer and egress port congestion, separately, along with a fine-grained window size adjustment algorithm at the sender. Specifically, the buffer CA algorithm perceives large-scale congestion that may trigger PFC pauses and provides early feedback, significantly reducing the PFC pause delay. The egress port CA algorithm perceives the link state and selectively inserts useful INT data, achieving lower queue sizes and reducing the average overhead per packet from 42 bytes to 2 bits. In our evaluation, compared with HPCC, PINT, and Bolt, CACC shortens the average and tail FCT by up to 27% and 60.1%, respectively.
{"title":"CACC: A Congestion-Aware Control Mechanism to Reduce INT Overhead and PFC Pause Delay","authors":"Xiwen Jie;Jiangping Han;Guanglei Chen;Hang Wang;Peilin Hong;Kaiping Xue","doi":"10.1109/TNSM.2024.3449699","DOIUrl":"10.1109/TNSM.2024.3449699","url":null,"abstract":"Nowadays, Remote Direct Memory Access (RDMA) is gaining popularity in data centers for low CPU overhead, high throughput, and ultra-low latency. As one of the state-of-the-art RDMA Congestion Control (CC) mechanisms, HPCC leverages the In-band Network Telemetry (INT) features to achieve accurate control and significantly shortens the Flow Completion Time (FCT) for short flows. However, there exists redundant INT information increasing the processing latency at switches and affecting flows’ throughput. Besides, its end-to-end feedback mechanism is not timely enough to help senders cope well with bursty traffic, and there still exists a high probability of triggering Priority-based Flow Control (PFC) pauses under large-scale incast. In this paper, we propose a Congestion-Aware (CA) control mechanism called CACC, which attempts to push CC to the theoretical low INT overhead and PFC pause delay. CACC introduces two CA algorithms to quantize switch buffer and egress port congestion, separately, along with a fine-grained window size adjustment algorithm at the sender. Specifically, the buffer CA algorithm perceives large-scale congestion that may trigger PFC pauses and provides early feedback, significantly reducing the PFC pause delay. The egress port CA algorithm perceives the link state and selectively inserts useful INT data, achieving lower queue sizes and reducing the average overhead per packet from 42 bytes to 2 bits. In our evaluation, compared with HPCC, PINT, and Bolt, CACC shortens the average and tail FCT by up to 27% and 60.1%, respectively.","PeriodicalId":13423,"journal":{"name":"IEEE Transactions on Network and Service Management","volume":"21 6","pages":"6382-6397"},"PeriodicalIF":4.7,"publicationDate":"2024-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142187216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-26DOI: 10.1109/TNSM.2024.3449575
Hui Wang;Zhenyu Yang;Ming Li;Xiaowei Zhang;Yanlan Hu;Donghui Hu
As the origin of blockchains, the Nakamoto Consensus protocol is the primary protocol for many public blockchains (e.g., Bitcoin) used in cryptocurrencies. Blockchains need to be decentralized as a core feature, yet it is difficult to strike a balance between scalability and security. Many approaches to improving blockchain scalability often result in diminished security or compromise the decentralized nature of the system. Inspired by network science, especially the epidemic model, we try to solve this problem by mapping the propagation of transactions and blocks as two interacting epidemics, called the CoSIS model. We extend the transaction propagation process to increase the efficiency of block propagation, which reduces the number of unknown transactions. The reduction of the block propagation latency ultimately increases the blockchain throughput. The theory of complex networks is employed to offer an optimal boundary condition. Finally, the node scores are stored in the chain, so that it also provides a new incentive approach. Our experiments show that CoSIS accelerates blocks’ propagation and TPS is raised by 20% $sim ~33$