Pub Date : 2024-03-13DOI: 10.1109/TPDS.2024.3400594
Kai Zhang;Jiahui Hong;Zhengying He;Yinan Jing;X. Sean Wang
In a Network Function Virtualization (NFV) system, network functions (NFs) are implemented on general-purpose hardware, including CPU, GPU, and FPGA. Studies have shown that there is no one-size-fits-all processor, as each processor demonstrates performance advantages to implement certain types of NFs. With more general-purpose processors such as GPUs being deployed in data center servers, the best practice to build a high-performance NFV service chain should employ available heterogeneous processors. However, current NFV systems fail to utilize these processors for acceleration. This is because, due to separate memory spaces, data synchronization is demanded to guarantee correctness, which can incur non-trivial overhead and result in low performance. This paper proposes AdaptChain, a data management facility that enables adaptive data sharing and synchronization for hybrid NFV systems on heterogeneous architectures. AdaptChain shares the host and device memory among NFs in a service chain. With adaptive synchronization plan generation and NF code adaptation, AdaptChain exploits three classes of opportunities to reduce the amount of synchronized data while guaranteeing correctness. Experimental results show that AdaptChain improves the overall throughput by up to 3.2× and reduces the latency by up to 52%.
{"title":"AdaptChain: Adaptive Data Sharing and Synchronization for NFV Systems on Heterogeneous Architectures","authors":"Kai Zhang;Jiahui Hong;Zhengying He;Yinan Jing;X. Sean Wang","doi":"10.1109/TPDS.2024.3400594","DOIUrl":"10.1109/TPDS.2024.3400594","url":null,"abstract":"In a Network Function Virtualization (NFV) system, network functions (NFs) are implemented on general-purpose hardware, including CPU, GPU, and FPGA. Studies have shown that there is no one-size-fits-all processor, as each processor demonstrates performance advantages to implement certain types of NFs. With more general-purpose processors such as GPUs being deployed in data center servers, the best practice to build a high-performance NFV service chain should employ available heterogeneous processors. However, current NFV systems fail to utilize these processors for acceleration. This is because, due to separate memory spaces, data synchronization is demanded to guarantee correctness, which can incur non-trivial overhead and result in low performance. This paper proposes AdaptChain, a data management facility that enables adaptive data sharing and synchronization for hybrid NFV systems on heterogeneous architectures. AdaptChain shares the host and device memory among NFs in a service chain. With adaptive synchronization plan generation and NF code adaptation, AdaptChain exploits three classes of opportunities to reduce the amount of synchronized data while guaranteeing correctness. Experimental results show that AdaptChain improves the overall throughput by up to 3.2× and reduces the latency by up to 52%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934474","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stream processing systems commonly work with auto-scaling to ensure resource efficiency and quality of service (QoS). Existing auto-scaling solutions lack accuracy in resource allocation because they rely on static QoS-resource models that fail to account for high workload variability and use indirect metrics with much distractive information. Moreover, different types of QoS metrics present different characteristics and thus need individual auto-scaling methods. In this paper, we propose a versatile auto-scaling solution for operator-level parallelism configuration, called AuTraScale+, to meet the throughput, processing-time latency, and event-time latency targets. AuTraScale+ follows the Bayesian optimization framework to make scaling decisions. First, it uses the Gaussian process model to eliminate the negative influence of uncertain factors on the performance model accuracy. Second, it leverages the expected improvement-based (EI-based) acquisition function to search and recommend the optimal configuration quickly. Besides, to make a more accurate scaling decision when the new model is not ready, AuTraScale+ proposes a transfer learning algorithm to estimate the benefits of all configurations at a new rate based on existing models and then recommend the optimal one. We implement and evaluate AuTraScale+ on the Flink platform. The experimental results on three representative workloads demonstrate that compared with the state-of-the-art methods, AuTraScale+ can reduce 66.6% and 36.7% resource consumption, respectively, in the scale-down and scale-up scenarios while achieving their throughput and processing-time latency targets. Compared with other methods of optimizing event-time latency, AuTraScale+ saves 26.9% of resources on average.
{"title":"Bayesian-Driven Automated Scaling in Stream Computing With Multiple QoS Targets","authors":"Liang Zhang;Wenli Zheng;Kuangyu Zheng;Hongzi Zhu;Chao Li;Minyi Guo","doi":"10.1109/TPDS.2024.3399834","DOIUrl":"10.1109/TPDS.2024.3399834","url":null,"abstract":"Stream processing systems commonly work with auto-scaling to ensure resource efficiency and quality of service (QoS). Existing auto-scaling solutions lack accuracy in resource allocation because they rely on static QoS-resource models that fail to account for high workload variability and use indirect metrics with much distractive information. Moreover, different types of QoS metrics present different characteristics and thus need individual auto-scaling methods. In this paper, we propose a versatile auto-scaling solution for operator-level parallelism configuration, called AuTraScale+, to meet the throughput, processing-time latency, and event-time latency targets. AuTraScale+ follows the Bayesian optimization framework to make scaling decisions. First, it uses the Gaussian process model to eliminate the negative influence of uncertain factors on the performance model accuracy. Second, it leverages the expected improvement-based (EI-based) acquisition function to search and recommend the optimal configuration quickly. Besides, to make a more accurate scaling decision when the new model is not ready, AuTraScale+ proposes a transfer learning algorithm to estimate the benefits of all configurations at a new rate based on existing models and then recommend the optimal one. We implement and evaluate AuTraScale+ on the Flink platform. The experimental results on three representative workloads demonstrate that compared with the state-of-the-art methods, AuTraScale+ can reduce 66.6% and 36.7% resource consumption, respectively, in the scale-down and scale-up scenarios while achieving their throughput and processing-time latency targets. Compared with other methods of optimizing event-time latency, AuTraScale+ saves 26.9% of resources on average.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-13DOI: 10.1109/TPDS.2024.3400365
Daniela Loreti;Marcello Artioli;Anna Ciampolini
The scale of nowadays High Performance Computing (HPC) systems is the key element that determines the achievement of impressive performance, as well as the reason for their relatively limited reliability. Over the last decade, specific areas of the High Performance Computing (HPC) research field have addressed the issue at different levels, by enriching the infrastructure, the platforms, or the algorithms with fault tolerance features. In this work, we focus on the rather-pervasive task of computing the solution of a dense, unstructured linear system and we propose an algorithm-based technique to obtain fault tolerance to multiple anywhere-located faults during the parallel computation. We particularly study the ways to boost the performance of the rollback-free recovery, and we provide an extensive evaluation of our technique w.r.t. to other state-of-the-art algorithm-based methods.
{"title":"Rollback-Free Recovery for a High Performance Dense Linear Solver With Reduced Memory Footprint","authors":"Daniela Loreti;Marcello Artioli;Anna Ciampolini","doi":"10.1109/TPDS.2024.3400365","DOIUrl":"10.1109/TPDS.2024.3400365","url":null,"abstract":"The scale of nowadays High Performance Computing (HPC) systems is the key element that determines the achievement of impressive performance, as well as the reason for their relatively limited reliability. Over the last decade, specific areas of the High Performance Computing (HPC) research field have addressed the issue at different levels, by enriching the infrastructure, the platforms, or the algorithms with fault tolerance features. In this work, we focus on the rather-pervasive task of computing the solution of a dense, unstructured linear system and we propose an algorithm-based technique to obtain fault tolerance to multiple anywhere-located faults during the parallel computation. We particularly study the ways to boost the performance of the rollback-free recovery, and we provide an extensive evaluation of our technique w.r.t. to other state-of-the-art algorithm-based methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-03-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10530061","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140934538","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-03-12DOI: 10.1109/TPDS.2024.3376452
Pourya Soltani;Farid Ashtiani
Sharding has shown great potential to scale out blockchains. It divides nodes into smaller groups which allow for partial transaction processing, relaying and storage. Hence, instead of running one blockchain, we will run multiple blockchains in parallel, and call each one a shard. Sharding can be applied to address shortcomings due to compulsory duplication of three resources in blockchains, i.e., computation, communication and storage. The most pressing issue in blockchains today is throughput. In this paper, we propose new queueing-theoretic models to derive the maximum throughput of sharded blockchains. We consider two cases, a fully sharded blockchain and a computation sharding. We model each with a queueing network that exploits signals to account for block production as well as multi-destination cross-shard transactions. We make sure quasi-reversibility for every queue in our models is satisfied so that they fall into the category of product-form queueing networks. We then obtain a closed-form solution for the maximum stable throughput of these systems with respect to block size, block rate, number of destinations in transactions and the number of shards. Comparing the results obtained from the two introduced sharding systems, we conclude that the extent of sharding in different domains plays a significant role in scalability.
{"title":"Analytical Modeling and Throughput Computation of Blockchain Sharding","authors":"Pourya Soltani;Farid Ashtiani","doi":"10.1109/TPDS.2024.3376452","DOIUrl":"10.1109/TPDS.2024.3376452","url":null,"abstract":"Sharding has shown great potential to scale out blockchains. It divides nodes into smaller groups which allow for partial transaction processing, relaying and storage. Hence, instead of running one blockchain, we will run multiple blockchains in parallel, and call each one a shard. Sharding can be applied to address shortcomings due to compulsory duplication of three resources in blockchains, i.e., computation, communication and storage. The most pressing issue in blockchains today is throughput. In this paper, we propose new queueing-theoretic models to derive the maximum throughput of sharded blockchains. We consider two cases, a fully sharded blockchain and a computation sharding. We model each with a queueing network that exploits signals to account for block production as well as multi-destination cross-shard transactions. We make sure quasi-reversibility for every queue in our models is satisfied so that they fall into the category of product-form queueing networks. We then obtain a closed-form solution for the maximum stable throughput of these systems with respect to block size, block rate, number of destinations in transactions and the number of shards. Comparing the results obtained from the two introduced sharding systems, we conclude that the extent of sharding in different domains plays a significant role in scalability.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":null,"pages":null},"PeriodicalIF":5.3,"publicationDate":"2024-03-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140126411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent advances in deep learning are driven by the growing scale of computation, data, and models. However, efficiently training large-scale models on distributed systems requires an intricate combination of data, operator, and pipeline parallelism, which exerts heavy burden on machine learning practitioners. To this end, we propose AutoDDL, a distributed training framework that automatically explores and exploits new parallelization schemes with near-optimal bandwidth cost. AutoDDL facilitates the description and implementation of different schemes by utilizing OneFlow's Split