IEEE Transactions on Parallel and Distributed Systems最新文献_第3页

Ripple: Enabling Decentralized Data Deduplication at the Edge 瑞波：在边缘实现去中心化重复数据删除

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-11-07 DOI: 10.1109/TPDS.2024.3493953

Ruikun Luo;Qiang He;Feifei Chen;Song Wu;Hai Jin;Yun Yang

With its advantages in ensuring low data retrieval latency and reducing backhaul network traffic, edge computing is becoming a backbone solution for many latency-sensitive applications. An increasingly large number of data is being generated at the edge, stretching the limited capacity of edge storage systems. Improving resource utilization for edge storage systems has become a significant challenge in recent years. Existing solutions attempt to achieve this goal through data placement optimization, data partitioning, data sharing, etc. These approaches overlook the data redundancy in edge storage systems, which produces substantial storage resource wastage. This motivates the need for an approach for data deduplication at the edge. However, existing data deduplication methods rely on centralized control, which is not always feasible in practical edge computing environments. This article presents Ripple, the first approach that enables edge servers to deduplicate their data in a decentralized manner. At its core, it builds a data index for each edge server, enabling them to deduplicate data without central control. With Ripple, edge servers can 1) identify data duplicates; 2) remove redundant data without violating data retrieval latency constraints; and 3) ensure data availability after deduplication. The results of trace-driven experiments conducted in a testbed system demonstrate the usefulness of Ripple in practice. Compared with the state-of-the-art approach, Ripple improves the deduplication ratio by up to 16.79% and reduces data retrieval latency by an average of 60.42%.

边缘计算具有确保低数据检索延迟和减少回程网络流量的优势，正在成为许多延迟敏感型应用的骨干解决方案。越来越多的数据在边缘产生，使边缘存储系统的有限容量捉襟见肘。近年来，提高边缘存储系统的资源利用率已成为一项重大挑战。现有的解决方案试图通过数据放置优化、数据分区、数据共享等来实现这一目标。这些方法忽视了边缘存储系统中的数据冗余，从而造成大量存储资源浪费。因此，我们需要一种边缘重复数据删除方法。然而，现有的重复数据删除方法依赖于集中控制，这在实际的边缘计算环境中并不总是可行的。本文介绍的 Ripple 是第一种能让边缘服务器以分散方式重复数据删除的方法。它的核心是为每个边缘服务器建立一个数据索引，使它们能够在没有中央控制的情况下重复数据。利用 Ripple，边缘服务器可以：1）识别重复数据；2）在不违反数据检索延迟约束的情况下删除冗余数据；3）确保重复数据删除后的数据可用性。在测试平台系统中进行的跟踪实验结果证明了 Ripple 在实践中的实用性。与最先进的方法相比，Ripple 将重复数据删除率提高了 16.79%，并将数据检索延迟平均降低了 60.42%。

{"title":"Ripple: Enabling Decentralized Data Deduplication at the Edge","authors":"Ruikun Luo;Qiang He;Feifei Chen;Song Wu;Hai Jin;Yun Yang","doi":"10.1109/TPDS.2024.3493953","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3493953","url":null,"abstract":"With its advantages in ensuring low data retrieval latency and reducing backhaul network traffic, edge computing is becoming a backbone solution for many latency-sensitive applications. An increasingly large number of data is being generated at the edge, stretching the limited capacity of edge storage systems. Improving resource utilization for edge storage systems has become a significant challenge in recent years. Existing solutions attempt to achieve this goal through data placement optimization, data partitioning, data sharing, etc. These approaches overlook the data redundancy in edge storage systems, which produces substantial storage resource wastage. This motivates the need for an approach for data deduplication at the edge. However, existing data deduplication methods rely on centralized control, which is not always feasible in practical edge computing environments. This article presents Ripple, the first approach that enables edge servers to deduplicate their data in a decentralized manner. At its core, it builds a data index for each edge server, enabling them to deduplicate data without central control. With Ripple, edge servers can 1) identify data duplicates; 2) remove redundant data without violating data retrieval latency constraints; and 3) ensure data availability after deduplication. The results of trace-driven experiments conducted in a testbed system demonstrate the usefulness of Ripple in practice. Compared with the state-of-the-art approach, Ripple improves the deduplication ratio by up to 16.79% and reduces data retrieval latency by an average of 60.42%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 1","pages":"55-66"},"PeriodicalIF":5.6,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10747114","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142672007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

EdgeHydra: Fault-Tolerant Edge Data Distribution Based on Erasure Coding EdgeHydra：基于消除编码的容错边缘数据分发

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-11-07 DOI: 10.1109/TPDS.2024.3493034

Qiang He;Guobiao Zhang;Jiawei Wang;Ruikun Luo;Xiaohai Dai;Yuchong Hu;Feifei Chen;Hai Jin;Yun Yang

In the edge computing environment, app vendors can distribute popular data from the cloud to edge servers to provide low-latency data retrieval. A key problem is how to distribute these data from the cloud to edge servers cost-effectively. Under current schemes, a file is divided into some data blocks for parallel transmissions from the cloud to target edge servers. Edge servers can then combine received data blocks to reconstruct the file. While this method expedites the data distribution process, it presents potential drawbacks. It is sensitive to transmission delays and transmission failures caused by runtime exceptions like network fluctuations and server failures. This paper presents EdgeHydra, the first edge data distribution scheme that tackles this challenge through fault tolerance based on erasure coding. Under EdgeHydra, a file is encoded into data blocks and parity blocks for parallel transmission from the cloud to target edge servers. An edge server can reconstruct the file upon the receipt of a sufficient number of these blocks without having to wait for all the blocks in transmission. It also innovatively employs a leaderless block supplement mechanism to ensure the receipt of sufficient blocks for individual target edge servers. These improve the robustness of the data distribution process significantly. Extensive experiments show that EdgeHydra can tolerate delays and failures in individual transmission links effectively, outperforming the state-of-the-art scheme by up to 50.54% in distribution time.

在边缘计算环境中，应用程序供应商可以将流行数据从云端分发到边缘服务器，以提供低延迟数据检索。一个关键问题是如何经济高效地将这些数据从云端分发到边缘服务器。根据目前的方案，一个文件会被分成若干数据块，从云端并行传输到目标边缘服务器。然后，边缘服务器可以将接收到的数据块组合起来，重建文件。虽然这种方法加快了数据分发过程，但也存在潜在的缺点。它对网络波动和服务器故障等运行时异常情况造成的传输延迟和传输失败很敏感。本文介绍的 EdgeHydra 是首个边缘数据分发方案，它通过基于擦除编码的容错来应对这一挑战。在 EdgeHydra 中，文件被编码成数据块和奇偶校验块，从云并行传输到目标边缘服务器。边缘服务器在收到足够数量的数据块后就能重建文件，而无需等待传输中的所有数据块。它还创新性地采用了无领导块补充机制，以确保各个目标边缘服务器收到足够的块。这些都大大提高了数据分发过程的稳健性。大量实验表明，EdgeHydra 可以有效地容忍单个传输链路的延迟和故障，在分发时间上比最先进的方案最多可节省 50.54%。

{"title":"EdgeHydra: Fault-Tolerant Edge Data Distribution Based on Erasure Coding","authors":"Qiang He;Guobiao Zhang;Jiawei Wang;Ruikun Luo;Xiaohai Dai;Yuchong Hu;Feifei Chen;Hai Jin;Yun Yang","doi":"10.1109/TPDS.2024.3493034","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3493034","url":null,"abstract":"In the edge computing environment, app vendors can distribute popular data from the cloud to edge servers to provide low-latency data retrieval. A key problem is how to distribute these data from the cloud to edge servers cost-effectively. Under current schemes, a file is divided into some data blocks for parallel transmissions from the cloud to target edge servers. Edge servers can then combine received data blocks to reconstruct the file. While this method expedites the data distribution process, it presents potential drawbacks. It is sensitive to transmission delays and transmission failures caused by runtime exceptions like network fluctuations and server failures. This paper presents EdgeHydra, the first edge data distribution scheme that tackles this challenge through fault tolerance based on erasure coding. Under EdgeHydra, a file is encoded into data blocks and parity blocks for parallel transmission from the cloud to target edge servers. An edge server can reconstruct the file upon the receipt of a sufficient number of these blocks without having to wait for all the blocks in transmission. It also innovatively employs a leaderless block supplement mechanism to ensure the receipt of sufficient blocks for individual target edge servers. These improve the robustness of the data distribution process significantly. Extensive experiments show that EdgeHydra can tolerate delays and failures in individual transmission links effectively, outperforming the state-of-the-art scheme by up to 50.54% in distribution time.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 1","pages":"29-42"},"PeriodicalIF":5.6,"publicationDate":"2024-11-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10746622","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142672047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Real Relative Encoding Genetic Algorithm for Workflow Scheduling in Heterogeneous Distributed Computing Systems 异构分布式计算系统中工作流调度的真实相对编码遗传算法

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-11-06 DOI: 10.1109/TPDS.2024.3492210

Junqiang Jiang;Zhifang Sun;Ruiqi Lu;Li Pan;Zebo Peng

This paper introduces a novel Real Relative encoding Genetic Algorithm (R

$^{2}$

GA) to tackle the workflow scheduling problem in heterogeneous distributed computing systems (HDCS). R

$^{2}$

GA employs a unique encoding mechanism, using real numbers to represent the relative positions of tasks in the schedulable task set. Decoding is performed by interpreting these real numbers in relation to the directed acyclic graph (DAG) of the workflow. This approach ensures that any sequence of randomly generated real numbers, produced by cross-over and mutation operations, can always be decoded into a valid solution, as the precedence constraints between tasks are explicitly defined by the DAG. The proposed encoding and decoding mechanism simplifies genetic operations and facilitates efficient exploration of the solution space. This inherent flexibility also allows R

$^{2}$

GA to be easily adapted to various optimization scenarios in workflow scheduling within HDCS. Additionally, R

$^{2}$

GA overcomes several issues associated with traditional genetic algorithms (GAs) and existing real-number encoding GAs, such as the generation of chromosomes that violate task precedence constraints and the strict limitations on gene value ranges. Experimental results show that R

$^{2}$

GA consistently delivers superior performance in terms of solution quality and efficiency compared to existing techniques.

本文介绍了一种新颖的实数相对编码遗传算法（R$^{2}$GA），用于解决异构分布式计算系统（HDCS）中的工作流调度问题。R$^{2}$GA 采用独特的编码机制，用实数表示可调度任务集中任务的相对位置。解码是根据工作流的有向无环图（DAG）来解释这些实数的。这种方法可确保任何由交叉和突变操作随机生成的实数序列总能被解码为有效的解决方案，因为任务之间的优先级约束是由 DAG 明确定义的。所提出的编码和解码机制简化了遗传操作，有利于高效探索解空间。这种固有的灵活性也使得 R$^{2}$GA 可以轻松适应 HDCS 中工作流调度的各种优化方案。此外，R$^{2}$GA 还克服了与传统遗传算法（GA）和现有实数编码 GA 相关的几个问题，如生成的染色体违反任务优先级约束以及基因值范围的严格限制。实验结果表明，与现有技术相比，R$^{2}$GA 在解决方案的质量和效率方面始终表现出色。

{"title":"Real Relative Encoding Genetic Algorithm for Workflow Scheduling in Heterogeneous Distributed Computing Systems","authors":"Junqiang Jiang;Zhifang Sun;Ruiqi Lu;Li Pan;Zebo Peng","doi":"10.1109/TPDS.2024.3492210","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3492210","url":null,"abstract":"This paper introduces a novel Real Relative encoding Genetic Algorithm (R\u0000<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>\u0000GA) to tackle the workflow scheduling problem in heterogeneous distributed computing systems (HDCS). R\u0000<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>\u0000GA employs a unique encoding mechanism, using real numbers to represent the relative positions of tasks in the schedulable task set. Decoding is performed by interpreting these real numbers in relation to the directed acyclic graph (DAG) of the workflow. This approach ensures that any sequence of randomly generated real numbers, produced by cross-over and mutation operations, can always be decoded into a valid solution, as the precedence constraints between tasks are explicitly defined by the DAG. The proposed encoding and decoding mechanism simplifies genetic operations and facilitates efficient exploration of the solution space. This inherent flexibility also allows R\u0000<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>\u0000GA to be easily adapted to various optimization scenarios in workflow scheduling within HDCS. Additionally, R\u0000<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>\u0000GA overcomes several issues associated with traditional genetic algorithms (GAs) and existing real-number encoding GAs, such as the generation of chromosomes that violate task precedence constraints and the strict limitations on gene value ranges. Experimental results show that R\u0000<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>\u0000GA consistently delivers superior performance in terms of solution quality and efficiency compared to existing techniques.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 1","pages":"1-14"},"PeriodicalIF":5.6,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142672036","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

H5Intent: Autotuning HDF5 With User Intent H5Intent：自动调整HDF5与用户的意图

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-11-06 DOI: 10.1109/TPDS.2024.3492704

Hariharan Devarajan;Gerd Heber;Kathryn Mohror

The complexity of data management in HPC systems stems from the diversity in I/O behavior exhibited by new workloads, multistage workflows, and multitiered storage systems. The HDF5 library is a popular interface to interact with storage systems in HPC workloads. The library manages the complexity of diverse I/O behaviors by providing user-level configurations to optimize the I/O for HPC workloads. The HDF5 library exposes hundreds of configuration properties that can be set to alter how HDF5 manages I/O requests for better performance. However, determining which properties to set is quite challenging for users who lack expertise in HDF5 library internals. We propose a paradigm change through our H5Intent software, where users specify the intent of I/O operations and the software can set various HDF5 properties automatically to optimize the I/O behavior. This work demonstrates several use cases where mapping user-defined intents to HDF5 properties can be exploited to optimize I/O. In this study, we make three observations. First, I/O intents can accurately define HDF5 properties while managing conflicts between various properties and improving the I/O performance of microbenchmarks by up to 22×. Second, I/O intents can be efficiently passed to HDF5 with a small footprint of 6.74MB per node for thousands of intents per process. Third, an H5Intent VOL connector can dynamically map I/O intents to HDF5 properties for various I/O behaviors exhibited by our microbenchmark and improve I/O performance by up to 8.8×. Overall, H5Intent software improves the I/O performance of complex large-scale workloads we studied by up to 11×.

高性能计算系统中数据管理的复杂性源于新工作负载、多阶段工作流和多层存储系统所表现出的I/O行为的多样性。HDF5库是与HPC工作负载中的存储系统交互的流行接口。该库通过提供用户级配置来优化HPC工作负载的I/O，从而管理各种I/O行为的复杂性。HDF5库公开了数百个配置属性，可以设置这些属性来改变HDF5如何管理I/O请求以获得更好的性能。然而，对于缺乏HDF5库内部专业知识的用户来说，确定要设置哪些属性是相当具有挑战性的。我们通过H5Intent软件提出了一种范式改变，用户指定I/O操作的意图，软件可以自动设置各种HDF5属性以优化I/O行为。这项工作演示了几个用例，其中可以利用将用户定义的意图映射到HDF5属性来优化I/O。在这项研究中，我们做了三个观察。首先，I/O意图可以准确地定义HDF5属性，同时管理各种属性之间的冲突，并将微基准测试的I/O性能提高多达22倍。其次，I/O意图可以有效地传递给HDF5，每个节点占用6.74MB，每个进程占用数千个意图。第三，H5Intent VOL连接器可以动态地将I/O意图映射到HDF5属性，以实现微基准测试中显示的各种I/O行为，并将I/O性能提高8.8倍。总体而言，H5Intent软件将我们研究的复杂大规模工作负载的I/O性能提高了11倍。

{"title":"H5Intent: Autotuning HDF5 With User Intent","authors":"Hariharan Devarajan;Gerd Heber;Kathryn Mohror","doi":"10.1109/TPDS.2024.3492704","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3492704","url":null,"abstract":"The complexity of data management in HPC systems stems from the diversity in I/O behavior exhibited by new workloads, multistage workflows, and multitiered storage systems. The HDF5 library is a popular interface to interact with storage systems in HPC workloads. The library manages the complexity of diverse I/O behaviors by providing user-level configurations to optimize the I/O for HPC workloads. The HDF5 library exposes hundreds of configuration properties that can be set to alter how HDF5 manages I/O requests for better performance. However, determining which properties to set is quite challenging for users who lack expertise in HDF5 library internals. We propose a paradigm change through our H5Intent software, where users specify the intent of I/O operations and the software can set various HDF5 properties automatically to optimize the I/O behavior. This work demonstrates several use cases where mapping user-defined intents to HDF5 properties can be exploited to optimize I/O. In this study, we make three observations. First, I/O intents can accurately define HDF5 properties while managing conflicts between various properties and improving the I/O performance of microbenchmarks by up to 22×. Second, I/O intents can be efficiently passed to HDF5 with a small footprint of 6.74MB per node for thousands of intents per process. Third, an H5Intent VOL connector can dynamically map I/O intents to HDF5 properties for various I/O behaviors exhibited by our microbenchmark and improve I/O performance by up to 8.8×. Overall, H5Intent software improves the I/O performance of complex large-scale workloads we studied by up to 11×.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"108-119"},"PeriodicalIF":5.6,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142810643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Dissecting the Software-Based Measurement of CPU Energy Consumption: A Comparative Analysis 剖析基于软件的 CPU 能耗测量：比较分析

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-11-06 DOI: 10.1109/TPDS.2024.3492336

Guillaume Raffin;Denis Trystram

Information and Communications Technologies (ICT) are an increasingly important contributor to the environmental crisis. Computer scientists need tools for measuring the footprint of the code they produce and for optimizing it. Running Average Power Limit (RAPL) is a low-level interface designed by Intel that provides a measure of the energy consumption of a CPU (and more) without the need for additional hardware. Since 2017, it is available on most x86 processors, including AMD processors. More and more people are using RAPL for energy measurement, mostly like a black box without deep knowledge of its behavior. Unfortunately, this causes mistakes when implementing measurement tools. In this article, we propose to come back to the basic mechanisms that allow to use RAPL measurements and present a critical analysis of their operations. In addition to long-established mechanisms, we explore the suitability of the recent eBPF technology (formerly and abbreviation for extended Berkeley Packet Filter) for working with RAPL. We release an implementation in Rust that avoids the pitfalls we detected in existing tools, improving correctness, timing accuracy and performance, with desirable properties for monitoring and profiling parallel applications. We provide an experimental study with multiple benchmarks and processor models to evaluate the efficiency of the various mechanisms and their impact on parallel software. We show that no mechanism provides a significant performance advantage over the others. However, they differ significantly in terms of ease-of-use and resiliency. We believe that this work will help the community to develop correct, resilient and lightweight measurement tools.

信息和通信技术（ICT）对环境危机的影响日益严重。计算机科学家需要一些工具来测量他们编写的代码的足迹，并对其进行优化。运行平均功耗限制（RAPL）是英特尔设计的一个低级接口，无需额外硬件即可测量 CPU 的能耗（以及更多能耗）。自 2017 年起，包括 AMD 处理器在内的大多数 x86 处理器都可以使用它。越来越多的人开始使用 RAPL 进行能耗测量，但大多像黑盒子一样，对其行为缺乏深入了解。不幸的是，这导致在实施测量工具时出现错误。在本文中，我们建议回到允许使用 RAPL 测量的基本机制，并对其操作进行批判性分析。除了历史悠久的机制外，我们还探讨了最新的 eBPF 技术（前身是扩展伯克利数据包过滤器的缩写）与 RAPL 的适用性。我们发布的 Rust 实现避免了我们在现有工具中发现的缺陷，提高了正确性、定时准确性和性能，具有监控和剖析并行应用程序的理想特性。我们使用多个基准和处理器模型进行了实验研究，以评估各种机制的效率及其对并行软件的影响。我们发现，与其他机制相比，没有任何一种机制能提供显著的性能优势。然而，它们在易用性和弹性方面却有很大不同。我们相信，这项工作将有助于社区开发正确、灵活和轻量级的测量工具。

{"title":"Dissecting the Software-Based Measurement of CPU Energy Consumption: A Comparative Analysis","authors":"Guillaume Raffin;Denis Trystram","doi":"10.1109/TPDS.2024.3492336","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3492336","url":null,"abstract":"Information and Communications Technologies (ICT) are an increasingly important contributor to the environmental crisis. Computer scientists need tools for measuring the footprint of the code they produce and for optimizing it. Running Average Power Limit (RAPL) is a low-level interface designed by Intel that provides a measure of the energy consumption of a CPU (and more) without the need for additional hardware. Since 2017, it is available on most x86 processors, including AMD processors. More and more people are using RAPL for energy measurement, mostly like a black box without deep knowledge of its behavior. Unfortunately, this causes mistakes when implementing measurement tools. In this article, we propose to come back to the basic mechanisms that allow to use RAPL measurements and present a critical analysis of their operations. In addition to long-established mechanisms, we explore the suitability of the recent eBPF technology (formerly and abbreviation for extended Berkeley Packet Filter) for working with RAPL. We release an implementation in Rust that avoids the pitfalls we detected in existing tools, improving correctness, timing accuracy and performance, with desirable properties for monitoring and profiling parallel applications. We provide an experimental study with multiple benchmarks and processor models to evaluate the efficiency of the various mechanisms and their impact on parallel software. We show that no mechanism provides a significant performance advantage over the others. However, they differ significantly in terms of ease-of-use and resiliency. We believe that this work will help the community to develop correct, resilient and lightweight measurement tools.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 1","pages":"96-107"},"PeriodicalIF":5.6,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142736584","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DyLaClass: Dynamic Labeling Based Classification for Optimal Sparse Matrix Format Selection in Accelerating SpMV DyLaClass：加速 SpMV 时基于动态标签分类的稀疏矩阵格式优化选择

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-10-29 DOI: 10.1109/TPDS.2024.3488053

Zheng Shi;Yi Zou;Xianfeng Song;Shupeng Li;Fangming Liu;Quan Xue

Sparse matrix-vector multiplication (SpMV) is crucial in many scientific and engineering applications, particularly concerning the effectiveness of different sparse matrix storage formats for various architectures, no single format excels across all hardware. Previous research has focused on trying different algorithms to build predictors for the best format, yet it overlooked how to address the issue of the best format changing in the same hardware environment and how to reduce prediction overhead rather than merely considering the overhead in building predictors. This paper proposes a novel classification algorithm for optimizing sparse matrix storage formats, DyLaClass, based on dynamic labeling and flexible feature selection. Particularly, we introduce mixed labels and features with strong correlations, allowing us to achieve ultra-high prediction accuracy with minimal feature inputs, significantly reducing feature extraction overhead. For the first time, we propose the concept of the most suitable storage format rather than the best storage format, which can stably predict changes in the best format for the same matrix across multiple SpMV executions. We further demonstrate the proposed method on the University of Florida’s public sparse matrix collection dataset. Experimental results show that compared to existing work, our method achieves up to 91% classification accuracy. Using two different hardware platforms for verification, the proposed method outperforms existing methods by 1.26 to 1.43 times. Most importantly, the stability of the proposed prediction model is 25.5% higher than previous methods, greatly increasing the feasibility of the model in practical field applications.

稀疏矩阵-矢量乘法（SpMV）在许多科学和工程应用中至关重要，特别是在不同架构下不同稀疏矩阵存储格式的有效性方面，没有一种格式在所有硬件上都表现出色。以往的研究侧重于尝试不同的算法来构建最佳格式的预测器，但却忽略了如何解决最佳格式在同一硬件环境中发生变化的问题，以及如何减少预测开销，而不仅仅是考虑构建预测器的开销。本文基于动态标签和灵活的特征选择，提出了一种优化稀疏矩阵存储格式的新型分类算法 DyLaClass。特别是，我们引入了具有强相关性的混合标签和特征，使我们能够以最小的特征输入实现超高的预测准确率，大大减少了特征提取开销。我们首次提出了 "最合适的存储格式 "而非 "最佳存储格式 "的概念，从而可以在 SpMV 的多次执行中稳定地预测同一矩阵的最佳格式变化。我们还在佛罗里达大学的公共稀疏矩阵收集数据集上进一步演示了所提出的方法。实验结果表明，与现有工作相比，我们的方法达到了高达 91% 的分类准确率。使用两种不同的硬件平台进行验证，所提出的方法比现有方法高出 1.26 到 1.43 倍。最重要的是，提出的预测模型的稳定性比以前的方法高出 25.5%，大大提高了模型在实际现场应用中的可行性。

{"title":"DyLaClass: Dynamic Labeling Based Classification for Optimal Sparse Matrix Format Selection in Accelerating SpMV","authors":"Zheng Shi;Yi Zou;Xianfeng Song;Shupeng Li;Fangming Liu;Quan Xue","doi":"10.1109/TPDS.2024.3488053","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3488053","url":null,"abstract":"Sparse matrix-vector multiplication (SpMV) is crucial in many scientific and engineering applications, particularly concerning the effectiveness of different sparse matrix storage formats for various architectures, no single format excels across all hardware. Previous research has focused on trying different algorithms to build predictors for the best format, yet it overlooked how to address the issue of the best format changing in the same hardware environment and how to reduce prediction overhead rather than merely considering the overhead in building predictors. This paper proposes a novel classification algorithm for optimizing sparse matrix storage formats, DyLaClass, based on dynamic labeling and flexible feature selection. Particularly, we introduce mixed labels and features with strong correlations, allowing us to achieve ultra-high prediction accuracy with minimal feature inputs, significantly reducing feature extraction overhead. For the first time, we propose the concept of the most suitable storage format rather than the best storage format, which can stably predict changes in the best format for the same matrix across multiple SpMV executions. We further demonstrate the proposed method on the University of Florida’s public sparse matrix collection dataset. Experimental results show that compared to existing work, our method achieves up to 91% classification accuracy. Using two different hardware platforms for verification, the proposed method outperforms existing methods by 1.26 to 1.43 times. Most importantly, the stability of the proposed prediction model is 25.5% higher than previous methods, greatly increasing the feasibility of the model in practical field applications.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2624-2639"},"PeriodicalIF":5.6,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595927","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

PeakFS: An Ultra-High Performance Parallel File System via Computing-Network-Storage Co-Optimization for HPC Applications PeakFS：通过计算-网络-存储协同优化实现高性能计算应用的超高性能并行文件系统

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-10-24 DOI: 10.1109/TPDS.2024.3485754

Yixiao Chen;Haomai Yang;Kai Lu;Wenlve Huang;Jibin Wang;Jiguang Wan;Jian Zhou;Fei Wu;Changsheng Xie

Emerging high-performance computing (HPC) applications with diverse workload characteristics impose greater demands on parallel file systems (PFSs). PFSs also require more efficient software designs to fully utilize the performance of modern hardware, such as multi-core CPUs, Remote Direct Memory Access (RDMA), and NVMe SSDs. However, existing PFSs expose great limitations under these requirements due to limited multi-core scalability, unaware of HPC workloads, and disjointed network-storage optimizations. In this article, we present PeakFS, an ultra-high performance parallel file system via computing-network-storage co-optimization for HPC applications. PeakFS designs a shared-nothing scheduling system based on link-reduced task dispatching with lock-free queues to reduce concurrency overhead. Besides, PeakFS improves I/O performance with flexible distribution strategies, memory-efficient indexing, and metadata caching according to HPC I/O characteristics. Finally, PeakFS shortens the critical path of request processing through network-storage co-optimizations. Experimental results show that the metadata and data performance of PeakFS reaches more than 90% of the hardware limits. For metadata throughput, PeakFS achieves a 3.5–19× improvement over GekkoFS and outperforms BeeGFS by three orders of magnitude.

新兴的高性能计算（HPC）应用具有多样化的工作负载特征，对并行文件系统（PFS）提出了更高的要求。并行文件系统还需要更高效的软件设计，以充分利用多核 CPU、远程直接内存访问（RDMA）和 NVMe SSD 等现代硬件的性能。然而，由于多核可扩展性有限、不了解 HPC 工作负载以及网络-存储优化脱节，现有的 PFS 在这些要求下暴露出很大的局限性。在这篇文章中，我们介绍了PeakFS，一个通过计算-网络-存储协同优化实现高性能计算应用的超高性能并行文件系统。PeakFS 设计了一个基于链路减少的任务调度和无锁队列的无共享调度系统，以减少并发开销。此外，PeakFS 还根据 HPC I/O 特性，通过灵活的分配策略、内存高效索引和元数据缓存来提高 I/O 性能。最后，PeakFS通过网络-存储协同优化缩短了请求处理的关键路径。实验结果表明，PeakFS的元数据和数据性能达到硬件极限的90%以上。在元数据吞吐量方面，PeakFS 比 GekkoFS 提高了 3.5-19 倍，比 BeeGFS 高出三个数量级。

{"title":"PeakFS: An Ultra-High Performance Parallel File System via Computing-Network-Storage Co-Optimization for HPC Applications","authors":"Yixiao Chen;Haomai Yang;Kai Lu;Wenlve Huang;Jibin Wang;Jiguang Wan;Jian Zhou;Fei Wu;Changsheng Xie","doi":"10.1109/TPDS.2024.3485754","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3485754","url":null,"abstract":"Emerging high-performance computing (HPC) applications with diverse workload characteristics impose greater demands on parallel file systems (PFSs). PFSs also require more efficient software designs to fully utilize the performance of modern hardware, such as multi-core CPUs, Remote Direct Memory Access (RDMA), and NVMe SSDs. However, existing PFSs expose great limitations under these requirements due to limited multi-core scalability, unaware of HPC workloads, and disjointed network-storage optimizations. In this article, we present PeakFS, an ultra-high performance parallel file system via computing-network-storage co-optimization for HPC applications. PeakFS designs a shared-nothing scheduling system based on link-reduced task dispatching with lock-free queues to reduce concurrency overhead. Besides, PeakFS improves I/O performance with flexible distribution strategies, memory-efficient indexing, and metadata caching according to HPC I/O characteristics. Finally, PeakFS shortens the critical path of request processing through network-storage co-optimizations. Experimental results show that the metadata and data performance of PeakFS reaches more than 90% of the hardware limits. For metadata throughput, PeakFS achieves a 3.5–19× improvement over GekkoFS and outperforms BeeGFS by three orders of magnitude.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2578-2595"},"PeriodicalIF":5.6,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Algorithms for Data Sharing-Aware Task Allocation in Edge Computing Systems 边缘计算系统中数据共享感知任务分配算法

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-10-24 DOI: 10.1109/TPDS.2024.3486184

Sanaz Rabinia;Niloofar Didar;Marco Brocanelli;Daniel Grosu

Edge computing has been developed as a low-latency data driven computation paradigm close to the end user to maximize profit, and/or minimize energy consumption. Edge computing allows each user’s task to analyze locally-acquired sensor data at the edge to reduce the resource congestion and improve the efficiency of data processing. To reduce application latency and data transferred to edge servers it is essential to consider data sharing for some user tasks that operate on the same data items. In this article, we formulate the data sharing-aware allocation problem which has as objectives the maximization of profit and minimization of network traffic by considering data-sharing characteristics of tasks on servers. Because the problem is

${sf NP-hard}$

, we design the

${sf DSTA}$

algorithm to find a feasible solution in polynomial time. We investigate the approximation guarantees of

${sf DSTA}$

by determining the approximation ratios with respect to the total profit and the amount of total data traffic in the edge network. We also design a variant of

${sf DSTA}$

, called

${sf DSTAR}$

that uses a smart rearrangement of tasks to allocate some of the unallocated tasks for increased total profit. We perform extensive experiments to investigate the performance of

${sf DSTA}$

and

${sf DSTAR}$

, and compare them with a representative greedy baseline that only maximizes profit. Our experimental analysis shows that, compared to the baseline,

${sf DSTA}$

reduces the total data traffic in the edge network by up to 20% across 45 case study instances at a small profit loss. In addition,

${sf DSTAR}$

increases the total profit by up to 27% and the number of allocated tasks by 25% compared to

${sf DSTA}$

, all while limiting the increase of total data traffic in the network.

边缘计算是一种接近终端用户的低延迟数据驱动计算模式，可实现利润最大化和/或能耗最小化。边缘计算允许每个用户的任务在边缘分析本地获取的传感器数据，以减少资源拥塞并提高数据处理效率。为了减少应用延迟和传输到边缘服务器的数据，必须考虑对相同数据项进行操作的某些用户任务的数据共享。在本文中，我们通过考虑服务器上任务的数据共享特性，提出了以利润最大化和网络流量最小化为目标的数据共享感知分配问题。由于该问题是${sf NP-hard}$，我们设计了${sf DSTA}$算法来在多项式时间内找到可行解。我们研究了 ${sf DSTA}$ 的近似保证，确定了与总利润和边缘网络总数据流量有关的近似率。我们还设计了${sf DSTA}$的一个变种，称为${sf DSTAR}$，它使用智能任务重排来分配一些未分配的任务，以增加总利润。我们进行了大量实验来研究 ${sf DSTA}$ 和 ${sf DSTAR}$ 的性能，并将它们与只追求利润最大化的代表性贪婪基线进行比较。我们的实验分析表明，与基线相比，${sf DSTA}$在45个案例研究实例中减少了高达20%的边缘网络总数据流量，而利润损失很小。此外，与 ${sf DSTA}$ 相比，${sf DSTAR}$ 的总利润增加了 27%，分配任务数增加了 25%，同时限制了网络总数据流量的增加。

{"title":"Algorithms for Data Sharing-Aware Task Allocation in Edge Computing Systems","authors":"Sanaz Rabinia;Niloofar Didar;Marco Brocanelli;Daniel Grosu","doi":"10.1109/TPDS.2024.3486184","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3486184","url":null,"abstract":"Edge computing has been developed as a low-latency data driven computation paradigm close to the end user to maximize profit, and/or minimize energy consumption. Edge computing allows each user’s task to analyze locally-acquired sensor data at the edge to reduce the resource congestion and improve the efficiency of data processing. To reduce application latency and data transferred to edge servers it is essential to consider data sharing for some user tasks that operate on the same data items. In this article, we formulate the data sharing-aware allocation problem which has as objectives the maximization of profit and minimization of network traffic by considering data-sharing characteristics of tasks on servers. Because the problem is \u0000<inline-formula><tex-math>${sf NP-hard}$</tex-math></inline-formula>\u0000, we design the \u0000<inline-formula><tex-math>${sf DSTA}$</tex-math></inline-formula>\u0000 algorithm to find a feasible solution in polynomial time. We investigate the approximation guarantees of \u0000<inline-formula><tex-math>${sf DSTA}$</tex-math></inline-formula>\u0000 by determining the approximation ratios with respect to the total profit and the amount of total data traffic in the edge network. We also design a variant of \u0000<inline-formula><tex-math>${sf DSTA}$</tex-math></inline-formula>\u0000, called \u0000<inline-formula><tex-math>${sf DSTAR}$</tex-math></inline-formula>\u0000 that uses a smart rearrangement of tasks to allocate some of the unallocated tasks for increased total profit. We perform extensive experiments to investigate the performance of \u0000<inline-formula><tex-math>${sf DSTA}$</tex-math></inline-formula>\u0000 and \u0000<inline-formula><tex-math>${sf DSTAR}$</tex-math></inline-formula>\u0000, and compare them with a representative greedy baseline that only maximizes profit. Our experimental analysis shows that, compared to the baseline, \u0000<inline-formula><tex-math>${sf DSTA}$</tex-math></inline-formula>\u0000 reduces the total data traffic in the edge network by up to 20% across 45 case study instances at a small profit loss. In addition, \u0000<inline-formula><tex-math>${sf DSTAR}$</tex-math></inline-formula>\u0000 increases the total profit by up to 27% and the number of allocated tasks by 25% compared to \u0000<inline-formula><tex-math>${sf DSTA}$</tex-math></inline-formula>\u0000, all while limiting the increase of total data traffic in the network.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 1","pages":"15-28"},"PeriodicalIF":5.6,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142672115","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Design and Performance Evaluation of Linearly Extensible Cube-Triangle Network for Multicore Systems 多核系统线性可扩展立方三角网络的设计与性能评估

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-10-24 DOI: 10.1109/TPDS.2024.3486219

Savita Gautam;Abdus Samad;Mohammad S. Umar

High-performance interconnection networks are currently being used to design Massively Parallel Computers. Selecting the set of nodes on which parallel tasks execute plays a vital role in the performance of such systems. These networks when deployed to run large parallel applications suffer from communication latencies which ultimately affect the system throughput. Mesh and Torus are primary examples of topologies used in such systems. However, these are being replaced with more efficient and complicated hybrid topologies such as ZMesh and x-Folded TM networks. This paper presents a new topology named as Linearly Extensible Cube-Triangle (LECΔ) which focuses on low latency, lesser average distance and improved throughput. It is symmetrical in nature and exhibits the desirable properties of similar networks with lesser complexity and cost. For N x N network, the LECΔ topology has lesser network latency than that of Mesh, ZMesh, Torus and x-Folded networks. The proposed LECΔ network produces reduced average distance, diameter and cost. It has a high value of bisection width and good scalability. The simulation results show that the performance of LECΔ network is similar to that of Mesh, ZMesh, Torus and x-Folded networks. The results verify the efficiency of the LECΔ network as evaluated and compared with similar networks.

高性能互连网络目前正被用于设计大规模并行计算机。选择执行并行任务的节点集对此类系统的性能起着至关重要的作用。这些网络在部署用于运行大型并行应用时会出现通信延迟，最终影响系统吞吐量。网格和 Torus 是此类系统中使用的拓扑结构的主要例子。然而，这些拓扑结构正在被 ZMesh 和 x-Folded TM 网络等更高效、更复杂的混合拓扑结构所取代。本文提出了一种名为线性可扩展立方体-三角形（LECΔ）的新拓扑结构，其重点是低延迟、减少平均距离和提高吞吐量。它在本质上是对称的，具有类似网络的理想特性，但复杂性和成本较低。对于 N x N 网络，LECΔ 拓扑的网络延迟低于 Mesh、ZMesh、Torus 和 x-Folded 网络。拟议的 LECΔ 网络可减少平均距离、直径和成本。它具有较高的分段宽度值和良好的可扩展性。仿真结果表明，LECΔ 网络的性能与 Mesh、ZMesh、Torus 和 x-Folded 网络相似。这些结果验证了 LECΔ 网络的效率，并将其与类似网络进行了评估和比较。

{"title":"Design and Performance Evaluation of Linearly Extensible Cube-Triangle Network for Multicore Systems","authors":"Savita Gautam;Abdus Samad;Mohammad S. Umar","doi":"10.1109/TPDS.2024.3486219","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3486219","url":null,"abstract":"High-performance interconnection networks are currently being used to design Massively Parallel Computers. Selecting the set of nodes on which parallel tasks execute plays a vital role in the performance of such systems. These networks when deployed to run large parallel applications suffer from communication latencies which ultimately affect the system throughput. Mesh and Torus are primary examples of topologies used in such systems. However, these are being replaced with more efficient and complicated hybrid topologies such as ZMesh and x-Folded TM networks. This paper presents a new topology named as Linearly Extensible Cube-Triangle (LECΔ) which focuses on low latency, lesser average distance and improved throughput. It is symmetrical in nature and exhibits the desirable properties of similar networks with lesser complexity and cost. For N x N network, the LECΔ topology has lesser network latency than that of Mesh, ZMesh, Torus and x-Folded networks. The proposed LECΔ network produces reduced average distance, diameter and cost. It has a high value of bisection width and good scalability. The simulation results show that the performance of LECΔ network is similar to that of Mesh, ZMesh, Torus and x-Folded networks. The results verify the efficiency of the LECΔ network as evaluated and compared with similar networks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"35 12","pages":"2596-2607"},"PeriodicalIF":5.6,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel Applications 利用图形分析找出并行应用程序可伸缩性问题的根本原因

IF 5.6 2区计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS

IEEE Transactions on Parallel and Distributed Systems

Pub Date : 2024-10-24 DOI: 10.1109/TPDS.2024.3485789

Yuyang Jin;Haojie Wang;Xiongchao Tang;Zhenhua Guo;Yaqian Zhao;Torsten Hoefler;Tao Liu;Xu Liu;Jidong Zhai

It is challenging to scale parallel applications to modern supercomputers because of load imbalance, resource contention, and communications between processes. Profiling and tracing are two main performance analysis approaches for detecting these scalability bottlenecks. Profiling is low-cost but lacks detailed dependence for identifying root causes. Tracing records plentiful information but incurs significant overheads. To address these issues, we present ScalAna, which employs static analysis techniques to combine the benefits of profiling and tracing - it enables tracing's analyzability with overhead similar to profiling. ScalAna uses static analysis to capture program structures and data dependence of parallel applications, and leverages lightweight profiling approaches to record performance data during runtime. Then a parallel performance graph is generated with both static and dynamic data. Based on this graph, we design a backtracking detection approach to automatically pinpoint the root causes of scaling issues. We evaluate the efficacy and efficiency of ScalAna using several real applications with up to 704K lines of code and demonstrate that our approach can effectively pinpoint the root causes of scaling loss with an average overhead of 5.65% for up to 16,384 processes. By fixing the root causes detected by our tool, it achieves up to 33.01% performance improvement.

由于负载不平衡、资源争用和进程之间的通信，将并行应用程序扩展到现代超级计算机是具有挑战性的。分析和跟踪是检测这些可伸缩性瓶颈的两种主要性能分析方法。分析是低成本的，但缺乏确定根本原因的详细依赖。跟踪记录了大量的信息，但是产生了很大的开销。为了解决这些问题，我们提出了ScalAna，它采用静态分析技术来结合分析和跟踪的优点——它使跟踪的可分析性与分析的开销相似。ScalAna使用静态分析来捕获并行应用程序的程序结构和数据依赖性，并利用轻量级分析方法在运行时记录性能数据。然后生成一个包含静态和动态数据的并行性能图。基于此图，我们设计了一种回溯检测方法来自动查明缩放问题的根本原因。我们使用几个真实的应用程序（多达704K行代码）来评估ScalAna的功效和效率，并证明我们的方法可以有效地找出伸缩损失的根本原因，平均开销为5.65%，最多可达16,384个进程。通过修复我们的工具检测到的根本原因，它实现了高达33.01%的性能改进。

{"title":"Leveraging Graph Analysis to Pinpoint Root Causes of Scalability Issues for Parallel Applications","authors":"Yuyang Jin;Haojie Wang;Xiongchao Tang;Zhenhua Guo;Yaqian Zhao;Torsten Hoefler;Tao Liu;Xu Liu;Jidong Zhai","doi":"10.1109/TPDS.2024.3485789","DOIUrl":"https://doi.org/10.1109/TPDS.2024.3485789","url":null,"abstract":"It is challenging to scale parallel applications to modern supercomputers because of load imbalance, resource contention, and communications between processes. Profiling and tracing are two main performance analysis approaches for detecting these scalability bottlenecks. Profiling is low-cost but lacks detailed dependence for identifying root causes. Tracing records plentiful information but incurs significant overheads. To address these issues, we present \u0000<sc>ScalAna</small>\u0000, which employs static analysis techniques to combine the benefits of profiling and tracing - it enables tracing's analyzability with overhead similar to profiling. \u0000<sc>ScalAna</small>\u0000 uses static analysis to capture program structures and data dependence of parallel applications, and leverages lightweight profiling approaches to record performance data during runtime. Then a parallel performance graph is generated with both static and dynamic data. Based on this graph, we design a backtracking detection approach to automatically pinpoint the root causes of scaling issues. We evaluate the efficacy and efficiency of \u0000<sc>ScalAna</small>\u0000 using several real applications with up to 704K lines of code and demonstrate that our approach can effectively pinpoint the root causes of scaling loss with an average overhead of 5.65% for up to 16,384 processes. By fixing the root causes detected by our tool, it achieves up to 33.01% performance improvement.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 2","pages":"308-325"},"PeriodicalIF":5.6,"publicationDate":"2024-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142905739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0