首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
Performance Portability Assessment in Gaia Gaia的性能可移植性评估
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-22 DOI: 10.1109/TPDS.2025.3591452
Giulio Malenza;Valentina Cesare;Marco Edoardo Santimaria;Robert Birke;Alberto Vecchiato;Ugo Becciani;Marco Aldinucci
Modern scientific experiments produce ever-increasing amounts of data, soon requiring ExaFLOPs computing capacities for analysis. Reaching such performance requires purpose-built supercomputers with $O(10^{3})$ nodes, each hosting multicore CPUs and multiple GPUs, and applications designed to exploit this hardware optimally. Given that each supercomputer is generally a one-off project, the need for computing frameworks portable across diverse CPU and GPU architectures without performance losses is increasingly compelling. We investigate the performance portability () of a real-world application: the solver module of the AVU–GSR pipeline for the ESA Gaia mission. This code finds the astrometric parameters of ${sim} 10^{8}$ stars in the Milky Way using the LSQR iterative algorithm. LSQR is widely used to solve linear systems of equations across a wide range of high-performance computing applications, elevating the study beyond its astrophysical relevance. The code is memory-bound, with six main compute kernels implementing sparse matrix-by-vector products. We optimize the previous CUDA implementation and port the code to further six GPU-acceleration frameworks: C++ PSTL, SYCL, OpenMP, HIP, KOKKOS, and OpenACC. We evaluate each framework’s performance portability across multiple GPUs (NVIDIA and AMD) and problem sizes in terms of application and architectural efficiency. Architectural efficiency is estimated through the roofline model of the six most computationally expensive GPU kernels. Our results show that C++ library-based (C++ PSTL and KOKKOS), pragma-based (OpenMP and OpenACC), and language-specific (CUDA, HIP, and SYCL) frameworks achieve increasingly better performance portability across the supported platforms with larger problem sizes providing better scores due to higher GPU occupancies.
现代科学实验产生的数据量不断增加,很快就需要ExaFLOPs的计算能力来进行分析。要达到这样的性能,需要有$O(10^{3})$节点的专用超级计算机,每个节点托管多核cpu和多个gpu,以及设计用于最佳利用这些硬件的应用程序。考虑到每台超级计算机通常都是一次性项目,对跨不同CPU和GPU架构的便携计算框架的需求越来越迫切,而不会造成性能损失。我们研究了一个实际应用的性能可移植性():欧空局盖亚任务的AVU-GSR管道的求解器模块。此代码使用LSQR迭代算法查找银河系中${sim} 10^{8}$恒星的天体测量参数。LSQR被广泛用于求解各种高性能计算应用中的线性方程组,使该研究超越了其与天体物理学的相关性。代码是内存限制的,有六个主要的计算内核实现稀疏矩阵向量乘积。我们优化了以前的CUDA实现,并将代码移植到另外六个gpu加速框架:c++ PSTL, SYCL, OpenMP, HIP, KOKKOS和OpenACC。我们评估了每个框架在多个gpu (NVIDIA和AMD)上的性能可移植性,以及应用程序和架构效率方面的问题大小。通过六个计算成本最高的GPU内核的屋顶线模型来估计架构效率。我们的结果表明,基于c++库的(c++ PSTL和KOKKOS)、基于pragma的(OpenMP和OpenACC)和特定于语言的(CUDA、HIP和SYCL)框架在支持的平台上实现了越来越好的性能可移植性,问题规模更大,由于GPU占用率更高,得分也更高。
{"title":"Performance Portability Assessment in Gaia","authors":"Giulio Malenza;Valentina Cesare;Marco Edoardo Santimaria;Robert Birke;Alberto Vecchiato;Ugo Becciani;Marco Aldinucci","doi":"10.1109/TPDS.2025.3591452","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3591452","url":null,"abstract":"Modern scientific experiments produce ever-increasing amounts of data, soon requiring ExaFLOPs computing capacities for analysis. Reaching such performance requires purpose-built supercomputers with <inline-formula><tex-math>$O(10^{3})$</tex-math></inline-formula> nodes, each hosting multicore CPUs and multiple GPUs, and applications designed to exploit this hardware optimally. Given that each supercomputer is generally a one-off project, the need for computing frameworks portable across diverse CPU and GPU architectures without performance losses is increasingly compelling. We investigate the performance portability (<inline-graphic>) of a real-world application: the solver module of the AVU–GSR pipeline for the ESA Gaia mission. This code finds the astrometric parameters of <inline-formula><tex-math>${sim} 10^{8}$</tex-math></inline-formula> stars in the Milky Way using the LSQR iterative algorithm. LSQR is widely used to solve linear systems of equations across a wide range of high-performance computing applications, elevating the study beyond its astrophysical relevance. The code is memory-bound, with six main compute kernels implementing sparse matrix-by-vector products. We optimize the previous CUDA implementation and port the code to further six GPU-acceleration frameworks: C++ PSTL, SYCL, OpenMP, HIP, KOKKOS, and OpenACC. We evaluate each framework’s performance portability across multiple GPUs (NVIDIA and AMD) and problem sizes in terms of application and architectural efficiency. Architectural efficiency is estimated through the roofline model of the six most computationally expensive GPU kernels. Our results show that C++ library-based (C++ PSTL and KOKKOS), pragma-based (OpenMP and OpenACC), and language-specific (CUDA, HIP, and SYCL) frameworks achieve increasingly better performance portability across the supported platforms with larger problem sizes providing better <inline-graphic> scores due to higher GPU occupancies.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2045-2057"},"PeriodicalIF":6.0,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11090032","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Task Scheduling in Geo-Distributed Computing: A Survey 地理分布式计算中的任务调度研究
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-21 DOI: 10.1109/TPDS.2025.3591010
Yujian Wu;Shanjiang Tang;Ce Yu;Bin Yang;Chao Sun;Jian Xiao;Hutong Wu;Jinghua Feng
Geo-distributed computing, a paradigm that assigns computational tasks to globally distributed nodes, has emerged as a promising approach in cloud computing, edge computing, cloud-edge computing, and supercomputer computing (SC). It enables low-latency services, ensures data locality, and handles large-scale applications. As global computing capacity and task demands increase rapidly, scheduling tasks for efficient execution in geo-distributed computing systems has become an increasingly critical research challenge. It arises from the inherent characteristics of geographic distribution, including heterogeneous network conditions, region-specific resource pricing, and varying computational capabilities across locations. Researchers have developed diverse task scheduling methods tailored to geo-distributed scenarios, aiming to achieve objectives such as performance enhancement, fairness assurance, and fault-tolerance improvement. This survey provides a comprehensive and systematic review of task scheduling techniques across four major distributed computing environments, with an in-depth analysis of these approaches based on their core scheduling objectives. Through our analysis, we identify key research challenges and outline promising directions for advancing task scheduling in geo-distributed computing.
地理分布式计算是一种将计算任务分配给全局分布式节点的范式,在云计算、边缘计算、云边缘计算和超级计算机计算(SC)中已经成为一种很有前途的方法。它支持低延迟服务,确保数据局部性,并处理大规模应用程序。随着全球计算能力和任务需求的快速增长,在地理分布式计算系统中调度任务以实现高效执行已成为一个日益重要的研究挑战。它源于地理分布的固有特征,包括异构网络条件、特定区域的资源定价以及不同位置的不同计算能力。研究人员针对地理分布式场景开发了多种任务调度方法,旨在实现性能增强、公平性保证和容错性改进等目标。本调查对四种主要分布式计算环境中的任务调度技术进行了全面而系统的回顾,并根据这些方法的核心调度目标对其进行了深入的分析。通过我们的分析,我们确定了关键的研究挑战,并概述了在地理分布式计算中推进任务调度的有希望的方向。
{"title":"Task Scheduling in Geo-Distributed Computing: A Survey","authors":"Yujian Wu;Shanjiang Tang;Ce Yu;Bin Yang;Chao Sun;Jian Xiao;Hutong Wu;Jinghua Feng","doi":"10.1109/TPDS.2025.3591010","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3591010","url":null,"abstract":"Geo-distributed computing, a paradigm that assigns computational tasks to globally distributed nodes, has emerged as a promising approach in cloud computing, edge computing, cloud-edge computing, and supercomputer computing (SC). It enables low-latency services, ensures data locality, and handles large-scale applications. As global computing capacity and task demands increase rapidly, scheduling tasks for efficient execution in geo-distributed computing systems has become an increasingly critical research challenge. It arises from the inherent characteristics of geographic distribution, including heterogeneous network conditions, region-specific resource pricing, and varying computational capabilities across locations. Researchers have developed diverse task scheduling methods tailored to geo-distributed scenarios, aiming to achieve objectives such as performance enhancement, fairness assurance, and fault-tolerance improvement. This survey provides a comprehensive and systematic review of task scheduling techniques across four major distributed computing environments, with an in-depth analysis of these approaches based on their core scheduling objectives. Through our analysis, we identify key research challenges and outline promising directions for advancing task scheduling in geo-distributed computing.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2073-2088"},"PeriodicalIF":6.0,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144867934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Doing More With Less: Balancing Probing Costs and Task Offloading Efficiency At the Network Edge 事半功倍:在网络边缘平衡探测成本和任务卸载效率
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-18 DOI: 10.1109/TPDS.2025.3590368
Xishuo Li;Shan Zhang;Tie Ma;Zhiyuan Wang;Hongbin Luo
In decentralized edge computing environments, user devices need to perceive the status of neighboring devices, including computational availability and communication delays, to optimize task offloading decisions. However, probing the real-time status of all devices introduces significant overhead, and probing only a few devices can lead to suboptimal decision-making, considering the massive connectivity and non-stationarity of edge networks. Aiming to balance the status probing cost and task offloading performance, we study the joint transmission and computation status probing problem, where the status and offloading delay on edge devices are characterized by general, bounded, and non-stationary distributions. The problem is proved to be NP-hard, even with known offloading delay distributions. To handle this case, we design an efficient offline method that guarantees a $(1-1/e)$ approximation ratio via leveraging the submodularity of the expected offloading delay function. Furthermore, for scenarios with unknown and non-stationary offloading delay distributions, we reformulate the problem using the piecewise-stationary combinatorial multi-armed bandit framework and develop a change-point detection-based online status probing (CD-OSP) algorithm. CD-OSP can timely detect environmental changes and update probing strategies via using the proposed offline method and estimating offloading delay distributions. We prove that CD-OSP achieves a regret of $mathcal {O}(NVsqrt{Tln T})$, with $N$, $V$, and $T$ denoting the numbers of stationary periods, edge devices, and time slots, respectively. Extensive simulations and testbed experiments demonstrate that CD-OSP significantly outperforms state-of-the-art baselines, which can reduce the probing cost by up to 16.18X with a 2.14X increase in the offloading delay.
在分散的边缘计算环境中,用户设备需要感知相邻设备的状态,包括计算可用性和通信延迟,以优化任务卸载决策。然而,探测所有设备的实时状态会带来巨大的开销,并且考虑到边缘网络的大量连接和非平定性,仅探测少数设备可能导致次优决策。为了平衡状态探测成本和任务卸载性能,研究了联合传输和计算状态探测问题,其中边缘设备的状态和卸载延迟具有一般分布、有界分布和非平稳分布的特征。这个问题被证明是np困难的,即使有已知的卸载延迟分布。为了处理这种情况,我们设计了一种有效的离线方法,通过利用期望卸载延迟函数的子模块化来保证$(1-1/e)$近似比。此外,对于具有未知和非平稳卸载延迟分布的场景,我们使用分段平稳组合多臂强盗框架重新表述问题,并开发了基于变点检测的在线状态探测(CD-OSP)算法。CD-OSP可以利用所提出的离线方法和估计卸载延迟分布来及时检测环境变化并更新探测策略。我们证明CD-OSP实现了$mathcal {O}(NVsqrt{Tln T})$的遗憾,其中$N$, $V$和$T$分别表示平稳周期,边缘设备和时隙的数量。大量的模拟和试验台实验表明,CD-OSP显著优于最先进的基线,可以将探测成本降低16.18倍,卸载延迟增加2.14倍。
{"title":"Doing More With Less: Balancing Probing Costs and Task Offloading Efficiency At the Network Edge","authors":"Xishuo Li;Shan Zhang;Tie Ma;Zhiyuan Wang;Hongbin Luo","doi":"10.1109/TPDS.2025.3590368","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3590368","url":null,"abstract":"In decentralized edge computing environments, user devices need to perceive the status of neighboring devices, including computational availability and communication delays, to optimize task offloading decisions. However, probing the real-time status of all devices introduces significant overhead, and probing only a few devices can lead to suboptimal decision-making, considering the massive connectivity and non-stationarity of edge networks. Aiming to balance the status probing cost and task offloading performance, we study the joint transmission and computation status probing problem, where the status and offloading delay on edge devices are characterized by general, bounded, and non-stationary distributions. The problem is proved to be NP-hard, even with known offloading delay distributions. To handle this case, we design an efficient offline method that guarantees a <inline-formula><tex-math>$(1-1/e)$</tex-math></inline-formula> approximation ratio via leveraging the submodularity of the expected offloading delay function. Furthermore, for scenarios with unknown and non-stationary offloading delay distributions, we reformulate the problem using the piecewise-stationary combinatorial multi-armed bandit framework and develop a change-point detection-based online status probing (CD-OSP) algorithm. CD-OSP can timely detect environmental changes and update probing strategies via using the proposed offline method and estimating offloading delay distributions. We prove that CD-OSP achieves a regret of <inline-formula><tex-math>$mathcal {O}(NVsqrt{Tln T})$</tex-math></inline-formula>, with <inline-formula><tex-math>$N$</tex-math></inline-formula>, <inline-formula><tex-math>$V$</tex-math></inline-formula>, and <inline-formula><tex-math>$T$</tex-math></inline-formula> denoting the numbers of stationary periods, edge devices, and time slots, respectively. Extensive simulations and testbed experiments demonstrate that CD-OSP significantly outperforms state-of-the-art baselines, which can reduce the probing cost by up to 16.18X with a 2.14X increase in the offloading delay.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2247-2263"},"PeriodicalIF":6.0,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cannikin: No Lagger of SLO in Concurrent Multiple LoRA LLM Serving 并行多LoRA LLM服务的SLO不滞后
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-17 DOI: 10.1109/TPDS.2025.3590014
Ruidong Zhu;Ziyue Jiang;Zhi Zhang;Xin Liu;Xuanzhe Liu;Xin Jin
Low-rank adaptation (LoRA) is widely used to efficiently fine-tune large language models (LLMs), leading to multiple models fine-tuned from the same pre-trained LLM. State-of-the-art LLM serving systems colocate these LoRA models on the same GPU instances for concurrent serving, which decreases memory usage and boosts efficiency. However, the unawareness of the SLO requirements of each LoRA service and the interference between requests from different LoRA services can cause significant SLO violations. This paper presents Cannikin, a multi-LoRA inference serving system that optimizes the minimum of the SLO attainments of all LoRA services in the serving system, denoted as lagger-SLO attainment. We obtain insights from the characterization of a real-world multi-LoRA serving trace, which reveals the stable input/output lengths of the most popular LoRA services. This motivates Cannikin to propose an SLO-aware scheduling algorithm that prioritizes requests based on efficient deadline estimation. Cannikin further detects the influence of interference between different LoRA services on SLO violations and eliminates the bias between these services. The evaluation using real-world traces demonstrates that compared to the state-of-the-art multi-LoRA serving systems, Cannikin can handle up to 3.6× higher rates or 2.8× more burstiness while maintaining the SLO attainment of each LoRA service $> $ 90% .
低秩自适应(Low-rank adaptation, LoRA)被广泛地应用于大型语言模型的有效微调,使得同一预训练的大型语言模型可以对多个模型进行微调。最先进的LLM服务系统将这些LoRA模型放在相同的GPU实例上进行并发服务,从而减少了内存使用并提高了效率。但是,不了解每个LoRA服务的SLO需求以及来自不同LoRA服务的请求之间的干扰可能会导致严重的SLO违规。本文提出了一种多LoRA推理服务系统,它优化了服务系统中所有LoRA服务的SLO成就的最小值,即lagger-SLO成就。我们从真实世界的多LoRA服务跟踪的特征中获得见解,它揭示了最流行的LoRA服务的稳定输入/输出长度。这促使cannkin提出了一种基于有效截止日期估计的请求优先级的慢速感知调度算法。cannkin进一步检测不同LoRA服务之间的干扰对SLO违规的影响,并消除这些服务之间的偏见。使用真实世界轨迹的评估表明,与最先进的多LoRA服务系统相比,在保持每个LoRA服务的SLO实现的同时,该系统可以处理高达3.6倍的速率或2.8倍的突发事件。90%。
{"title":"Cannikin: No Lagger of SLO in Concurrent Multiple LoRA LLM Serving","authors":"Ruidong Zhu;Ziyue Jiang;Zhi Zhang;Xin Liu;Xuanzhe Liu;Xin Jin","doi":"10.1109/TPDS.2025.3590014","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3590014","url":null,"abstract":"Low-rank adaptation (LoRA) is widely used to efficiently fine-tune large language models (LLMs), leading to multiple models fine-tuned from the same pre-trained LLM. State-of-the-art LLM serving systems colocate these LoRA models on the same GPU instances for concurrent serving, which decreases memory usage and boosts efficiency. However, the unawareness of the SLO requirements of each LoRA service and the interference between requests from different LoRA services can cause significant SLO violations. This paper presents Cannikin, a multi-LoRA inference serving system that optimizes the minimum of the SLO attainments of all LoRA services in the serving system, denoted as lagger-SLO attainment. We obtain insights from the characterization of a real-world multi-LoRA serving trace, which reveals the stable input/output lengths of the most popular LoRA services. This motivates Cannikin to propose an SLO-aware scheduling algorithm that prioritizes requests based on efficient deadline estimation. Cannikin further detects the influence of interference between different LoRA services on SLO violations and eliminates the bias between these services. The evaluation using real-world traces demonstrates that compared to the state-of-the-art multi-LoRA serving systems, Cannikin can handle up to 3.6× higher rates or 2.8× more burstiness while maintaining the SLO attainment of each LoRA service <inline-formula><tex-math>$&gt; $</tex-math></inline-formula> 90% .","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 9","pages":"1972-1984"},"PeriodicalIF":6.0,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating Half-Precision Seismic Simulation on Neural Processing Unit 神经处理单元加速半精度地震模拟
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-15 DOI: 10.1109/TPDS.2025.3584773
Yinuo Wang;Zeyu Song;Wubing Wan;Xinpeng Zhao;Lin Gan;Ping Gao;Wenqiang Wang;Zhenguo Zhang;Haohuan Fu;Wei Xue;Guangwen Yang
Due to the superiority of handling irregular regions of interest, the curvilinear grid finite difference method (CGFDM) has become wildely used in seismic simulation for earthquake hazard evaluation and understanding of earthquake physics. This paper proposes a novel approach that optimizes a CGFDM solver on the Ascend, a cutting-edge Neural Processing Unit (NPU) using half-precision storage and mixed-precision arithmetic. The approach increases the data throughput and computing efficiency, enabling more effective seismic modeling. Furthermore, we propose an efficient matrix unit enabled 3D difference algorithm that employs matrix unit on NPU to accelerate the computation. By fully exploiting the capability of matrix unit and wide SIMD lane, our solver on Ascend achieves a speedup of 4.19 × over the performance of parallel solver on two AMD CPUs and has successfully simulated real-world Wenchuan earthquake. To the best of our knowledge, we are the first to conduct seismic simulations on NPU.
曲线网格有限差分法(CGFDM)由于处理不规则感兴趣区域的优越性,在地震模拟中被广泛应用于地震危险性评价和地震物理认识。本文提出了一种优化CGFDM求解器的新方法,该方法采用半精度存储和混合精度算法来优化先进的神经处理单元(NPU)。该方法提高了数据吞吐量和计算效率,实现了更有效的地震建模。此外,我们提出了一种有效的矩阵单元支持的三维差分算法,该算法在NPU上使用矩阵单元来加速计算。通过充分利用矩阵单元和宽SIMD通道的能力,我们的求解器在Ascend上实现了比在两个AMD cpu上并行求解器性能提高4.19倍的速度,并成功模拟了真实的汶川地震。据我们所知,我们是第一个在NPU上进行地震模拟的公司。
{"title":"Accelerating Half-Precision Seismic Simulation on Neural Processing Unit","authors":"Yinuo Wang;Zeyu Song;Wubing Wan;Xinpeng Zhao;Lin Gan;Ping Gao;Wenqiang Wang;Zhenguo Zhang;Haohuan Fu;Wei Xue;Guangwen Yang","doi":"10.1109/TPDS.2025.3584773","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3584773","url":null,"abstract":"Due to the superiority of handling irregular regions of interest, the curvilinear grid finite difference method (CGFDM) has become wildely used in seismic simulation for earthquake hazard evaluation and understanding of earthquake physics. This paper proposes a novel approach that optimizes a CGFDM solver on the Ascend, a cutting-edge Neural Processing Unit (NPU) using half-precision storage and mixed-precision arithmetic. The approach increases the data throughput and computing efficiency, enabling more effective seismic modeling. Furthermore, we propose an efficient matrix unit enabled 3D difference algorithm that employs matrix unit on NPU to accelerate the computation. By fully exploiting the capability of matrix unit and wide SIMD lane, our solver on Ascend achieves a speedup of 4.19 × over the performance of parallel solver on two AMD CPUs and has successfully simulated real-world Wenchuan earthquake. To the best of our knowledge, we are the first to conduct seismic simulations on NPU.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"1998-2013"},"PeriodicalIF":6.0,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
High Performance OpenCL-Based GEMM Kernel Auto-Tuned by Bayesian Optimization 基于贝叶斯优化的高性能opencl gem内核自调优
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-10 DOI: 10.1109/TPDS.2025.3587673
Shengle Lin;Guoqing Xiao;Haotian Wang;Wangdong Yang;Kenli Li;Keqin Li
OpenCL has become the favored framework for emerging heterogeneous devices and FPGAs, owing to its versatility and portability. However, OpenCL-based math libraries still face challenges in fully leveraging device performance. When deploying high-performance arithmetic applications on these devices, the most important hot function is General Matrix-matrix Multiplication (GEMM). This study presents a meticulously optimized OpenCL GEMM kernel. Our enhanced GEMM kernel emphasizes two key improvements: 1) a three-level double buffer pipeline that efficiently overlaps data fetching with floating-point computations; 2) a fine-grained prefetching strategy of private memory to increase device occupancy by optimizing register unit utilization. Furthermore, this work presents a Bayesian Optimization (BO) tuner for kernel auto-tuning. Experimental results demonstrate considerable optimization improvement and performance advantages achieved on diverse OpenCL devices. Additionally, the BO tuner demonstrates superior efficiency and robustness, outperforming contemporary tuning methods.
由于其通用性和可移植性,OpenCL已成为新兴异构设备和fpga的首选框架。然而,基于opencl的数学库在充分利用设备性能方面仍然面临挑战。在这些设备上部署高性能算术应用程序时,最重要的热功能是通用矩阵-矩阵乘法(GEMM)。本研究提出了一个精心优化的OpenCL gem内核。我们增强的GEMM内核强调了两个关键的改进:1)一个三层双缓冲区管道,有效地将数据获取与浮点计算重叠;2)采用细粒度的私有内存预取策略,通过优化寄存器单元利用率来提高设备占用率。此外,本文还提出了一种用于内核自动调优的贝叶斯优化(BO)调谐器。实验结果表明,在不同的OpenCL设备上实现了相当大的优化改进和性能优势。此外,BO调谐器表现出卓越的效率和鲁棒性,优于当代调谐方法。
{"title":"High Performance OpenCL-Based GEMM Kernel Auto-Tuned by Bayesian Optimization","authors":"Shengle Lin;Guoqing Xiao;Haotian Wang;Wangdong Yang;Kenli Li;Keqin Li","doi":"10.1109/TPDS.2025.3587673","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3587673","url":null,"abstract":"OpenCL has become the favored framework for emerging heterogeneous devices and FPGAs, owing to its versatility and portability. However, OpenCL-based math libraries still face challenges in fully leveraging device performance. When deploying high-performance arithmetic applications on these devices, the most important hot function is General Matrix-matrix Multiplication (GEMM). This study presents a meticulously optimized OpenCL GEMM kernel. Our enhanced GEMM kernel emphasizes two key improvements: 1) a three-level double buffer pipeline that efficiently overlaps data fetching with floating-point computations; 2) a fine-grained prefetching strategy of private memory to increase device occupancy by optimizing register unit utilization. Furthermore, this work presents a Bayesian Optimization (BO) tuner for kernel auto-tuning. Experimental results demonstrate considerable optimization improvement and performance advantages achieved on diverse OpenCL devices. Additionally, the BO tuner demonstrates superior efficiency and robustness, outperforming contemporary tuning methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 9","pages":"1985-1997"},"PeriodicalIF":6.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144782079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Elastic Relaxation of Concurrent Data Structures 并发数据结构的弹性松弛
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-10 DOI: 10.1109/TPDS.2025.3587888
Kåre von Geijer;Philippas Tsigas
The sequential semantics of many concurrent data structures, such as stacks and queues, inevitably lead to memory contention in parallel environments, thus limiting scalability. Semantic relaxation has the potential to address this issue, increasing the parallelism at the expense of weakened semantics. Although prior research has shown that improved performance can be attained by relaxing concurrent data structure semantics, there is no one-size-fits-all relaxation that adequately addresses the varying needs of dynamic executions. In this paper, we first introduce the concept of elastic relaxation and consequently present the Lateral structure, which is an algorithmic component capable of supporting the design of elastically relaxed concurrent data structures. Using the Lateral, we design novel elastically relaxed, lock-free queues, stacks, a counter, and a deque, capable of reconfiguring relaxation during run-time. We establish linearizability and define worst-case bounds for relaxation errors in our designs. Experimental evaluations show that our elastic designs match the performance of state-of-the-art statically relaxed structures when no elastic changes are utilized. We develop a lightweight, contention-aware controller for adjusting relaxation in real time, and demonstrate its benefits both in a dynamic producer-consumer micro-benchmark and in a parallel BFS traversal, where it improves throughput and work-efficiency compared to static designs.
许多并发数据结构(如堆栈和队列)的顺序语义不可避免地导致并行环境中的内存争用,从而限制了可伸缩性。语义放松有可能解决这个问题,以削弱语义为代价增加并行性。尽管先前的研究表明,通过放松并发数据结构语义可以提高性能,但是没有一种放之四海而均准的放松方法可以充分满足动态执行的不同需求。在本文中,我们首先引入了弹性松弛的概念,从而提出了横向结构,这是一个能够支持弹性松弛并发数据结构设计的算法组件。使用Lateral,我们设计了新颖的弹性放松,无锁队列,堆栈,计数器和队列,能够在运行时重新配置放松。在我们的设计中,我们建立了线性化并定义了松弛误差的最坏情况界限。实验评估表明,当不使用弹性变化时,我们的弹性设计符合最先进的静力松弛结构的性能。我们开发了一种轻量级的、竞争感知的控制器,用于实时调整松弛,并在动态生产者-消费者微基准测试和并行BFS遍行中展示了它的好处,与静态设计相比,它提高了吞吐量和工作效率。
{"title":"Elastic Relaxation of Concurrent Data Structures","authors":"Kåre von Geijer;Philippas Tsigas","doi":"10.1109/TPDS.2025.3587888","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3587888","url":null,"abstract":"The sequential semantics of many concurrent data structures, such as stacks and queues, inevitably lead to memory contention in parallel environments, thus limiting scalability. Semantic relaxation has the potential to address this issue, increasing the parallelism at the expense of weakened semantics. Although prior research has shown that improved performance can be attained by relaxing concurrent data structure semantics, there is no one-size-fits-all relaxation that adequately addresses the varying needs of dynamic executions. In this paper, we first introduce the concept of <italic>elastic relaxation</i> and consequently present the <italic>Lateral</i> structure, which is an algorithmic component capable of supporting the design of elastically relaxed concurrent data structures. Using the <italic>Lateral</i>, we design novel elastically relaxed, lock-free queues, stacks, a counter, and a deque, capable of reconfiguring relaxation during run-time. We establish linearizability and define worst-case bounds for relaxation errors in our designs. Experimental evaluations show that our elastic designs match the performance of state-of-the-art statically relaxed structures when no elastic changes are utilized. We develop a lightweight, contention-aware controller for adjusting relaxation in real time, and demonstrate its benefits both in a dynamic producer-consumer micro-benchmark and in a parallel BFS traversal, where it improves throughput and work-efficiency compared to static designs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2578-2595"},"PeriodicalIF":6.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11077833","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
m$^{2}$2LLM: A Multi-Dimensional Optimization Framework for LLM Inference on Mobile Devices m$^{2}$2LLM:移动设备上LLM推理的多维优化框架
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-10 DOI: 10.1109/TPDS.2025.3587445
Kaiyuan Liu;Xiaobo Zhou;Li Li
Large Language Models (LLMs) are reshaping mobile AI. Directly deploying LLMs on mobile devices is an emerging paradigm that can widely support different mobile applications while preserving data privacy. However, intensive memory footprint, long inference latency and high energy consumption severely bottlenecks on-device inference of LLM in real-world scenarios. In response to these challenges, this work introduces m$^{2}$LLM, an innovative framework that performs joint optimization from multiple dimensions for on-device LLM inference in order to strike a balance among performance, realtimeliness and energy efficiency. Specifically, m$^{2}$LLM features the following four core components including : 1) Hardware-aware Model Customization, 2) Elastic Chunk-wise Pipeline, 3) Latency-guided Prompt Compression and 4) Layer-wise Resource Scheduling. These four components interact with each other in order to guide the inference process from the following three dimensions. At the model level, m$^{2}$LLM designs an elastic chunk-wise pipeline to expand device memory and customize the model according to the hardware configuration, maximizing performance within the memory budget. At the prompt level, facing the stochastic input, m$^{2}$LLM judiciously compresses the prompts in order to guarantee the first token can be generated in time while maintaining the semantic information. Additionally, at the system level, the layer-wise resource scheduler is employed in order to complete the token generation process with minimized energy consumption while guaranteeing the realtimeness in the highly dynamic mobile environment. m$^{2}$LLM is evaluated on off-the-shelf smartphone with represented models and datasets. Compared to baseline methods, m$^{2}$LLM delivers 2.99–13.5× TTFT acceleration and 2.28–24.3× energy savings, with only a minimal model performance loss of 2% –7% .
大型语言模型(llm)正在重塑移动人工智能。直接在移动设备上部署llm是一种新兴的范例,它可以广泛支持不同的移动应用程序,同时保护数据隐私。然而,在现实场景中,大量的内存占用、较长的推理延迟和较高的能耗严重制约了LLM在设备上的推理。为了应对这些挑战,本工作引入了m$^{2}$LLM,这是一个创新的框架,可以从多个维度对设备上的LLM推理进行联合优化,以便在性能、实时性和能源效率之间取得平衡。具体来说,m$^{2}$LLM具有以下四个核心组件,包括:1)硬件感知模型定制,2)弹性块智能管道,3)延迟引导提示压缩和4)分层资源调度。这四个组件相互作用,以便从以下三个维度指导推理过程。在模型级别,m$^{2}$LLM设计了一个弹性块管道来扩展设备内存并根据硬件配置定制模型,在内存预算内最大化性能。在提示级别,面对随机输入,m$^{2}$LLM明智地压缩提示,以保证在保持语义信息的同时及时生成第一个令牌。此外,在系统级,为了在保证高动态移动环境中的实时性的同时,以最小的能耗完成令牌生成过程,采用了分层资源调度程序。m$^{2}$LLM在现成的智能手机上用表示的模型和数据集进行评估。与基线方法相比,m$^{2}$LLM提供2.99 - 13.5倍的TTFT加速和2.28 - 24.3倍的节能,而模型性能损失仅为2% -7%。
{"title":"m$^{2}$2LLM: A Multi-Dimensional Optimization Framework for LLM Inference on Mobile Devices","authors":"Kaiyuan Liu;Xiaobo Zhou;Li Li","doi":"10.1109/TPDS.2025.3587445","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3587445","url":null,"abstract":"Large Language Models (LLMs) are reshaping mobile AI. Directly deploying LLMs on mobile devices is an emerging paradigm that can widely support different mobile applications while preserving data privacy. However, intensive memory footprint, long inference latency and high energy consumption severely bottlenecks on-device inference of LLM in real-world scenarios. In response to these challenges, this work introduces m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM, an innovative framework that performs joint optimization from multiple dimensions for on-device LLM inference in order to strike a balance among performance, realtimeliness and energy efficiency. Specifically, m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM features the following four core components including : 1) Hardware-aware Model Customization, 2) Elastic Chunk-wise Pipeline, 3) Latency-guided Prompt Compression and 4) Layer-wise Resource Scheduling. These four components interact with each other in order to guide the inference process from the following three dimensions. At the model level, m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM designs an elastic chunk-wise pipeline to expand device memory and customize the model according to the hardware configuration, maximizing performance within the memory budget. At the prompt level, facing the stochastic input, m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM judiciously compresses the prompts in order to guarantee the first token can be generated in time while maintaining the semantic information. Additionally, at the system level, the layer-wise resource scheduler is employed in order to complete the token generation process with minimized energy consumption while guaranteeing the realtimeness in the highly dynamic mobile environment. m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM is evaluated on off-the-shelf smartphone with represented models and datasets. Compared to baseline methods, m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM delivers 2.99–13.5× TTFT acceleration and 2.28–24.3× energy savings, with only a minimal model performance loss of 2% –7% .","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2014-2029"},"PeriodicalIF":6.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SEMSO: A Secure and Efficient Multi-Data Source Blockchain Oracle SEMSO:一个安全高效的多数据源区块链Oracle
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-10 DOI: 10.1109/TPDS.2025.3586450
Youquan Xian;Xueying Zeng;Chunpei Li;Peng Wang;Dongcheng Li;Peng Liu;Xianxian Li
In recent years, blockchain oracle, as the key link between blockchain and real-world data interaction, has greatly expanded the application scope of blockchain. In particular, the emergence of the Multi-Data Source (MDS) oracle has greatly improved the reliability of the oracle in the case of untrustworthy data sources. However, the current MDS oracle scheme requires nodes to obtain data redundantly from multiple data sources to guarantee data reliability, which greatly increases the resource overhead and response time of the system. Therefore, in this paper, we propose a Secure and Efficient Multi-data Source Oracle framework (SEMSO), where nodes only need to access one data source to ensure the reliability of final data. First, we design a new off-chain data aggregation protocol TBLS, to guarantee data source diversity and reliability at low cost. Second, according to the rational man assumption, the data source selection task of nodes is modeled and solved based on the Bayesian game under incomplete information to maximize the node’s revenue while improving the success rate of TBLS aggregation and system response speed. Security analysis verifies the reliability of the proposed scheme, and experiments show that under the same environmental assumptions, SEMSO takes into account data diversity while reducing the response time by 23.5%.
近年来,区块链oracle作为区块链与现实世界数据交互的关键纽带,极大地扩展了区块链的应用范围。特别是多数据源(Multi-Data Source, MDS) oracle的出现,极大地提高了oracle在不可信数据源情况下的可靠性。但是,目前的MDS oracle方案需要节点从多个数据源中冗余获取数据以保证数据的可靠性,这大大增加了系统的资源开销和响应时间。因此,在本文中,我们提出了一个安全高效的多数据源Oracle框架(SEMSO),其中节点只需要访问一个数据源,以确保最终数据的可靠性。首先,我们设计了一种新的脱链数据聚合协议TBLS,以低成本保证数据源的多样性和可靠性。其次,根据理性人假设,基于不完全信息下的贝叶斯博弈对节点的数据源选择任务进行建模和求解,在提高TBLS聚合成功率和系统响应速度的同时,实现节点收益最大化。安全性分析验证了所提方案的可靠性,实验表明,在相同的环境假设下,SEMSO在考虑数据多样性的同时将响应时间缩短了23.5%。
{"title":"SEMSO: A Secure and Efficient Multi-Data Source Blockchain Oracle","authors":"Youquan Xian;Xueying Zeng;Chunpei Li;Peng Wang;Dongcheng Li;Peng Liu;Xianxian Li","doi":"10.1109/TPDS.2025.3586450","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3586450","url":null,"abstract":"In recent years, blockchain oracle, as the key link between blockchain and real-world data interaction, has greatly expanded the application scope of blockchain. In particular, the emergence of the Multi-Data Source (MDS) oracle has greatly improved the reliability of the oracle in the case of untrustworthy data sources. However, the current MDS oracle scheme requires nodes to obtain data redundantly from multiple data sources to guarantee data reliability, which greatly increases the resource overhead and response time of the system. Therefore, in this paper, we propose a Secure and Efficient Multi-data Source Oracle framework (SEMSO), where nodes only need to access one data source to ensure the reliability of final data. First, we design a new off-chain data aggregation protocol TBLS, to guarantee data source diversity and reliability at low cost. Second, according to the rational man assumption, the data source selection task of nodes is modeled and solved based on the Bayesian game under incomplete information to maximize the node’s revenue while improving the success rate of TBLS aggregation and system response speed. Security analysis verifies the reliability of the proposed scheme, and experiments show that under the same environmental assumptions, SEMSO takes into account data diversity while reducing the response time by 23.5%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2512-2523"},"PeriodicalIF":6.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
$mathsf{streamline}$: Accelerating Deployment and Assessment of Real-Time Big Data Systems $mathsf{streamlined}$:加速实时大数据系统的部署和评估
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-07-10 DOI: 10.1109/TPDS.2025.3587641
Md. Monzurul Amin Ifath;Tommaso Melodia;Israat Haque
Real-time stream processing applications (e.g., IoT data analytics and fraud detection) are becoming integral to everyday life. A robust and efficient Big Data system, especially a streaming pipeline composed of producers, brokers, and consumers, is at the heart of the successful deployment of these applications. However, their deployment and assessment can be complex and costly due to the intricate interactions between pipeline components and the reliance on expensive hardware or cloud environments. Thus, we propose $mathsf{streamline}$streamline, an agile, efficient, and dependable framework as an alternative to assess streaming applications without requiring a hardware testbed or cloud setup. To simplify the deployment, prototyping, and benchmarking of end-to-end stream processing applications involving distributed platforms (e.g., Apache Kafka, Spark, Flink), the framework provides a lightweight environment with a developer-friendly, high-level API for dynamically selecting and configuring pipeline components. Moreover, the modular architecture of $mathsf{streamline}$streamline enables developers to integrate any required platform into their systems. The performance and robustness of a deployed pipeline can be assessed with varying network conditions and injected faults. Furthermore, it facilitates benchmarking event streaming platforms like Apache Kafka and RabbitMQ. Extensive evaluations of various streaming applications confirm the effectiveness and dependability of $mathsf{streamline}$streamline.
实时流处理应用程序(例如,物联网数据分析和欺诈检测)正在成为日常生活中不可或缺的一部分。一个强大而高效的大数据系统,特别是由生产者、代理和消费者组成的流管道,是成功部署这些应用程序的核心。然而,由于管道组件之间复杂的交互以及对昂贵的硬件或云环境的依赖,它们的部署和评估可能是复杂和昂贵的。因此,我们提出$mathsf{streamlined}$ streamlined,这是一个敏捷、高效和可靠的框架,可以作为评估流应用程序的替代方案,而无需硬件测试平台或云设置。为了简化涉及分布式平台(如Apache Kafka、Spark、Flink)的端到端流处理应用程序的部署、原型设计和基准测试,该框架提供了一个轻量级环境,其中包含一个对开发人员友好的高级API,用于动态选择和配置管道组件。此外,$mathsf{streamlined}$ streamlined的模块化架构使开发人员能够将任何所需的平台集成到他们的系统中。部署管道的性能和鲁棒性可以通过不同的网络条件和注入的故障来评估。此外,它还有助于对Apache Kafka和RabbitMQ等事件流平台进行基准测试。对各种流应用程序的广泛评估证实了$mathsf{streamlined}$ streamlined的有效性和可靠性。
{"title":"$mathsf{streamline}$: Accelerating Deployment and Assessment of Real-Time Big Data Systems","authors":"Md. Monzurul Amin Ifath;Tommaso Melodia;Israat Haque","doi":"10.1109/TPDS.2025.3587641","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3587641","url":null,"abstract":"Real-time stream processing applications (e.g., IoT data analytics and fraud detection) are becoming integral to everyday life. A robust and efficient Big Data system, especially a streaming pipeline composed of producers, brokers, and consumers, is at the heart of the successful deployment of these applications. However, their deployment and assessment can be complex and costly due to the intricate interactions between pipeline components and the reliance on expensive hardware or cloud environments. Thus, we propose <italic><inline-formula><tex-math>$mathsf{streamline}$</tex-math><alternatives><mml:math><mml:mi>streamline</mml:mi></mml:math><inline-graphic></alternatives></inline-formula></i>, an agile, efficient, and dependable framework as an alternative to assess streaming applications without requiring a hardware testbed or cloud setup. To simplify the deployment, prototyping, and benchmarking of end-to-end stream processing applications involving distributed platforms (e.g., Apache Kafka, Spark, Flink), the framework provides a lightweight environment with a developer-friendly, high-level API for dynamically selecting and configuring pipeline components. Moreover, the modular architecture of <italic><inline-formula><tex-math>$mathsf{streamline}$</tex-math><alternatives><mml:math><mml:mi>streamline</mml:mi></mml:math><inline-graphic></alternatives></inline-formula></i> enables developers to integrate any required platform into their systems. The performance and robustness of a deployed pipeline can be assessed with varying network conditions and injected faults. Furthermore, it facilitates benchmarking event streaming platforms like Apache Kafka and RabbitMQ. Extensive evaluations of various streaming applications confirm the effectiveness and dependability of <italic><inline-formula><tex-math>$mathsf{streamline}$</tex-math><alternatives><mml:math><mml:mi>streamline</mml:mi></mml:math><inline-graphic></alternatives></inline-formula></i>.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2455-2468"},"PeriodicalIF":6.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1