首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
Autonomous Model Aggregation for Decentralized Learning on Edge Devices 边缘设备上分散学习的自主模型聚合
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-14 DOI: 10.1109/TPDS.2025.3621058
Jinru Chen;Jingke Tu;Lei Yang;Jiannong Cao
Edge AI applications enable edge devices to collaboratively learn a model via repeated model aggregations, aiming to utilize the distributed data on the devices for achieving high model accuracy. Existing methods either leverage a centralized server to directly aggregate the model updates from edge devices or need a central coordinator to group the edge devices for localized model aggregations. The centralized server (or coordinator) has a performance bottleneck and a high cost of collecting the global state needed for making the grouping decision in large-scale networks. In this paper, we propose an Autonomous Model Aggregation (AMA) method for large-scale decentralized learning on edge devices. Instead of needing a central coordinator to group the edge devices, AMA allows the edge devices to autonomously form groups using a highly efficient protocol, according to model functional similarity and historical grouping information. Moreover, AMA adopts a reinforcement learning approach to optimize the size of each group. Evaluation results on our self-developed edge computing testbed demonstrate that AMA outperforms the benchmark approaches by up to 20.71% in accuracy and reduced the convergence time by 75.58%.
边缘人工智能应用使边缘设备能够通过重复的模型聚合来协作学习模型,旨在利用设备上的分布式数据来实现高模型精度。现有的方法要么利用集中式服务器直接聚合来自边缘设备的模型更新,要么需要一个中央协调器对边缘设备进行分组以进行局部模型聚合。集中式服务器(或协调器)存在性能瓶颈,并且在大规模网络中收集进行分组决策所需的全局状态的成本很高。在本文中,我们提出了一种用于边缘设备上大规模分散学习的自治模型聚合(AMA)方法。AMA不需要中央协调器对边缘设备进行分组,而是允许边缘设备根据模型功能相似性和历史分组信息,使用高效协议自主组成分组。此外,AMA采用强化学习方法来优化每个组的大小。在自主开发的边缘计算试验台上的评估结果表明,AMA的准确率比基准方法高出20.71%,收敛时间缩短了75.58%。
{"title":"Autonomous Model Aggregation for Decentralized Learning on Edge Devices","authors":"Jinru Chen;Jingke Tu;Lei Yang;Jiannong Cao","doi":"10.1109/TPDS.2025.3621058","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3621058","url":null,"abstract":"Edge AI applications enable edge devices to collaboratively learn a model via repeated model aggregations, aiming to utilize the distributed data on the devices for achieving high model accuracy. Existing methods either leverage a centralized server to directly aggregate the model updates from edge devices or need a central coordinator to group the edge devices for localized model aggregations. The centralized server (or coordinator) has a performance bottleneck and a high cost of collecting the global state needed for making the grouping decision in large-scale networks. In this paper, we propose an Autonomous Model Aggregation (AMA) method for large-scale decentralized learning on edge devices. Instead of needing a central coordinator to group the edge devices, AMA allows the edge devices to autonomously form groups using a highly efficient protocol, according to model functional similarity and historical grouping information. Moreover, AMA adopts a reinforcement learning approach to optimize the size of each group. Evaluation results on our self-developed edge computing testbed demonstrate that AMA outperforms the benchmark approaches by up to 20.71% in accuracy and reduced the convergence time by 75.58%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"15-28"},"PeriodicalIF":6.0,"publicationDate":"2025-10-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FEditor: Consecutive Task Placement With Adjustable Shapes Using FPGA State Frames FEditor:使用FPGA状态帧的可调形状的连续任务放置
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-13 DOI: 10.1109/TPDS.2025.3620384
Yanyan Li;Yu Chen;Zhiqian Xu;Yawen Wang;Hai Jiang;Keqin Li
Field Programmable Gate Arrays (FPGAs) are widely adopted in datacenters, where each FPGA is exclusively assigned to a task. This strategy results in significant resource waste and increased task rejections. To address this issue, placement algorithms adjust the locations and shapes of tasks based on Dynamic Partial Reconfiguration, which partitions an FPGA into multiple rectangular areas for sharing. However, existing schemes are designed for static task sets without adjustable shapes, incapable of optimizing the placement problem in datacenters. In this paper, FEditor is proposed as the first consecutive task placement scheme with adjustable shapes. It expands the planar FPGA models into three-dimensional ones with timestamps to accommodate consecutive tasks. To reduce the complexity of three-dimensional resource management, State Frames (SFs) are designed to compress the models losslessly. Three metrics and a nested heuristic algorithm are used for task placement. Experimental results demonstrate that FEditor has improved resource utilization by at least 19.8% and acceptance rate by at least 10% compared to the referenced algorithms. SFs and the nested algorithm accelerate the task placement by up to $10.26times$. The suitability of FEditor in datacenter environments is verified by its time efficiency trends.
现场可编程门阵列(FPGA)被广泛应用于数据中心,每个FPGA都被专门分配给一个任务。这种策略会导致严重的资源浪费和任务拒绝的增加。为了解决这个问题,放置算法基于动态部分重构来调整任务的位置和形状,该算法将FPGA划分为多个矩形区域以供共享。然而,现有的方案是针对静态任务集设计的,没有可调整的形状,无法优化数据中心中的放置问题。本文提出了FEditor作为第一个具有可调形状的连续任务布置方案。它将平面FPGA模型扩展为带时间戳的三维FPGA模型,以适应连续的任务。为了降低三维资源管理的复杂性,设计了状态框架(State Frames, sf)对模型进行无损压缩。三个指标和嵌套启发式算法用于任务布置。实验结果表明,与参考算法相比,FEditor的资源利用率至少提高了19.8%,接受率至少提高了10%。sf和嵌套算法将任务放置速度提高了10.26倍。FEditor的时间效率趋势验证了其在数据中心环境中的适用性。
{"title":"FEditor: Consecutive Task Placement With Adjustable Shapes Using FPGA State Frames","authors":"Yanyan Li;Yu Chen;Zhiqian Xu;Yawen Wang;Hai Jiang;Keqin Li","doi":"10.1109/TPDS.2025.3620384","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3620384","url":null,"abstract":"Field Programmable Gate Arrays (FPGAs) are widely adopted in datacenters, where each FPGA is exclusively assigned to a task. This strategy results in significant resource waste and increased task rejections. To address this issue, placement algorithms adjust the locations and shapes of tasks based on Dynamic Partial Reconfiguration, which partitions an FPGA into multiple rectangular areas for sharing. However, existing schemes are designed for static task sets without adjustable shapes, incapable of optimizing the placement problem in datacenters. In this paper, FEditor is proposed as the first consecutive task placement scheme with adjustable shapes. It expands the planar FPGA models into three-dimensional ones with timestamps to accommodate consecutive tasks. To reduce the complexity of three-dimensional resource management, <i>State Frames</i> (<i>SFs</i>) are designed to compress the models losslessly. Three metrics and a nested heuristic algorithm are used for task placement. Experimental results demonstrate that FEditor has improved resource utilization by at least 19.8% and acceptance rate by at least 10% compared to the referenced algorithms. <i>SFs</i> and the nested algorithm accelerate the task placement by up to <inline-formula><tex-math>$10.26times$</tex-math></inline-formula>. The suitability of FEditor in datacenter environments is verified by its time efficiency trends.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"1-14"},"PeriodicalIF":6.0,"publicationDate":"2025-10-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Chorus: Robust Multitasking Local Client-Server Collaborative Inference With Wi-Fi 6 for AIoT Against Stochastic Congestion Delay 基于Wi-Fi 6的AIoT随机拥塞延迟鲁棒多任务本地客户端-服务器协同推理
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-09 DOI: 10.1109/TPDS.2025.3619775
Yuzhe Luo;Ji Qi;Ling Li;Ruizhi Chen;Xiaoyu Wu;Limin Cheng;Chen Zhao
The rapid growth of AIoT devices brings huge demands for DNNs deployed on resource-constrained devices. However, the intensive computation and high memory footprint of DNN inference make it difficult for the AIoT devices to execute the inference tasks efficiently. In many widely deployed AIoT use cases, multiple local AIoT devices launch DNN inference tasks randomly. Although local collaborative inference has been proposed to accelerate DNN inference on local devices with limited resources, multitasking local collaborative inference, which is common in AIoT scenarios, has not been fully studied in previous works. We consider multitasking local client-server collaborative inference (MLCCI), which achieves efficient DNN inference by offloading the inference tasks from multiple AIoT devices to a more powerful local server with parallel pipelined execution streams through Wi-Fi 6. Our optimization goal is to minimize the mean end-to-end latency of MLCCI. Based on the experiment results, we identify three key challenges: high communication costs, high model initialization latency, and congestion delay brought by task interference. We analyze congestion delay in MLCCI and its stochastic fluctuations with queuing theory and propose Chorus, a high-performance adaptive MLCCI framework for AIoT devices, to minimize the mean end-to-end latency of MLCCI against stochastic congestion delay. Chorus generates communication-efficient model partitions with heuristic search, uses a prefetch-enabled two-level LRU cache to accelerate model initialization on the server, reduces congestion delay and its short-term fluctuations with execution stream allocation based on the cross-entropy method, and finally achieves efficient computation offloading with reinforcement learning. We established a system prototype, which statistically simulated many virtual clients with limited physical client devices to conduct performance evaluations, for Chorus with real devices. The evaluation results for various workload levels show that Chorus achieved an average of $1.4times$, $1.3times$, and $2times$ speedup over client-only inference, and server-only inference with LRU and MLSH, respectively.
AIoT设备的快速增长对部署在资源受限设备上的深度神经网络提出了巨大的需求。然而,深度神经网络推理的高计算量和高内存占用使得AIoT设备难以有效地执行推理任务。在许多广泛部署的AIoT用例中,多个本地AIoT设备随机启动DNN推理任务。虽然已经提出了在资源有限的本地设备上加速DNN推理的本地协同推理,但在AIoT场景中常见的多任务本地协同推理在以往的工作中并没有得到充分的研究。我们考虑多任务本地客户端-服务器协作推理(MLCCI),它通过Wi-Fi 6将推理任务从多个AIoT设备卸载到更强大的本地服务器上,并通过并行流水线执行流实现高效的DNN推理。我们的优化目标是最小化MLCCI的平均端到端延迟。基于实验结果,我们确定了三个关键挑战:高通信成本、高模型初始化延迟和任务干扰带来的拥塞延迟。本文利用排队理论分析了MLCCI中的拥塞延迟及其随机波动,并提出了一种用于AIoT设备的高性能自适应MLCCI框架Chorus,以最大限度地降低MLCCI对随机拥塞延迟的平均端到端延迟。Chorus通过启发式搜索生成通信高效的模型分区,使用支持预取的两级LRU缓存加速服务器上的模型初始化,通过基于交叉熵方法的执行流分配减少拥塞延迟及其短期波动,最终通过强化学习实现高效的计算卸载。我们建立了一个系统原型,用有限的物理客户端设备统计模拟了许多虚拟客户端进行性能评估。不同工作负载水平的评估结果表明,与LRU和MLSH相比,Chorus在仅客户端推理和仅服务器推理方面的平均加速分别为1.4倍、1.3倍和2倍。
{"title":"Chorus: Robust Multitasking Local Client-Server Collaborative Inference With Wi-Fi 6 for AIoT Against Stochastic Congestion Delay","authors":"Yuzhe Luo;Ji Qi;Ling Li;Ruizhi Chen;Xiaoyu Wu;Limin Cheng;Chen Zhao","doi":"10.1109/TPDS.2025.3619775","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3619775","url":null,"abstract":"The rapid growth of AIoT devices brings huge demands for DNNs deployed on resource-constrained devices. However, the intensive computation and high memory footprint of DNN inference make it difficult for the AIoT devices to execute the inference tasks efficiently. In many widely deployed AIoT use cases, multiple local AIoT devices launch DNN inference tasks randomly. Although local collaborative inference has been proposed to accelerate DNN inference on local devices with limited resources, multitasking local collaborative inference, which is common in AIoT scenarios, has not been fully studied in previous works. We consider multitasking local client-server collaborative inference (MLCCI), which achieves efficient DNN inference by offloading the inference tasks from multiple AIoT devices to a more powerful local server with parallel pipelined execution streams through Wi-Fi 6. Our optimization goal is to minimize the mean end-to-end latency of MLCCI. Based on the experiment results, we identify three key challenges: high communication costs, high model initialization latency, and congestion delay brought by task interference. We analyze congestion delay in MLCCI and its stochastic fluctuations with queuing theory and propose Chorus, a high-performance adaptive MLCCI framework for AIoT devices, to minimize the mean end-to-end latency of MLCCI against stochastic congestion delay. Chorus generates communication-efficient model partitions with heuristic search, uses a prefetch-enabled two-level LRU cache to accelerate model initialization on the server, reduces congestion delay and its short-term fluctuations with execution stream allocation based on the cross-entropy method, and finally achieves efficient computation offloading with reinforcement learning. We established a system prototype, which statistically simulated many virtual clients with limited physical client devices to conduct performance evaluations, for Chorus with real devices. The evaluation results for various workload levels show that Chorus achieved an average of <inline-formula><tex-math>$1.4times$</tex-math></inline-formula>, <inline-formula><tex-math>$1.3times$</tex-math></inline-formula>, and <inline-formula><tex-math>$2times$</tex-math></inline-formula> speedup over client-only inference, and server-only inference with LRU and MLSH, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2706-2723"},"PeriodicalIF":6.0,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352076","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cache Partition Management for Improving Fairness and I/O Responsiveness in NVMe SSDs 提高NVMe ssd公平性和I/O响应性的缓存分区管理
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-09 DOI: 10.1109/TPDS.2025.3619866
Jiaojiao Wu;Fan Yang;Zhibing Sha;Li Cai;Zhigang Cai;Balazs Gerofi;Yuanquan Shi;Jianwei Liao
NVMe SSDs have become mainstream storage devices thanks to their compact size and ultra-low latency. It has been observed that the impact of interference among all concurrently running streams (i.e., I/O workloads) on their overall responsiveness differs significantly, thus leading to unfairness. The intensity and access locality of streams are the primary factors contributing to interference. A small-sized data cache is commonly equipped in the front-end of SSDs to improve I/O performance and extend the device’s lifetime. The degree of parallelism at this level, however, is limited compared to that of the SSD back end, which consists of multiple channels, chips, and planes. Therefore, the impact of interference can be more significant at the data cache level. In this paper, we propose a cache division management scheme that not only contributes to fairness but also boosts I/O responsiveness across all workloads in NVMe SSDs. Specifically, our proposal supports long-term data cache partitioning and short-term cache adjustment with global sharing, ensuring better fairness and further enhancing cache utilization efficiency in multi-stream scenarios. Trace-driven simulation experiments show that our proposal improves fairness by an average of 66.0% and reduces overall I/O response time by between 3.8% and 18.0%, compared to existing cache management schemes for NVMe SSDs.
NVMe固态硬盘由于其紧凑的尺寸和超低的延迟,已经成为主流的存储设备。据观察,所有并发运行流(即I/O工作负载)之间的干扰对其总体响应性的影响差异很大,从而导致不公平。流的强度和进入位置是造成干扰的主要因素。为了提高I/O性能和延长设备的使用寿命,通常在ssd的前端配置一个小型的数据缓存。然而,与由多个通道、芯片和平面组成的SSD后端相比,这个级别的并行度是有限的。因此,在数据缓存级别,干扰的影响可能更为显著。在本文中,我们提出了一种缓存划分管理方案,该方案不仅有助于公平性,而且还提高了NVMe ssd中所有工作负载的I/O响应能力。具体来说,我们的方案支持全局共享的长期数据缓存分区和短期缓存调整,保证了更好的公平性,进一步提高了多流场景下的缓存利用效率。跟踪驱动的仿真实验表明,与现有的NVMe ssd缓存管理方案相比,我们的提议平均提高了66.0%的公平性,并将总体I/O响应时间减少了3.8%至18.0%。
{"title":"Cache Partition Management for Improving Fairness and I/O Responsiveness in NVMe SSDs","authors":"Jiaojiao Wu;Fan Yang;Zhibing Sha;Li Cai;Zhigang Cai;Balazs Gerofi;Yuanquan Shi;Jianwei Liao","doi":"10.1109/TPDS.2025.3619866","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3619866","url":null,"abstract":"NVMe SSDs have become mainstream storage devices thanks to their compact size and ultra-low latency. It has been observed that the impact of interference among all concurrently running streams (i.e., I/O workloads) on their overall responsiveness differs significantly, thus leading to unfairness. The intensity and access locality of streams are the primary factors contributing to interference. A small-sized data cache is commonly equipped in the front-end of SSDs to improve I/O performance and extend the device’s lifetime. The degree of parallelism at this level, however, is limited compared to that of the SSD back end, which consists of multiple channels, chips, and planes. Therefore, the impact of interference can be more significant at the data cache level. In this paper, we propose a cache division management scheme that not only contributes to fairness but also boosts I/O responsiveness across all workloads in NVMe SSDs. Specifically, our proposal supports long-term data cache partitioning and short-term cache adjustment with global sharing, ensuring better fairness and further enhancing cache utilization efficiency in multi-stream scenarios. Trace-driven simulation experiments show that our proposal improves fairness by an average of <monospace>66.0</monospace>% and reduces overall I/O response time by between <monospace>3.8</monospace>% and <monospace>18.0</monospace>%, compared to existing cache management schemes for NVMe SSDs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"122-136"},"PeriodicalIF":6.0,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Popularity-Aware Data Placement in Erasure Coding-Based Edge Storage Systems 基于Erasure编码的边缘存储系统中流行度感知的数据放置
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-08 DOI: 10.1109/TPDS.2025.3619273
Ruikun Luo;Jiadong Zhao;Qiang He;Feifei Chen;Song Wu;Hai Jin;Yun Yang
Edge computing enables low-latency data access by caching popular content on edge servers. However, server unavailability at runtime can increase retrieval latency when requests are redirected to the cloud. To enhance availability, erasure coding (EC) has been employed to ensure full data access for all users in an edge storage system (ESS). Existing approaches for edge data placement place coded blocks across the entire system without considering data popularity. As a result, they often suffer from high data retrieval latency. In addition, they are designed to process data items individually. Data placed earlier will limit the placement options for subsequent files because edge servers with the most neighbors in the system can be easily exhausted. Some files cannot be placed properly to accommodate user demands. This increases users’ data retrieval latency further. This paper investigates the edge data placement (EDP) problem with popularity awareness. We formulate EDP as a mixed-integer programming problem and prove its $mathcal{NP}$-hardness. We then design an exact algorithm (EDP-O) that decomposes the problem into three convex subproblems and solves it iteratively, and an approximation algorithm (EDP-A) with a guaranteed $ln N$ approximation ratio for large-scale systems. Experiments on real-world datasets show that EDP-O and EDP-A reduce average retrieval latency by 18.4% and 15.6% in small-scale settings, while EDP-A achieves 54.7% latency reduction and 34.9% lower discard rate in large-scale scenarios compared to four baselines.
边缘计算通过在边缘服务器上缓存流行内容来实现低延迟数据访问。但是,运行时服务器不可用会增加请求重定向到云时的检索延迟。为了提高可用性,采用了擦除编码(EC)来确保边缘存储系统(ESS)中所有用户都可以访问完整的数据。现有的边缘数据放置方法在整个系统中放置编码块,而不考虑数据的流行程度。因此,它们经常遭受高数据检索延迟的困扰。此外,它们被设计为单独处理数据项。较早放置的数据将限制后续文件的放置选项,因为系统中邻居最多的边缘服务器很容易耗尽。有些文件无法正确放置以满足用户需求。这进一步增加了用户的数据检索延迟。本文研究了具有流行意识的边缘数据放置问题。我们将EDP问题化为一个混合整数规划问题,并证明了它的$数学{NP}$-硬度。然后,我们设计了一种精确算法(EDP-O),将问题分解为三个凸子问题并迭代求解,并设计了一种近似算法(EDP-A),对大规模系统具有保证的$ln N$近似比。在实际数据集上的实验表明,与四种基线相比,EDP-O和EDP-A在小规模场景下的平均检索延迟降低了18.4%和15.6%,而EDP-A在大规模场景下的平均检索延迟降低了54.7%,丢弃率降低了34.9%。
{"title":"Popularity-Aware Data Placement in Erasure Coding-Based Edge Storage Systems","authors":"Ruikun Luo;Jiadong Zhao;Qiang He;Feifei Chen;Song Wu;Hai Jin;Yun Yang","doi":"10.1109/TPDS.2025.3619273","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3619273","url":null,"abstract":"Edge computing enables low-latency data access by caching popular content on edge servers. However, server unavailability at runtime can increase retrieval latency when requests are redirected to the cloud. To enhance availability, <i>erasure coding</i> (EC) has been employed to ensure full data access for all users in an edge storage system (ESS). Existing approaches for edge data placement place coded blocks across the entire system without considering data popularity. As a result, they often suffer from high data retrieval latency. In addition, they are designed to process data items individually. Data placed earlier will limit the placement options for subsequent files because edge servers with the most neighbors in the system can be easily exhausted. Some files cannot be placed properly to accommodate user demands. This increases users’ data retrieval latency further. This paper investigates the <i>edge data placement</i> (EDP) problem with popularity awareness. We formulate EDP as a mixed-integer programming problem and prove its <inline-formula><tex-math>$mathcal{NP}$</tex-math></inline-formula>-hardness. We then design an exact algorithm (EDP-O) that decomposes the problem into three convex subproblems and solves it iteratively, and an approximation algorithm (EDP-A) with a guaranteed <inline-formula><tex-math>$ln N$</tex-math></inline-formula> approximation ratio for large-scale systems. Experiments on real-world datasets show that EDP-O and EDP-A reduce average retrieval latency by 18.4% and 15.6% in small-scale settings, while EDP-A achieves 54.7% latency reduction and 34.9% lower discard rate in large-scale scenarios compared to four baselines.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2733-2746"},"PeriodicalIF":6.0,"publicationDate":"2025-10-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11197023","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405262","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SSpMM: Efficiently Scalable SpMM Kernels Across Multiple Generations of Tensor Cores SSpMM:跨多代张量核的有效可扩展的SpMM内核
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-01 DOI: 10.1109/TPDS.2025.3616981
Zeyu Xue;Mei Wen;Jianchao Yang;Minjin Tang;Zhongdi Luo;Jing Feng;Yang Shi;Zhaoyun Chen;Junzhong Shen;Johannes Langguth
Sparse-Dense Matrix-Matrix Multiplication (SpMM) has emerged as a foundational primitive in HPC and AI. Recent advancements have aimed to accelerate SpMM by harnessing the powerful Tensor Cores found in modern GPUs. However, despite these efforts, existing methods frequently encounter performance degradation when ported across different Tensor Core architectures. Recognizing that scalable SpMM across multiple generations of Tensor Cores relies on the effective use of general-purpose instructions, we have meticulously developed a SpMM library named SSpMM. However, a significant conflict exists between granularity and performance in current Tensor Core instructions. To resolve this, we introduce the innovative Transpose Mapping Scheme, which elegantly implements fine-grained kernels using coarse-grained instructions. Additionally, we propose the Register Shuffle Method to further enhance performance. Finally, we introduce Sparse Vector Compression, a technique that ensures our kernels are scalable with both structured and unstructured sparsity. Our experimental results, conducted on four generations of Tensor Core GPUs using over 3,000 sparse matrices from well-established matrix collections, demonstrate that SSpMM achieves an average speedup of 2.04 ×, 2.81 ×, 2.07 ×, and 1.87 ×, respectively, over the state-of-the-art SpMM solution. Furthermore, we have integrated SSpMM into PyTorch, achieving a 1.81 × speedup in end-to-end Transformer inference compared to cuDNN.
稀疏-密集矩阵-矩阵乘法(SpMM)已经成为高性能计算和人工智能的基础基元。最近的进展旨在通过利用现代gpu中发现的强大张量核心来加速SpMM。然而,尽管做出了这些努力,现有的方法在跨不同的Tensor Core架构移植时经常会遇到性能下降。认识到跨多代张量核心的可扩展SpMM依赖于通用指令的有效使用,我们精心开发了一个名为SSpMM的SpMM库。然而,在当前的Tensor Core指令中,粒度和性能之间存在着显著的冲突。为了解决这个问题,我们引入了创新的转置映射方案,该方案使用粗粒度指令优雅地实现了细粒度内核。此外,我们提出了寄存器Shuffle方法来进一步提高性能。最后,我们介绍了稀疏向量压缩,这是一种确保我们的内核在结构化和非结构化稀疏性下都可扩展的技术。我们在四代Tensor Core gpu上进行的实验结果表明,与最先进的SpMM解决方案相比,SSpMM的平均加速分别为2.04 x、2.81 x、2.07 x和1.87 x。此外,我们将SSpMM集成到PyTorch中,与cuDNN相比,在端到端Transformer推理方面实现了1.81倍的加速。
{"title":"SSpMM: Efficiently Scalable SpMM Kernels Across Multiple Generations of Tensor Cores","authors":"Zeyu Xue;Mei Wen;Jianchao Yang;Minjin Tang;Zhongdi Luo;Jing Feng;Yang Shi;Zhaoyun Chen;Junzhong Shen;Johannes Langguth","doi":"10.1109/TPDS.2025.3616981","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3616981","url":null,"abstract":"Sparse-Dense Matrix-Matrix Multiplication (SpMM) has emerged as a foundational primitive in HPC and AI. Recent advancements have aimed to accelerate SpMM by harnessing the powerful Tensor Cores found in modern GPUs. However, despite these efforts, existing methods frequently encounter performance degradation when ported across different Tensor Core architectures. Recognizing that scalable SpMM across multiple generations of Tensor Cores relies on the effective use of general-purpose instructions, we have meticulously developed a SpMM library named <italic>SSpMM</i>. However, a significant conflict exists between granularity and performance in current Tensor Core instructions. To resolve this, we introduce the innovative <italic>Transpose Mapping Scheme</i>, which elegantly implements fine-grained kernels using coarse-grained instructions. Additionally, we propose the <italic>Register Shuffle Method</i> to further enhance performance. Finally, we introduce <italic>Sparse Vector Compression</i>, a technique that ensures our kernels are scalable with both structured and unstructured sparsity. Our experimental results, conducted on four generations of Tensor Core GPUs using over 3,000 sparse matrices from well-established matrix collections, demonstrate that <italic>SSpMM</i> achieves an average speedup of 2.04 ×, 2.81 ×, 2.07 ×, and 1.87 ×, respectively, over the state-of-the-art SpMM solution. Furthermore, we have integrated <italic>SSpMM</i> into PyTorch, achieving a 1.81 × speedup in end-to-end Transformer inference compared to <italic>cuDNN</i>.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2652-2667"},"PeriodicalIF":6.0,"publicationDate":"2025-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
GAP-DCCS: A Generic Acceleration Paradigm for Data-Intensive Applications With Efficient Data Compression and Caching Strategy Over CPU-GPU Clusters GAP-DCCS:基于CPU-GPU集群的高效数据压缩和缓存策略的数据密集型应用的通用加速范例
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-09-29 DOI: 10.1109/TPDS.2025.3615283
Jiangwei Xiao;Yingzhe Bai;Hanfei Diao;Guofeng Liu;Yuzhu Wang
Seismic exploration is a geophysical method used for imaging subsurface structures, capable of providing high-resolution images of the underground. In seismic data processing, Kirchhoff Pre-Stack Depth Migration (KPSDM) serves as one of the key techniques, playing a critical role in significantly enhancing the lateral resolution of imaging and providing accurate characterization of subsurface media. However, with the continuous growth in high-density seismic data volumes, the computational efficiency of KPSDM is primarily constrained by substantial computational loads, end-to-end I/O bottlenecks, and data storage pressures. To address the performance optimization challenges of computation-intensive applications that require frequent large-scale data transfers between the host and accelerator devices, this paper proposes GAP-DCCS, a GPU-based Generic Acceleration Paradigm with efficient Data Compression and Caching Strategy, which includes the following core strategies: (1) For compute-intensive modules, a GPU-based three-dimensional parallel acceleration is implemented, combined with memory access optimization techniques and overlapping strategies for data transfer and computation, to improve GPU resource utilization; (2) To alleviate the storage pressure of large-scale datasets, the BitComp compression algorithm is introduced to efficiently compress task data while maintaining output stability, significantly reducing storage requirements and end-to-end data transfer volume; (3) To tackle the I/O bottleneck caused by frequent large-scale data transfers between the host and devices, an adaptive dynamic caching data management mechanism is designed to effectively increase data reuse rates and markedly reduce end-to-end transfer frequency. Experimental results demonstrate that the proposed optimization method significantly enhances the computational performance of KPSDM, achieving a speedup of 123.51× on a single NVIDIA Tesla A800 GPU compared to a 16-core CPU. This optimization paradigm has not only been effectively validated in KPSDM but also offers a referable high-performance computing solution for other large-scale data processing tasks.
地震勘探是一种用于地下构造成像的地球物理方法,能够提供地下的高分辨率图像。在地震数据处理中,Kirchhoff叠前深度偏移(KPSDM)是关键技术之一,在显著提高成像横向分辨率和提供准确的地下介质表征方面发挥着关键作用。然而,随着高密度地震数据量的不断增长,KPSDM的计算效率主要受到大量计算负载、端到端I/O瓶颈和数据存储压力的制约。为了解决需要在主机和加速器设备之间频繁进行大规模数据传输的计算密集型应用程序的性能优化挑战,本文提出了基于gpu的通用加速范式GAP-DCCS,该范式具有高效的数据压缩和缓存策略,其中包括以下核心策略:(1)针对计算密集型模块,实现基于GPU的三维并行加速,结合内存访问优化技术和数据传输与计算重叠策略,提高GPU资源利用率;(2)为缓解大规模数据集的存储压力,引入BitComp压缩算法,在保持输出稳定性的同时高效压缩任务数据,显著降低存储需求和端到端数据传输量;(3)针对主机与设备间频繁的大规模数据传输带来的I/O瓶颈,设计了自适应动态缓存数据管理机制,有效提高数据复用率,显著降低端到端传输频率。实验结果表明,所提出的优化方法显著提高了KPSDM的计算性能,在单个NVIDIA Tesla A800 GPU上,与16核CPU相比,KPSDM的加速提高了123.51倍。该优化范式不仅在KPSDM中得到了有效验证,而且为其他大规模数据处理任务提供了可参考的高性能计算解决方案。
{"title":"GAP-DCCS: A Generic Acceleration Paradigm for Data-Intensive Applications With Efficient Data Compression and Caching Strategy Over CPU-GPU Clusters","authors":"Jiangwei Xiao;Yingzhe Bai;Hanfei Diao;Guofeng Liu;Yuzhu Wang","doi":"10.1109/TPDS.2025.3615283","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3615283","url":null,"abstract":"Seismic exploration is a geophysical method used for imaging subsurface structures, capable of providing high-resolution images of the underground. In seismic data processing, Kirchhoff Pre-Stack Depth Migration (KPSDM) serves as one of the key techniques, playing a critical role in significantly enhancing the lateral resolution of imaging and providing accurate characterization of subsurface media. However, with the continuous growth in high-density seismic data volumes, the computational efficiency of KPSDM is primarily constrained by substantial computational loads, end-to-end I/O bottlenecks, and data storage pressures. To address the performance optimization challenges of computation-intensive applications that require frequent large-scale data transfers between the host and accelerator devices, this paper proposes GAP-DCCS, a GPU-based Generic Acceleration Paradigm with efficient Data Compression and Caching Strategy, which includes the following core strategies: (1) For compute-intensive modules, a GPU-based three-dimensional parallel acceleration is implemented, combined with memory access optimization techniques and overlapping strategies for data transfer and computation, to improve GPU resource utilization; (2) To alleviate the storage pressure of large-scale datasets, the BitComp compression algorithm is introduced to efficiently compress task data while maintaining output stability, significantly reducing storage requirements and end-to-end data transfer volume; (3) To tackle the I/O bottleneck caused by frequent large-scale data transfers between the host and devices, an adaptive dynamic caching data management mechanism is designed to effectively increase data reuse rates and markedly reduce end-to-end transfer frequency. Experimental results demonstrate that the proposed optimization method significantly enhances the computational performance of KPSDM, achieving a speedup of 123.51× on a single NVIDIA Tesla A800 GPU compared to a 16-core CPU. This optimization paradigm has not only been effectively validated in KPSDM but also offers a referable high-performance computing solution for other large-scale data processing tasks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2747-2762"},"PeriodicalIF":6.0,"publicationDate":"2025-09-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145405313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Art of the Fugue: Minimizing Interleaving in Collaborative Text Editing 赋格的艺术:在协同文本编辑中尽量减少交错
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-09-26 DOI: 10.1109/TPDS.2025.3611880
Matthew Weidner;Martin Kleppmann
Most existing algorithms for replicated lists, which are widely used in collaborative text editors, suffer from a problem: when two users concurrently insert text at the same position in the document, the merged outcome may interleave the inserted text passages, resulting in corrupted and potentially unreadable text. The problem has gone unnoticed for decades, and it affects both CRDTs and Operational Transformation. This paper defines maximal non-interleaving, our new correctness property for replicated lists. We introduce two related CRDT algorithms, Fugue and FugueMax, and prove that FugueMax satisfies maximal non-interleaving. We also implement our algorithms and demonstrate that Fugue offers performance comparable to state-of-the-art CRDT libraries for text editing.
在协作文本编辑器中广泛使用的复制列表的大多数现有算法都存在一个问题:当两个用户同时在文档中的同一位置插入文本时,合并的结果可能会与插入的文本段落交错,从而导致损坏和可能不可读的文本。几十年来,这个问题一直没有被注意到,它既影响到crdt,也影响到运营转型。本文定义了极大非交错性,这是我们在复制链表上新的正确性。介绍了两种相关的CRDT算法Fugue和FugueMax,并证明了FugueMax满足最大非交错。我们还实现了我们的算法,并证明了Fugue提供的性能可与最先进的CRDT文本编辑库相媲美。
{"title":"The Art of the Fugue: Minimizing Interleaving in Collaborative Text Editing","authors":"Matthew Weidner;Martin Kleppmann","doi":"10.1109/TPDS.2025.3611880","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3611880","url":null,"abstract":"Most existing algorithms for replicated lists, which are widely used in collaborative text editors, suffer from a problem: when two users concurrently insert text at the same position in the document, the merged outcome may interleave the inserted text passages, resulting in corrupted and potentially unreadable text. The problem has gone unnoticed for decades, and it affects both CRDTs and Operational Transformation. This paper defines maximal non-interleaving, our new correctness property for replicated lists. We introduce two related CRDT algorithms, Fugue and FugueMax, and prove that FugueMax satisfies maximal non-interleaving. We also implement our algorithms and demonstrate that Fugue offers performance comparable to state-of-the-art CRDT libraries for text editing.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2425-2437"},"PeriodicalIF":6.0,"publicationDate":"2025-09-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210113","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Atomic Smart Contract Interoperability With High Efficiency via Cross-Chain Integrated Execution 通过跨链集成执行实现高效的原子智能合约互操作性
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-09-25 DOI: 10.1109/TPDS.2025.3614374
Chaoyue Yin;Mingzhe Li;Jin Zhang;You Lin;Qingsong Wei;Siow Mong Rick Goh
With the development of Ethereum, numerous blockchains compatible with Ethereum’s execution environment (i.e., Ethereum Virtual Machine, EVM) have emerged. Developers can leverage smart contracts to run various complex decentralized applications on top of blockchains. However, the increasing number of EVM-compatible blockchains has introduced significant challenges in cross-chain interoperability, particularly in ensuring efficiency and atomicity for the whole cross-chain application. Existing solutions are either limited in guaranteeing overall atomicity for the cross-chain application, or inefficient due to the need for multiple rounds of cross-chain smart contract execution. To address this gap, we propose IntegrateX, an efficient cross-chain interoperability system that ensures the overall atomicity of cross-chain smart contract invocations. The core idea is to deploy the logic required for cross-chain execution onto a single blockchain, where it can be executed in an integrated manner. This allows cross-chain applications to perform all cross-chain logic efficiently within the same blockchain. IntegrateX consists of a cross-chain smart contract deployment protocol and a cross-chain smart contract integrated execution protocol. The former achieves efficient and secure cross-chain deployment by decoupling smart contract logic from state, and employing an off-chain cross-chain deployment mechanism combined with on-chain cross-chain verification. The latter ensures atomicity of cross-chain invocations through a 2PC-based mechanism, and enhances performance through transaction aggregation and fine-grained state lock. We implement a prototype of IntegrateX. Extensive experiments demonstrate that it reduces up to 61.2% latency compared to the state-of-the-art baseline while maintaining low gas consumption.
随着以太坊的发展,出现了许多与以太坊执行环境兼容的区块链(即以太坊虚拟机,EVM)。开发人员可以利用智能合约在区块链上运行各种复杂的分散应用程序。然而,越来越多的evm兼容区块链在跨链互操作性方面带来了重大挑战,特别是在确保整个跨链应用程序的效率和原子性方面。现有的解决方案要么在保证跨链应用程序的整体原子性方面受到限制,要么由于需要多轮跨链智能合约执行而效率低下。为了解决这一差距,我们提出了IntegrateX,这是一个高效的跨链互操作性系统,可确保跨链智能合约调用的整体原子性。其核心思想是将跨链执行所需的逻辑部署到单个区块链上,以便以集成的方式执行。这允许跨链应用程序在同一个区块链中有效地执行所有跨链逻辑。IntegrateX由跨链智能合约部署协议和跨链智能合约集成执行协议组成。前者通过将智能合约逻辑与状态解耦,采用链下跨链部署机制结合链上跨链验证,实现高效安全的跨链部署。后者通过基于2pc的机制确保了跨链调用的原子性,并通过事务聚合和细粒度状态锁增强了性能。我们实现了一个IntegrateX的原型。大量的实验表明,与最先进的基线相比,它可以在保持低气体消耗的同时减少高达61.2%的延迟。
{"title":"Atomic Smart Contract Interoperability With High Efficiency via Cross-Chain Integrated Execution","authors":"Chaoyue Yin;Mingzhe Li;Jin Zhang;You Lin;Qingsong Wei;Siow Mong Rick Goh","doi":"10.1109/TPDS.2025.3614374","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3614374","url":null,"abstract":"With the development of Ethereum, numerous blockchains compatible with Ethereum’s execution environment (i.e., Ethereum Virtual Machine, EVM) have emerged. Developers can leverage smart contracts to run various complex decentralized applications on top of blockchains. However, the increasing number of EVM-compatible blockchains has introduced significant challenges in cross-chain interoperability, particularly in ensuring efficiency and atomicity for the whole cross-chain application. Existing solutions are <italic>either limited in guaranteeing overall atomicity for the cross-chain application, or inefficient due to the need for multiple rounds of cross-chain smart contract execution.</i> To address this gap, we propose <monospace>IntegrateX</monospace>, an efficient cross-chain interoperability system that ensures the overall atomicity of cross-chain smart contract invocations. The core idea is to <italic>deploy the logic required for cross-chain execution onto a single blockchain, where it can be executed in an integrated manner.</i> This allows cross-chain applications to perform all cross-chain logic efficiently within the same blockchain. <monospace>IntegrateX</monospace> consists of a <italic>cross-chain smart contract deployment protocol</i> and a <italic>cross-chain smart contract integrated execution protocol.</i> The former achieves efficient and secure cross-chain deployment by decoupling smart contract logic from state, and employing an off-chain cross-chain deployment mechanism combined with on-chain cross-chain verification. The latter ensures atomicity of cross-chain invocations through a 2PC-based mechanism, and enhances performance through transaction aggregation and fine-grained state lock. We implement a prototype of <monospace>IntegrateX</monospace>. Extensive experiments demonstrate that it reduces up to 61.2% latency compared to the state-of-the-art baseline while maintaining low gas consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2635-2651"},"PeriodicalIF":6.0,"publicationDate":"2025-09-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Megapixel Approach for Efficient Execution of Irregular Wavefront Algorithms on GPUs 在gpu上高效执行不规则波前算法的百万像素方法
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-09-23 DOI: 10.1109/TPDS.2025.3612696
Mathias Oliveira;Willian Barreiros;Renato Ferreira;Alba C. M. A. Melo;George Teodoro
Morphological operations are critical in high-resolution biomedical image processing. Their efficient execution relies on an irregular flood-filling strategy consolidated in the Irregular Wavefront Propagation Pattern (IWPP). IWPP was designed for GPUs and achieved significant gains compared to previous work. Here, however, we have revisited IWPP to identify the key limitations of its GPU implementation and proposed a novel more efficient strategy. In particular, the IWPP most demanding phase consists of tracking active pixels, those contributing to the output, that are the ones processed during the execution. This computational strategy leads to irregular memory access, divergent execution, and high storage (queue) management costs. To address these aspects, we have proposed the novel execution strategy called Irregular Wavefront Megapixel Propagation Pattern (IWMPP). IWMPP introduces a coarse-grained execution approach based on fixed-size square regions (instead of pixels in IWPP), referred to as megapixels (MPs). This design reduces the number of elements tracked and enables a regular processing within MPs that, in turn, improves thread divergence and memory accesses. IWMPP introduces optimizations, such as Duplicate Megapixel Removal (DMR) to avoid MPs recomputation and Tiled-Ordered (TO) execution that enforces a semistructured MPs execution sequence to improve data propagation efficiency. Experimental results using large tissue cancer images demonstrated that the IWMPP GPU attains significant gains over the state-of-the-art (IWPP). For morphological reconstruction, fill holes, and h-maxima operations, on the RTX 4090, the IWMPP GPU is up to 17.9×, 45.6×, and 14.9× faster than IWPP GPU, respectively, while at the same time reducing memory demands. IWMPP is an important step to enable quick processing of large imaging datasets.
形态学操作是高分辨率生物医学图像处理的关键。它们的有效执行依赖于不规则波前传播模式(IWPP)中巩固的不规则洪水填充策略。IWPP是为gpu设计的,与以前的工作相比取得了显著的进步。然而,在这里,我们重新审视了IWPP,以确定其GPU实现的关键限制,并提出了一种新的更有效的策略。特别是,IWPP要求最高的阶段包括跟踪活动像素,这些像素对输出有贡献,是在执行期间处理的像素。这种计算策略导致不规则的内存访问、分散的执行和高存储(队列)管理成本。为了解决这些问题,我们提出了一种新的执行策略,称为不规则波前百万像素传播模式(IWMPP)。IWMPP引入了一种基于固定大小的正方形区域(而不是IWPP中的像素)的粗粒度执行方法,称为百万像素(MPs)。这种设计减少了跟踪元素的数量,并使MPs内的常规处理成为可能,这反过来又改善了线程发散和内存访问。IWMPP引入了优化,例如重复百万像素移除(Duplicate Megapixel Removal, DMR)以避免MPs的重新计算和平排顺序(tile - ordered, to)执行,后者强制执行半结构化的MPs执行序列,以提高数据传播效率。使用大型组织癌图像的实验结果表明,IWMPP GPU比最先进的(IWPP)获得了显著的收益。在RTX 4090上,IWMPP GPU的形态学重构、补孔和h-maxima运算速度分别比IWPP GPU快17.9倍、45.6倍和14.9倍,同时降低了对内存的需求。IWMPP是实现大型成像数据集快速处理的重要步骤。
{"title":"The Megapixel Approach for Efficient Execution of Irregular Wavefront Algorithms on GPUs","authors":"Mathias Oliveira;Willian Barreiros;Renato Ferreira;Alba C. M. A. Melo;George Teodoro","doi":"10.1109/TPDS.2025.3612696","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3612696","url":null,"abstract":"Morphological operations are critical in high-resolution biomedical image processing. Their efficient execution relies on an irregular flood-filling strategy consolidated in the Irregular Wavefront Propagation Pattern (IWPP). IWPP was designed for GPUs and achieved significant gains compared to previous work. Here, however, we have revisited IWPP to identify the key limitations of its GPU implementation and proposed a novel more efficient strategy. In particular, the IWPP most demanding phase consists of tracking active pixels, those contributing to the output, that are the ones processed during the execution. This computational strategy leads to irregular memory access, divergent execution, and high storage (queue) management costs. To address these aspects, we have proposed the novel execution strategy called Irregular Wavefront Megapixel Propagation Pattern (IWMPP). IWMPP introduces a coarse-grained execution approach based on fixed-size square regions (instead of pixels in IWPP), referred to as megapixels (MPs). This design reduces the number of elements tracked and enables a regular processing within MPs that, in turn, improves thread divergence and memory accesses. IWMPP introduces optimizations, such as Duplicate Megapixel Removal (DMR) to avoid MPs recomputation and Tiled-Ordered (TO) execution that enforces a semistructured MPs execution sequence to improve data propagation efficiency. Experimental results using large tissue cancer images demonstrated that the IWMPP GPU attains significant gains over the state-of-the-art (IWPP). For morphological reconstruction, fill holes, and h-maxima operations, on the RTX 4090, the IWMPP GPU is up to 17.9×, 45.6×, and 14.9× faster than IWPP GPU, respectively, while at the same time reducing memory demands. IWMPP is an important step to enable quick processing of large imaging datasets.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2399-2411"},"PeriodicalIF":6.0,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11176841","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145210127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1