Modern scientific experiments produce ever-increasing amounts of data, soon requiring ExaFLOPs computing capacities for analysis. Reaching such performance requires purpose-built supercomputers with $O(10^{3})$ nodes, each hosting multicore CPUs and multiple GPUs, and applications designed to exploit this hardware optimally. Given that each supercomputer is generally a one-off project, the need for computing frameworks portable across diverse CPU and GPU architectures without performance losses is increasingly compelling. We investigate the performance portability () of a real-world application: the solver module of the AVU–GSR pipeline for the ESA Gaia mission. This code finds the astrometric parameters of ${sim} 10^{8}$ stars in the Milky Way using the LSQR iterative algorithm. LSQR is widely used to solve linear systems of equations across a wide range of high-performance computing applications, elevating the study beyond its astrophysical relevance. The code is memory-bound, with six main compute kernels implementing sparse matrix-by-vector products. We optimize the previous CUDA implementation and port the code to further six GPU-acceleration frameworks: C++ PSTL, SYCL, OpenMP, HIP, KOKKOS, and OpenACC. We evaluate each framework’s performance portability across multiple GPUs (NVIDIA and AMD) and problem sizes in terms of application and architectural efficiency. Architectural efficiency is estimated through the roofline model of the six most computationally expensive GPU kernels. Our results show that C++ library-based (C++ PSTL and KOKKOS), pragma-based (OpenMP and OpenACC), and language-specific (CUDA, HIP, and SYCL) frameworks achieve increasingly better performance portability across the supported platforms with larger problem sizes providing better scores due to higher GPU occupancies.
{"title":"Performance Portability Assessment in Gaia","authors":"Giulio Malenza;Valentina Cesare;Marco Edoardo Santimaria;Robert Birke;Alberto Vecchiato;Ugo Becciani;Marco Aldinucci","doi":"10.1109/TPDS.2025.3591452","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3591452","url":null,"abstract":"Modern scientific experiments produce ever-increasing amounts of data, soon requiring ExaFLOPs computing capacities for analysis. Reaching such performance requires purpose-built supercomputers with <inline-formula><tex-math>$O(10^{3})$</tex-math></inline-formula> nodes, each hosting multicore CPUs and multiple GPUs, and applications designed to exploit this hardware optimally. Given that each supercomputer is generally a one-off project, the need for computing frameworks portable across diverse CPU and GPU architectures without performance losses is increasingly compelling. We investigate the performance portability (<inline-graphic>) of a real-world application: the solver module of the AVU–GSR pipeline for the ESA Gaia mission. This code finds the astrometric parameters of <inline-formula><tex-math>${sim} 10^{8}$</tex-math></inline-formula> stars in the Milky Way using the LSQR iterative algorithm. LSQR is widely used to solve linear systems of equations across a wide range of high-performance computing applications, elevating the study beyond its astrophysical relevance. The code is memory-bound, with six main compute kernels implementing sparse matrix-by-vector products. We optimize the previous CUDA implementation and port the code to further six GPU-acceleration frameworks: C++ PSTL, SYCL, OpenMP, HIP, KOKKOS, and OpenACC. We evaluate each framework’s performance portability across multiple GPUs (NVIDIA and AMD) and problem sizes in terms of application and architectural efficiency. Architectural efficiency is estimated through the roofline model of the six most computationally expensive GPU kernels. Our results show that C++ library-based (C++ PSTL and KOKKOS), pragma-based (OpenMP and OpenACC), and language-specific (CUDA, HIP, and SYCL) frameworks achieve increasingly better performance portability across the supported platforms with larger problem sizes providing better <inline-graphic> scores due to higher GPU occupancies.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2045-2057"},"PeriodicalIF":6.0,"publicationDate":"2025-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11090032","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Geo-distributed computing, a paradigm that assigns computational tasks to globally distributed nodes, has emerged as a promising approach in cloud computing, edge computing, cloud-edge computing, and supercomputer computing (SC). It enables low-latency services, ensures data locality, and handles large-scale applications. As global computing capacity and task demands increase rapidly, scheduling tasks for efficient execution in geo-distributed computing systems has become an increasingly critical research challenge. It arises from the inherent characteristics of geographic distribution, including heterogeneous network conditions, region-specific resource pricing, and varying computational capabilities across locations. Researchers have developed diverse task scheduling methods tailored to geo-distributed scenarios, aiming to achieve objectives such as performance enhancement, fairness assurance, and fault-tolerance improvement. This survey provides a comprehensive and systematic review of task scheduling techniques across four major distributed computing environments, with an in-depth analysis of these approaches based on their core scheduling objectives. Through our analysis, we identify key research challenges and outline promising directions for advancing task scheduling in geo-distributed computing.
{"title":"Task Scheduling in Geo-Distributed Computing: A Survey","authors":"Yujian Wu;Shanjiang Tang;Ce Yu;Bin Yang;Chao Sun;Jian Xiao;Hutong Wu;Jinghua Feng","doi":"10.1109/TPDS.2025.3591010","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3591010","url":null,"abstract":"Geo-distributed computing, a paradigm that assigns computational tasks to globally distributed nodes, has emerged as a promising approach in cloud computing, edge computing, cloud-edge computing, and supercomputer computing (SC). It enables low-latency services, ensures data locality, and handles large-scale applications. As global computing capacity and task demands increase rapidly, scheduling tasks for efficient execution in geo-distributed computing systems has become an increasingly critical research challenge. It arises from the inherent characteristics of geographic distribution, including heterogeneous network conditions, region-specific resource pricing, and varying computational capabilities across locations. Researchers have developed diverse task scheduling methods tailored to geo-distributed scenarios, aiming to achieve objectives such as performance enhancement, fairness assurance, and fault-tolerance improvement. This survey provides a comprehensive and systematic review of task scheduling techniques across four major distributed computing environments, with an in-depth analysis of these approaches based on their core scheduling objectives. Through our analysis, we identify key research challenges and outline promising directions for advancing task scheduling in geo-distributed computing.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2073-2088"},"PeriodicalIF":6.0,"publicationDate":"2025-07-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144867934","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-18DOI: 10.1109/TPDS.2025.3590368
Xishuo Li;Shan Zhang;Tie Ma;Zhiyuan Wang;Hongbin Luo
In decentralized edge computing environments, user devices need to perceive the status of neighboring devices, including computational availability and communication delays, to optimize task offloading decisions. However, probing the real-time status of all devices introduces significant overhead, and probing only a few devices can lead to suboptimal decision-making, considering the massive connectivity and non-stationarity of edge networks. Aiming to balance the status probing cost and task offloading performance, we study the joint transmission and computation status probing problem, where the status and offloading delay on edge devices are characterized by general, bounded, and non-stationary distributions. The problem is proved to be NP-hard, even with known offloading delay distributions. To handle this case, we design an efficient offline method that guarantees a $(1-1/e)$ approximation ratio via leveraging the submodularity of the expected offloading delay function. Furthermore, for scenarios with unknown and non-stationary offloading delay distributions, we reformulate the problem using the piecewise-stationary combinatorial multi-armed bandit framework and develop a change-point detection-based online status probing (CD-OSP) algorithm. CD-OSP can timely detect environmental changes and update probing strategies via using the proposed offline method and estimating offloading delay distributions. We prove that CD-OSP achieves a regret of $mathcal {O}(NVsqrt{Tln T})$, with $N$, $V$, and $T$ denoting the numbers of stationary periods, edge devices, and time slots, respectively. Extensive simulations and testbed experiments demonstrate that CD-OSP significantly outperforms state-of-the-art baselines, which can reduce the probing cost by up to 16.18X with a 2.14X increase in the offloading delay.
{"title":"Doing More With Less: Balancing Probing Costs and Task Offloading Efficiency At the Network Edge","authors":"Xishuo Li;Shan Zhang;Tie Ma;Zhiyuan Wang;Hongbin Luo","doi":"10.1109/TPDS.2025.3590368","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3590368","url":null,"abstract":"In decentralized edge computing environments, user devices need to perceive the status of neighboring devices, including computational availability and communication delays, to optimize task offloading decisions. However, probing the real-time status of all devices introduces significant overhead, and probing only a few devices can lead to suboptimal decision-making, considering the massive connectivity and non-stationarity of edge networks. Aiming to balance the status probing cost and task offloading performance, we study the joint transmission and computation status probing problem, where the status and offloading delay on edge devices are characterized by general, bounded, and non-stationary distributions. The problem is proved to be NP-hard, even with known offloading delay distributions. To handle this case, we design an efficient offline method that guarantees a <inline-formula><tex-math>$(1-1/e)$</tex-math></inline-formula> approximation ratio via leveraging the submodularity of the expected offloading delay function. Furthermore, for scenarios with unknown and non-stationary offloading delay distributions, we reformulate the problem using the piecewise-stationary combinatorial multi-armed bandit framework and develop a change-point detection-based online status probing (CD-OSP) algorithm. CD-OSP can timely detect environmental changes and update probing strategies via using the proposed offline method and estimating offloading delay distributions. We prove that CD-OSP achieves a regret of <inline-formula><tex-math>$mathcal {O}(NVsqrt{Tln T})$</tex-math></inline-formula>, with <inline-formula><tex-math>$N$</tex-math></inline-formula>, <inline-formula><tex-math>$V$</tex-math></inline-formula>, and <inline-formula><tex-math>$T$</tex-math></inline-formula> denoting the numbers of stationary periods, edge devices, and time slots, respectively. Extensive simulations and testbed experiments demonstrate that CD-OSP significantly outperforms state-of-the-art baselines, which can reduce the probing cost by up to 16.18X with a 2.14X increase in the offloading delay.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 11","pages":"2247-2263"},"PeriodicalIF":6.0,"publicationDate":"2025-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145141745","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-17DOI: 10.1109/TPDS.2025.3590014
Ruidong Zhu;Ziyue Jiang;Zhi Zhang;Xin Liu;Xuanzhe Liu;Xin Jin
Low-rank adaptation (LoRA) is widely used to efficiently fine-tune large language models (LLMs), leading to multiple models fine-tuned from the same pre-trained LLM. State-of-the-art LLM serving systems colocate these LoRA models on the same GPU instances for concurrent serving, which decreases memory usage and boosts efficiency. However, the unawareness of the SLO requirements of each LoRA service and the interference between requests from different LoRA services can cause significant SLO violations. This paper presents Cannikin, a multi-LoRA inference serving system that optimizes the minimum of the SLO attainments of all LoRA services in the serving system, denoted as lagger-SLO attainment. We obtain insights from the characterization of a real-world multi-LoRA serving trace, which reveals the stable input/output lengths of the most popular LoRA services. This motivates Cannikin to propose an SLO-aware scheduling algorithm that prioritizes requests based on efficient deadline estimation. Cannikin further detects the influence of interference between different LoRA services on SLO violations and eliminates the bias between these services. The evaluation using real-world traces demonstrates that compared to the state-of-the-art multi-LoRA serving systems, Cannikin can handle up to 3.6× higher rates or 2.8× more burstiness while maintaining the SLO attainment of each LoRA service $> $ 90% .
{"title":"Cannikin: No Lagger of SLO in Concurrent Multiple LoRA LLM Serving","authors":"Ruidong Zhu;Ziyue Jiang;Zhi Zhang;Xin Liu;Xuanzhe Liu;Xin Jin","doi":"10.1109/TPDS.2025.3590014","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3590014","url":null,"abstract":"Low-rank adaptation (LoRA) is widely used to efficiently fine-tune large language models (LLMs), leading to multiple models fine-tuned from the same pre-trained LLM. State-of-the-art LLM serving systems colocate these LoRA models on the same GPU instances for concurrent serving, which decreases memory usage and boosts efficiency. However, the unawareness of the SLO requirements of each LoRA service and the interference between requests from different LoRA services can cause significant SLO violations. This paper presents Cannikin, a multi-LoRA inference serving system that optimizes the minimum of the SLO attainments of all LoRA services in the serving system, denoted as lagger-SLO attainment. We obtain insights from the characterization of a real-world multi-LoRA serving trace, which reveals the stable input/output lengths of the most popular LoRA services. This motivates Cannikin to propose an SLO-aware scheduling algorithm that prioritizes requests based on efficient deadline estimation. Cannikin further detects the influence of interference between different LoRA services on SLO violations and eliminates the bias between these services. The evaluation using real-world traces demonstrates that compared to the state-of-the-art multi-LoRA serving systems, Cannikin can handle up to 3.6× higher rates or 2.8× more burstiness while maintaining the SLO attainment of each LoRA service <inline-formula><tex-math>$> $</tex-math></inline-formula> 90% .","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 9","pages":"1972-1984"},"PeriodicalIF":6.0,"publicationDate":"2025-07-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144751089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Due to the superiority of handling irregular regions of interest, the curvilinear grid finite difference method (CGFDM) has become wildely used in seismic simulation for earthquake hazard evaluation and understanding of earthquake physics. This paper proposes a novel approach that optimizes a CGFDM solver on the Ascend, a cutting-edge Neural Processing Unit (NPU) using half-precision storage and mixed-precision arithmetic. The approach increases the data throughput and computing efficiency, enabling more effective seismic modeling. Furthermore, we propose an efficient matrix unit enabled 3D difference algorithm that employs matrix unit on NPU to accelerate the computation. By fully exploiting the capability of matrix unit and wide SIMD lane, our solver on Ascend achieves a speedup of 4.19 × over the performance of parallel solver on two AMD CPUs and has successfully simulated real-world Wenchuan earthquake. To the best of our knowledge, we are the first to conduct seismic simulations on NPU.
{"title":"Accelerating Half-Precision Seismic Simulation on Neural Processing Unit","authors":"Yinuo Wang;Zeyu Song;Wubing Wan;Xinpeng Zhao;Lin Gan;Ping Gao;Wenqiang Wang;Zhenguo Zhang;Haohuan Fu;Wei Xue;Guangwen Yang","doi":"10.1109/TPDS.2025.3584773","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3584773","url":null,"abstract":"Due to the superiority of handling irregular regions of interest, the curvilinear grid finite difference method (CGFDM) has become wildely used in seismic simulation for earthquake hazard evaluation and understanding of earthquake physics. This paper proposes a novel approach that optimizes a CGFDM solver on the Ascend, a cutting-edge Neural Processing Unit (NPU) using half-precision storage and mixed-precision arithmetic. The approach increases the data throughput and computing efficiency, enabling more effective seismic modeling. Furthermore, we propose an efficient matrix unit enabled 3D difference algorithm that employs matrix unit on NPU to accelerate the computation. By fully exploiting the capability of matrix unit and wide SIMD lane, our solver on Ascend achieves a speedup of 4.19 × over the performance of parallel solver on two AMD CPUs and has successfully simulated real-world Wenchuan earthquake. To the best of our knowledge, we are the first to conduct seismic simulations on NPU.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"1998-2013"},"PeriodicalIF":6.0,"publicationDate":"2025-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-10DOI: 10.1109/TPDS.2025.3587673
Shengle Lin;Guoqing Xiao;Haotian Wang;Wangdong Yang;Kenli Li;Keqin Li
OpenCL has become the favored framework for emerging heterogeneous devices and FPGAs, owing to its versatility and portability. However, OpenCL-based math libraries still face challenges in fully leveraging device performance. When deploying high-performance arithmetic applications on these devices, the most important hot function is General Matrix-matrix Multiplication (GEMM). This study presents a meticulously optimized OpenCL GEMM kernel. Our enhanced GEMM kernel emphasizes two key improvements: 1) a three-level double buffer pipeline that efficiently overlaps data fetching with floating-point computations; 2) a fine-grained prefetching strategy of private memory to increase device occupancy by optimizing register unit utilization. Furthermore, this work presents a Bayesian Optimization (BO) tuner for kernel auto-tuning. Experimental results demonstrate considerable optimization improvement and performance advantages achieved on diverse OpenCL devices. Additionally, the BO tuner demonstrates superior efficiency and robustness, outperforming contemporary tuning methods.
{"title":"High Performance OpenCL-Based GEMM Kernel Auto-Tuned by Bayesian Optimization","authors":"Shengle Lin;Guoqing Xiao;Haotian Wang;Wangdong Yang;Kenli Li;Keqin Li","doi":"10.1109/TPDS.2025.3587673","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3587673","url":null,"abstract":"OpenCL has become the favored framework for emerging heterogeneous devices and FPGAs, owing to its versatility and portability. However, OpenCL-based math libraries still face challenges in fully leveraging device performance. When deploying high-performance arithmetic applications on these devices, the most important hot function is General Matrix-matrix Multiplication (GEMM). This study presents a meticulously optimized OpenCL GEMM kernel. Our enhanced GEMM kernel emphasizes two key improvements: 1) a three-level double buffer pipeline that efficiently overlaps data fetching with floating-point computations; 2) a fine-grained prefetching strategy of private memory to increase device occupancy by optimizing register unit utilization. Furthermore, this work presents a Bayesian Optimization (BO) tuner for kernel auto-tuning. Experimental results demonstrate considerable optimization improvement and performance advantages achieved on diverse OpenCL devices. Additionally, the BO tuner demonstrates superior efficiency and robustness, outperforming contemporary tuning methods.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 9","pages":"1985-1997"},"PeriodicalIF":6.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144782079","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-10DOI: 10.1109/TPDS.2025.3587888
Kåre von Geijer;Philippas Tsigas
The sequential semantics of many concurrent data structures, such as stacks and queues, inevitably lead to memory contention in parallel environments, thus limiting scalability. Semantic relaxation has the potential to address this issue, increasing the parallelism at the expense of weakened semantics. Although prior research has shown that improved performance can be attained by relaxing concurrent data structure semantics, there is no one-size-fits-all relaxation that adequately addresses the varying needs of dynamic executions. In this paper, we first introduce the concept of elastic relaxation and consequently present the Lateral structure, which is an algorithmic component capable of supporting the design of elastically relaxed concurrent data structures. Using the Lateral, we design novel elastically relaxed, lock-free queues, stacks, a counter, and a deque, capable of reconfiguring relaxation during run-time. We establish linearizability and define worst-case bounds for relaxation errors in our designs. Experimental evaluations show that our elastic designs match the performance of state-of-the-art statically relaxed structures when no elastic changes are utilized. We develop a lightweight, contention-aware controller for adjusting relaxation in real time, and demonstrate its benefits both in a dynamic producer-consumer micro-benchmark and in a parallel BFS traversal, where it improves throughput and work-efficiency compared to static designs.
{"title":"Elastic Relaxation of Concurrent Data Structures","authors":"Kåre von Geijer;Philippas Tsigas","doi":"10.1109/TPDS.2025.3587888","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3587888","url":null,"abstract":"The sequential semantics of many concurrent data structures, such as stacks and queues, inevitably lead to memory contention in parallel environments, thus limiting scalability. Semantic relaxation has the potential to address this issue, increasing the parallelism at the expense of weakened semantics. Although prior research has shown that improved performance can be attained by relaxing concurrent data structure semantics, there is no one-size-fits-all relaxation that adequately addresses the varying needs of dynamic executions. In this paper, we first introduce the concept of <italic>elastic relaxation</i> and consequently present the <italic>Lateral</i> structure, which is an algorithmic component capable of supporting the design of elastically relaxed concurrent data structures. Using the <italic>Lateral</i>, we design novel elastically relaxed, lock-free queues, stacks, a counter, and a deque, capable of reconfiguring relaxation during run-time. We establish linearizability and define worst-case bounds for relaxation errors in our designs. Experimental evaluations show that our elastic designs match the performance of state-of-the-art statically relaxed structures when no elastic changes are utilized. We develop a lightweight, contention-aware controller for adjusting relaxation in real time, and demonstrate its benefits both in a dynamic producer-consumer micro-benchmark and in a parallel BFS traversal, where it improves throughput and work-efficiency compared to static designs.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2578-2595"},"PeriodicalIF":6.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11077833","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-10DOI: 10.1109/TPDS.2025.3587445
Kaiyuan Liu;Xiaobo Zhou;Li Li
Large Language Models (LLMs) are reshaping mobile AI. Directly deploying LLMs on mobile devices is an emerging paradigm that can widely support different mobile applications while preserving data privacy. However, intensive memory footprint, long inference latency and high energy consumption severely bottlenecks on-device inference of LLM in real-world scenarios. In response to these challenges, this work introduces m$^{2}$LLM, an innovative framework that performs joint optimization from multiple dimensions for on-device LLM inference in order to strike a balance among performance, realtimeliness and energy efficiency. Specifically, m$^{2}$LLM features the following four core components including : 1) Hardware-aware Model Customization, 2) Elastic Chunk-wise Pipeline, 3) Latency-guided Prompt Compression and 4) Layer-wise Resource Scheduling. These four components interact with each other in order to guide the inference process from the following three dimensions. At the model level, m$^{2}$LLM designs an elastic chunk-wise pipeline to expand device memory and customize the model according to the hardware configuration, maximizing performance within the memory budget. At the prompt level, facing the stochastic input, m$^{2}$LLM judiciously compresses the prompts in order to guarantee the first token can be generated in time while maintaining the semantic information. Additionally, at the system level, the layer-wise resource scheduler is employed in order to complete the token generation process with minimized energy consumption while guaranteeing the realtimeness in the highly dynamic mobile environment. m$^{2}$LLM is evaluated on off-the-shelf smartphone with represented models and datasets. Compared to baseline methods, m$^{2}$LLM delivers 2.99–13.5× TTFT acceleration and 2.28–24.3× energy savings, with only a minimal model performance loss of 2% –7% .
{"title":"m$^{2}$2LLM: A Multi-Dimensional Optimization Framework for LLM Inference on Mobile Devices","authors":"Kaiyuan Liu;Xiaobo Zhou;Li Li","doi":"10.1109/TPDS.2025.3587445","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3587445","url":null,"abstract":"Large Language Models (LLMs) are reshaping mobile AI. Directly deploying LLMs on mobile devices is an emerging paradigm that can widely support different mobile applications while preserving data privacy. However, intensive memory footprint, long inference latency and high energy consumption severely bottlenecks on-device inference of LLM in real-world scenarios. In response to these challenges, this work introduces m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM, an innovative framework that performs joint optimization from multiple dimensions for on-device LLM inference in order to strike a balance among performance, realtimeliness and energy efficiency. Specifically, m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM features the following four core components including : 1) Hardware-aware Model Customization, 2) Elastic Chunk-wise Pipeline, 3) Latency-guided Prompt Compression and 4) Layer-wise Resource Scheduling. These four components interact with each other in order to guide the inference process from the following three dimensions. At the model level, m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM designs an elastic chunk-wise pipeline to expand device memory and customize the model according to the hardware configuration, maximizing performance within the memory budget. At the prompt level, facing the stochastic input, m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM judiciously compresses the prompts in order to guarantee the first token can be generated in time while maintaining the semantic information. Additionally, at the system level, the layer-wise resource scheduler is employed in order to complete the token generation process with minimized energy consumption while guaranteeing the realtimeness in the highly dynamic mobile environment. m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM is evaluated on off-the-shelf smartphone with represented models and datasets. Compared to baseline methods, m<inline-formula><tex-math>$^{2}$</tex-math></inline-formula>LLM delivers 2.99–13.5× TTFT acceleration and 2.28–24.3× energy savings, with only a minimal model performance loss of 2% –7% .","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 10","pages":"2014-2029"},"PeriodicalIF":6.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144831796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-07-10DOI: 10.1109/TPDS.2025.3586450
Youquan Xian;Xueying Zeng;Chunpei Li;Peng Wang;Dongcheng Li;Peng Liu;Xianxian Li
In recent years, blockchain oracle, as the key link between blockchain and real-world data interaction, has greatly expanded the application scope of blockchain. In particular, the emergence of the Multi-Data Source (MDS) oracle has greatly improved the reliability of the oracle in the case of untrustworthy data sources. However, the current MDS oracle scheme requires nodes to obtain data redundantly from multiple data sources to guarantee data reliability, which greatly increases the resource overhead and response time of the system. Therefore, in this paper, we propose a Secure and Efficient Multi-data Source Oracle framework (SEMSO), where nodes only need to access one data source to ensure the reliability of final data. First, we design a new off-chain data aggregation protocol TBLS, to guarantee data source diversity and reliability at low cost. Second, according to the rational man assumption, the data source selection task of nodes is modeled and solved based on the Bayesian game under incomplete information to maximize the node’s revenue while improving the success rate of TBLS aggregation and system response speed. Security analysis verifies the reliability of the proposed scheme, and experiments show that under the same environmental assumptions, SEMSO takes into account data diversity while reducing the response time by 23.5%.
{"title":"SEMSO: A Secure and Efficient Multi-Data Source Blockchain Oracle","authors":"Youquan Xian;Xueying Zeng;Chunpei Li;Peng Wang;Dongcheng Li;Peng Liu;Xianxian Li","doi":"10.1109/TPDS.2025.3586450","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3586450","url":null,"abstract":"In recent years, blockchain oracle, as the key link between blockchain and real-world data interaction, has greatly expanded the application scope of blockchain. In particular, the emergence of the Multi-Data Source (MDS) oracle has greatly improved the reliability of the oracle in the case of untrustworthy data sources. However, the current MDS oracle scheme requires nodes to obtain data redundantly from multiple data sources to guarantee data reliability, which greatly increases the resource overhead and response time of the system. Therefore, in this paper, we propose a Secure and Efficient Multi-data Source Oracle framework (SEMSO), where nodes only need to access one data source to ensure the reliability of final data. First, we design a new off-chain data aggregation protocol TBLS, to guarantee data source diversity and reliability at low cost. Second, according to the rational man assumption, the data source selection task of nodes is modeled and solved based on the Bayesian game under incomplete information to maximize the node’s revenue while improving the success rate of TBLS aggregation and system response speed. Security analysis verifies the reliability of the proposed scheme, and experiments show that under the same environmental assumptions, SEMSO takes into account data diversity while reducing the response time by 23.5%.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2512-2523"},"PeriodicalIF":6.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Real-time stream processing applications (e.g., IoT data analytics and fraud detection) are becoming integral to everyday life. A robust and efficient Big Data system, especially a streaming pipeline composed of producers, brokers, and consumers, is at the heart of the successful deployment of these applications. However, their deployment and assessment can be complex and costly due to the intricate interactions between pipeline components and the reliance on expensive hardware or cloud environments. Thus, we propose $mathsf{streamline}$streamline, an agile, efficient, and dependable framework as an alternative to assess streaming applications without requiring a hardware testbed or cloud setup. To simplify the deployment, prototyping, and benchmarking of end-to-end stream processing applications involving distributed platforms (e.g., Apache Kafka, Spark, Flink), the framework provides a lightweight environment with a developer-friendly, high-level API for dynamically selecting and configuring pipeline components. Moreover, the modular architecture of $mathsf{streamline}$streamline enables developers to integrate any required platform into their systems. The performance and robustness of a deployed pipeline can be assessed with varying network conditions and injected faults. Furthermore, it facilitates benchmarking event streaming platforms like Apache Kafka and RabbitMQ. Extensive evaluations of various streaming applications confirm the effectiveness and dependability of $mathsf{streamline}$streamline.
{"title":"$mathsf{streamline}$: Accelerating Deployment and Assessment of Real-Time Big Data Systems","authors":"Md. Monzurul Amin Ifath;Tommaso Melodia;Israat Haque","doi":"10.1109/TPDS.2025.3587641","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3587641","url":null,"abstract":"Real-time stream processing applications (e.g., IoT data analytics and fraud detection) are becoming integral to everyday life. A robust and efficient Big Data system, especially a streaming pipeline composed of producers, brokers, and consumers, is at the heart of the successful deployment of these applications. However, their deployment and assessment can be complex and costly due to the intricate interactions between pipeline components and the reliance on expensive hardware or cloud environments. Thus, we propose <italic><inline-formula><tex-math>$mathsf{streamline}$</tex-math><alternatives><mml:math><mml:mi>streamline</mml:mi></mml:math><inline-graphic></alternatives></inline-formula></i>, an agile, efficient, and dependable framework as an alternative to assess streaming applications without requiring a hardware testbed or cloud setup. To simplify the deployment, prototyping, and benchmarking of end-to-end stream processing applications involving distributed platforms (e.g., Apache Kafka, Spark, Flink), the framework provides a lightweight environment with a developer-friendly, high-level API for dynamically selecting and configuring pipeline components. Moreover, the modular architecture of <italic><inline-formula><tex-math>$mathsf{streamline}$</tex-math><alternatives><mml:math><mml:mi>streamline</mml:mi></mml:math><inline-graphic></alternatives></inline-formula></i> enables developers to integrate any required platform into their systems. The performance and robustness of a deployed pipeline can be assessed with varying network conditions and injected faults. Furthermore, it facilitates benchmarking event streaming platforms like Apache Kafka and RabbitMQ. Extensive evaluations of various streaming applications confirm the effectiveness and dependability of <italic><inline-formula><tex-math>$mathsf{streamline}$</tex-math><alternatives><mml:math><mml:mi>streamline</mml:mi></mml:math><inline-graphic></alternatives></inline-formula></i>.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2455-2468"},"PeriodicalIF":6.0,"publicationDate":"2025-07-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145352138","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}