Pub Date : 2026-02-02DOI: 10.1016/j.future.2026.108415
Peng Yu , Bo Liu , Shaomin Tang , Dongdong Li , Weiwei Lin
Agentic workflows, driven by Large Language Models (LLMs), present new opportunities for realizing advanced edge intelligence in data-sensitive domains such as finance and healthcare. However, deploying these workflows in private, resource-constrained edge environments poses unique challenges. Unlike public cloud services, these scenarios require computations to be performed locally on dedicated edge clusters to meet strict data compliance and privacy regulations. This restriction, coupled with the limited memory capacity of edge devices relative to the massive size of LLMs, makes dynamic memory management and model loading critical factors. Furthermore, the autoregressive nature of LLMs introduces high dynamic uncertainty in inference latency and memory footprint, fundamentally contradicting the static information assumptions of traditional scheduling methods. To address these challenges, we propose AWTO, a Deep Reinforcement Learning (DRL) offloading scheme designed to minimize the makespan of agentic workflows in isolated edge environments. The core of AWTO is a task-by-task dynamic decision-making mechanism that explicitly handles on-demand model loading and memory contention. We formulate this problem as a Markov Decision Process (MDP) and employ a Proximal Policy Optimization (PPO)-based algorithm. A novel three-module LSTM encoder is designed to capture task dependencies, device heterogeneity, and real-time memory states. Experimental results in heterogeneous environments demonstrate that AWTO reduces the average makespan by 16.99% to 36.36% compared to heuristic baselines. Furthermore, it achieves a 14.00% gain over DRL methods, validating its adaptability to dynamic memory constraints and cache-aware scheduling.
{"title":"AWTO: A latency-optimized task offloading scheme for LLM-driven agentic workflows on heterogeneous edge","authors":"Peng Yu , Bo Liu , Shaomin Tang , Dongdong Li , Weiwei Lin","doi":"10.1016/j.future.2026.108415","DOIUrl":"10.1016/j.future.2026.108415","url":null,"abstract":"<div><div>Agentic workflows, driven by Large Language Models (LLMs), present new opportunities for realizing advanced edge intelligence in data-sensitive domains such as finance and healthcare. However, deploying these workflows in private, resource-constrained edge environments poses unique challenges. Unlike public cloud services, these scenarios require computations to be performed locally on dedicated edge clusters to meet strict data compliance and privacy regulations. This restriction, coupled with the limited memory capacity of edge devices relative to the massive size of LLMs, makes dynamic memory management and model loading critical factors. Furthermore, the autoregressive nature of LLMs introduces high dynamic uncertainty in inference latency and memory footprint, fundamentally contradicting the static information assumptions of traditional scheduling methods. To address these challenges, we propose AWTO, a Deep Reinforcement Learning (DRL) offloading scheme designed to minimize the makespan of agentic workflows in isolated edge environments. The core of AWTO is a task-by-task dynamic decision-making mechanism that explicitly handles on-demand model loading and memory contention. We formulate this problem as a Markov Decision Process (MDP) and employ a Proximal Policy Optimization (PPO)-based algorithm. A novel three-module LSTM encoder is designed to capture task dependencies, device heterogeneity, and real-time memory states. Experimental results in heterogeneous environments demonstrate that AWTO reduces the average makespan by 16.99% to 36.36% compared to heuristic baselines. Furthermore, it achieves a 14.00% gain over DRL methods, validating its adaptability to dynamic memory constraints and cache-aware scheduling.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108415"},"PeriodicalIF":6.2,"publicationDate":"2026-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146110187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-01DOI: 10.1016/j.future.2026.108410
Sekione Reward Jeremiah, ByungHyun Jo, Kim-Kwang Raymond Choo, Jong Hyuk Park
{"title":"Block-FDT: Blockchain-Enhanced Federated Learning Approach to Secure DT-Assisted IIoT Networks","authors":"Sekione Reward Jeremiah, ByungHyun Jo, Kim-Kwang Raymond Choo, Jong Hyuk Park","doi":"10.1016/j.future.2026.108410","DOIUrl":"https://doi.org/10.1016/j.future.2026.108410","url":null,"abstract":"","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"89 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2026-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146110195","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-25DOI: 10.1016/j.future.2026.108384
He Li, Jiajia Gui, Weihang Kong, Xingchen Zhang
Multi-view pedestrian detection aims to generate a bird’s-eye view occupancy map of pedestrians from multiple calibrated camera views. Multi-view methods offer advantages over single-view approaches: they can mitigate occlusions, expand scene coverage, and improve robustness. However, existing multi-view detection methods still face two critical challenges: mixing heterogeneous cross-view information in the fused representation and feature misalignment in the world coordinate system caused by various scales across views. To solve these issues, we develop a novel multi-view pedestrian detection framework that includes a residual mask fusion module and a cosine similarity-based passive sampler. Specifically, the residual mask fusion module enables adaptive feature selection and compensation across views, yielding an optimal fusion under geometric redundancy. Moreover, the cosine similarity-based passive sampler computes dynamic coordinate offsets by evaluating feature consistency. This reduces the impact of unavoidable biases introduced during projection. Experimental results on Wildtrack, MultiviewX and CityStreet demonstrate the effectiveness and reliability of the developed framework for multi-view pedestrian detection. Our code is available at https://github.com/guixiaojia/improve-shot.
{"title":"Multi-view pedestrian detection via residual mask fusion and cosine similarity-based passive sampler for video surveillance systems","authors":"He Li, Jiajia Gui, Weihang Kong, Xingchen Zhang","doi":"10.1016/j.future.2026.108384","DOIUrl":"https://doi.org/10.1016/j.future.2026.108384","url":null,"abstract":"Multi-view pedestrian detection aims to generate a bird’s-eye view occupancy map of pedestrians from multiple calibrated camera views. Multi-view methods offer advantages over single-view approaches: they can mitigate occlusions, expand scene coverage, and improve robustness. However, existing multi-view detection methods still face two critical challenges: mixing heterogeneous cross-view information in the fused representation and feature misalignment in the world coordinate system caused by various scales across views. To solve these issues, we develop a novel multi-view pedestrian detection framework that includes a residual mask fusion module and a cosine similarity-based passive sampler. Specifically, the residual mask fusion module enables adaptive feature selection and compensation across views, yielding an optimal fusion under geometric redundancy. Moreover, the cosine similarity-based passive sampler computes dynamic coordinate offsets by evaluating feature consistency. This reduces the impact of unavoidable biases introduced during projection. Experimental results on Wildtrack, MultiviewX and CityStreet demonstrate the effectiveness and reliability of the developed framework for multi-view pedestrian detection. Our code is available at <ce:inter-ref xlink:href=\"https://github.com/guixiaojia/improve-shot\" xlink:type=\"simple\">https://github.com/guixiaojia/improve-shot</ce:inter-ref>.","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"1 1","pages":""},"PeriodicalIF":7.5,"publicationDate":"2026-01-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146047989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1016/j.future.2026.108393
Sandra Catalán , Rafael Rodríguez-Sánchez , Carlos García Sánchez , Luis Piñuel Moreno
This paper evaluates the performance and energy efficiency of Apple processors across multiple ARM-based M-series generations and models (standard and Pro). The study is motivated by the increasing heterogeneity of Apple´s SoC architectures, which integrate multiple computing engines raising the scientific question of which hardware components are best suited for executing general-purpose and domain-specific computations such as the GEneral Matrix Multiply (GEMM). The analysis focuses on four key components: the Central Processing Unit (CPU), the Graphics Processing Unit (GPU), the matrix calculation accelerator (AMX), and the Apple Neural Engine (ANE).
The assessments use the GEMM as benchmark to characterize the performance of the CPU and GPU, alongside tests on AMX, which is specialized in handling large-scale mathematical operations, and tests on the ANE, which is specifically designed for Deep Learning purposes. Additionally, energy consumption data has been collected to analyze the energy efficiency of the aforementioned resources. Results highlight notable improvements in computational capacity and energy efficiency over successive generations. On one hand, the AMX stands out as the most efficient component for FP32 and FP64 workloads, significantly boosting overall system performance. In the M4 Pro, which integrates two matrix accelerators, it achieves up to 68% of the GPU’s FP32 performance while consuming only 42% of its power. On the other hand, the ANE, although limited to FP16 precision, excels in energy efficiency for low-precision tasks, surpassing other accelerators with over 700 GFLOPs/Watt under batched workloads.
This analysis offers a clear understanding of how Apple´s custom ARM designs optimize both performance and energy use, particularly in the context of multi-core processing and specialized acceleration units. In addition, a significant contribution of this study is the comprehensive comparative analysis of Apple’s accelerators, which have previously been poorly documented and scarcely studied. The analysis spans different generations and compares the accelerators against both CPU and GPU performance.
{"title":"A comparative performance and efficiency analysis of Apple’s M architectures: A GEMM case study","authors":"Sandra Catalán , Rafael Rodríguez-Sánchez , Carlos García Sánchez , Luis Piñuel Moreno","doi":"10.1016/j.future.2026.108393","DOIUrl":"10.1016/j.future.2026.108393","url":null,"abstract":"<div><div>This paper evaluates the performance and energy efficiency of Apple processors across multiple ARM-based M-series generations and models (standard and Pro). The study is motivated by the increasing heterogeneity of Apple´s SoC architectures, which integrate multiple computing engines raising the scientific question of which hardware components are best suited for executing general-purpose and domain-specific computations such as the GEneral Matrix Multiply (<span>GEMM</span>). The analysis focuses on four key components: the Central Processing Unit (CPU), the Graphics Processing Unit (GPU), the matrix calculation accelerator (AMX), and the Apple Neural Engine (ANE).</div><div>The assessments use the <span>GEMM</span> as benchmark to characterize the performance of the CPU and GPU, alongside tests on AMX, which is specialized in handling large-scale mathematical operations, and tests on the ANE, which is specifically designed for Deep Learning purposes. Additionally, energy consumption data has been collected to analyze the energy efficiency of the aforementioned resources. Results highlight notable improvements in computational capacity and energy efficiency over successive generations. On one hand, the AMX stands out as the most efficient component for FP32 and FP64 workloads, significantly boosting overall system performance. In the M4 Pro, which integrates two matrix accelerators, it achieves up to 68% of the GPU’s FP32 performance while consuming only 42% of its power. On the other hand, the ANE, although limited to FP16 precision, excels in energy efficiency for low-precision tasks, surpassing other accelerators with over 700 GFLOPs/Watt under batched workloads.</div><div>This analysis offers a clear understanding of how Apple´s custom ARM designs optimize both performance and energy use, particularly in the context of multi-core processing and specialized acceleration units. In addition, a significant contribution of this study is the comprehensive comparative analysis of Apple’s accelerators, which have previously been poorly documented and scarcely studied. The analysis spans different generations and compares the accelerators against both CPU and GPU performance.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108393"},"PeriodicalIF":6.2,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146048040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1016/j.future.2026.108389
Zhao Tong , Shiyan Zhang , Jing Mei , Can Wang , Keqin Li
The integration and development of unmanned aerial vehicles (UAVs) and mobile edge computing (MEC) technology provide users with more flexible, reliable, and high-quality computing services. However, most UAV-assisted MEC model designs mainly focus on static environments, which do not apply to the practical scenarios considered in this work. In this paper, we consider a UAV-assisted MEC platform, which can provide continuous services for multiple mobile ground users with random movements and task arrivals. Moreover, we investigate the long-term system utility maximization problem in UAV-assisted MEC systems, considering continuous task offloading, users’ mobility, UAV’s 3D trajectory control, and resource allocation. To address the challenges of limited system information, high-dimensional continuous actions, and state space approximation, we propose an Online decision-making algorithm for Dynamic environments based on Exploration-enhanced Greedy DDPG (ODEGD). Additionally, to more accurately evaluate the algorithm’s performance, we introduced real-world roads into the experiment. Experimental results show that the proposed algorithm reduces response delay by 26.98% and energy consumption by 22.61% compared to other algorithms, while achieving the highest system utility. These results validate the applicability of the ODEGD algorithm under dynamic conditions, demonstrating its good robustness and scalability.
{"title":"Online 3D trajectory and resource optimization for dynamic UAV-assisted MEC systems","authors":"Zhao Tong , Shiyan Zhang , Jing Mei , Can Wang , Keqin Li","doi":"10.1016/j.future.2026.108389","DOIUrl":"10.1016/j.future.2026.108389","url":null,"abstract":"<div><div>The integration and development of unmanned aerial vehicles (UAVs) and mobile edge computing (MEC) technology provide users with more flexible, reliable, and high-quality computing services. However, most UAV-assisted MEC model designs mainly focus on static environments, which do not apply to the practical scenarios considered in this work. In this paper, we consider a UAV-assisted MEC platform, which can provide continuous services for multiple mobile ground users with random movements and task arrivals. Moreover, we investigate the long-term system utility maximization problem in UAV-assisted MEC systems, considering continuous task offloading, users’ mobility, UAV’s 3D trajectory control, and resource allocation. To address the challenges of limited system information, high-dimensional continuous actions, and state space approximation, we propose an <u>O</u>nline decision-making algorithm for <u>D</u>ynamic environments based on <u>E</u>xploration-enhanced <u>G</u>reedy <u>D</u>DPG (ODEGD). Additionally, to more accurately evaluate the algorithm’s performance, we introduced real-world roads into the experiment. Experimental results show that the proposed algorithm reduces response delay by 26.98% and energy consumption by 22.61% compared to other algorithms, while achieving the highest system utility. These results validate the applicability of the ODEGD algorithm under dynamic conditions, demonstrating its good robustness and scalability.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108389"},"PeriodicalIF":6.2,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146047991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1016/j.future.2026.108383
Aleix Boné , Alejandro Aguirre , David Álvarez , Pedro J. Martinez-Ferrer , Vicenç Beltran
Heterogeneous nodes that combine multi-core CPUs with diverse accelerators are rapidly becoming the norm in both high-performance computing (HPC) and AI infrastructures. Exploiting these platforms, however, requires orchestrating several low-level accelerator APIs such as CUDA, SYCL, and Triton. In some occasions they can be combined with optimized vendor math libraries: e.g., cuBLAS and oneAPI. Each API or library introduces its own abstractions, execution semantics, and synchronization mechanisms. Combining them within a single application is therefore error-prone and labor-intensive. We propose reusing a task-based data-flow methodology together with Task-Aware APIs (TA-libs) to overcome these limitations and facilitate the seamless integration of multiple accelerator programming models, while still leveraging the best-in-class kernels offered by each API.
Applications are expressed as a directed acyclic graph (DAG) of host tasks and device kernels managed by an OpenMP/OmpSs-2 runtime. We introduce Task-Aware SYCL (TASYCL) and leverage Task-Aware CUDA (TACUDA), which elevate individual accelerator invocations to first-class tasks. When multiple native runtimes coexist on the same multi-core CPU, they contend for threads, leading to oversubscription and performance variability. To address this, we unify their thread management under the nOS-V tasking and threading library, to which we contribute a new port of the PoCL (Portable OpenCL) runtime.
The methodology is evaluated on a multi-core server and a GPU-accelerated node using two contrasting workloads: the GPT-2 pre-training phase, representative of modern AI pipelines, and the HPCCG conjugate-gradient benchmark, representative of traditional HPC. From a performance standpoint, monolithic-kernel and fork-join executions are comparable —in both execution time and memory footprint— to a coarse-grained task-based formulation on both GPU-accelerated and multi-core systems. On the latter, unifying all runtimes through nOS-V mitigates interference and delivers performance on par with using a single runtime in isolation.
These results demonstrate that task-aware libraries, coupled with the nOS-V library, enable a single application to harness multiple accelerator programming models transparently and efficiently. The proposed methodology is immediately applicable to current heterogeneous nodes and is readily extensible to future systems that integrate even richer combinations of CPUs, GPUs, FPGAs, and AI accelerators.
{"title":"A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs","authors":"Aleix Boné , Alejandro Aguirre , David Álvarez , Pedro J. Martinez-Ferrer , Vicenç Beltran","doi":"10.1016/j.future.2026.108383","DOIUrl":"10.1016/j.future.2026.108383","url":null,"abstract":"<div><div>Heterogeneous nodes that combine multi-core CPUs with diverse accelerators are rapidly becoming the norm in both high-performance computing (HPC) and AI infrastructures. Exploiting these platforms, however, requires orchestrating several low-level accelerator APIs such as CUDA, SYCL, and Triton. In some occasions they can be combined with optimized vendor math libraries: e.g., cuBLAS and oneAPI. Each API or library introduces its own abstractions, execution semantics, and synchronization mechanisms. Combining them within a single application is therefore error-prone and labor-intensive. We propose reusing a task-based data-flow methodology together with Task-Aware APIs (TA-libs) to overcome these limitations and facilitate the seamless integration of multiple accelerator programming models, while still leveraging the best-in-class kernels offered by each API.</div><div>Applications are expressed as a directed acyclic graph (DAG) of host tasks and device kernels managed by an OpenMP/OmpSs-2 runtime. We introduce Task-Aware SYCL (TASYCL) and leverage Task-Aware CUDA (TACUDA), which elevate individual accelerator invocations to first-class tasks. When multiple native runtimes coexist on the same multi-core CPU, they contend for threads, leading to oversubscription and performance variability. To address this, we unify their thread management under the nOS-V tasking and threading library, to which we contribute a new port of the PoCL (Portable OpenCL) runtime.</div><div>The methodology is evaluated on a multi-core server and a GPU-accelerated node using two contrasting workloads: the GPT-2 pre-training phase, representative of modern AI pipelines, and the HPCCG conjugate-gradient benchmark, representative of traditional HPC. From a performance standpoint, monolithic-kernel and fork-join executions are comparable —in both execution time and memory footprint— to a coarse-grained task-based formulation on both GPU-accelerated and multi-core systems. On the latter, unifying all runtimes through nOS-V mitigates interference and delivers performance on par with using a single runtime in isolation.</div><div>These results demonstrate that task-aware libraries, coupled with the nOS-V library, enable a single application to harness multiple accelerator programming models transparently and efficiently. The proposed methodology is immediately applicable to current heterogeneous nodes and is readily extensible to future systems that integrate even richer combinations of CPUs, GPUs, FPGAs, and AI accelerators.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108383"},"PeriodicalIF":6.2,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146048039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1016/j.future.2026.108406
Hanlin Liu , Aliya Bao , Mingyue Li , Yintan Ai , Hua Li
Link-level network performance modeling (NPM) facilitates efficient traffic control, precise fault localization, and reliable resource management in emerging network paradigms such as Software-Defined Networking and Intent-Based Networking. A variety of models, such as Long Short-Term Memory and Graph Neural Networks (GNNs), are utilized to enhance the effectiveness of NPM. However, a practical NPM requires the generalization ability to adapt to diverse network topologies and prediction tasks without retraining. To meet this requirement, graph Transformer models are a breakthrough by encoding nodes and their structural features into tokens, breaking free from the dependencies on fixed graph structures typical of traditional GNNs. Nevertheless, they mostly focus on node-centric representations, which are insufficient to capture the fine-grained interactions and dependencies between links, thus limiting their applicability in link-level NPM. In this paper, we propose a centrality-aware multi-task graph Transformer with multi-gate mixture-of-experts (MMoE), named MoFormer, for link-level NPM. Specifically, a link-centric tokenized graph representation method is proposed to transform each link and its neighborhood information into a sequence of tokens guided by the routing protocol. A routing-aware betweenness centrality encoding mechanism is further developed to enhance the ability to characterize the tokens considering the relative importance of each link. MoFormer takes advantage of MMoE combined with Transformer to enable joint learning of multiple prediction tasks. Experimental results on both simulated and real-world datasets demonstrate the significant improvements of MoFormer over existing state-of-the-art baselines while maintaining superior generalization ability.
{"title":"MoFormer: A centrality-aware multi-task graph transformer with multi-gate mixture-of-experts for link-level network performance modeling","authors":"Hanlin Liu , Aliya Bao , Mingyue Li , Yintan Ai , Hua Li","doi":"10.1016/j.future.2026.108406","DOIUrl":"10.1016/j.future.2026.108406","url":null,"abstract":"<div><div>Link-level network performance modeling (NPM) facilitates efficient traffic control, precise fault localization, and reliable resource management in emerging network paradigms such as Software-Defined Networking and Intent-Based Networking. A variety of models, such as Long Short-Term Memory and Graph Neural Networks (GNNs), are utilized to enhance the effectiveness of NPM. However, a practical NPM requires the generalization ability to adapt to diverse network topologies and prediction tasks without retraining. To meet this requirement, graph Transformer models are a breakthrough by encoding nodes and their structural features into tokens, breaking free from the dependencies on fixed graph structures typical of traditional GNNs. Nevertheless, they mostly focus on node-centric representations, which are insufficient to capture the fine-grained interactions and dependencies between links, thus limiting their applicability in link-level NPM. In this paper, we propose a centrality-aware multi-task graph Transformer with multi-gate mixture-of-experts (MMoE), named MoFormer, for link-level NPM. Specifically, a link-centric tokenized graph representation method is proposed to transform each link and its neighborhood information into a sequence of tokens guided by the routing protocol. A routing-aware betweenness centrality encoding mechanism is further developed to enhance the ability to characterize the tokens considering the relative importance of each link. MoFormer takes advantage of MMoE combined with Transformer to enable joint learning of multiple prediction tasks. Experimental results on both simulated and real-world datasets demonstrate the significant improvements of MoFormer over existing state-of-the-art baselines while maintaining superior generalization ability.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108406"},"PeriodicalIF":6.2,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146047990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1016/j.future.2026.108387
Dishi Xu , Fagui Liu , Bin Wang , Xuhao Tang , Qingbo Wu
Latency-critical (LC) long-running applications operating on Java Virtual Machines (JLRAs) often rely on substantial CPU over-provisioning to meet Service-Level Objectives (SLOs) under dynamic workloads, leading to significant resource underutilization. Additionally, JLRAs exhibit inferior cold-start performance, and frequent deletion and creation of application instances to adjust resource allocation results in performance degradation. Furthermore, harvesting redundant resources by deploying best-effort (BE) batch jobs alongside JLRAs encounters serious challenges due to contention for shared CPU resources. Therefore, we present ChaosRM, a bi-level resource management framework for JVM workload co-location to enhance resource utilization efficiency while eliminating resource contention. In contrast to the conventional approach of isolating JLRAs and batch jobs on non-overlapping CPU sets, ChaosRM proposes a tri-zone CPU isolation mechanism, utilizing two CPU zones to isolate JLRAs and batch jobs, and an shared region for concurrently executing their threads. An application-wide, learning-based Application Manager adjusts the instance states of JLRAs based on the global workload and adaptively learns the shared zone allocation strategy and the performance target represented by thread queuing time; the Node Manager on each server heuristically binds CPU sets to JLRAs and dynamically schedules batch jobs among CPU zones according to this performance target and the JLRA instance states. Experimental results show that, while guaranteeing the SLOs of JLRAs, ChaosRM reduces the completion time of batch jobs by up to 14.10% over the best-performing baseline and up to 54.29% over all baselines.
{"title":"Adaptive CPU sharing for co-located latency-critical JVM applications and batch jobs under dynamic workloads","authors":"Dishi Xu , Fagui Liu , Bin Wang , Xuhao Tang , Qingbo Wu","doi":"10.1016/j.future.2026.108387","DOIUrl":"10.1016/j.future.2026.108387","url":null,"abstract":"<div><div>Latency-critical (LC) long-running applications operating on Java Virtual Machines (JLRAs) often rely on substantial CPU over-provisioning to meet Service-Level Objectives (SLOs) under dynamic workloads, leading to significant resource underutilization. Additionally, JLRAs exhibit inferior cold-start performance, and frequent deletion and creation of application instances to adjust resource allocation results in performance degradation. Furthermore, harvesting redundant resources by deploying best-effort (BE) batch jobs alongside JLRAs encounters serious challenges due to contention for shared CPU resources. Therefore, we present ChaosRM, a bi-level resource management framework for JVM workload co-location to enhance resource utilization efficiency while eliminating resource contention. In contrast to the conventional approach of isolating JLRAs and batch jobs on non-overlapping CPU sets, ChaosRM proposes a tri-zone CPU isolation mechanism, utilizing two CPU zones to isolate JLRAs and batch jobs, and an shared region for concurrently executing their threads. An application-wide, learning-based Application Manager adjusts the instance states of JLRAs based on the global workload and adaptively learns the shared zone allocation strategy and the performance target represented by thread queuing time; the Node Manager on each server heuristically binds CPU sets to JLRAs and dynamically schedules batch jobs among CPU zones according to this performance target and the JLRA instance states. Experimental results show that, while guaranteeing the SLOs of JLRAs, ChaosRM reduces the completion time of batch jobs by up to 14.10% over the best-performing baseline and up to 54.29% over all baselines.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108387"},"PeriodicalIF":6.2,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146033294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1016/j.future.2026.108385
Cemil Kaan Akyol , Muhammet Mustafa Ozdal , Ozcan Ozturk
Graph applications are becoming increasingly important with their widespread usage and the amounts of data they deal with. Biological and social web graphs are well-known examples that show the importance of efficiently processing graph analytic applications and problems. Due to limited resources, efficiency and performance are much more critical in embedded systems. We propose an efficient source-to-source-based methodology for graph applications that gives the freedom of not knowing the low-level details of parallelization and distribution by translating any vertex-centric C++ graph application into a pipelined SystemC model. High-Level Synthesis (HLS) tools can synthesize the generated SystemC model to obtain the design of the hardware. To support different types of graph applications, we have implemented features like non-standard application support, active set functionality, asynchronous execution support, conditional pipeline support, non-neighbor data access support, multiple pipeline support, and user-defined data type functionality. Our accelerator development flow can generate better-performing accelerators than OpenCL. Furthermore, it dramatically reduces the design time compared to using HLS tools. Therefore, the proposed methodology can generate fast accelerators with minimal effort using a high-level language description from the user.
{"title":"High performance graph-parallel accelerator design","authors":"Cemil Kaan Akyol , Muhammet Mustafa Ozdal , Ozcan Ozturk","doi":"10.1016/j.future.2026.108385","DOIUrl":"10.1016/j.future.2026.108385","url":null,"abstract":"<div><div>Graph applications are becoming increasingly important with their widespread usage and the amounts of data they deal with. Biological and social web graphs are well-known examples that show the importance of efficiently processing graph analytic applications and problems. Due to limited resources, efficiency and performance are much more critical in embedded systems. We propose an efficient source-to-source-based methodology for graph applications that gives the freedom of not knowing the low-level details of parallelization and distribution by translating any vertex-centric C++ graph application into a pipelined SystemC model. High-Level Synthesis (HLS) tools can synthesize the generated SystemC model to obtain the design of the hardware. To support different types of graph applications, we have implemented features like non-standard application support, active set functionality, asynchronous execution support, conditional pipeline support, non-neighbor data access support, multiple pipeline support, and user-defined data type functionality. Our accelerator development flow can generate better-performing accelerators than OpenCL. Furthermore, it dramatically reduces the design time compared to using HLS tools. Therefore, the proposed methodology can generate fast accelerators with minimal effort using a high-level language description from the user.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108385"},"PeriodicalIF":6.2,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146033295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1016/j.future.2026.108407
Taeshin Kang, Minwoo Kang, Heonchang Yu
Cloud service providers typically offer containers with fixed resource sizes. However, cloud users often overprovision container resources to prevent service interruptions caused by resource shortages. This practice leads to low utilization of system resources in the cloud. To address this issue, cloud service providers offer container auto-scaling. They primarily support horizontal auto-scaling, which provides horizontal elasticity. However, this approach has limitations in responding promptly to unexpected spikes in resource usage and in optimizing resource utilization. Vertical auto-scaling can help overcome these limitations. Its importance is increasing, particularly for stateful and real-time applications that require immediate resource elasticity. Nevertheless, vertical elasticity remains difficult to achieve and has not been actively researched or widely implemented. This study proposes a vertical auto-scaling mechanism for elastic memory management in container-based applications running in Kubernetes, which is widely recognized as the standard platform for container orchestration. In the proposed approach, high-priority tasks are given priority for scaling up, while tasks that cannot undergo scale-up are suspended using the cgroup freeze feature to prevent further memory allocation. If memory pressure persists and task termination becomes unavoidable, tasks are terminated in ascending order of priority, starting with the lowest. Once memory pressure is relieved, stateful applications are restarted from the point at which they were suspended. Compared to the default Kubernetes environment without vertical elasticity, EVMMv2 reduced the total execution time of stateful applications by up to 40% and improved the request success rate of stateless applications by 37%.
{"title":"Vertical auto-scaling mechanism for elastic memory management of containerized applications in Kubernetes","authors":"Taeshin Kang, Minwoo Kang, Heonchang Yu","doi":"10.1016/j.future.2026.108407","DOIUrl":"10.1016/j.future.2026.108407","url":null,"abstract":"<div><div>Cloud service providers typically offer containers with fixed resource sizes. However, cloud users often overprovision container resources to prevent service interruptions caused by resource shortages. This practice leads to low utilization of system resources in the cloud. To address this issue, cloud service providers offer container auto-scaling. They primarily support horizontal auto-scaling, which provides horizontal elasticity. However, this approach has limitations in responding promptly to unexpected spikes in resource usage and in optimizing resource utilization. Vertical auto-scaling can help overcome these limitations. Its importance is increasing, particularly for stateful and real-time applications that require immediate resource elasticity. Nevertheless, vertical elasticity remains difficult to achieve and has not been actively researched or widely implemented. This study proposes a vertical auto-scaling mechanism for elastic memory management in container-based applications running in Kubernetes, which is widely recognized as the standard platform for container orchestration. In the proposed approach, high-priority tasks are given priority for scaling up, while tasks that cannot undergo scale-up are suspended using the <em>cgroup freeze</em> feature to prevent further memory allocation. If memory pressure persists and task termination becomes unavoidable, tasks are terminated in ascending order of priority, starting with the lowest. Once memory pressure is relieved, stateful applications are restarted from the point at which they were suspended. Compared to the default Kubernetes environment without vertical elasticity, EVMMv2 reduced the total execution time of stateful applications by up to 40% and improved the request success rate of stateless applications by 37%.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108407"},"PeriodicalIF":6.2,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146033299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}