Pub Date : 2026-01-24DOI: 10.1016/j.future.2026.108389
Zhao Tong , Shiyan Zhang , Jing Mei , Can Wang , Keqin Li
The integration and development of unmanned aerial vehicles (UAVs) and mobile edge computing (MEC) technology provide users with more flexible, reliable, and high-quality computing services. However, most UAV-assisted MEC model designs mainly focus on static environments, which do not apply to the practical scenarios considered in this work. In this paper, we consider a UAV-assisted MEC platform, which can provide continuous services for multiple mobile ground users with random movements and task arrivals. Moreover, we investigate the long-term system utility maximization problem in UAV-assisted MEC systems, considering continuous task offloading, users’ mobility, UAV’s 3D trajectory control, and resource allocation. To address the challenges of limited system information, high-dimensional continuous actions, and state space approximation, we propose an Online decision-making algorithm for Dynamic environments based on Exploration-enhanced Greedy DDPG (ODEGD). Additionally, to more accurately evaluate the algorithm’s performance, we introduced real-world roads into the experiment. Experimental results show that the proposed algorithm reduces response delay by 26.98% and energy consumption by 22.61% compared to other algorithms, while achieving the highest system utility. These results validate the applicability of the ODEGD algorithm under dynamic conditions, demonstrating its good robustness and scalability.
{"title":"Online 3D trajectory and resource optimization for dynamic UAV-assisted MEC systems","authors":"Zhao Tong , Shiyan Zhang , Jing Mei , Can Wang , Keqin Li","doi":"10.1016/j.future.2026.108389","DOIUrl":"10.1016/j.future.2026.108389","url":null,"abstract":"<div><div>The integration and development of unmanned aerial vehicles (UAVs) and mobile edge computing (MEC) technology provide users with more flexible, reliable, and high-quality computing services. However, most UAV-assisted MEC model designs mainly focus on static environments, which do not apply to the practical scenarios considered in this work. In this paper, we consider a UAV-assisted MEC platform, which can provide continuous services for multiple mobile ground users with random movements and task arrivals. Moreover, we investigate the long-term system utility maximization problem in UAV-assisted MEC systems, considering continuous task offloading, users’ mobility, UAV’s 3D trajectory control, and resource allocation. To address the challenges of limited system information, high-dimensional continuous actions, and state space approximation, we propose an <u>O</u>nline decision-making algorithm for <u>D</u>ynamic environments based on <u>E</u>xploration-enhanced <u>G</u>reedy <u>D</u>DPG (ODEGD). Additionally, to more accurately evaluate the algorithm’s performance, we introduced real-world roads into the experiment. Experimental results show that the proposed algorithm reduces response delay by 26.98% and energy consumption by 22.61% compared to other algorithms, while achieving the highest system utility. These results validate the applicability of the ODEGD algorithm under dynamic conditions, demonstrating its good robustness and scalability.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108389"},"PeriodicalIF":6.2,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146047991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1016/j.future.2026.108383
Aleix Boné , Alejandro Aguirre , David Álvarez , Pedro J. Martinez-Ferrer , Vicenç Beltran
Heterogeneous nodes that combine multi-core CPUs with diverse accelerators are rapidly becoming the norm in both high-performance computing (HPC) and AI infrastructures. Exploiting these platforms, however, requires orchestrating several low-level accelerator APIs such as CUDA, SYCL, and Triton. In some occasions they can be combined with optimized vendor math libraries: e.g., cuBLAS and oneAPI. Each API or library introduces its own abstractions, execution semantics, and synchronization mechanisms. Combining them within a single application is therefore error-prone and labor-intensive. We propose reusing a task-based data-flow methodology together with Task-Aware APIs (TA-libs) to overcome these limitations and facilitate the seamless integration of multiple accelerator programming models, while still leveraging the best-in-class kernels offered by each API.
Applications are expressed as a directed acyclic graph (DAG) of host tasks and device kernels managed by an OpenMP/OmpSs-2 runtime. We introduce Task-Aware SYCL (TASYCL) and leverage Task-Aware CUDA (TACUDA), which elevate individual accelerator invocations to first-class tasks. When multiple native runtimes coexist on the same multi-core CPU, they contend for threads, leading to oversubscription and performance variability. To address this, we unify their thread management under the nOS-V tasking and threading library, to which we contribute a new port of the PoCL (Portable OpenCL) runtime.
The methodology is evaluated on a multi-core server and a GPU-accelerated node using two contrasting workloads: the GPT-2 pre-training phase, representative of modern AI pipelines, and the HPCCG conjugate-gradient benchmark, representative of traditional HPC. From a performance standpoint, monolithic-kernel and fork-join executions are comparable —in both execution time and memory footprint— to a coarse-grained task-based formulation on both GPU-accelerated and multi-core systems. On the latter, unifying all runtimes through nOS-V mitigates interference and delivers performance on par with using a single runtime in isolation.
These results demonstrate that task-aware libraries, coupled with the nOS-V library, enable a single application to harness multiple accelerator programming models transparently and efficiently. The proposed methodology is immediately applicable to current heterogeneous nodes and is readily extensible to future systems that integrate even richer combinations of CPUs, GPUs, FPGAs, and AI accelerators.
{"title":"A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs","authors":"Aleix Boné , Alejandro Aguirre , David Álvarez , Pedro J. Martinez-Ferrer , Vicenç Beltran","doi":"10.1016/j.future.2026.108383","DOIUrl":"10.1016/j.future.2026.108383","url":null,"abstract":"<div><div>Heterogeneous nodes that combine multi-core CPUs with diverse accelerators are rapidly becoming the norm in both high-performance computing (HPC) and AI infrastructures. Exploiting these platforms, however, requires orchestrating several low-level accelerator APIs such as CUDA, SYCL, and Triton. In some occasions they can be combined with optimized vendor math libraries: e.g., cuBLAS and oneAPI. Each API or library introduces its own abstractions, execution semantics, and synchronization mechanisms. Combining them within a single application is therefore error-prone and labor-intensive. We propose reusing a task-based data-flow methodology together with Task-Aware APIs (TA-libs) to overcome these limitations and facilitate the seamless integration of multiple accelerator programming models, while still leveraging the best-in-class kernels offered by each API.</div><div>Applications are expressed as a directed acyclic graph (DAG) of host tasks and device kernels managed by an OpenMP/OmpSs-2 runtime. We introduce Task-Aware SYCL (TASYCL) and leverage Task-Aware CUDA (TACUDA), which elevate individual accelerator invocations to first-class tasks. When multiple native runtimes coexist on the same multi-core CPU, they contend for threads, leading to oversubscription and performance variability. To address this, we unify their thread management under the nOS-V tasking and threading library, to which we contribute a new port of the PoCL (Portable OpenCL) runtime.</div><div>The methodology is evaluated on a multi-core server and a GPU-accelerated node using two contrasting workloads: the GPT-2 pre-training phase, representative of modern AI pipelines, and the HPCCG conjugate-gradient benchmark, representative of traditional HPC. From a performance standpoint, monolithic-kernel and fork-join executions are comparable —in both execution time and memory footprint— to a coarse-grained task-based formulation on both GPU-accelerated and multi-core systems. On the latter, unifying all runtimes through nOS-V mitigates interference and delivers performance on par with using a single runtime in isolation.</div><div>These results demonstrate that task-aware libraries, coupled with the nOS-V library, enable a single application to harness multiple accelerator programming models transparently and efficiently. The proposed methodology is immediately applicable to current heterogeneous nodes and is readily extensible to future systems that integrate even richer combinations of CPUs, GPUs, FPGAs, and AI accelerators.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108383"},"PeriodicalIF":6.2,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146048039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-24DOI: 10.1016/j.future.2026.108406
Hanlin Liu , Aliya Bao , Mingyue Li , Yintan Ai , Hua Li
Link-level network performance modeling (NPM) facilitates efficient traffic control, precise fault localization, and reliable resource management in emerging network paradigms such as Software-Defined Networking and Intent-Based Networking. A variety of models, such as Long Short-Term Memory and Graph Neural Networks (GNNs), are utilized to enhance the effectiveness of NPM. However, a practical NPM requires the generalization ability to adapt to diverse network topologies and prediction tasks without retraining. To meet this requirement, graph Transformer models are a breakthrough by encoding nodes and their structural features into tokens, breaking free from the dependencies on fixed graph structures typical of traditional GNNs. Nevertheless, they mostly focus on node-centric representations, which are insufficient to capture the fine-grained interactions and dependencies between links, thus limiting their applicability in link-level NPM. In this paper, we propose a centrality-aware multi-task graph Transformer with multi-gate mixture-of-experts (MMoE), named MoFormer, for link-level NPM. Specifically, a link-centric tokenized graph representation method is proposed to transform each link and its neighborhood information into a sequence of tokens guided by the routing protocol. A routing-aware betweenness centrality encoding mechanism is further developed to enhance the ability to characterize the tokens considering the relative importance of each link. MoFormer takes advantage of MMoE combined with Transformer to enable joint learning of multiple prediction tasks. Experimental results on both simulated and real-world datasets demonstrate the significant improvements of MoFormer over existing state-of-the-art baselines while maintaining superior generalization ability.
{"title":"MoFormer: A centrality-aware multi-task graph transformer with multi-gate mixture-of-experts for link-level network performance modeling","authors":"Hanlin Liu , Aliya Bao , Mingyue Li , Yintan Ai , Hua Li","doi":"10.1016/j.future.2026.108406","DOIUrl":"10.1016/j.future.2026.108406","url":null,"abstract":"<div><div>Link-level network performance modeling (NPM) facilitates efficient traffic control, precise fault localization, and reliable resource management in emerging network paradigms such as Software-Defined Networking and Intent-Based Networking. A variety of models, such as Long Short-Term Memory and Graph Neural Networks (GNNs), are utilized to enhance the effectiveness of NPM. However, a practical NPM requires the generalization ability to adapt to diverse network topologies and prediction tasks without retraining. To meet this requirement, graph Transformer models are a breakthrough by encoding nodes and their structural features into tokens, breaking free from the dependencies on fixed graph structures typical of traditional GNNs. Nevertheless, they mostly focus on node-centric representations, which are insufficient to capture the fine-grained interactions and dependencies between links, thus limiting their applicability in link-level NPM. In this paper, we propose a centrality-aware multi-task graph Transformer with multi-gate mixture-of-experts (MMoE), named MoFormer, for link-level NPM. Specifically, a link-centric tokenized graph representation method is proposed to transform each link and its neighborhood information into a sequence of tokens guided by the routing protocol. A routing-aware betweenness centrality encoding mechanism is further developed to enhance the ability to characterize the tokens considering the relative importance of each link. MoFormer takes advantage of MMoE combined with Transformer to enable joint learning of multiple prediction tasks. Experimental results on both simulated and real-world datasets demonstrate the significant improvements of MoFormer over existing state-of-the-art baselines while maintaining superior generalization ability.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108406"},"PeriodicalIF":6.2,"publicationDate":"2026-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146047990","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1016/j.future.2026.108387
Dishi Xu , Fagui Liu , Bin Wang , Xuhao Tang , Qingbo Wu
Latency-critical (LC) long-running applications operating on Java Virtual Machines (JLRAs) often rely on substantial CPU over-provisioning to meet Service-Level Objectives (SLOs) under dynamic workloads, leading to significant resource underutilization. Additionally, JLRAs exhibit inferior cold-start performance, and frequent deletion and creation of application instances to adjust resource allocation results in performance degradation. Furthermore, harvesting redundant resources by deploying best-effort (BE) batch jobs alongside JLRAs encounters serious challenges due to contention for shared CPU resources. Therefore, we present ChaosRM, a bi-level resource management framework for JVM workload co-location to enhance resource utilization efficiency while eliminating resource contention. In contrast to the conventional approach of isolating JLRAs and batch jobs on non-overlapping CPU sets, ChaosRM proposes a tri-zone CPU isolation mechanism, utilizing two CPU zones to isolate JLRAs and batch jobs, and an shared region for concurrently executing their threads. An application-wide, learning-based Application Manager adjusts the instance states of JLRAs based on the global workload and adaptively learns the shared zone allocation strategy and the performance target represented by thread queuing time; the Node Manager on each server heuristically binds CPU sets to JLRAs and dynamically schedules batch jobs among CPU zones according to this performance target and the JLRA instance states. Experimental results show that, while guaranteeing the SLOs of JLRAs, ChaosRM reduces the completion time of batch jobs by up to 14.10% over the best-performing baseline and up to 54.29% over all baselines.
{"title":"Adaptive CPU sharing for co-located latency-critical JVM applications and batch jobs under dynamic workloads","authors":"Dishi Xu , Fagui Liu , Bin Wang , Xuhao Tang , Qingbo Wu","doi":"10.1016/j.future.2026.108387","DOIUrl":"10.1016/j.future.2026.108387","url":null,"abstract":"<div><div>Latency-critical (LC) long-running applications operating on Java Virtual Machines (JLRAs) often rely on substantial CPU over-provisioning to meet Service-Level Objectives (SLOs) under dynamic workloads, leading to significant resource underutilization. Additionally, JLRAs exhibit inferior cold-start performance, and frequent deletion and creation of application instances to adjust resource allocation results in performance degradation. Furthermore, harvesting redundant resources by deploying best-effort (BE) batch jobs alongside JLRAs encounters serious challenges due to contention for shared CPU resources. Therefore, we present ChaosRM, a bi-level resource management framework for JVM workload co-location to enhance resource utilization efficiency while eliminating resource contention. In contrast to the conventional approach of isolating JLRAs and batch jobs on non-overlapping CPU sets, ChaosRM proposes a tri-zone CPU isolation mechanism, utilizing two CPU zones to isolate JLRAs and batch jobs, and an shared region for concurrently executing their threads. An application-wide, learning-based Application Manager adjusts the instance states of JLRAs based on the global workload and adaptively learns the shared zone allocation strategy and the performance target represented by thread queuing time; the Node Manager on each server heuristically binds CPU sets to JLRAs and dynamically schedules batch jobs among CPU zones according to this performance target and the JLRA instance states. Experimental results show that, while guaranteeing the SLOs of JLRAs, ChaosRM reduces the completion time of batch jobs by up to 14.10% over the best-performing baseline and up to 54.29% over all baselines.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108387"},"PeriodicalIF":6.2,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146033294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-23DOI: 10.1016/j.future.2026.108385
Cemil Kaan Akyol , Muhammet Mustafa Ozdal , Ozcan Ozturk
Graph applications are becoming increasingly important with their widespread usage and the amounts of data they deal with. Biological and social web graphs are well-known examples that show the importance of efficiently processing graph analytic applications and problems. Due to limited resources, efficiency and performance are much more critical in embedded systems. We propose an efficient source-to-source-based methodology for graph applications that gives the freedom of not knowing the low-level details of parallelization and distribution by translating any vertex-centric C++ graph application into a pipelined SystemC model. High-Level Synthesis (HLS) tools can synthesize the generated SystemC model to obtain the design of the hardware. To support different types of graph applications, we have implemented features like non-standard application support, active set functionality, asynchronous execution support, conditional pipeline support, non-neighbor data access support, multiple pipeline support, and user-defined data type functionality. Our accelerator development flow can generate better-performing accelerators than OpenCL. Furthermore, it dramatically reduces the design time compared to using HLS tools. Therefore, the proposed methodology can generate fast accelerators with minimal effort using a high-level language description from the user.
{"title":"High performance graph-parallel accelerator design","authors":"Cemil Kaan Akyol , Muhammet Mustafa Ozdal , Ozcan Ozturk","doi":"10.1016/j.future.2026.108385","DOIUrl":"10.1016/j.future.2026.108385","url":null,"abstract":"<div><div>Graph applications are becoming increasingly important with their widespread usage and the amounts of data they deal with. Biological and social web graphs are well-known examples that show the importance of efficiently processing graph analytic applications and problems. Due to limited resources, efficiency and performance are much more critical in embedded systems. We propose an efficient source-to-source-based methodology for graph applications that gives the freedom of not knowing the low-level details of parallelization and distribution by translating any vertex-centric C++ graph application into a pipelined SystemC model. High-Level Synthesis (HLS) tools can synthesize the generated SystemC model to obtain the design of the hardware. To support different types of graph applications, we have implemented features like non-standard application support, active set functionality, asynchronous execution support, conditional pipeline support, non-neighbor data access support, multiple pipeline support, and user-defined data type functionality. Our accelerator development flow can generate better-performing accelerators than OpenCL. Furthermore, it dramatically reduces the design time compared to using HLS tools. Therefore, the proposed methodology can generate fast accelerators with minimal effort using a high-level language description from the user.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108385"},"PeriodicalIF":6.2,"publicationDate":"2026-01-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146033295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1016/j.future.2026.108407
Taeshin Kang, Minwoo Kang, Heonchang Yu
Cloud service providers typically offer containers with fixed resource sizes. However, cloud users often overprovision container resources to prevent service interruptions caused by resource shortages. This practice leads to low utilization of system resources in the cloud. To address this issue, cloud service providers offer container auto-scaling. They primarily support horizontal auto-scaling, which provides horizontal elasticity. However, this approach has limitations in responding promptly to unexpected spikes in resource usage and in optimizing resource utilization. Vertical auto-scaling can help overcome these limitations. Its importance is increasing, particularly for stateful and real-time applications that require immediate resource elasticity. Nevertheless, vertical elasticity remains difficult to achieve and has not been actively researched or widely implemented. This study proposes a vertical auto-scaling mechanism for elastic memory management in container-based applications running in Kubernetes, which is widely recognized as the standard platform for container orchestration. In the proposed approach, high-priority tasks are given priority for scaling up, while tasks that cannot undergo scale-up are suspended using the cgroup freeze feature to prevent further memory allocation. If memory pressure persists and task termination becomes unavoidable, tasks are terminated in ascending order of priority, starting with the lowest. Once memory pressure is relieved, stateful applications are restarted from the point at which they were suspended. Compared to the default Kubernetes environment without vertical elasticity, EVMMv2 reduced the total execution time of stateful applications by up to 40% and improved the request success rate of stateless applications by 37%.
{"title":"Vertical auto-scaling mechanism for elastic memory management of containerized applications in Kubernetes","authors":"Taeshin Kang, Minwoo Kang, Heonchang Yu","doi":"10.1016/j.future.2026.108407","DOIUrl":"10.1016/j.future.2026.108407","url":null,"abstract":"<div><div>Cloud service providers typically offer containers with fixed resource sizes. However, cloud users often overprovision container resources to prevent service interruptions caused by resource shortages. This practice leads to low utilization of system resources in the cloud. To address this issue, cloud service providers offer container auto-scaling. They primarily support horizontal auto-scaling, which provides horizontal elasticity. However, this approach has limitations in responding promptly to unexpected spikes in resource usage and in optimizing resource utilization. Vertical auto-scaling can help overcome these limitations. Its importance is increasing, particularly for stateful and real-time applications that require immediate resource elasticity. Nevertheless, vertical elasticity remains difficult to achieve and has not been actively researched or widely implemented. This study proposes a vertical auto-scaling mechanism for elastic memory management in container-based applications running in Kubernetes, which is widely recognized as the standard platform for container orchestration. In the proposed approach, high-priority tasks are given priority for scaling up, while tasks that cannot undergo scale-up are suspended using the <em>cgroup freeze</em> feature to prevent further memory allocation. If memory pressure persists and task termination becomes unavoidable, tasks are terminated in ascending order of priority, starting with the lowest. Once memory pressure is relieved, stateful applications are restarted from the point at which they were suspended. Compared to the default Kubernetes environment without vertical elasticity, EVMMv2 reduced the total execution time of stateful applications by up to 40% and improved the request success rate of stateless applications by 37%.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108407"},"PeriodicalIF":6.2,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146033299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1016/j.future.2026.108392
Francesco Antici , Jens Domke , Andrea Bartolini , Zeynep Kiziltan , Satoshi Matsuoka
The increasing reliance on High-Performance Computing (HPC) systems to execute complex scientific and industrial workloads raises significant security concerns related to the misuse of HPC resources for unauthorized or malicious activities. Rogue job executions can threaten the integrity, confidentiality, and availability of HPC infrastructures. Given the scale and heterogeneity of HPC job submissions, manual or ad hoc monitoring is inadequate to effectively detect such misuse. Therefore, automated solutions capable of systematically analyzing job submissions are essential to detect rogue workloads. To address this challenge, we present RoWD (Rogue Workload Detector), the first framework for automated and systematic security screening of the HPC job-submission pipeline. RoWD is composed of modular plug-ins that classify different types of workloads and enable the detection of rogue jobs through the analysis of job scripts and associated metadata. We deploy RoWD on the Supercomputer Fugaku to classify AI workloads and release SCRIPT-AI, the first dataset of annotated job scripts labeled with workload characteristics. We evaluate RoWD on approximately 50K previously unseen jobs executed on Fugaku between 2021 and 2025. Our results show that RoWD accurately classifies AI jobs (achieving an F1 score of 95%), is robust against adversarial behavior, and incurs low runtime overhead, making it suitable for strengthening the security of HPC environments and for real-time deployment in production systems.
{"title":"RoWD: Automated rogue workload detector for HPC security","authors":"Francesco Antici , Jens Domke , Andrea Bartolini , Zeynep Kiziltan , Satoshi Matsuoka","doi":"10.1016/j.future.2026.108392","DOIUrl":"10.1016/j.future.2026.108392","url":null,"abstract":"<div><div>The increasing reliance on High-Performance Computing (HPC) systems to execute complex scientific and industrial workloads raises significant security concerns related to the misuse of HPC resources for unauthorized or malicious activities. Rogue job executions can threaten the integrity, confidentiality, and availability of HPC infrastructures. Given the scale and heterogeneity of HPC job submissions, manual or ad hoc monitoring is inadequate to effectively detect such misuse. Therefore, automated solutions capable of systematically analyzing job submissions are essential to detect rogue workloads. To address this challenge, we present RoWD (Rogue Workload Detector), the first framework for automated and systematic security screening of the HPC job-submission pipeline. RoWD is composed of modular plug-ins that classify different types of workloads and enable the detection of rogue jobs through the analysis of job scripts and associated metadata. We deploy RoWD on the Supercomputer Fugaku to classify AI workloads and release SCRIPT-AI, the first dataset of annotated job scripts labeled with workload characteristics. We evaluate RoWD on approximately 50K previously unseen jobs executed on Fugaku between 2021 and 2025. Our results show that RoWD accurately classifies AI jobs (achieving an F1 score of 95%), is robust against adversarial behavior, and incurs low runtime overhead, making it suitable for strengthening the security of HPC environments and for real-time deployment in production systems.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108392"},"PeriodicalIF":6.2,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146033300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1016/j.future.2026.108391
Muhammad Asim , Wu Junsheng , Li Weigang , Lin Zhijun , Zhang Peng , He Hao , Wei Dong , Ghulam Mohi-ud-Din
The increasing interconnectivity within modern transportation ecosystems, a cornerstone of Intelligent Transportation Systems (ITS), creates critical vulnerabilities, demanding stronger security measures to prevent unauthorized access to vehicles and private data. Existing blockchain implementations for Vehicular Ad Hoc Networks (VANETs) are fundamentally flawed, exhibiting inefficiency with traditional consensus mechanisms, vulnerability to quantum attacks, or often both. To overcome these critical limitations, this study introduces a novel Quantum-Resistant Blockchain Architecture. The core objectives are to achieve highly efficient vehicular data storage, ensure robust confidentiality through post-quantum cryptography, and automate secure transactions. The proposed methodology employs a dual-blockchain structure: a Registration Blockchain (RBC) using Proof-of-Authority (PoA) for secure identity management, and a Message Blockchain (MBC) using Proof-of-Position (PoP) for low-latency message dissemination. A key innovation is the integration of smart contracts with the NIST-approved Module Lattice-Based Key Encapsulation Mechanism (ML-KEM) to automate and secure all processes. The framework is rigorously evaluated using a realistic 5G-VANET Multi-access Edge Computing(MEC) dataset, which includes key parameters like vehicle ID, speed, and location. The results are compelling, demonstrating an Average Block Processing Time of 0.0326 s and a Transactional Throughput of 30.64 TPS, significantly outperforming RSA and AES benchmarks. This research’s primary contribution is a comprehensive framework that substantially improves data security and scalability while future-proofing VANETs against the imminent and evolving threat of quantum computing.
{"title":"Quantum-resistant blockchain architecture for secure vehicular networks: A ML-KEM-enabled approach with PoA and PoP consensus","authors":"Muhammad Asim , Wu Junsheng , Li Weigang , Lin Zhijun , Zhang Peng , He Hao , Wei Dong , Ghulam Mohi-ud-Din","doi":"10.1016/j.future.2026.108391","DOIUrl":"10.1016/j.future.2026.108391","url":null,"abstract":"<div><div>The increasing interconnectivity within modern transportation ecosystems, a cornerstone of Intelligent Transportation Systems (ITS), creates critical vulnerabilities, demanding stronger security measures to prevent unauthorized access to vehicles and private data. Existing blockchain implementations for Vehicular Ad Hoc Networks (VANETs) are fundamentally flawed, exhibiting inefficiency with traditional consensus mechanisms, vulnerability to quantum attacks, or often both. To overcome these critical limitations, this study introduces a novel Quantum-Resistant Blockchain Architecture. The core objectives are to achieve highly efficient vehicular data storage, ensure robust confidentiality through post-quantum cryptography, and automate secure transactions. The proposed methodology employs a dual-blockchain structure: a Registration Blockchain (RBC) using Proof-of-Authority (PoA) for secure identity management, and a Message Blockchain (MBC) using Proof-of-Position (PoP) for low-latency message dissemination. A key innovation is the integration of smart contracts with the NIST-approved Module Lattice-Based Key Encapsulation Mechanism (ML-KEM) to automate and secure all processes. The framework is rigorously evaluated using a realistic 5G-VANET Multi-access Edge Computing(MEC) dataset, which includes key parameters like vehicle ID, speed, and location. The results are compelling, demonstrating an Average Block Processing Time of 0.0326 s and a Transactional Throughput of 30.64 TPS, significantly outperforming RSA and AES benchmarks. This research’s primary contribution is a comprehensive framework that substantially improves data security and scalability while future-proofing VANETs against the imminent and evolving threat of quantum computing.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108391"},"PeriodicalIF":6.2,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146033297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-22DOI: 10.1016/j.future.2026.108394
Bibrak Qamar Chandio, Maciej Brodowicz, Thomas Sterling
The paper provides a unified co-design of: 1) a non-Von Neumann architecture for fine-grain irregular memory computations, 2) a programming and execution model that allows spawning tasks from within the graph vertex data at runtime, 3) language constructs for actions that send work to where the data resides, combining parallel expressiveness of local control objects (LCOs) to implement asynchronous graph processing primitives, 4) and an innovative vertex-centric data-structure, using the concept of Rhizomes, that parallelizes both the out and in-degree load of vertex objects across many cores and yet provides a single programming abstraction to the vertex objects. The data structure hierarchically parallelizes the out-degree load of vertices and the in-degree load laterally. The rhizomes internally communicate and remain consistent, using event-driven synchronization mechanisms, to provide a unified and correct view of the vertex.
Simulated experimental results show performance gains for BFS, SSSP, and Page Rank on large chip sizes for the tested input graph datasets containing highly skewed degree distributions. The improvements come from the ability to express and create fine-grain dynamic computing task in the form of actions, language constructs that aid the compiler to generate code that the runtime system uses to optimally schedule tasks, and the data structure that shares both in and out-degree compute workload among memory-processing elements.
{"title":"A message-driven system for processing highly skewed graphs","authors":"Bibrak Qamar Chandio, Maciej Brodowicz, Thomas Sterling","doi":"10.1016/j.future.2026.108394","DOIUrl":"10.1016/j.future.2026.108394","url":null,"abstract":"<div><div>The paper provides a unified co-design of: 1) a non-Von Neumann architecture for fine-grain irregular memory computations, 2) a programming and execution model that allows spawning tasks from within the graph vertex data at runtime, 3) language constructs for <em>actions</em> that send work to where the data resides, combining parallel expressiveness of local control objects (LCOs) to implement asynchronous graph processing primitives, 4) and an innovative vertex-centric data-structure, using the concept of Rhizomes, that parallelizes both the out and in-degree load of vertex objects across many cores and yet provides a single programming abstraction to the vertex objects. The data structure hierarchically parallelizes the out-degree load of vertices and the in-degree load laterally. The rhizomes internally communicate and remain consistent, using event-driven synchronization mechanisms, to provide a unified and correct view of the vertex.</div><div>Simulated experimental results show performance gains for BFS, SSSP, and Page Rank on large chip sizes for the tested input graph datasets containing highly skewed degree distributions. The improvements come from the ability to express and create fine-grain dynamic computing task in the form of <em>actions</em>, language constructs that aid the compiler to generate code that the runtime system uses to optimally schedule tasks, and the data structure that shares both in and out-degree compute workload among memory-processing elements.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108394"},"PeriodicalIF":6.2,"publicationDate":"2026-01-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146033296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-20DOI: 10.1016/j.future.2026.108378
Mario Barbareschi , Salvatore Barone , Alberto Bosio , Antonio Emmanuele
The increasing adoption of AI models has driven applications toward the use of hardware accelerators to meet high computational demands and strict performance requirements. Beyond consideration of performance and energy efficiency, explainability and reliability have emerged as pivotal requirements, particularly for critical applications such as automotive, medical, and aerospace systems. Among the various AI models, Decision Tree Ensembles (DTEs) are particularly notable for their high accuracy and explainability. Moreover, they are particularly well-suited for hardware implementations, enabling high-performance and improved energy efficiency. However, a frequently overlooked aspect of DTEs is their reliability in the presence of hardware malfunctions. While DTEs are generally regarded as robust by design, due to their redundancy and voting mechanisms, hardware faults can still have catastrophic consequences. To address this gap, we present an in-depth reliability analysis of two types of DTE hardware accelerators: classical and approximate implementations. Specifically, we conduct a comprehensive fault injection campaign, varying the number of trees involved in the classification task, the approximation technique used, and the tolerated accuracy loss, while evaluating several benchmark datasets. The results of this study demonstrate that approximation techniques have to be carefully designed, as they can significantly impact resilience. However, techniques that target the representation of features and thresholds appear to be better suited for fault tolerance.
{"title":"Reliability analysis of hardware accelerators for decision tree-based classifier systems","authors":"Mario Barbareschi , Salvatore Barone , Alberto Bosio , Antonio Emmanuele","doi":"10.1016/j.future.2026.108378","DOIUrl":"10.1016/j.future.2026.108378","url":null,"abstract":"<div><div>The increasing adoption of AI models has driven applications toward the use of hardware accelerators to meet high computational demands and strict performance requirements. Beyond consideration of performance and energy efficiency, explainability and reliability have emerged as pivotal requirements, particularly for critical applications such as automotive, medical, and aerospace systems. Among the various AI models, Decision Tree Ensembles (DTEs) are particularly notable for their high accuracy and explainability. Moreover, they are particularly well-suited for hardware implementations, enabling high-performance and improved energy efficiency. However, a frequently overlooked aspect of DTEs is their reliability in the presence of hardware malfunctions. While DTEs are generally regarded as robust by design, due to their redundancy and voting mechanisms, hardware faults can still have catastrophic consequences. To address this gap, we present an in-depth reliability analysis of two types of DTE hardware accelerators: classical and approximate implementations. Specifically, we conduct a comprehensive fault injection campaign, varying the number of trees involved in the classification task, the approximation technique used, and the tolerated accuracy loss, while evaluating several benchmark datasets. The results of this study demonstrate that approximation techniques have to be carefully designed, as they can significantly impact resilience. However, techniques that target the representation of features and thresholds appear to be better suited for fault tolerance.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108378"},"PeriodicalIF":6.2,"publicationDate":"2026-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146014882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}