Pub Date : 2026-07-01Epub Date: 2026-01-23DOI: 10.1016/j.future.2026.108387
Dishi Xu , Fagui Liu , Bin Wang , Xuhao Tang , Qingbo Wu
Latency-critical (LC) long-running applications operating on Java Virtual Machines (JLRAs) often rely on substantial CPU over-provisioning to meet Service-Level Objectives (SLOs) under dynamic workloads, leading to significant resource underutilization. Additionally, JLRAs exhibit inferior cold-start performance, and frequent deletion and creation of application instances to adjust resource allocation results in performance degradation. Furthermore, harvesting redundant resources by deploying best-effort (BE) batch jobs alongside JLRAs encounters serious challenges due to contention for shared CPU resources. Therefore, we present ChaosRM, a bi-level resource management framework for JVM workload co-location to enhance resource utilization efficiency while eliminating resource contention. In contrast to the conventional approach of isolating JLRAs and batch jobs on non-overlapping CPU sets, ChaosRM proposes a tri-zone CPU isolation mechanism, utilizing two CPU zones to isolate JLRAs and batch jobs, and an shared region for concurrently executing their threads. An application-wide, learning-based Application Manager adjusts the instance states of JLRAs based on the global workload and adaptively learns the shared zone allocation strategy and the performance target represented by thread queuing time; the Node Manager on each server heuristically binds CPU sets to JLRAs and dynamically schedules batch jobs among CPU zones according to this performance target and the JLRA instance states. Experimental results show that, while guaranteeing the SLOs of JLRAs, ChaosRM reduces the completion time of batch jobs by up to 14.10% over the best-performing baseline and up to 54.29% over all baselines.
{"title":"Adaptive CPU sharing for co-located latency-critical JVM applications and batch jobs under dynamic workloads","authors":"Dishi Xu , Fagui Liu , Bin Wang , Xuhao Tang , Qingbo Wu","doi":"10.1016/j.future.2026.108387","DOIUrl":"10.1016/j.future.2026.108387","url":null,"abstract":"<div><div>Latency-critical (LC) long-running applications operating on Java Virtual Machines (JLRAs) often rely on substantial CPU over-provisioning to meet Service-Level Objectives (SLOs) under dynamic workloads, leading to significant resource underutilization. Additionally, JLRAs exhibit inferior cold-start performance, and frequent deletion and creation of application instances to adjust resource allocation results in performance degradation. Furthermore, harvesting redundant resources by deploying best-effort (BE) batch jobs alongside JLRAs encounters serious challenges due to contention for shared CPU resources. Therefore, we present ChaosRM, a bi-level resource management framework for JVM workload co-location to enhance resource utilization efficiency while eliminating resource contention. In contrast to the conventional approach of isolating JLRAs and batch jobs on non-overlapping CPU sets, ChaosRM proposes a tri-zone CPU isolation mechanism, utilizing two CPU zones to isolate JLRAs and batch jobs, and an shared region for concurrently executing their threads. An application-wide, learning-based Application Manager adjusts the instance states of JLRAs based on the global workload and adaptively learns the shared zone allocation strategy and the performance target represented by thread queuing time; the Node Manager on each server heuristically binds CPU sets to JLRAs and dynamically schedules batch jobs among CPU zones according to this performance target and the JLRA instance states. Experimental results show that, while guaranteeing the SLOs of JLRAs, ChaosRM reduces the completion time of batch jobs by up to 14.10% over the best-performing baseline and up to 54.29% over all baselines.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108387"},"PeriodicalIF":6.2,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146033294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-07-01Epub Date: 2026-01-10DOI: 10.1016/j.future.2026.108371
Jing Wang , Wenshi Dan , Ke Yang , Xing Tang , Lingyu Yan
The proliferation of vehicular networks within intelligent transportation systems (ITS) has significantly increased the demand for efficient and adaptive spectrum resource allocation. Spectrum coordination is challenging due to high vehicle traffic, intensive communication environments and diversified service requirements. These are of particular significance in Vehicle-to-Everything (V2X) communications, where adaptive conditions call out powerful solutions. Multi-agent reinforcement learning (MARL) techniques are promising and have been applied to the management of dynamic spectrum access, but with limitations including overestimated value functions, unsteady policy convergence, and dependence on manual choices of rewards, these techniques have limitations as far as their application in practice. This paper presents a new framework of spectrum management IRL-D3QN, which combines Inverse Reinforcement Learning (IRL) and a Dueling Double Deep Q-Network (D3QN). This algorithm involves a prediction network of rewards on determining intrinsic motivation according to its interplay with environments, eliminating the necessity of a danger of designing rewards manually. This enhances generalization in various situations. The dueling network design contributes to learning that is more stable because it keeps the values of state and values of the action apart. In the meantime, the bias of overestimation is minimized in the case of double q-learning. It has been demonstrated through simulations that IRL-D3QN can support a higher Vehicle to Infrastructure (V2I) transmission rate by 7.94 percent and demonstrate significantly less performance degradation under heavy communication loads than state of the art RL algorithms. Therefore, it will provide a solution to the distribution of dynamic spectrum, which will be scalable and self-sufficient in the next generation of vehicular communication systems.
{"title":"IRL-D3QN: An intelligent multi-agent learning framework for dynamic spectrum management in vehicular networks","authors":"Jing Wang , Wenshi Dan , Ke Yang , Xing Tang , Lingyu Yan","doi":"10.1016/j.future.2026.108371","DOIUrl":"10.1016/j.future.2026.108371","url":null,"abstract":"<div><div>The proliferation of vehicular networks within intelligent transportation systems (ITS) has significantly increased the demand for efficient and adaptive spectrum resource allocation. Spectrum coordination is challenging due to high vehicle traffic, intensive communication environments and diversified service requirements. These are of particular significance in Vehicle-to-Everything (V2X) communications, where adaptive conditions call out powerful solutions. Multi-agent reinforcement learning (MARL) techniques are promising and have been applied to the management of dynamic spectrum access, but with limitations including overestimated value functions, unsteady policy convergence, and dependence on manual choices of rewards, these techniques have limitations as far as their application in practice. This paper presents a new framework of spectrum management IRL-D3QN, which combines Inverse Reinforcement Learning (IRL) and a Dueling Double Deep Q-Network (D3QN). This algorithm involves a prediction network of rewards on determining intrinsic motivation according to its interplay with environments, eliminating the necessity of a danger of designing rewards manually. This enhances generalization in various situations. The dueling network design contributes to learning that is more stable because it keeps the values of state and values of the action apart. In the meantime, the bias of overestimation is minimized in the case of double q-learning. It has been demonstrated through simulations that IRL-D3QN can support a higher Vehicle to Infrastructure (V2I) transmission rate by 7.94 percent and demonstrate significantly less performance degradation under heavy communication loads than state of the art RL algorithms. Therefore, it will provide a solution to the distribution of dynamic spectrum, which will be scalable and self-sufficient in the next generation of vehicular communication systems.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108371"},"PeriodicalIF":6.2,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145957070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-07-01Epub Date: 2026-01-24DOI: 10.1016/j.future.2026.108393
Sandra Catalán , Rafael Rodríguez-Sánchez , Carlos García Sánchez , Luis Piñuel Moreno
This paper evaluates the performance and energy efficiency of Apple processors across multiple ARM-based M-series generations and models (standard and Pro). The study is motivated by the increasing heterogeneity of Apple´s SoC architectures, which integrate multiple computing engines raising the scientific question of which hardware components are best suited for executing general-purpose and domain-specific computations such as the GEneral Matrix Multiply (GEMM). The analysis focuses on four key components: the Central Processing Unit (CPU), the Graphics Processing Unit (GPU), the matrix calculation accelerator (AMX), and the Apple Neural Engine (ANE).
The assessments use the GEMM as benchmark to characterize the performance of the CPU and GPU, alongside tests on AMX, which is specialized in handling large-scale mathematical operations, and tests on the ANE, which is specifically designed for Deep Learning purposes. Additionally, energy consumption data has been collected to analyze the energy efficiency of the aforementioned resources. Results highlight notable improvements in computational capacity and energy efficiency over successive generations. On one hand, the AMX stands out as the most efficient component for FP32 and FP64 workloads, significantly boosting overall system performance. In the M4 Pro, which integrates two matrix accelerators, it achieves up to 68% of the GPU’s FP32 performance while consuming only 42% of its power. On the other hand, the ANE, although limited to FP16 precision, excels in energy efficiency for low-precision tasks, surpassing other accelerators with over 700 GFLOPs/Watt under batched workloads.
This analysis offers a clear understanding of how Apple´s custom ARM designs optimize both performance and energy use, particularly in the context of multi-core processing and specialized acceleration units. In addition, a significant contribution of this study is the comprehensive comparative analysis of Apple’s accelerators, which have previously been poorly documented and scarcely studied. The analysis spans different generations and compares the accelerators against both CPU and GPU performance.
{"title":"A comparative performance and efficiency analysis of Apple’s M architectures: A GEMM case study","authors":"Sandra Catalán , Rafael Rodríguez-Sánchez , Carlos García Sánchez , Luis Piñuel Moreno","doi":"10.1016/j.future.2026.108393","DOIUrl":"10.1016/j.future.2026.108393","url":null,"abstract":"<div><div>This paper evaluates the performance and energy efficiency of Apple processors across multiple ARM-based M-series generations and models (standard and Pro). The study is motivated by the increasing heterogeneity of Apple´s SoC architectures, which integrate multiple computing engines raising the scientific question of which hardware components are best suited for executing general-purpose and domain-specific computations such as the GEneral Matrix Multiply (<span>GEMM</span>). The analysis focuses on four key components: the Central Processing Unit (CPU), the Graphics Processing Unit (GPU), the matrix calculation accelerator (AMX), and the Apple Neural Engine (ANE).</div><div>The assessments use the <span>GEMM</span> as benchmark to characterize the performance of the CPU and GPU, alongside tests on AMX, which is specialized in handling large-scale mathematical operations, and tests on the ANE, which is specifically designed for Deep Learning purposes. Additionally, energy consumption data has been collected to analyze the energy efficiency of the aforementioned resources. Results highlight notable improvements in computational capacity and energy efficiency over successive generations. On one hand, the AMX stands out as the most efficient component for FP32 and FP64 workloads, significantly boosting overall system performance. In the M4 Pro, which integrates two matrix accelerators, it achieves up to 68% of the GPU’s FP32 performance while consuming only 42% of its power. On the other hand, the ANE, although limited to FP16 precision, excels in energy efficiency for low-precision tasks, surpassing other accelerators with over 700 GFLOPs/Watt under batched workloads.</div><div>This analysis offers a clear understanding of how Apple´s custom ARM designs optimize both performance and energy use, particularly in the context of multi-core processing and specialized acceleration units. In addition, a significant contribution of this study is the comprehensive comparative analysis of Apple’s accelerators, which have previously been poorly documented and scarcely studied. The analysis spans different generations and compares the accelerators against both CPU and GPU performance.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108393"},"PeriodicalIF":6.2,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146048040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-07-01Epub Date: 2026-01-18DOI: 10.1016/j.future.2026.108377
Asma Naseri Rad , Shaghayegh Vahdat , Ali Afzali-Kusha , Massoud Pedram
This paper proposes an approximate floating-point (FP) multiplier, called AFMIS, which is based on input segmentation. The AFMIS multiplier statically divides the input mantissas into several segments and performs exact multiplication on the selected segments. This approach eliminates the need for a costly leading-one detector (LOD) circuit. The static segmentation and limited segment count in the proposed design reduce the number of required post-multiplication shift values. With only a few possible shifts, a simple multiplexer can replace a full shifter. This substitution improves speed compared with that of dynamic segmentation approaches. The proposed structure allows for adjustable accuracy levels by modifying the number of bits in each segment, making it suitable for a wide range of applications. To evaluate the efficiency of the AFMIS multiplier, its hardware parameters are compared to those of an exact FP multiplier and several other approximate FP multipliers. The comparison is performed using Synopsys Design Compiler in a 7 nm technology. The results show that the proposed multiplier achieves a mean relative error distance (MRED) of 0.27% to 18.6% while improving delay, area, and power consumption by up to 81.7%, 98%, and 99%, respectively, compared to the exact FP multiplier. Furthermore, the AFMIS multiplier outperforms other approximate FP multipliers in terms of speed, area, and energy consumption at similar accuracy levels. The utility of the AFMIS multiplier is demonstrated by its application in regression and classification tasks using neural networks (NNs) and JPEG compression. The results indicate that, in most cases, the output differences between the AFMIS multiplier and the exact multiplier are negligible.
{"title":"AFMIS: An approximate floating-point multiplier based on input segmentation","authors":"Asma Naseri Rad , Shaghayegh Vahdat , Ali Afzali-Kusha , Massoud Pedram","doi":"10.1016/j.future.2026.108377","DOIUrl":"10.1016/j.future.2026.108377","url":null,"abstract":"<div><div>This paper proposes an approximate floating-point (FP) multiplier, called AFMIS, which is based on input segmentation. The AFMIS multiplier statically divides the input mantissas into several segments and performs exact multiplication on the selected segments. This approach eliminates the need for a costly leading-one detector (LOD) circuit. The static segmentation and limited segment count in the proposed design reduce the number of required post-multiplication shift values. With only a few possible shifts, a simple multiplexer can replace a full shifter. This substitution improves speed compared with that of dynamic segmentation approaches. The proposed structure allows for adjustable accuracy levels by modifying the number of bits in each segment, making it suitable for a wide range of applications. To evaluate the efficiency of the AFMIS multiplier, its hardware parameters are compared to those of an exact FP multiplier and several other approximate FP multipliers. The comparison is performed using Synopsys Design Compiler in a 7 nm technology. The results show that the proposed multiplier achieves a mean relative error distance (MRED) of 0.27% to 18.6% while improving delay, area, and power consumption by up to 81.7%, 98%, and 99%, respectively, compared to the exact FP multiplier. Furthermore, the AFMIS multiplier outperforms other approximate FP multipliers in terms of speed, area, and energy consumption at similar accuracy levels. The utility of the AFMIS multiplier is demonstrated by its application in regression and classification tasks using neural networks (NNs) and JPEG compression. The results indicate that, in most cases, the output differences between the AFMIS multiplier and the exact multiplier are negligible.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108377"},"PeriodicalIF":6.2,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145995521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Graphics Processing Units (GPUs) are increasingly used in safety-critical systems where Silent Data Corruptions (SDCs) pose severe risks. Selective Instruction Duplication (SID) can mitigate these risks but relies on accurate static-instruction vulnerability assessment, which is complicated by variations in input values and sizes. This paper presents a comprehensive study of how input characteristics shape instruction-level SDC vulnerability, which we quantify using the Static Instruction Error Probability (SIEP) and the SDC Occurrence rate (SDCO). We extend gpuFI-4 to enable fault injection mapping at the static-instruction level. Across 14 benchmarks and more than ten million single-, double-, and triple-bit injections, we find that SIEP is largely value-insensitive, whereas SDCO is highly value-sensitive. For register instructions, SDCO remains stable for random and structured-sparse inputs but differs markedly for all-zero, NaN, or denormal inputs. Moreover, when SIEP is size-sensitive, SDCO also tends to exhibit size sensitivity. We further observe that invalid-injection rates decrease with input size and that shared-memory instructions, though few, can contribute disproportionately to SDCs. Leveraging these insights, we propose BiD-Accel, a bi-dimensional, input-aware framework for accelerated static-instruction SDC vulnerability assessment. Its SIEP-driven Descending Order Sort (DOS) method achieves stable SDCO rankings with injections on only 70.4% of instructions on average, compared with 86.2% for the Random Ordering (RO) method, thereby meaningfully reducing assessment cost while preserving ranking fidelity and providing actionable guidance for robust SID under input-varying GPU workloads.
{"title":"BiD-Accel: Accelerated bidimensional input-aware SDC vulnerability assessment for GPU static instructions","authors":"Zhenyu Qian , Lianguo Wang , Pengfei Zhang , Jianing Rao","doi":"10.1016/j.future.2026.108372","DOIUrl":"10.1016/j.future.2026.108372","url":null,"abstract":"<div><div>Graphics Processing Units (GPUs) are increasingly used in safety-critical systems where Silent Data Corruptions (SDCs) pose severe risks. Selective Instruction Duplication (SID) can mitigate these risks but relies on accurate static-instruction vulnerability assessment, which is complicated by variations in input values and sizes. This paper presents a comprehensive study of how input characteristics shape instruction-level SDC vulnerability, which we quantify using the Static Instruction Error Probability (SIEP) and the SDC Occurrence rate (SDCO). We extend gpuFI-4 to enable fault injection mapping at the static-instruction level. Across 14 benchmarks and more than ten million single-, double-, and triple-bit injections, we find that SIEP is largely value-insensitive, whereas SDCO is highly value-sensitive. For register instructions, SDCO remains stable for random and structured-sparse inputs but differs markedly for all-zero, NaN, or denormal inputs. Moreover, when SIEP is size-sensitive, SDCO also tends to exhibit size sensitivity. We further observe that invalid-injection rates decrease with input size and that shared-memory instructions, though few, can contribute disproportionately to SDCs. Leveraging these insights, we propose BiD-Accel, a bi-dimensional, input-aware framework for accelerated static-instruction SDC vulnerability assessment. Its SIEP-driven Descending Order Sort (DOS) method achieves stable SDCO rankings with injections on only 70.4% of instructions on average, compared with 86.2% for the Random Ordering (RO) method, thereby meaningfully reducing assessment cost while preserving ranking fidelity and providing actionable guidance for robust SID under input-varying GPU workloads.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108372"},"PeriodicalIF":6.2,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145957074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-07-01Epub Date: 2026-01-20DOI: 10.1016/j.future.2026.108378
Mario Barbareschi , Salvatore Barone , Alberto Bosio , Antonio Emmanuele
The increasing adoption of AI models has driven applications toward the use of hardware accelerators to meet high computational demands and strict performance requirements. Beyond consideration of performance and energy efficiency, explainability and reliability have emerged as pivotal requirements, particularly for critical applications such as automotive, medical, and aerospace systems. Among the various AI models, Decision Tree Ensembles (DTEs) are particularly notable for their high accuracy and explainability. Moreover, they are particularly well-suited for hardware implementations, enabling high-performance and improved energy efficiency. However, a frequently overlooked aspect of DTEs is their reliability in the presence of hardware malfunctions. While DTEs are generally regarded as robust by design, due to their redundancy and voting mechanisms, hardware faults can still have catastrophic consequences. To address this gap, we present an in-depth reliability analysis of two types of DTE hardware accelerators: classical and approximate implementations. Specifically, we conduct a comprehensive fault injection campaign, varying the number of trees involved in the classification task, the approximation technique used, and the tolerated accuracy loss, while evaluating several benchmark datasets. The results of this study demonstrate that approximation techniques have to be carefully designed, as they can significantly impact resilience. However, techniques that target the representation of features and thresholds appear to be better suited for fault tolerance.
{"title":"Reliability analysis of hardware accelerators for decision tree-based classifier systems","authors":"Mario Barbareschi , Salvatore Barone , Alberto Bosio , Antonio Emmanuele","doi":"10.1016/j.future.2026.108378","DOIUrl":"10.1016/j.future.2026.108378","url":null,"abstract":"<div><div>The increasing adoption of AI models has driven applications toward the use of hardware accelerators to meet high computational demands and strict performance requirements. Beyond consideration of performance and energy efficiency, explainability and reliability have emerged as pivotal requirements, particularly for critical applications such as automotive, medical, and aerospace systems. Among the various AI models, Decision Tree Ensembles (DTEs) are particularly notable for their high accuracy and explainability. Moreover, they are particularly well-suited for hardware implementations, enabling high-performance and improved energy efficiency. However, a frequently overlooked aspect of DTEs is their reliability in the presence of hardware malfunctions. While DTEs are generally regarded as robust by design, due to their redundancy and voting mechanisms, hardware faults can still have catastrophic consequences. To address this gap, we present an in-depth reliability analysis of two types of DTE hardware accelerators: classical and approximate implementations. Specifically, we conduct a comprehensive fault injection campaign, varying the number of trees involved in the classification task, the approximation technique used, and the tolerated accuracy loss, while evaluating several benchmark datasets. The results of this study demonstrate that approximation techniques have to be carefully designed, as they can significantly impact resilience. However, techniques that target the representation of features and thresholds appear to be better suited for fault tolerance.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108378"},"PeriodicalIF":6.2,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146014882","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-07-01Epub Date: 2026-01-24DOI: 10.1016/j.future.2026.108389
Zhao Tong , Shiyan Zhang , Jing Mei , Can Wang , Keqin Li
The integration and development of unmanned aerial vehicles (UAVs) and mobile edge computing (MEC) technology provide users with more flexible, reliable, and high-quality computing services. However, most UAV-assisted MEC model designs mainly focus on static environments, which do not apply to the practical scenarios considered in this work. In this paper, we consider a UAV-assisted MEC platform, which can provide continuous services for multiple mobile ground users with random movements and task arrivals. Moreover, we investigate the long-term system utility maximization problem in UAV-assisted MEC systems, considering continuous task offloading, users’ mobility, UAV’s 3D trajectory control, and resource allocation. To address the challenges of limited system information, high-dimensional continuous actions, and state space approximation, we propose an Online decision-making algorithm for Dynamic environments based on Exploration-enhanced Greedy DDPG (ODEGD). Additionally, to more accurately evaluate the algorithm’s performance, we introduced real-world roads into the experiment. Experimental results show that the proposed algorithm reduces response delay by 26.98% and energy consumption by 22.61% compared to other algorithms, while achieving the highest system utility. These results validate the applicability of the ODEGD algorithm under dynamic conditions, demonstrating its good robustness and scalability.
{"title":"Online 3D trajectory and resource optimization for dynamic UAV-assisted MEC systems","authors":"Zhao Tong , Shiyan Zhang , Jing Mei , Can Wang , Keqin Li","doi":"10.1016/j.future.2026.108389","DOIUrl":"10.1016/j.future.2026.108389","url":null,"abstract":"<div><div>The integration and development of unmanned aerial vehicles (UAVs) and mobile edge computing (MEC) technology provide users with more flexible, reliable, and high-quality computing services. However, most UAV-assisted MEC model designs mainly focus on static environments, which do not apply to the practical scenarios considered in this work. In this paper, we consider a UAV-assisted MEC platform, which can provide continuous services for multiple mobile ground users with random movements and task arrivals. Moreover, we investigate the long-term system utility maximization problem in UAV-assisted MEC systems, considering continuous task offloading, users’ mobility, UAV’s 3D trajectory control, and resource allocation. To address the challenges of limited system information, high-dimensional continuous actions, and state space approximation, we propose an <u>O</u>nline decision-making algorithm for <u>D</u>ynamic environments based on <u>E</u>xploration-enhanced <u>G</u>reedy <u>D</u>DPG (ODEGD). Additionally, to more accurately evaluate the algorithm’s performance, we introduced real-world roads into the experiment. Experimental results show that the proposed algorithm reduces response delay by 26.98% and energy consumption by 22.61% compared to other algorithms, while achieving the highest system utility. These results validate the applicability of the ODEGD algorithm under dynamic conditions, demonstrating its good robustness and scalability.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108389"},"PeriodicalIF":6.2,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146047991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-07-01Epub Date: 2026-01-24DOI: 10.1016/j.future.2026.108383
Aleix Boné , Alejandro Aguirre , David Álvarez , Pedro J. Martinez-Ferrer , Vicenç Beltran
Heterogeneous nodes that combine multi-core CPUs with diverse accelerators are rapidly becoming the norm in both high-performance computing (HPC) and AI infrastructures. Exploiting these platforms, however, requires orchestrating several low-level accelerator APIs such as CUDA, SYCL, and Triton. In some occasions they can be combined with optimized vendor math libraries: e.g., cuBLAS and oneAPI. Each API or library introduces its own abstractions, execution semantics, and synchronization mechanisms. Combining them within a single application is therefore error-prone and labor-intensive. We propose reusing a task-based data-flow methodology together with Task-Aware APIs (TA-libs) to overcome these limitations and facilitate the seamless integration of multiple accelerator programming models, while still leveraging the best-in-class kernels offered by each API.
Applications are expressed as a directed acyclic graph (DAG) of host tasks and device kernels managed by an OpenMP/OmpSs-2 runtime. We introduce Task-Aware SYCL (TASYCL) and leverage Task-Aware CUDA (TACUDA), which elevate individual accelerator invocations to first-class tasks. When multiple native runtimes coexist on the same multi-core CPU, they contend for threads, leading to oversubscription and performance variability. To address this, we unify their thread management under the nOS-V tasking and threading library, to which we contribute a new port of the PoCL (Portable OpenCL) runtime.
The methodology is evaluated on a multi-core server and a GPU-accelerated node using two contrasting workloads: the GPT-2 pre-training phase, representative of modern AI pipelines, and the HPCCG conjugate-gradient benchmark, representative of traditional HPC. From a performance standpoint, monolithic-kernel and fork-join executions are comparable —in both execution time and memory footprint— to a coarse-grained task-based formulation on both GPU-accelerated and multi-core systems. On the latter, unifying all runtimes through nOS-V mitigates interference and delivers performance on par with using a single runtime in isolation.
These results demonstrate that task-aware libraries, coupled with the nOS-V library, enable a single application to harness multiple accelerator programming models transparently and efficiently. The proposed methodology is immediately applicable to current heterogeneous nodes and is readily extensible to future systems that integrate even richer combinations of CPUs, GPUs, FPGAs, and AI accelerators.
{"title":"A task-based data-flow methodology for programming heterogeneous systems with multiple accelerator APIs","authors":"Aleix Boné , Alejandro Aguirre , David Álvarez , Pedro J. Martinez-Ferrer , Vicenç Beltran","doi":"10.1016/j.future.2026.108383","DOIUrl":"10.1016/j.future.2026.108383","url":null,"abstract":"<div><div>Heterogeneous nodes that combine multi-core CPUs with diverse accelerators are rapidly becoming the norm in both high-performance computing (HPC) and AI infrastructures. Exploiting these platforms, however, requires orchestrating several low-level accelerator APIs such as CUDA, SYCL, and Triton. In some occasions they can be combined with optimized vendor math libraries: e.g., cuBLAS and oneAPI. Each API or library introduces its own abstractions, execution semantics, and synchronization mechanisms. Combining them within a single application is therefore error-prone and labor-intensive. We propose reusing a task-based data-flow methodology together with Task-Aware APIs (TA-libs) to overcome these limitations and facilitate the seamless integration of multiple accelerator programming models, while still leveraging the best-in-class kernels offered by each API.</div><div>Applications are expressed as a directed acyclic graph (DAG) of host tasks and device kernels managed by an OpenMP/OmpSs-2 runtime. We introduce Task-Aware SYCL (TASYCL) and leverage Task-Aware CUDA (TACUDA), which elevate individual accelerator invocations to first-class tasks. When multiple native runtimes coexist on the same multi-core CPU, they contend for threads, leading to oversubscription and performance variability. To address this, we unify their thread management under the nOS-V tasking and threading library, to which we contribute a new port of the PoCL (Portable OpenCL) runtime.</div><div>The methodology is evaluated on a multi-core server and a GPU-accelerated node using two contrasting workloads: the GPT-2 pre-training phase, representative of modern AI pipelines, and the HPCCG conjugate-gradient benchmark, representative of traditional HPC. From a performance standpoint, monolithic-kernel and fork-join executions are comparable —in both execution time and memory footprint— to a coarse-grained task-based formulation on both GPU-accelerated and multi-core systems. On the latter, unifying all runtimes through nOS-V mitigates interference and delivers performance on par with using a single runtime in isolation.</div><div>These results demonstrate that task-aware libraries, coupled with the nOS-V library, enable a single application to harness multiple accelerator programming models transparently and efficiently. The proposed methodology is immediately applicable to current heterogeneous nodes and is readily extensible to future systems that integrate even richer combinations of CPUs, GPUs, FPGAs, and AI accelerators.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108383"},"PeriodicalIF":6.2,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146048039","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-07-01Epub Date: 2025-12-26DOI: 10.1016/j.future.2025.108333
E. Iraola , M. García-Lorenzo , F. Lordan-Gomis , F. Rossi , E. Prieto-Araujo , R.M. Badia
Digital twins are transforming the way we monitor, analyze, and control physical systems, but designing architectures that balance real-time responsiveness with heavy computational demands remains a challenge. Cloud-based solutions often struggle with latency and resource constraints, while edge-based approaches lack the processing power for complex simulations and data-driven optimizations.
To address this problem, we propose the High-Precision High-Performance Computer-enabled Digital Twin (HP2C-DT) reference architecture, which integrates High-Performance Computing (HPC) into the computing continuum. Unlike traditional setups that use HPC only for offline simulations, HP2C-DT makes it an active part of digital twin workflows, dynamically assigning tasks to edge, cloud, or HPC resources based on urgency and computational needs.
Furthermore, to bridge the gap between theory and practice, we introduce the HP2C-DT framework, a working implementation that uses COMPSs for seamless workload distribution across diverse infrastructures. We test it in a power grid use case, showing how it reduces communication bandwidth by an order of magnitude through edge-side data aggregation, improves response times by up to 2x via dynamic offloading, and maintains near-ideal strong scaling for compute-intensive workflows across a practical range of resources. These results demonstrate how an HPC-driven approach can push digital twins beyond their current limitations, making them smarter, faster, and more capable of handling real-world complexity.
{"title":"HP2C-DT: High-Precision High-Performance Computer-enabled Digital Twin","authors":"E. Iraola , M. García-Lorenzo , F. Lordan-Gomis , F. Rossi , E. Prieto-Araujo , R.M. Badia","doi":"10.1016/j.future.2025.108333","DOIUrl":"10.1016/j.future.2025.108333","url":null,"abstract":"<div><div>Digital twins are transforming the way we monitor, analyze, and control physical systems, but designing architectures that balance real-time responsiveness with heavy computational demands remains a challenge. Cloud-based solutions often struggle with latency and resource constraints, while edge-based approaches lack the processing power for complex simulations and data-driven optimizations.</div><div>To address this problem, we propose the <em>High-Precision High-Performance Computer-enabled Digital Twin</em> (HP2C-DT) reference architecture, which integrates High-Performance Computing (HPC) into the computing continuum. Unlike traditional setups that use HPC only for offline simulations, HP2C-DT makes it an active part of digital twin workflows, dynamically assigning tasks to edge, cloud, or HPC resources based on urgency and computational needs.</div><div>Furthermore, to bridge the gap between theory and practice, we introduce the HP2C-DT framework, a working implementation that uses COMPSs for seamless workload distribution across diverse infrastructures. We test it in a power grid use case, showing how it reduces communication bandwidth by an order of magnitude through edge-side data aggregation, improves response times by up to 2x via dynamic offloading, and maintains near-ideal strong scaling for compute-intensive workflows across a practical range of resources. These results demonstrate how an HPC-driven approach can push digital twins beyond their current limitations, making them smarter, faster, and more capable of handling real-world complexity.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108333"},"PeriodicalIF":6.2,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145845127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-07-01Epub Date: 2026-01-19DOI: 10.1016/j.future.2026.108379
Manuel Andruccioli, Giovanni Delnevo, Roberto Girau, Paola Salomoni
The adoption of Digital Twin (DT) technologies in public transport systems, particularly bus networks, is gaining momentum as cities seek smarter, more responsive, and efficient mobility solutions. Enabled by advances in IoT, AI, and Big Data Analytics, DTs offer real-time monitoring, simulation, and optimization of transit operations. However, despite their potential, the application of DTs in bus-based public transport remains relatively underexplored and fragmented across the literature. This study presents a Systematic Literature Review (SLR) aimed at synthesizing current research on DT technologies in this domain. Specifically, it investigates architectural models, technological frameworks, and platform designs; examines how AI and machine learning models are integrated to support operational tasks; and analyzes the role of Human-Computer Interaction (HCI) in the design and usability of such systems. By identifying key trends, challenges, and research gaps, this work provides a structured overview of the current landscape. Furthermore, it outlines directions for future research in DT-enabled public transportation systems.
{"title":"Digital twins in public bus transport: A systematic literature review of architectures, intelligence, and interaction","authors":"Manuel Andruccioli, Giovanni Delnevo, Roberto Girau, Paola Salomoni","doi":"10.1016/j.future.2026.108379","DOIUrl":"10.1016/j.future.2026.108379","url":null,"abstract":"<div><div>The adoption of Digital Twin (DT) technologies in public transport systems, particularly bus networks, is gaining momentum as cities seek smarter, more responsive, and efficient mobility solutions. Enabled by advances in IoT, AI, and Big Data Analytics, DTs offer real-time monitoring, simulation, and optimization of transit operations. However, despite their potential, the application of DTs in bus-based public transport remains relatively underexplored and fragmented across the literature. This study presents a Systematic Literature Review (SLR) aimed at synthesizing current research on DT technologies in this domain. Specifically, it investigates architectural models, technological frameworks, and platform designs; examines how AI and machine learning models are integrated to support operational tasks; and analyzes the role of Human-Computer Interaction (HCI) in the design and usability of such systems. By identifying key trends, challenges, and research gaps, this work provides a structured overview of the current landscape. Furthermore, it outlines directions for future research in DT-enabled public transportation systems.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"180 ","pages":"Article 108379"},"PeriodicalIF":6.2,"publicationDate":"2026-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146000904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}