Pub Date : 2025-12-16DOI: 10.1016/j.future.2025.108325
Mariano Garralda-Barrio, Carlos Eiras-Franco, Verónica Bolón-Canedo
Tuning configuration parameters in distributed Big Data engines such as Apache Spark is a high-dimensional, workload-dependent problem with significant impact on performance and operational cost. We address this challenge with a hybrid optimization framework that integrates Iterated Local Search, Tabu Search, and locally embedded Bayesian Optimization guided by STL-PARN (safe transfer learning with pattern-adaptive robust neighborhoods). Historical executions are partitioned into a Nucleus of reliable neighbors and a Corona of exploratory configurations, ensuring relevance while mitigating negative transfer. The surrogate within the embedded Bayesian Optimization stage decouples performance prediction from uncertainty modeling, enabling parameter-free acquisition functions that self-adapt to diverse workloads. Experiments on a modernized HiBench suite across multiple input scales show consistent gains over state-of-the-art baselines in execution time, convergence, and cost efficiency. Overall, the results demonstrate the robustness and practical value of embedding Bayesian Optimization within a global metaheuristic loop for adaptive, cost-aware Spark tuning. All source code and datasets are publicly available, supporting reproducibility and operational efficiency in large-scale data processing.
{"title":"A hybrid metaheuristics-Bayesian optimization framework with safe transfer learning for continuous spark tuning","authors":"Mariano Garralda-Barrio, Carlos Eiras-Franco, Verónica Bolón-Canedo","doi":"10.1016/j.future.2025.108325","DOIUrl":"10.1016/j.future.2025.108325","url":null,"abstract":"<div><div>Tuning configuration parameters in distributed Big Data engines such as Apache Spark is a high-dimensional, workload-dependent problem with significant impact on performance and operational cost. We address this challenge with a hybrid optimization framework that integrates Iterated Local Search, Tabu Search, and locally embedded Bayesian Optimization guided by STL-PARN (safe transfer learning with pattern-adaptive robust neighborhoods). Historical executions are partitioned into a Nucleus of reliable neighbors and a Corona of exploratory configurations, ensuring relevance while mitigating negative transfer. The surrogate within the embedded Bayesian Optimization stage decouples performance prediction from uncertainty modeling, enabling parameter-free acquisition functions that self-adapt to diverse workloads. Experiments on a modernized HiBench suite across multiple input scales show consistent gains over state-of-the-art baselines in execution time, convergence, and cost efficiency. Overall, the results demonstrate the robustness and practical value of embedding Bayesian Optimization within a global metaheuristic loop for adaptive, cost-aware Spark tuning. All source code and datasets are publicly available, supporting reproducibility and operational efficiency in large-scale data processing.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"178 ","pages":"Article 108325"},"PeriodicalIF":6.2,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-16DOI: 10.1016/j.future.2025.108303
Teh-Jen Sun , A-Young Son , Eui-Nam Huh
Data-parallel training at scale is often run with static settings, which waste time when compute, input, and communication bottlenecks shift. Dynamic control can shorten wall clock time but, without a state-aware estimator, it tends to chase high-variance per-node measurements and treats resource importance as time-invariant despite changes in the training and cluster state. We present DART, a framework-agnostic online co-scheduling runtime that infers state-conditioned resource-importance weights (an attribution over compute, memory, input, communication, and thermal headroom inferred from CPU/GPU temperatures) via a cubature Kalman filter and jointly updates per-node dataset shard fraction, batch size, data-loader workers, and learning rate scale using accuracy-tracking, rate-limited steps at epoch boundaries (overhead < 2 %). Across 12 model-dataset configurations on 2–12 nodes, DART shortens wall clock time by up to 63.44 % (median 31.95 %) while keeping final Top-1 within 0.93 percentage points of static DDP. Trace and correlation analyses indicate fewer synchronizations and reduced compute skew rather than changes to the optimization trajectory.
{"title":"DART: A state-aware online co-scheduling runtime for data-parallel training","authors":"Teh-Jen Sun , A-Young Son , Eui-Nam Huh","doi":"10.1016/j.future.2025.108303","DOIUrl":"10.1016/j.future.2025.108303","url":null,"abstract":"<div><div>Data-parallel training at scale is often run with static settings, which waste time when compute, input, and communication bottlenecks shift. Dynamic control can shorten wall clock time but, without a state-aware estimator, it tends to chase high-variance per-node measurements and treats resource importance as time-invariant despite changes in the training and cluster state. We present DART, a framework-agnostic online co-scheduling runtime that infers state-conditioned resource-importance weights (an attribution over compute, memory, input, communication, and thermal headroom inferred from CPU/GPU temperatures) via a cubature Kalman filter and jointly updates per-node dataset shard fraction, batch size, data-loader workers, and learning rate scale using accuracy-tracking, rate-limited steps at epoch boundaries (overhead < 2 %). Across 12 model-dataset configurations on 2–12 nodes, DART shortens wall clock time by up to 63.44 % (median 31.95 %) while keeping final Top-1 within 0.93 percentage points of static DDP. Trace and correlation analyses indicate fewer synchronizations and reduced compute skew rather than changes to the optimization trajectory.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"178 ","pages":"Article 108303"},"PeriodicalIF":6.2,"publicationDate":"2025-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785048","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-15DOI: 10.1016/j.future.2025.108313
Gibran Gomez , Kevin van Liebergen , Davide Sanvito , Giuseppe Siracusano , Roberto Gonzalez , Juan Caballero
Cryptocurrency abuse reporting services are a valuable data source about abusive blockchain addresses, prevalent types of cryptocurrency abuse, and their financial impact on victims. However, they may suffer data pollution due to their crowd-sourced nature. This work analyzes the extent and impact of data pollution in cryptocurrency abuse reporting services and proposes a novel LLM-based defense to address the pollution. We collect 289K abuse reports submitted over 6 years to two popular services and use them to answer three research questions. RQ1 analyzes the extent and impact of pollution. We show that spam reports will eventually flood unchecked abuse reporting services, with BitcoinAbuse receiving 75 % of spam before stopping operations. We build a public dataset of 19,443 abuse reports labeled with 19 popular abuse types and use it to reveal the inaccuracy of user-reported abuse types. We identified 91 (0.1 %) benign addresses reported, responsible for 60 % of all the received funds. RQ2 examines whether we can automate identifying valid reports and their classification into abuse types. We propose an unsupervised LLM-based classifier that achieves an F1 score of 0.95 when classifying reports, an F1 of 0.89 when classifying out-of-distribution data, and an F1 of 0.99 when identifying spam reports. Our unsupervised LLM-based classifier clearly outperforms two baselines: a supervised classifier and a naive usage of the LLM. Finally, RQ3 demonstrates the usefulness of our LLM-based classifier for quantifying the financial impact of different cryptocurrency abuse types. We show that victim-reported losses heavily underestimate cybercriminal revenue by estimating a 29 times higher revenue from deposit transactions. We identified that investment scams have the highest financial impact and that extortions have lower conversion rates but compensate for them with massive email campaigns.
{"title":"Clean up the mess: Addressing data pollution in cryptocurrency abuse reporting services","authors":"Gibran Gomez , Kevin van Liebergen , Davide Sanvito , Giuseppe Siracusano , Roberto Gonzalez , Juan Caballero","doi":"10.1016/j.future.2025.108313","DOIUrl":"10.1016/j.future.2025.108313","url":null,"abstract":"<div><div>Cryptocurrency abuse reporting services are a valuable data source about abusive blockchain addresses, prevalent types of cryptocurrency abuse, and their financial impact on victims. However, they may suffer data pollution due to their crowd-sourced nature. This work analyzes the extent and impact of data pollution in cryptocurrency abuse reporting services and proposes a novel LLM-based defense to address the pollution. We collect 289K abuse reports submitted over 6 years to two popular services and use them to answer three research questions. RQ1 analyzes the extent and impact of pollution. We show that spam reports will eventually flood unchecked abuse reporting services, with BitcoinAbuse receiving 75 % of spam before stopping operations. We build a public dataset of 19,443 abuse reports labeled with 19 popular abuse types and use it to reveal the inaccuracy of user-reported abuse types. We identified 91 (0.1 %) benign addresses reported, responsible for 60 % of all the received funds. RQ2 examines whether we can automate identifying valid reports and their classification into abuse types. We propose an unsupervised LLM-based classifier that achieves an F1 score of 0.95 when classifying reports, an F1 of 0.89 when classifying out-of-distribution data, and an F1 of 0.99 when identifying spam reports. Our unsupervised LLM-based classifier clearly outperforms two baselines: a supervised classifier and a naive usage of the LLM. Finally, RQ3 demonstrates the usefulness of our LLM-based classifier for quantifying the financial impact of different cryptocurrency abuse types. We show that victim-reported losses heavily underestimate cybercriminal revenue by estimating a 29 times higher revenue from deposit transactions. We identified that investment scams have the highest financial impact and that extortions have lower conversion rates but compensate for them with massive email campaigns.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"179 ","pages":"Article 108313"},"PeriodicalIF":6.2,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785052","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-15DOI: 10.1016/j.future.2025.108309
Jaime Palacios, Esteban Stafford, José Luis Bosque
Backfilling is a widely used scheduling technique in High-Performance Computing (HPC) systems to improve resource utilization. However, traditional approaches like EASY Backfill were devised for mono-core homogeneous environments, without considering the implications of multi-core architectures or the individual characteristics of nodes in heterogeneous clusters. This article proposes two refinements of EASY called Heterogeneous Backfill (HB) and Heterogeneous Reordering Backfill (HRB). These algorithms adapt the backfilling strategy to heterogeneous multi-core environments by incorporating node properties into the scheduling process. The HB algorithm sorts nodes based on a given criterion, such as power consumption or performance, to improve resource allocation. The HRB algorithm extends this approach by incorporating job reordering criteria, allowing for more efficient backfilling decisions. An evaluation of these algorithms shows that they can significantly reduce energy consumption and improve scheduling efficiency in heterogeneous clusters. The results demonstrate that the proposed algorithms outperform traditional backfilling methods, such as EASY Backfill, in terms of energy consumption, waiting time or makespan. By embracing the heterogeneity of modern HPC systems, these algorithms enable more efficient resource utilization and contribute to the overall performance of large-scale computing environments.
{"title":"HRB: A backfilling algorithm for heterogeneous clusters with job prioritization","authors":"Jaime Palacios, Esteban Stafford, José Luis Bosque","doi":"10.1016/j.future.2025.108309","DOIUrl":"10.1016/j.future.2025.108309","url":null,"abstract":"<div><div>Backfilling is a widely used scheduling technique in High-Performance Computing (HPC) systems to improve resource utilization. However, traditional approaches like EASY Backfill were devised for mono-core homogeneous environments, without considering the implications of multi-core architectures or the individual characteristics of nodes in heterogeneous clusters. This article proposes two refinements of EASY called Heterogeneous Backfill (HB) and Heterogeneous Reordering Backfill (HRB). These algorithms adapt the backfilling strategy to heterogeneous multi-core environments by incorporating node properties into the scheduling process. The HB algorithm sorts nodes based on a given criterion, such as power consumption or performance, to improve resource allocation. The HRB algorithm extends this approach by incorporating job reordering criteria, allowing for more efficient backfilling decisions. An evaluation of these algorithms shows that they can significantly reduce energy consumption and improve scheduling efficiency in heterogeneous clusters. The results demonstrate that the proposed algorithms outperform traditional backfilling methods, such as EASY Backfill, in terms of energy consumption, waiting time or makespan. By embracing the heterogeneity of modern HPC systems, these algorithms enable more efficient resource utilization and contribute to the overall performance of large-scale computing environments.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"178 ","pages":"Article 108309"},"PeriodicalIF":6.2,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145797719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-15DOI: 10.1016/j.future.2025.108312
Andrea Manzi , Raul Bardaji , Ivan Rodero , Germán Moltó , Sandro Fiore , Isabel Campos , Donatello Elia , Francesco Sarandrea , A. Paul Millar , Daniele Spiga , Matteo Bunino , Gabriele Accarino , Lorenzo Asprea , Samuel Bernardo , Miguel Caballer , Charis Chatzikyriakou , Diego Ciangottini , Michele Claus , Andrea Cristofori , Davide Donno , Juraj Zvolensky
The EU project interTwin, co-designed and implemented the prototype of an interdisciplinary Digital Twin Engine (DTE), an open-source platform that provides generic and domain-specific software components for modelling and simulation to integrate application-specific Digital Twins (DTs). The DTE is built upon a co-designed conceptual model - the DTE blueprint architecture - guided by open standards and interoperability principles. The ambition is to develop a unified approach to the implementation of DTs that is applicable across diverse scientific disciplines to foster collaborations and facilitate developments. Co-design involved DT use cases from high-energy physics, radio astronomy, astroparticle physics, climate research, and environmental monitoring, which drove advancements in modelling and simulation by leveraging heterogeneous distributed digital infrastructures, enabling dynamic workflow composition, real-time data management and processing, quality and uncertainty tracing of models, and multi-source data fusion.
{"title":"interTwin: Advancing Scientific Digital Twins through AI, Federated Computing and Data","authors":"Andrea Manzi , Raul Bardaji , Ivan Rodero , Germán Moltó , Sandro Fiore , Isabel Campos , Donatello Elia , Francesco Sarandrea , A. Paul Millar , Daniele Spiga , Matteo Bunino , Gabriele Accarino , Lorenzo Asprea , Samuel Bernardo , Miguel Caballer , Charis Chatzikyriakou , Diego Ciangottini , Michele Claus , Andrea Cristofori , Davide Donno , Juraj Zvolensky","doi":"10.1016/j.future.2025.108312","DOIUrl":"10.1016/j.future.2025.108312","url":null,"abstract":"<div><div>The EU project interTwin, co-designed and implemented the prototype of an interdisciplinary Digital Twin Engine (DTE), an open-source platform that provides generic and domain-specific software components for modelling and simulation to integrate application-specific Digital Twins (DTs). The DTE is built upon a co-designed conceptual model - the DTE blueprint architecture - guided by open standards and interoperability principles. The ambition is to develop a unified approach to the implementation of DTs that is applicable across diverse scientific disciplines to foster collaborations and facilitate developments. Co-design involved DT use cases from high-energy physics, radio astronomy, astroparticle physics, climate research, and environmental monitoring, which drove advancements in modelling and simulation by leveraging heterogeneous distributed digital infrastructures, enabling dynamic workflow composition, real-time data management and processing, quality and uncertainty tracing of models, and multi-source data fusion.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"179 ","pages":"Article 108312"},"PeriodicalIF":6.2,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785045","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-15DOI: 10.1016/j.future.2025.108321
Lei Song , Leyi Shi , Xiuli Ren , Xiaoguang Li
With the rapid development of quantum, secure in-network aggregation is essential for sensitive information in resource-constrained environments. However, traditional aggregation methods often fall short due to their high computational costs and security concerns. Meanwhile, their security methods are becoming less effective when defending against quantum attacks. Therefore, it is critical for aggregation techniques to develop solutions that can withstand quantum computing threats while minimizing overhead. In this paper, we use lattice cryptography to defend against quantum attacks. Given the significant computational cost of lattice encryption, we categorize nodes into sensitive and non-sensitive, applying different encryption methods accordingly. Lattice encryption secures sensitive data, while data compression further reduces the computational load. For better differentiation between sensitive types, we design a hypertree. The leaf nodes are assigned α and β values(known as weak game-theoretical perturbations), while the control center, located at the root, uses other weight to determine the optimal routing path. Watermarks are used to distinguish between sensitive and non-sensitive nodes within the same layer. These watermarks help identify nodes at the same level, allowing data packets containing watermark and weight metadata to be forwarded to the next node for secure aggregation. The highest-weight nodes undergo aggregation at the control center. This approach is implemented in the SoA-SDA (State-of-the-Art Secure Data Aggregation) protocol. Evaluations in small-scale settings show that SoA-SDA outperforms existing solutions with lower overhead, better fault tolerance, and reduced latency. Large-scale tests further highlight its strong compatibility and robust security against attacks like MITM, side-channel, DoS, and Sybil, while maintaining quantum resistance.
{"title":"SoA-SDA: Quantum-Resistant, Energy-Efficient In-Network Aggregation Protocol for Resource-Constrained Environment","authors":"Lei Song , Leyi Shi , Xiuli Ren , Xiaoguang Li","doi":"10.1016/j.future.2025.108321","DOIUrl":"10.1016/j.future.2025.108321","url":null,"abstract":"<div><div>With the rapid development of quantum, secure in-network aggregation is essential for sensitive information in resource-constrained environments. However, traditional aggregation methods often fall short due to their high computational costs and security concerns. Meanwhile, their security methods are becoming less effective when defending against quantum attacks. Therefore, it is critical for aggregation techniques to develop solutions that can withstand quantum computing threats while minimizing overhead. In this paper, we use lattice cryptography to defend against quantum attacks. Given the significant computational cost of lattice encryption, we categorize nodes into sensitive and non-sensitive, applying different encryption methods accordingly. Lattice encryption secures sensitive data, while data compression further reduces the computational load. For better differentiation between sensitive types, we design a hypertree. The leaf nodes are assigned <em>α</em> and <em>β</em> values(known as weak game-theoretical perturbations), while the control center, located at the root, uses other weight to determine the optimal routing path. Watermarks are used to distinguish between sensitive and non-sensitive nodes within the same layer. These watermarks help identify nodes at the same level, allowing data packets containing watermark and weight metadata to be forwarded to the next node for secure aggregation. The highest-weight nodes undergo aggregation at the control center. This approach is implemented in the SoA-SDA (State-of-the-Art Secure Data Aggregation) protocol. Evaluations in small-scale settings show that SoA-SDA outperforms existing solutions with lower overhead, better fault tolerance, and reduced latency. Large-scale tests further highlight its strong compatibility and robust security against attacks like MITM, side-channel, DoS, and Sybil, while maintaining quantum resistance.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"178 ","pages":"Article 108321"},"PeriodicalIF":6.2,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-15DOI: 10.1016/j.future.2025.108315
Longjing Yang , Ayong Ye , Yuanhuang Liu , Wenting Lu , Chuang Huang
Advanced persistent threats (APTs) pose a significant challenge to global cybersecurity, causing substantial economic losses. Existing detection methods often rely on expert-defined rules to map anomalous events to APT tactics. Still, they are highly dependent on prior knowledge, making them unsuitable for dynamic and complex attack scenarios. This results in insufficient fine-grained activity identification and attack provenance capabilities. This study proposes LLM-APTDS, an APT detection system based on large language models (LLMs). First, a multi-model collaborative detection architecture is constructed to leverage LLMs’ semantic understanding for precise localization of log anomalies. Second, a K-nearest neighbor graph reconstruction algorithm is designed to reconstruct the relevant neighborhood graph of malicious entities, enhancing contextual awareness of attack behavior. Finally, a cyclically enhanced analysis mechanism, guided by the Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) knowledge graph, allows the LLM to iteratively reason and generate threat intelligence reports with multiple dimensions, while simultaneously providing multi-layered explanations and automated mitigation strategies. Experiments using the Defense Advanced Research Projects Agency Transparent Computing Engagement 3 (DARPA TC-E3) dataset demonstrate that, compared to baseline methods, the proposed system achieves a 5 % improvement in detection precision and a 4 % increase in F1-score, while producing high-quality, multi-dimensional threat intelligence reports.
{"title":"LLM-APTDS: A high-precision advanced persistent threat detection system for imbalanced data based on large language models with strong interpretabilit","authors":"Longjing Yang , Ayong Ye , Yuanhuang Liu , Wenting Lu , Chuang Huang","doi":"10.1016/j.future.2025.108315","DOIUrl":"10.1016/j.future.2025.108315","url":null,"abstract":"<div><div>Advanced persistent threats (APTs) pose a significant challenge to global cybersecurity, causing substantial economic losses. Existing detection methods often rely on expert-defined rules to map anomalous events to APT tactics. Still, they are highly dependent on prior knowledge, making them unsuitable for dynamic and complex attack scenarios. This results in insufficient fine-grained activity identification and attack provenance capabilities. This study proposes LLM-APTDS, an APT detection system based on large language models (LLMs). First, a multi-model collaborative detection architecture is constructed to leverage LLMs’ semantic understanding for precise localization of log anomalies. Second, a K-nearest neighbor graph reconstruction algorithm is designed to reconstruct the relevant neighborhood graph of malicious entities, enhancing contextual awareness of attack behavior. Finally, a cyclically enhanced analysis mechanism, guided by the Adversarial Tactics, Techniques, and Common Knowledge (ATT&CK) knowledge graph, allows the LLM to iteratively reason and generate threat intelligence reports with multiple dimensions, while simultaneously providing multi-layered explanations and automated mitigation strategies. Experiments using the Defense Advanced Research Projects Agency Transparent Computing Engagement 3 (DARPA TC-E3) dataset demonstrate that, compared to baseline methods, the proposed system achieves a 5 % improvement in detection precision and a 4 % increase in F1-score, while producing high-quality, multi-dimensional threat intelligence reports.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"178 ","pages":"Article 108315"},"PeriodicalIF":6.2,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145785043","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Advanced Persistent Threats (APTs) represent some of the most sophisticated and coordinated cyberattacks, often targeting critical infrastructure with stealthy, multi-stage techniques. Despite the availability of numerous intrusion detection datasets, most fail to capture the sequential and strategic nature of APT campaigns as outlined in frameworks like MITRE ATT&CK. This paper introduces a novel dataset based on a realistic emulation of the Sandworm APT group targeting the Supervisory Control and Data Acquisition (SCADA) system of a Wide Area Measurement System (WAMS). The dataset captures the full lifecycle of an APT attack, from initial access to impact, in a structured and time-ordered manner, enabling the study of both atomic and multi-step intrusion behaviours. We train and evaluate supervised multiclass sequence-aware models, specifically Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) architectures, to detect these behaviours using network flow data, assessing their performance and analysing their strengths and limitations. Our results show that BiLSTM models offer greater stability and generalization, while LSTM models achieve competitive performance with optimal configurations. These findings highlight the importance of realistic, sequence-aware datasets for developing robust intrusion detection systems tailored to modern APT threats.
{"title":"A pattern-aware LSTM-based approach for APT detection leveraging a realistic dataset for critical infrastructure security","authors":"Eider Iturbe , Christos Dalamagkas , Panagiotis Radoglou-Grammatikis , Erkuden Rios , Nerea Toledo","doi":"10.1016/j.future.2025.108308","DOIUrl":"10.1016/j.future.2025.108308","url":null,"abstract":"<div><div>Advanced Persistent Threats (APTs) represent some of the most sophisticated and coordinated cyberattacks, often targeting critical infrastructure with stealthy, multi-stage techniques. Despite the availability of numerous intrusion detection datasets, most fail to capture the sequential and strategic nature of APT campaigns as outlined in frameworks like MITRE ATT&CK. This paper introduces a novel dataset based on a realistic emulation of the Sandworm APT group targeting the Supervisory Control and Data Acquisition (SCADA) system of a Wide Area Measurement System (WAMS). The dataset captures the full lifecycle of an APT attack, from initial access to impact, in a structured and time-ordered manner, enabling the study of both atomic and multi-step intrusion behaviours. We train and evaluate supervised multiclass sequence-aware models, specifically Long Short-Term Memory (LSTM) and Bidirectional LSTM (BiLSTM) architectures, to detect these behaviours using network flow data, assessing their performance and analysing their strengths and limitations. Our results show that BiLSTM models offer greater stability and generalization, while LSTM models achieve competitive performance with optimal configurations. These findings highlight the importance of realistic, sequence-aware datasets for developing robust intrusion detection systems tailored to modern APT threats.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"178 ","pages":"Article 108308"},"PeriodicalIF":6.2,"publicationDate":"2025-12-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145753433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-13DOI: 10.1016/j.future.2025.108305
Sergio Iserte , Maël Madon , Georges Da Costa , Jean-Marc Pierson , Antonio J. Peña
Dynamic Resource Management (DRM) techniques can be leveraged to maximize throughput and resource utilization in computational clusters. Although DRM has been extensively studied through analytical workloads and simulations, skepticism persists among end administrators and users regarding their feasibility under real-world conditions. To address this problem, we propose a novel methodology for validating DRM techniques, such as malleability, in realistic scenarios that reproduce actual cluster conditions of jobs and users by replaying workload logs on a High-performance Computing (HPC) infrastructure. Our methodology is capable of adapting the workload to the target cluster. We evaluate our methodology in a malleability-enabled 125-node partition of the Marenostrum 5 supercomputer. Our results validate the proposed method and assess the benefits of MPI malleability on a novel use case of a pioneer user of malleability (our “PhD Student”): parallel-efficiency-aware malleability reduced a malleable workload time by 27 % without delaying the baseline workload, although introducing queueing delays for individual jobs, but maintaining the resource utilization rate.
{"title":"MPI malleability validation under replayed real-world HPC conditions","authors":"Sergio Iserte , Maël Madon , Georges Da Costa , Jean-Marc Pierson , Antonio J. Peña","doi":"10.1016/j.future.2025.108305","DOIUrl":"10.1016/j.future.2025.108305","url":null,"abstract":"<div><div>Dynamic Resource Management (DRM) techniques can be leveraged to maximize throughput and resource utilization in computational clusters. Although DRM has been extensively studied through analytical workloads and simulations, skepticism persists among end administrators and users regarding their feasibility under real-world conditions. To address this problem, we propose a novel methodology for validating DRM techniques, such as malleability, in realistic scenarios that reproduce actual cluster conditions of jobs and users by replaying workload logs on a High-performance Computing (HPC) infrastructure. Our methodology is capable of adapting the workload to the target cluster. We evaluate our methodology in a malleability-enabled 125-node partition of the Marenostrum 5 supercomputer. Our results validate the proposed method and assess the benefits of MPI malleability on a novel use case of a pioneer user of malleability (our “PhD Student”): parallel-efficiency-aware malleability reduced a malleable workload time by 27 % without delaying the baseline workload, although introducing queueing delays for individual jobs, but maintaining the resource utilization rate.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"178 ","pages":"Article 108305"},"PeriodicalIF":6.2,"publicationDate":"2025-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145753432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-12DOI: 10.1016/j.future.2025.108304
Zhao Yang , Xuanyun Qiu , Haoran Hu , Weiyi Hu , Hua Cui , Qingshuang Sun
Federated Learning (FL) is a privacy-preserving distributed machine learning approach that enables collaborative training using data from Mobile Edge Computing (MEC) devices without accessing raw data. However, deploying FL on MEC devices faces challenges due to resource and data heterogeneity and dynamic changes, which can cause unstable training and fairness issues, limiting global model performance and efficiency. This paper proposes an automated FL method designed for dynamic MEC environments, featuring adjustable synchronization intervals and an adaptive aggregation strategy. By combining Bidirectional Long Short-Term Memory networks with Q-learning, the method predicts device availability and dynamically adjusts synchronization intervals. This improves device participation in aggregation and reduces waiting times. Additionally, a Graph Attention Network with GraphTransformer models device collaboration and evaluates knowledge contributions, optimizing aggregation to maximize the utility of distributed data. Extensive experiments show that the proposed method improves accuracy (by 0.7 % to 21.3 %) and efficiency (by 1.19 × to 8.93 × ) compared to baseline methods.
{"title":"Automated federated aggregation for dynamic systems and data in mobile edge computing","authors":"Zhao Yang , Xuanyun Qiu , Haoran Hu , Weiyi Hu , Hua Cui , Qingshuang Sun","doi":"10.1016/j.future.2025.108304","DOIUrl":"10.1016/j.future.2025.108304","url":null,"abstract":"<div><div>Federated Learning (FL) is a privacy-preserving distributed machine learning approach that enables collaborative training using data from Mobile Edge Computing (MEC) devices without accessing raw data. However, deploying FL on MEC devices faces challenges due to resource and data heterogeneity and dynamic changes, which can cause unstable training and fairness issues, limiting global model performance and efficiency. This paper proposes an automated FL method designed for dynamic MEC environments, featuring adjustable synchronization intervals and an adaptive aggregation strategy. By combining Bidirectional Long Short-Term Memory networks with Q-learning, the method predicts device availability and dynamically adjusts synchronization intervals. This improves device participation in aggregation and reduces waiting times. Additionally, a Graph Attention Network with GraphTransformer models device collaboration and evaluates knowledge contributions, optimizing aggregation to maximize the utility of distributed data. Extensive experiments show that the proposed method improves accuracy (by 0.7 % to 21.3 %) and efficiency (by 1.19 × to 8.93 × ) compared to baseline methods.</div></div>","PeriodicalId":55132,"journal":{"name":"Future Generation Computer Systems-The International Journal of Escience","volume":"178 ","pages":"Article 108304"},"PeriodicalIF":6.2,"publicationDate":"2025-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145732461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}