Pub Date : 2025-12-11DOI: 10.1109/TPDS.2025.3643175
Xiang Chen;Bing Lu;Haoquan Long;Huizhang Luo;Yili Ma;Guangming Tan;Dingwen Tao;Fei Wu;Tao Lu
Burst buffers (BBs) act as an intermediate storage layer between compute nodes and parallel file systems (PFS), effectively alleviating the I/O performance gap in high-performance computing (HPC). As scientific simulations and AI workloads generate larger checkpoints and analysis outputs, BB capacity shortages and PFS bandwidth bottlenecks are emerging, and CPU-based compression is not an effective solution due to its high overhead. We introduce Computational Burst Buffers (CBBs), a storage paradigm that embeds hardware compression engines such as application-specific integrated circuit (ASIC) inside computational storage drives (CSDs) at the BB tier. CBB transparently offloads both lossless and error-bounded lossy compression from CPUs to CSDs, thereby (i) expanding effective SSD-backed BB capacity, (ii) reducing BB–PFS traffic, and (iii) eliminating contention and energy overheads of CPU-based compression. Unlike prior CSD-based compression designs targeting databases or flash caching, CBB co-designs the burst-buffer layer and CSD hardware for HPC and quantitatively evaluates compression offload in BB–PFS hierarchies. We prototype CBB using a PCIe 5.0 CSD with an ASIC Zstd-like compressor and an FPGA prototype of an SZ entropy encoder, and evaluate CBB on a 16-node cluster. Experiments with four representative HPC applications and a large-scale workflow simulator show up to 61% lower application runtime, 8–12× higher cache hit ratios, and substantially reduced compute-node CPU utilization compared to software compression and conventional BBs. These results demonstrate that compression-aware BBs with CSDs provide a practical, scalable path to next-generation HPC storage.
{"title":"Computational Burst Buffers: Accelerating HPC I/O via In-Storage Compression Offloading","authors":"Xiang Chen;Bing Lu;Haoquan Long;Huizhang Luo;Yili Ma;Guangming Tan;Dingwen Tao;Fei Wu;Tao Lu","doi":"10.1109/TPDS.2025.3643175","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3643175","url":null,"abstract":"Burst buffers (BBs) act as an intermediate storage layer between compute nodes and parallel file systems (PFS), effectively alleviating the I/O performance gap in high-performance computing (HPC). As scientific simulations and AI workloads generate larger checkpoints and analysis outputs, BB capacity shortages and PFS bandwidth bottlenecks are emerging, and CPU-based compression is not an effective solution due to its high overhead. We introduce <underline>Computational Burst Buffers</u> (CBBs), a storage paradigm that embeds hardware compression engines such as application-specific integrated circuit (ASIC) inside computational storage drives (CSDs) at the BB tier. CBB transparently offloads both lossless and error-bounded lossy compression from CPUs to CSDs, thereby (i) expanding effective SSD-backed BB capacity, (ii) reducing BB–PFS traffic, and (iii) eliminating contention and energy overheads of CPU-based compression. Unlike prior CSD-based compression designs targeting databases or flash caching, CBB co-designs the burst-buffer layer and CSD hardware for HPC and quantitatively evaluates compression offload in BB–PFS hierarchies. We prototype CBB using a PCIe 5.0 CSD with an ASIC Zstd-like compressor and an FPGA prototype of an SZ entropy encoder, and evaluate CBB on a 16-node cluster. Experiments with four representative HPC applications and a large-scale workflow simulator show up to 61% lower application runtime, 8–12× higher cache hit ratios, and substantially reduced compute-node CPU utilization compared to software compression and conventional BBs. These results demonstrate that compression-aware BBs with CSDs provide a practical, scalable path to next-generation HPC storage.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"518-532"},"PeriodicalIF":6.0,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The soaring computing power of HPC systems supports numerous large-scale applications, which generate massive data volumes and diverse I/O patterns, leading to severe I/O bottlenecks. Analyzing and optimizing HPC I/O is therefore critical. However, traditional approaches are typically customized and lack the adaptability required to cope with dynamic changes in HPC environments. To address the challenge, Machine Learning (ML) has been increasingly adopted to automate and enhance I/O analysis and optimization. Given sufficient I/O traces from HPC systems, ML can learn underlying I/O behaviors, extract actionable insights, and dynamically adapt to evolving workloads to improve performance. In this survey, we propose a novel taxonomy that aligns HPC I/O problems with learning tasks to systematically review existing studies. Through this taxonomy, we synthesize key findings on research distribution, data preparation, and model selection. Finally, we discuss several directions to advance the effective integration of ML in HPC I/O systems.
{"title":"A Survey on Machine Learning-Based HPC I/O Analysis and Optimization","authors":"Jingxian Peng;Lihua Yang;Huijun Wu;Wenzhe Zhang;Zhenwei Wu;Wei Zhang;Jiaxin Li;Yiqin Dai;Yong Dong","doi":"10.1109/TPDS.2025.3639682","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3639682","url":null,"abstract":"The soaring computing power of HPC systems supports numerous large-scale applications, which generate massive data volumes and diverse I/O patterns, leading to severe I/O bottlenecks. Analyzing and optimizing HPC I/O is therefore critical. However, traditional approaches are typically customized and lack the adaptability required to cope with dynamic changes in HPC environments. To address the challenge, Machine Learning (ML) has been increasingly adopted to automate and enhance I/O analysis and optimization. Given sufficient I/O traces from HPC systems, ML can learn underlying I/O behaviors, extract actionable insights, and dynamically adapt to evolving workloads to improve performance. In this survey, we propose a novel taxonomy that aligns HPC I/O problems with learning tasks to systematically review existing studies. Through this taxonomy, we synthesize key findings on research distribution, data preparation, and model selection. Finally, we discuss several directions to advance the effective integration of ML in HPC I/O systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"618-632"},"PeriodicalIF":6.0,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-02DOI: 10.1109/TPDS.2025.3639485
Yu Zhang;Feng Zhang;Yani Liu;Huanchen Zhang;Jidong Zhai;Wenchao Zhou;Xiaoyong Du
The explosive growth of data poses significant challenges for GPU-based databases, which must balance limited memory capacity with the need for high-speed query execution. Compression has become an essential technique for optimizing memory utilization and reducing data movement. However, its benefits have been limited to the necessary data decompression. Querying compressed data conventionally requires decompression, which causes the query process to be significantly slower than a direct query on uncompressed data. To address this problem, this article presents a novel GPU-accelerated tile-based direct query framework that successfully eliminates the limitation, significantly enhancing query performance. By employing direct query strategies, the framework minimizes data movement and maximizes memory bandwidth utilization. It incorporates tile-based hardware-conscious execution strategies for direct query, including memory management and control flow coordination, to improve execution efficiency. Additionally, adaptive data-driven compression formats are paired with tailored SQL operators to enable efficient support for diverse queries. Our experiments, conducted using the Star Schema Benchmark, show an average improvement of 3.5× compared to the state-of-the-art tile-based decompression scheme, while maintaining the space-saving advantages of compression. Notably, our solution consistently outperforms existing direct execution schemes for compressed data across all query types.
{"title":"Enabling Tile-Based Direct Query on Adaptively Compressed Data With GPU Acceleration","authors":"Yu Zhang;Feng Zhang;Yani Liu;Huanchen Zhang;Jidong Zhai;Wenchao Zhou;Xiaoyong Du","doi":"10.1109/TPDS.2025.3639485","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3639485","url":null,"abstract":"The explosive growth of data poses significant challenges for GPU-based databases, which must balance limited memory capacity with the need for high-speed query execution. Compression has become an essential technique for optimizing memory utilization and reducing data movement. However, its benefits have been limited to the necessary data decompression. Querying compressed data conventionally requires decompression, which causes the query process to be significantly slower than a direct query on uncompressed data. To address this problem, this article presents a novel GPU-accelerated tile-based direct query framework that successfully eliminates the limitation, significantly enhancing query performance. By employing direct query strategies, the framework minimizes data movement and maximizes memory bandwidth utilization. It incorporates tile-based hardware-conscious execution strategies for direct query, including memory management and control flow coordination, to improve execution efficiency. Additionally, adaptive data-driven compression formats are paired with tailored SQL operators to enable efficient support for diverse queries. Our experiments, conducted using the Star Schema Benchmark, show an average improvement of 3.5× compared to the state-of-the-art tile-based decompression scheme, while maintaining the space-saving advantages of compression. Notably, our solution consistently outperforms existing direct execution schemes for compressed data across all query types.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"410-426"},"PeriodicalIF":6.0,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01DOI: 10.1109/TPDS.2025.3639066
Hai Zhou;Dan Feng
Data centers commonly use erasure codes to maintain high data reliability with lower storage overhead than replication. However, recycling invalid data blocks caused by deletion and update operations is challenging in erasure-coded data centers. Erasure codes organize data blocks into stripes, and we cannot directly delete invalid data blocks like replication to ensure the redundancy of the remaining valid blocks within a stripe. When considering the recycling issues in data centers, existing studies still need to address the following problems: ignoring heavy cross-rack traffic and the load imbalance problem during recycling, and incurring high disk seeks that affect writing performance after recycling. This paper presents the first systematic study on data recycling in erasure-coded data centers and proposes a Cross-rack Aware Recycle (CARecycle) technique. The key idea is migrating valid data blocks from certain stripes to rewrite invalid ones in others, thereby releasing the invalid blocks for certain stripes. Specifically, CARecycle first carefully examines the block distribution for each stripe and generates an efficient recycle solution for migrating and releasing, with the primary objective of reducing cross-rack traffic and disk seek load of nodes. Due to the rewriting of invalid data blocks, parity blocks in multiple stripes need to be updated concurrently. Thus, it further batch processes multiple stripes and selectively arranges appropriate stripes into a batch to achieve uniform cross-rack traffic load distribution. In addition, CARecycle can be extended to adapt to different erasure codes and boost recycling in heterogeneous network environments. Large-scale simulations and Amazon EC2 experiments show that CARecycle can reduce up to 33.8% cross-rack traffic and 28.64%–59.64% recycle time while incurring low disk seek, compared to a state-of-the-art recycling technique.
{"title":"Cross-Rack Aware Recycle Technique in Erasure-Coded Data Centers","authors":"Hai Zhou;Dan Feng","doi":"10.1109/TPDS.2025.3639066","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3639066","url":null,"abstract":"Data centers commonly use erasure codes to maintain high data reliability with lower storage overhead than replication. However, recycling invalid data blocks caused by deletion and update operations is challenging in erasure-coded data centers. Erasure codes organize data blocks into stripes, and we cannot directly delete invalid data blocks like replication to ensure the redundancy of the remaining valid blocks within a stripe. When considering the recycling issues in data centers, existing studies still need to address the following problems: ignoring heavy cross-rack traffic and the load imbalance problem during recycling, and incurring high disk seeks that affect writing performance after recycling. This paper presents the first systematic study on data recycling in erasure-coded data centers and proposes a <italic>Cross-rack Aware Recycle</i> (CARecycle) technique. The key idea is migrating valid data blocks from certain stripes to rewrite invalid ones in others, thereby releasing the invalid blocks for certain stripes. Specifically, CARecycle first carefully examines the block distribution for each stripe and generates an efficient recycle solution for migrating and releasing, with the primary objective of reducing cross-rack traffic and disk seek load of nodes. Due to the rewriting of invalid data blocks, parity blocks in multiple stripes need to be updated concurrently. Thus, it further batch processes multiple stripes and selectively arranges appropriate stripes into a batch to achieve uniform cross-rack traffic load distribution. In addition, CARecycle can be extended to adapt to different erasure codes and boost recycling in heterogeneous network environments. Large-scale simulations and Amazon EC2 experiments show that CARecycle can reduce up to 33.8% cross-rack traffic and 28.64%–59.64% recycle time while incurring low disk seek, compared to a state-of-the-art recycling technique.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"365-379"},"PeriodicalIF":6.0,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01DOI: 10.1109/TPDS.2025.3638945
Ruikun Luo;Wang Yang;Qiang He;Feifei Chen;Song Wu;Hai Jin;Yun Yang
Data deduplication, originally designed for cloud storage systems, is increasingly popular in edge storage systems due to the costly and limited resources and prevalent data redundancy in edge computing environments. The geographical distribution of edge servers poses a challenge in aggregating all data storage information for global decision-making. Existing edge data deduplication (EDD) methods rely on centralized cloud control, which faces issues of timeliness and system scalability. Additionally, these methods overlook data popularity, leading to significantly increased data retrieval latency. A promising approach to this challenge is to implement distributed EDD without cloud control, performing regional deduplication with the edge server requiring deduplication as the center. However, our investigation reveals that existing distributed EDD approaches either fail to account for the impact of collaborative caching on data availability or generate excessive information exchange between edge servers, leading to high communication overhead. To tackle this challenge, this paper presents EdgeDup, which attempts to implement effective EDD in a distributed manner. Additionally, to ensure data availability, EdgeDup aims to maintain low data retrieval latency. EdgeDup achieves its goals by: 1) identifying data redundancies across different edge servers in the system; 2) deduplicating data based on their popularity; and 3) reducing communication overheads using a novel data dependency index. Extensive experimental results show that EdgeDup significantly enhances performance, i.e., reducing data retrieval latency by an average of 47.78% compared to state-of-the-art EDD approaches while maintaining a comparable deduplication ratio.
{"title":"EdgeDup: Popularity-Aware Communication-Efficient Decentralized Edge Data Deduplication","authors":"Ruikun Luo;Wang Yang;Qiang He;Feifei Chen;Song Wu;Hai Jin;Yun Yang","doi":"10.1109/TPDS.2025.3638945","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3638945","url":null,"abstract":"Data deduplication, originally designed for cloud storage systems, is increasingly popular in edge storage systems due to the costly and limited resources and prevalent data redundancy in edge computing environments. The geographical distribution of edge servers poses a challenge in aggregating all data storage information for global decision-making. Existing edge data deduplication (EDD) methods rely on centralized cloud control, which faces issues of timeliness and system scalability. Additionally, these methods overlook data popularity, leading to significantly increased data retrieval latency. A promising approach to this challenge is to implement distributed EDD without cloud control, performing regional deduplication with the edge server requiring deduplication as the center. However, our investigation reveals that existing distributed EDD approaches either fail to account for the impact of collaborative caching on data availability or generate excessive information exchange between edge servers, leading to high communication overhead. To tackle this challenge, this paper presents EdgeDup, which attempts to implement effective EDD in a distributed manner. Additionally, to ensure data availability, EdgeDup aims to maintain low data retrieval latency. EdgeDup achieves its goals by: 1) identifying data redundancies across different edge servers in the system; 2) deduplicating data based on their popularity; and 3) reducing communication overheads using a novel data dependency index. Extensive experimental results show that EdgeDup significantly enhances performance, i.e., reducing data retrieval latency by an average of 47.78% compared to state-of-the-art EDD approaches while maintaining a comparable deduplication ratio.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"459-471"},"PeriodicalIF":6.0,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11271552","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145830903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-01DOI: 10.1109/TPDS.2025.3638703
Anne Benoit;Joachim Cendrier;Frédéric Vivien
Even though it is usually assumed that data centers can always operate at maximum capacity, there have been recent scenarios where the amount of electricity that can be used by data centers evolve over time. Hence, the number of available processors is not a constant anymore. In this work, we assume that jobs can be checkpointed before a resource change. Indeed, in the scenarios that we consider, the resource provider warns the user before a change in the number of processors. It is thus possible to anticipate and take checkpoints before the change happens, such that no work is ever lost. The goal is then to maximize the goodput and/or the minimum yield of jobs within the next section (time between two changes in the number of processors). We model the problem and design greedy solutions and sophisticated dynamic programming algorithms with some optimality results for jobs of infinite duration, and adapt the algorithms to finite jobs. A comprehensive set of simulations, building on real-life job sets, demonstrates the performance of the proposed algorithms. Most algorithms achieve a useful platform utilization (goodput) of over 95%. With infinite jobs, the algorithms also keep fairness by having a relative minimum yield above 0.8, meaning that each job gets a good access to the platform (80% of the time that it would have had if each job had its perfect share of the platform). For finite jobs, the minimum yield can be low since very short new jobs may have to wait until the beginning of the next section to start (and finish), significantly impacting their yield. However, for 75% of the jobs within each workload, the yield ratio between these jobs is at most at a factor two, hence demonstrating the fairness of the proposed algorithms.
{"title":"Scheduling Jobs Under a Variable Number of Processors","authors":"Anne Benoit;Joachim Cendrier;Frédéric Vivien","doi":"10.1109/TPDS.2025.3638703","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3638703","url":null,"abstract":"Even though it is usually assumed that data centers can always operate at maximum capacity, there have been recent scenarios where the amount of electricity that can be used by data centers evolve over time. Hence, the number of available processors is not a constant anymore. In this work, we assume that jobs can be checkpointed before a resource change. Indeed, in the scenarios that we consider, the resource provider warns the user before a change in the number of processors. It is thus possible to anticipate and take checkpoints before the change happens, such that no work is ever lost. The goal is then to maximize the goodput and/or the minimum yield of jobs within the next section (time between two changes in the number of processors). We model the problem and design greedy solutions and sophisticated dynamic programming algorithms with some optimality results for jobs of infinite duration, and adapt the algorithms to finite jobs. A comprehensive set of simulations, building on real-life job sets, demonstrates the performance of the proposed algorithms. Most algorithms achieve a useful platform utilization (goodput) of over 95%. With infinite jobs, the algorithms also keep fairness by having a relative minimum yield above 0.8, meaning that each job gets a good access to the platform (80% of the time that it would have had if each job had its perfect share of the platform). For finite jobs, the minimum yield can be low since very short new jobs may have to wait until the beginning of the next section to start (and finish), significantly impacting their yield. However, for 75% of the jobs within each workload, the yield ratio between these jobs is at most at a factor two, hence demonstrating the fairness of the proposed algorithms.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"427-442"},"PeriodicalIF":6.0,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-28DOI: 10.1109/TPDS.2025.3638693
Yixiao Xia;Yinghui Zhao;Jian Wan;Congfeng Jiang
Traditional log anomaly detection systems are centralized, which poses the risk of privacy leakage during data transmission. Previous research mainly focuses on single-domain logs, requiring domain-specific models and retraining, which limits flexibility and scalability. In this paper, we propose a unified federated cross-domain log anomaly detection approach, FLUXLog, which is based on MoE (Mixture of Experts) to handle heterogeneous log data. Based on our insights, we establish a two-phase training process: pre-training the gating network to assign expert weights based on data distribution, followed by expert-driven top-down feature fusion. The following training of the gating network is based on fine-tuning the adapters, providing the necessary flexibility for the model to adapt across domains while maintaining expert specialization. This training paradigm enables a Hybrid Specialization Strategy, fostering both domain-specific expertise and cross-domain generalization. The Cross Gated-Experts Module (CGEM) then fuses expert weights and dual-channel outputs. Experiments on public datasets demonstrate that our model outperforms baseline models in handling unified cross-domain log data.
{"title":"FLUXLog: A Federated Mixture-of-Experts Framework for Unified Log Anomaly Detection","authors":"Yixiao Xia;Yinghui Zhao;Jian Wan;Congfeng Jiang","doi":"10.1109/TPDS.2025.3638693","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3638693","url":null,"abstract":"Traditional log anomaly detection systems are centralized, which poses the risk of privacy leakage during data transmission. Previous research mainly focuses on single-domain logs, requiring domain-specific models and retraining, which limits flexibility and scalability. In this paper, we propose a unified federated cross-domain log anomaly detection approach, FLUXLog, which is based on MoE (Mixture of Experts) to handle heterogeneous log data. Based on our insights, we establish a two-phase training process: pre-training the gating network to assign expert weights based on data distribution, followed by expert-driven top-down feature fusion. The following training of the gating network is based on fine-tuning the adapters, providing the necessary flexibility for the model to adapt across domains while maintaining expert specialization. This training paradigm enables a Hybrid Specialization Strategy, fostering both domain-specific expertise and cross-domain generalization. The Cross Gated-Experts Module (CGEM) then fuses expert weights and dual-channel outputs. Experiments on public datasets demonstrate that our model outperforms baseline models in handling unified cross-domain log data.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"395-409"},"PeriodicalIF":6.0,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11271152","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-28DOI: 10.1109/TPDS.2025.3638428
Yifan Sui;Hanfei Yu;Yitao Hu;Jianxun Li;Hao Wang
Serverless computing has emerged as a novel paradigm in cloud computing, characterized by its agile scalability, cost-effective pay-as-you-go billing, and user-friendly capabilities for Machine Learning (ML) inference tasks. Developers wrap their ML algorithms into serverless functions and run them in containers. However, the well-known cold-start problem significantly slows down the response time of functions. To address cold-starts, the technique of pre-warming, which proactively maintains containers in a warm state, has gained widespread adoption across both research and industry. Nevertheless, we observed that pre-warming does not address the distinct delays caused by the loading of ML artifacts. According to our analysis, in ML inference functions, the time required to load libraries and models significantly exceeds the time needed to warm containers. Thus, relying solely on pre-warming is insufficient for mitigating cold-starts. This paper presents Tyche, an opportunistic pre-loading approach designed to eliminate the latency associated with loading ML artifacts, enabling near-instant inference and minimizing function execution time. Tyche fully leverages the idle memory in warmed containers and GPUs to pre-load required libraries and models, striking an optimal balance between acceleration and resource efficiency. Additionally, Tyche is tailored for large-scale serverless platforms, incorporating cluster-wide scheduling and lightweight locality-aware load balancing to enhance performance. We design Tyche to be transparent to providers and compatible with existing pre-warming solutions. Experiments on OpenWhisk with real-world workloads show that Tyche reduces up to 93% loading latency and achieves up to 8× speedup compared to state-of-the-art pre-warming solutions. Compared with the state-of-the-art serverless pre-loading solution, Tyche also achieves up to 1.9× speedup.
{"title":"Accelerating ML Inference via Opportunistic Pre-Loading on Serverless Clusters","authors":"Yifan Sui;Hanfei Yu;Yitao Hu;Jianxun Li;Hao Wang","doi":"10.1109/TPDS.2025.3638428","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3638428","url":null,"abstract":"Serverless computing has emerged as a novel paradigm in cloud computing, characterized by its agile scalability, cost-effective pay-as-you-go billing, and user-friendly capabilities for Machine Learning (ML) inference tasks. Developers wrap their ML algorithms into serverless functions and run them in containers. However, the well-known cold-start problem significantly slows down the response time of functions. To address cold-starts, the technique of pre-warming, which proactively maintains containers in a warm state, has gained widespread adoption across both research and industry. Nevertheless, we observed that pre-warming does not address the distinct delays caused by the loading of ML artifacts. According to our analysis, in ML inference functions, the time required to load libraries and models significantly exceeds the time needed to warm containers. Thus, relying solely on pre-warming is insufficient for mitigating cold-starts. This paper presents <italic>Tyche</i>, an opportunistic pre-loading approach designed to eliminate the latency associated with loading ML artifacts, enabling near-instant inference and minimizing function execution time. <italic>Tyche</i> fully leverages the idle memory in warmed containers and GPUs to pre-load required libraries and models, striking an optimal balance between acceleration and resource efficiency. Additionally, <italic>Tyche</i> is tailored for large-scale serverless platforms, incorporating cluster-wide scheduling and lightweight locality-aware load balancing to enhance performance. We design <italic>Tyche</i> to be transparent to providers and compatible with existing pre-warming solutions. Experiments on OpenWhisk with real-world workloads show that <italic>Tyche</i> reduces up to 93% loading latency and achieves up to 8× speedup compared to state-of-the-art pre-warming solutions. Compared with the state-of-the-art serverless pre-loading solution, <italic>Tyche</i> also achieves up to 1.9× speedup.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"472-488"},"PeriodicalIF":6.0,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145830972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Energy predictive models employing performance events have emerged as a promising alternative to other mainstream methods for developing software power meters used in runtime energy profiling of applications. These models are cost-effective and provide a highly accurate means of measuring the energy consumption of applications during execution. Recently, software power meters have been proposed to profile the dynamic energy consumption of data transfers between CPU and GPU in heterogeneous hybrid platforms, thereby effectively addressing the gap between software power meters that measure computations and those that measure data transfers. However, the state-of-the-art software power meters lack fundamental properties essential for achieving accurate runtime energy profiling of parallel hybrid programs on heterogeneous hybrid servers. Two critical properties are concurrency and orthogonality. In this work, we define these essential properties and propose a methodology for developing concurrent and orthogonal platform-level software power meters capable of accurate runtime energy profiling of parallel hybrid programs on heterogeneous hybrid servers. We apply this methodology to develop software power meters for three heterogeneous hybrid servers that consist of Intel multicore CPUs and Nvidia GPUs from different generations. Furthermore, we demonstrate the accuracy and efficiency of the proposed software power meters by using them to estimate the dynamic energy consumption of computation and communication activities in three parallel hybrid programs. Our results show that the average prediction error for dynamic energy consumption by these software power meters is just 2.5% across our servers.
{"title":"Concurrent and Orthogonal Software Power Meters for Accurate Runtime Energy Profiling of Parallel Hybrid Programs on Heterogeneous Hybrid Servers","authors":"Hafiz Adnan Niaz;Ravi Reddy Manumachu;Alexey Lastovetsky","doi":"10.1109/TPDS.2025.3637511","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3637511","url":null,"abstract":"Energy predictive models employing performance events have emerged as a promising alternative to other mainstream methods for developing software power meters used in runtime energy profiling of applications. These models are cost-effective and provide a highly accurate means of measuring the energy consumption of applications during execution. Recently, software power meters have been proposed to profile the dynamic energy consumption of data transfers between CPU and GPU in heterogeneous hybrid platforms, thereby effectively addressing the gap between software power meters that measure computations and those that measure data transfers. However, the state-of-the-art software power meters lack fundamental properties essential for achieving accurate runtime energy profiling of parallel hybrid programs on heterogeneous hybrid servers. Two critical properties are <italic>concurrency</i> and <italic>orthogonality</i>. In this work, we define these essential properties and propose a methodology for developing concurrent and orthogonal platform-level software power meters capable of accurate runtime energy profiling of parallel hybrid programs on heterogeneous hybrid servers. We apply this methodology to develop software power meters for three heterogeneous hybrid servers that consist of Intel multicore CPUs and Nvidia GPUs from different generations. Furthermore, we demonstrate the accuracy and efficiency of the proposed software power meters by using them to estimate the dynamic energy consumption of computation and communication activities in three parallel hybrid programs. Our results show that the average prediction error for dynamic energy consumption by these software power meters is just 2.5% across our servers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"322-339"},"PeriodicalIF":6.0,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11269896","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-26DOI: 10.1109/TPDS.2025.3637175
Fuze Tian;Lixin Zhang;Qi Pan;Jingyu Liu;Qinglin Zhao;Bin Hu
Depression detection using Electroencephalogram (EEG) signals obtained from wearable medical-assisted diagnostic systems has become a well-established approach in the field of affective disorders. However, despite recent advancements, on-board Artificial Intelligence (AI) models still demand substantial computational resources, presenting significant challenges for deployment on resource-constrained wearable medical devices. Embedded Multi-core Processors (MPs) offer a promising solution for accelerating these models. However, the limited computational capabilities of embedded MPs, combined with the structural diversity of AI models, complicate resource allocation and increase associated costs. To address these challenges, we propose a Memory-Aware Multi-Objective Iterative Local Search (MAMILS) algorithm to optimize task scheduling, thereby improving the efficiency of AI model deployment on wearable EEG devices. Experimental results across seven AI models demonstrate that, the MAMILS approach yields substantial improvements in key performance indicators: Total Energy Consumption ($bm {TEC}$) with an average reduction of 47.57%, $bm {Makespan}$ with an average reduction of 48.75%, and $bm {Throughput}$ with an average increase of 198.37%, all while maintaining satisfactory classification performance for both Machine Learning (ML) and Deep Learning (DL) models. Especially, on-board deployment of EEGNeX achieves an accuracy of 93.4%, sensitivity of 91.6%, and specificity of 95.8%. Further analysis indicates that, when integrated with wearable EEG sensors and executable on-board AI models, the proposed MAMILS optimization strategy shows significant promise in facilitating the widespread adoption of low-power, real-time diagnostic systems for depression detection.
{"title":"MAMILS: A Memory-Aware Multiobjective Scheduler for Real-Time Embedded EEG Depression Diagnosis","authors":"Fuze Tian;Lixin Zhang;Qi Pan;Jingyu Liu;Qinglin Zhao;Bin Hu","doi":"10.1109/TPDS.2025.3637175","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3637175","url":null,"abstract":"Depression detection using Electroencephalogram (EEG) signals obtained from wearable medical-assisted diagnostic systems has become a well-established approach in the field of affective disorders. However, despite recent advancements, on-board Artificial Intelligence (AI) models still demand substantial computational resources, presenting significant challenges for deployment on resource-constrained wearable medical devices. Embedded Multi-core Processors (MPs) offer a promising solution for accelerating these models. However, the limited computational capabilities of embedded MPs, combined with the structural diversity of AI models, complicate resource allocation and increase associated costs. To address these challenges, we propose a Memory-Aware Multi-Objective Iterative Local Search (MAMILS) algorithm to optimize task scheduling, thereby improving the efficiency of AI model deployment on wearable EEG devices. Experimental results across seven AI models demonstrate that, the MAMILS approach yields substantial improvements in key performance indicators: Total Energy Consumption (<inline-formula><tex-math>$bm {TEC}$</tex-math></inline-formula>) with an average reduction of 47.57%, <inline-formula><tex-math>$bm {Makespan}$</tex-math></inline-formula> with an average reduction of 48.75%, and <inline-formula><tex-math>$bm {Throughput}$</tex-math></inline-formula> with an average increase of 198.37%, all while maintaining satisfactory classification performance for both Machine Learning (ML) and Deep Learning (DL) models. Especially, on-board deployment of EEGNeX achieves an accuracy of 93.4%, sensitivity of 91.6%, and specificity of 95.8%. Further analysis indicates that, when integrated with wearable EEG sensors and executable on-board AI models, the proposed MAMILS optimization strategy shows significant promise in facilitating the widespread adoption of low-power, real-time diagnostic systems for depression detection.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"600-617"},"PeriodicalIF":6.0,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}