首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
Computational Burst Buffers: Accelerating HPC I/O via In-Storage Compression Offloading 计算突发缓冲:通过存储压缩卸载加速HPC I/O
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-11 DOI: 10.1109/TPDS.2025.3643175
Xiang Chen;Bing Lu;Haoquan Long;Huizhang Luo;Yili Ma;Guangming Tan;Dingwen Tao;Fei Wu;Tao Lu
Burst buffers (BBs) act as an intermediate storage layer between compute nodes and parallel file systems (PFS), effectively alleviating the I/O performance gap in high-performance computing (HPC). As scientific simulations and AI workloads generate larger checkpoints and analysis outputs, BB capacity shortages and PFS bandwidth bottlenecks are emerging, and CPU-based compression is not an effective solution due to its high overhead. We introduce Computational Burst Buffers (CBBs), a storage paradigm that embeds hardware compression engines such as application-specific integrated circuit (ASIC) inside computational storage drives (CSDs) at the BB tier. CBB transparently offloads both lossless and error-bounded lossy compression from CPUs to CSDs, thereby (i) expanding effective SSD-backed BB capacity, (ii) reducing BB–PFS traffic, and (iii) eliminating contention and energy overheads of CPU-based compression. Unlike prior CSD-based compression designs targeting databases or flash caching, CBB co-designs the burst-buffer layer and CSD hardware for HPC and quantitatively evaluates compression offload in BB–PFS hierarchies. We prototype CBB using a PCIe 5.0 CSD with an ASIC Zstd-like compressor and an FPGA prototype of an SZ entropy encoder, and evaluate CBB on a 16-node cluster. Experiments with four representative HPC applications and a large-scale workflow simulator show up to 61% lower application runtime, 8–12× higher cache hit ratios, and substantially reduced compute-node CPU utilization compared to software compression and conventional BBs. These results demonstrate that compression-aware BBs with CSDs provide a practical, scalable path to next-generation HPC storage.
突发缓冲区(Burst buffers)作为计算节点和并行文件系统(PFS)之间的中间存储层,可以有效缓解高性能计算(HPC)中的I/O性能差距。随着科学模拟和人工智能工作负载产生更大的检查点和分析输出,BB容量短缺和PFS带宽瓶颈正在出现,基于cpu的压缩由于其高开销而不是有效的解决方案。我们介绍了计算突发缓冲区(CBBs),这是一种存储范例,它将硬件压缩引擎(如专用集成电路(ASIC))嵌入到BB层的计算存储驱动器(csd)中。CBB透明地将无损和错误有界的有损压缩从cpu卸载到csd,从而(i)扩展ssd支持的有效BB容量,(ii)减少BB - pfs流量,以及(iii)消除基于cpu的压缩的争用和能源开销。与之前针对数据库或闪存缓存的基于CSD的压缩设计不同,CBB为HPC共同设计突发缓冲层和CSD硬件,并定量评估BB-PFS层次结构中的压缩卸载。我们使用带有ASIC zstd类压缩器的PCIe 5.0 CSD和SZ熵编码器的FPGA原型对CBB进行了原型设计,并在16节点集群上对CBB进行了评估。在四个典型HPC应用程序和一个大型工作流模拟器上进行的实验表明,与软件压缩和传统BBs相比,应用程序运行时间降低了61%,缓存命中率提高了8 - 12倍,并且大大降低了计算节点的CPU利用率。这些结果表明,带有csd的压缩感知型bsd为下一代高性能计算存储提供了一条实用的、可扩展的途径。
{"title":"Computational Burst Buffers: Accelerating HPC I/O via In-Storage Compression Offloading","authors":"Xiang Chen;Bing Lu;Haoquan Long;Huizhang Luo;Yili Ma;Guangming Tan;Dingwen Tao;Fei Wu;Tao Lu","doi":"10.1109/TPDS.2025.3643175","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3643175","url":null,"abstract":"Burst buffers (BBs) act as an intermediate storage layer between compute nodes and parallel file systems (PFS), effectively alleviating the I/O performance gap in high-performance computing (HPC). As scientific simulations and AI workloads generate larger checkpoints and analysis outputs, BB capacity shortages and PFS bandwidth bottlenecks are emerging, and CPU-based compression is not an effective solution due to its high overhead. We introduce <underline>Computational Burst Buffers</u> (CBBs), a storage paradigm that embeds hardware compression engines such as application-specific integrated circuit (ASIC) inside computational storage drives (CSDs) at the BB tier. CBB transparently offloads both lossless and error-bounded lossy compression from CPUs to CSDs, thereby (i) expanding effective SSD-backed BB capacity, (ii) reducing BB–PFS traffic, and (iii) eliminating contention and energy overheads of CPU-based compression. Unlike prior CSD-based compression designs targeting databases or flash caching, CBB co-designs the burst-buffer layer and CSD hardware for HPC and quantitatively evaluates compression offload in BB–PFS hierarchies. We prototype CBB using a PCIe 5.0 CSD with an ASIC Zstd-like compressor and an FPGA prototype of an SZ entropy encoder, and evaluate CBB on a 16-node cluster. Experiments with four representative HPC applications and a large-scale workflow simulator show up to 61% lower application runtime, 8–12× higher cache hit ratios, and substantially reduced compute-node CPU utilization compared to software compression and conventional BBs. These results demonstrate that compression-aware BBs with CSDs provide a practical, scalable path to next-generation HPC storage.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"518-532"},"PeriodicalIF":6.0,"publicationDate":"2025-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Survey on Machine Learning-Based HPC I/O Analysis and Optimization 基于机器学习的高性能计算I/O分析与优化研究综述
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-03 DOI: 10.1109/TPDS.2025.3639682
Jingxian Peng;Lihua Yang;Huijun Wu;Wenzhe Zhang;Zhenwei Wu;Wei Zhang;Jiaxin Li;Yiqin Dai;Yong Dong
The soaring computing power of HPC systems supports numerous large-scale applications, which generate massive data volumes and diverse I/O patterns, leading to severe I/O bottlenecks. Analyzing and optimizing HPC I/O is therefore critical. However, traditional approaches are typically customized and lack the adaptability required to cope with dynamic changes in HPC environments. To address the challenge, Machine Learning (ML) has been increasingly adopted to automate and enhance I/O analysis and optimization. Given sufficient I/O traces from HPC systems, ML can learn underlying I/O behaviors, extract actionable insights, and dynamically adapt to evolving workloads to improve performance. In this survey, we propose a novel taxonomy that aligns HPC I/O problems with learning tasks to systematically review existing studies. Through this taxonomy, we synthesize key findings on research distribution, data preparation, and model selection. Finally, we discuss several directions to advance the effective integration of ML in HPC I/O systems.
高性能计算系统的计算能力不断提高,支持大量大规模应用程序,这些应用程序产生大量数据量和各种I/O模式,导致严重的I/O瓶颈。因此,分析和优化HPC I/O至关重要。然而,传统的方法通常是定制的,缺乏应对高性能计算环境中动态变化所需的适应性。为了应对这一挑战,机器学习(ML)已被越来越多地用于自动化和增强I/O分析和优化。给定来自HPC系统的足够的I/O跟踪,ML可以学习底层的I/O行为,提取可操作的见解,并动态适应不断变化的工作负载以提高性能。在这项调查中,我们提出了一种新的分类法,将HPC I/O问题与学习任务相结合,以系统地回顾现有的研究。通过这个分类法,我们综合了研究分布、数据准备和模型选择的关键发现。最后,我们讨论了在HPC I/O系统中推进ML有效集成的几个方向。
{"title":"A Survey on Machine Learning-Based HPC I/O Analysis and Optimization","authors":"Jingxian Peng;Lihua Yang;Huijun Wu;Wenzhe Zhang;Zhenwei Wu;Wei Zhang;Jiaxin Li;Yiqin Dai;Yong Dong","doi":"10.1109/TPDS.2025.3639682","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3639682","url":null,"abstract":"The soaring computing power of HPC systems supports numerous large-scale applications, which generate massive data volumes and diverse I/O patterns, leading to severe I/O bottlenecks. Analyzing and optimizing HPC I/O is therefore critical. However, traditional approaches are typically customized and lack the adaptability required to cope with dynamic changes in HPC environments. To address the challenge, Machine Learning (ML) has been increasingly adopted to automate and enhance I/O analysis and optimization. Given sufficient I/O traces from HPC systems, ML can learn underlying I/O behaviors, extract actionable insights, and dynamically adapt to evolving workloads to improve performance. In this survey, we propose a novel taxonomy that aligns HPC I/O problems with learning tasks to systematically review existing studies. Through this taxonomy, we synthesize key findings on research distribution, data preparation, and model selection. Finally, we discuss several directions to advance the effective integration of ML in HPC I/O systems.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"618-632"},"PeriodicalIF":6.0,"publicationDate":"2025-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Enabling Tile-Based Direct Query on Adaptively Compressed Data With GPU Acceleration 使用GPU加速对自适应压缩数据启用基于磁贴的直接查询
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-02 DOI: 10.1109/TPDS.2025.3639485
Yu Zhang;Feng Zhang;Yani Liu;Huanchen Zhang;Jidong Zhai;Wenchao Zhou;Xiaoyong Du
The explosive growth of data poses significant challenges for GPU-based databases, which must balance limited memory capacity with the need for high-speed query execution. Compression has become an essential technique for optimizing memory utilization and reducing data movement. However, its benefits have been limited to the necessary data decompression. Querying compressed data conventionally requires decompression, which causes the query process to be significantly slower than a direct query on uncompressed data. To address this problem, this article presents a novel GPU-accelerated tile-based direct query framework that successfully eliminates the limitation, significantly enhancing query performance. By employing direct query strategies, the framework minimizes data movement and maximizes memory bandwidth utilization. It incorporates tile-based hardware-conscious execution strategies for direct query, including memory management and control flow coordination, to improve execution efficiency. Additionally, adaptive data-driven compression formats are paired with tailored SQL operators to enable efficient support for diverse queries. Our experiments, conducted using the Star Schema Benchmark, show an average improvement of 3.5× compared to the state-of-the-art tile-based decompression scheme, while maintaining the space-saving advantages of compression. Notably, our solution consistently outperforms existing direct execution schemes for compressed data across all query types.
数据的爆炸性增长对基于gpu的数据库提出了重大挑战,这些数据库必须平衡有限的内存容量和高速查询执行的需求。压缩已经成为优化内存利用率和减少数据移动的基本技术。然而,它的好处仅限于必要的数据解压缩。查询压缩数据通常需要解压缩,这导致查询过程明显慢于直接查询未压缩数据。为了解决这个问题,本文提出了一种新的gpu加速的基于tile的直接查询框架,该框架成功地消除了这一限制,显著提高了查询性能。通过采用直接查询策略,该框架最大限度地减少了数据移动并最大限度地提高了内存带宽利用率。它为直接查询集成了基于磁贴的硬件感知执行策略,包括内存管理和控制流协调,以提高执行效率。此外,自适应数据驱动的压缩格式与定制的SQL操作符配对,以便有效地支持各种查询。我们使用星型模式基准进行的实验显示,与最先进的基于tile的解压方案相比,平均提高了3.5倍,同时保持了压缩节省空间的优势。值得注意的是,对于所有查询类型的压缩数据,我们的解决方案始终优于现有的直接执行方案。
{"title":"Enabling Tile-Based Direct Query on Adaptively Compressed Data With GPU Acceleration","authors":"Yu Zhang;Feng Zhang;Yani Liu;Huanchen Zhang;Jidong Zhai;Wenchao Zhou;Xiaoyong Du","doi":"10.1109/TPDS.2025.3639485","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3639485","url":null,"abstract":"The explosive growth of data poses significant challenges for GPU-based databases, which must balance limited memory capacity with the need for high-speed query execution. Compression has become an essential technique for optimizing memory utilization and reducing data movement. However, its benefits have been limited to the necessary data decompression. Querying compressed data conventionally requires decompression, which causes the query process to be significantly slower than a direct query on uncompressed data. To address this problem, this article presents a novel GPU-accelerated tile-based direct query framework that successfully eliminates the limitation, significantly enhancing query performance. By employing direct query strategies, the framework minimizes data movement and maximizes memory bandwidth utilization. It incorporates tile-based hardware-conscious execution strategies for direct query, including memory management and control flow coordination, to improve execution efficiency. Additionally, adaptive data-driven compression formats are paired with tailored SQL operators to enable efficient support for diverse queries. Our experiments, conducted using the Star Schema Benchmark, show an average improvement of 3.5× compared to the state-of-the-art tile-based decompression scheme, while maintaining the space-saving advantages of compression. Notably, our solution consistently outperforms existing direct execution schemes for compressed data across all query types.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"410-426"},"PeriodicalIF":6.0,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cross-Rack Aware Recycle Technique in Erasure-Coded Data Centers 擦除编码数据中心的跨机架感知回收技术
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-01 DOI: 10.1109/TPDS.2025.3639066
Hai Zhou;Dan Feng
Data centers commonly use erasure codes to maintain high data reliability with lower storage overhead than replication. However, recycling invalid data blocks caused by deletion and update operations is challenging in erasure-coded data centers. Erasure codes organize data blocks into stripes, and we cannot directly delete invalid data blocks like replication to ensure the redundancy of the remaining valid blocks within a stripe. When considering the recycling issues in data centers, existing studies still need to address the following problems: ignoring heavy cross-rack traffic and the load imbalance problem during recycling, and incurring high disk seeks that affect writing performance after recycling. This paper presents the first systematic study on data recycling in erasure-coded data centers and proposes a Cross-rack Aware Recycle (CARecycle) technique. The key idea is migrating valid data blocks from certain stripes to rewrite invalid ones in others, thereby releasing the invalid blocks for certain stripes. Specifically, CARecycle first carefully examines the block distribution for each stripe and generates an efficient recycle solution for migrating and releasing, with the primary objective of reducing cross-rack traffic and disk seek load of nodes. Due to the rewriting of invalid data blocks, parity blocks in multiple stripes need to be updated concurrently. Thus, it further batch processes multiple stripes and selectively arranges appropriate stripes into a batch to achieve uniform cross-rack traffic load distribution. In addition, CARecycle can be extended to adapt to different erasure codes and boost recycling in heterogeneous network environments. Large-scale simulations and Amazon EC2 experiments show that CARecycle can reduce up to 33.8% cross-rack traffic and 28.64%–59.64% recycle time while incurring low disk seek, compared to a state-of-the-art recycling technique.
数据中心通常使用擦除码来保持高的数据可靠性,并且比复制具有更低的存储开销。然而,在擦除编码的数据中心中,回收由删除和更新操作引起的无效数据块是一项挑战。Erasure码将数据块组织成条带,不能像复制一样直接删除无效的数据块,以保证条带内剩余有效块的冗余性。在考虑数据中心的回收问题时,现有的研究还需要解决以下问题:忽略了回收过程中过大的跨机架流量和负载不平衡问题,以及回收后产生的高磁盘寻道率影响写入性能。本文首次系统地研究了擦除编码数据中心的数据回收问题,提出了一种跨机架感知回收(CARecycle)技术。关键思想是将有效的数据块从某些条带迁移到重写其他条带中的无效数据块,从而释放某些条带的无效块。具体来说,CARecycle首先仔细检查每个条带的块分布,并为迁移和释放生成有效的回收解决方案,其主要目标是减少跨机架流量和节点的磁盘寻道负载。由于重写无效数据块,需要同时更新多个分条的校验块。从而进一步对多个条纹进行批量处理,并有选择地将适当的条纹排列成批,以实现均匀的跨机架流量负载分布。此外,还可以对CARecycle进行扩展,以适应不同的擦除码,提高异构网络环境下的回收能力。大规模模拟和Amazon EC2实验表明,与最先进的回收技术相比,CARecycle可以减少高达33.8%的跨机架流量和28.64%-59.64%的回收时间,同时减少磁盘寻道。
{"title":"Cross-Rack Aware Recycle Technique in Erasure-Coded Data Centers","authors":"Hai Zhou;Dan Feng","doi":"10.1109/TPDS.2025.3639066","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3639066","url":null,"abstract":"Data centers commonly use erasure codes to maintain high data reliability with lower storage overhead than replication. However, recycling invalid data blocks caused by deletion and update operations is challenging in erasure-coded data centers. Erasure codes organize data blocks into stripes, and we cannot directly delete invalid data blocks like replication to ensure the redundancy of the remaining valid blocks within a stripe. When considering the recycling issues in data centers, existing studies still need to address the following problems: ignoring heavy cross-rack traffic and the load imbalance problem during recycling, and incurring high disk seeks that affect writing performance after recycling. This paper presents the first systematic study on data recycling in erasure-coded data centers and proposes a <italic>Cross-rack Aware Recycle</i> (CARecycle) technique. The key idea is migrating valid data blocks from certain stripes to rewrite invalid ones in others, thereby releasing the invalid blocks for certain stripes. Specifically, CARecycle first carefully examines the block distribution for each stripe and generates an efficient recycle solution for migrating and releasing, with the primary objective of reducing cross-rack traffic and disk seek load of nodes. Due to the rewriting of invalid data blocks, parity blocks in multiple stripes need to be updated concurrently. Thus, it further batch processes multiple stripes and selectively arranges appropriate stripes into a batch to achieve uniform cross-rack traffic load distribution. In addition, CARecycle can be extended to adapt to different erasure codes and boost recycling in heterogeneous network environments. Large-scale simulations and Amazon EC2 experiments show that CARecycle can reduce up to 33.8% cross-rack traffic and 28.64%–59.64% recycle time while incurring low disk seek, compared to a state-of-the-art recycling technique.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"365-379"},"PeriodicalIF":6.0,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EdgeDup: Popularity-Aware Communication-Efficient Decentralized Edge Data Deduplication EdgeDup:受欢迎的高效通信分散边缘重复数据删除
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-01 DOI: 10.1109/TPDS.2025.3638945
Ruikun Luo;Wang Yang;Qiang He;Feifei Chen;Song Wu;Hai Jin;Yun Yang
Data deduplication, originally designed for cloud storage systems, is increasingly popular in edge storage systems due to the costly and limited resources and prevalent data redundancy in edge computing environments. The geographical distribution of edge servers poses a challenge in aggregating all data storage information for global decision-making. Existing edge data deduplication (EDD) methods rely on centralized cloud control, which faces issues of timeliness and system scalability. Additionally, these methods overlook data popularity, leading to significantly increased data retrieval latency. A promising approach to this challenge is to implement distributed EDD without cloud control, performing regional deduplication with the edge server requiring deduplication as the center. However, our investigation reveals that existing distributed EDD approaches either fail to account for the impact of collaborative caching on data availability or generate excessive information exchange between edge servers, leading to high communication overhead. To tackle this challenge, this paper presents EdgeDup, which attempts to implement effective EDD in a distributed manner. Additionally, to ensure data availability, EdgeDup aims to maintain low data retrieval latency. EdgeDup achieves its goals by: 1) identifying data redundancies across different edge servers in the system; 2) deduplicating data based on their popularity; and 3) reducing communication overheads using a novel data dependency index. Extensive experimental results show that EdgeDup significantly enhances performance, i.e., reducing data retrieval latency by an average of 47.78% compared to state-of-the-art EDD approaches while maintaining a comparable deduplication ratio.
重复数据删除最初是为云存储系统设计的,由于边缘计算环境中资源昂贵且有限,数据冗余普遍存在,因此在边缘存储系统中越来越流行。边缘服务器的地理分布对聚合所有数据存储信息以进行全局决策提出了挑战。现有的边缘重复数据删除(EDD)方法依赖于集中的云控制,这面临着及时性和系统可扩展性的问题。此外,这些方法忽略了数据的流行程度,导致数据检索延迟显著增加。应对这一挑战的一种很有前途的方法是在没有云控制的情况下实现分布式EDD,以需要重复数据删除的边缘服务器为中心执行区域重复数据删除。然而,我们的调查显示,现有的分布式EDD方法要么无法考虑协作缓存对数据可用性的影响,要么在边缘服务器之间产生过多的信息交换,从而导致高通信开销。为了应对这一挑战,本文提出了EdgeDup,它试图以分布式的方式实现有效的EDD。此外,为了确保数据的可用性,EdgeDup旨在保持较低的数据检索延迟。EdgeDup实现其目标:1)识别系统中不同边缘服务器之间的数据冗余;2)根据受欢迎程度对数据进行重复数据删除;3)使用新的数据依赖索引减少通信开销。大量的实验结果表明,EdgeDup显著提高了性能,即与最先进的EDD方法相比,在保持相当的重复数据删除比率的同时,平均减少了47.78%的数据检索延迟。
{"title":"EdgeDup: Popularity-Aware Communication-Efficient Decentralized Edge Data Deduplication","authors":"Ruikun Luo;Wang Yang;Qiang He;Feifei Chen;Song Wu;Hai Jin;Yun Yang","doi":"10.1109/TPDS.2025.3638945","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3638945","url":null,"abstract":"Data deduplication, originally designed for cloud storage systems, is increasingly popular in edge storage systems due to the costly and limited resources and prevalent data redundancy in edge computing environments. The geographical distribution of edge servers poses a challenge in aggregating all data storage information for global decision-making. Existing edge data deduplication (EDD) methods rely on centralized cloud control, which faces issues of timeliness and system scalability. Additionally, these methods overlook data popularity, leading to significantly increased data retrieval latency. A promising approach to this challenge is to implement distributed EDD without cloud control, performing regional deduplication with the edge server requiring deduplication as the center. However, our investigation reveals that existing distributed EDD approaches either fail to account for the impact of collaborative caching on data availability or generate excessive information exchange between edge servers, leading to high communication overhead. To tackle this challenge, this paper presents EdgeDup, which attempts to implement effective EDD in a distributed manner. Additionally, to ensure data availability, EdgeDup aims to maintain low data retrieval latency. EdgeDup achieves its goals by: 1) identifying data redundancies across different edge servers in the system; 2) deduplicating data based on their popularity; and 3) reducing communication overheads using a novel data dependency index. Extensive experimental results show that EdgeDup significantly enhances performance, i.e., reducing data retrieval latency by an average of 47.78% compared to state-of-the-art EDD approaches while maintaining a comparable deduplication ratio.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"459-471"},"PeriodicalIF":6.0,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11271552","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145830903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scheduling Jobs Under a Variable Number of Processors 在可变数量的处理器下调度作业
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-12-01 DOI: 10.1109/TPDS.2025.3638703
Anne Benoit;Joachim Cendrier;Frédéric Vivien
Even though it is usually assumed that data centers can always operate at maximum capacity, there have been recent scenarios where the amount of electricity that can be used by data centers evolve over time. Hence, the number of available processors is not a constant anymore. In this work, we assume that jobs can be checkpointed before a resource change. Indeed, in the scenarios that we consider, the resource provider warns the user before a change in the number of processors. It is thus possible to anticipate and take checkpoints before the change happens, such that no work is ever lost. The goal is then to maximize the goodput and/or the minimum yield of jobs within the next section (time between two changes in the number of processors). We model the problem and design greedy solutions and sophisticated dynamic programming algorithms with some optimality results for jobs of infinite duration, and adapt the algorithms to finite jobs. A comprehensive set of simulations, building on real-life job sets, demonstrates the performance of the proposed algorithms. Most algorithms achieve a useful platform utilization (goodput) of over 95%. With infinite jobs, the algorithms also keep fairness by having a relative minimum yield above 0.8, meaning that each job gets a good access to the platform (80% of the time that it would have had if each job had its perfect share of the platform). For finite jobs, the minimum yield can be low since very short new jobs may have to wait until the beginning of the next section to start (and finish), significantly impacting their yield. However, for 75% of the jobs within each workload, the yield ratio between these jobs is at most at a factor two, hence demonstrating the fairness of the proposed algorithms.
尽管通常假设数据中心可以始终以最大容量运行,但最近出现了数据中心可以使用的电量随着时间的推移而变化的情况。因此,可用处理器的数量不再是一个常数。在本工作中,我们假设作业可以在资源更改之前被检查点。实际上,在我们考虑的场景中,资源提供者在更改处理器数量之前会向用户发出警告。因此,可以在更改发生之前预测并采取检查点,这样就不会丢失任何工作。然后,目标是在下一段(处理器数量两次变化之间的时间)内最大化作业的goodput和/或最小yield。我们对问题进行建模,设计了贪心解和复杂的动态规划算法,并对无限工期的作业给出了一些最优结果,并使算法适用于有限工期的作业。一组全面的模拟,建立在现实生活中的工作集,证明了所提出的算法的性能。大多数算法实现了95%以上的有用平台利用率(goodput)。对于无限的作业,算法也保持了公平性,其相对最小收益高于0.8,这意味着每个作业都可以很好地访问平台(如果每个作业都拥有完美的平台份额,那么它将拥有80%的时间)。对于有限的作业,最小产量可能很低,因为非常短的新作业可能要等到下一段的开始才能开始(和结束),这极大地影响了它们的产量。然而,对于每个工作负载中75%的作业,这些作业之间的收益率最多为2倍,从而证明了所提出算法的公平性。
{"title":"Scheduling Jobs Under a Variable Number of Processors","authors":"Anne Benoit;Joachim Cendrier;Frédéric Vivien","doi":"10.1109/TPDS.2025.3638703","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3638703","url":null,"abstract":"Even though it is usually assumed that data centers can always operate at maximum capacity, there have been recent scenarios where the amount of electricity that can be used by data centers evolve over time. Hence, the number of available processors is not a constant anymore. In this work, we assume that jobs can be checkpointed before a resource change. Indeed, in the scenarios that we consider, the resource provider warns the user before a change in the number of processors. It is thus possible to anticipate and take checkpoints before the change happens, such that no work is ever lost. The goal is then to maximize the goodput and/or the minimum yield of jobs within the next section (time between two changes in the number of processors). We model the problem and design greedy solutions and sophisticated dynamic programming algorithms with some optimality results for jobs of infinite duration, and adapt the algorithms to finite jobs. A comprehensive set of simulations, building on real-life job sets, demonstrates the performance of the proposed algorithms. Most algorithms achieve a useful platform utilization (goodput) of over 95%. With infinite jobs, the algorithms also keep fairness by having a relative minimum yield above 0.8, meaning that each job gets a good access to the platform (80% of the time that it would have had if each job had its perfect share of the platform). For finite jobs, the minimum yield can be low since very short new jobs may have to wait until the beginning of the next section to start (and finish), significantly impacting their yield. However, for 75% of the jobs within each workload, the yield ratio between these jobs is at most at a factor two, hence demonstrating the fairness of the proposed algorithms.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"427-442"},"PeriodicalIF":6.0,"publicationDate":"2025-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778314","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FLUXLog: A Federated Mixture-of-Experts Framework for Unified Log Anomaly Detection FLUXLog:用于统一日志异常检测的联邦混合专家框架
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-28 DOI: 10.1109/TPDS.2025.3638693
Yixiao Xia;Yinghui Zhao;Jian Wan;Congfeng Jiang
Traditional log anomaly detection systems are centralized, which poses the risk of privacy leakage during data transmission. Previous research mainly focuses on single-domain logs, requiring domain-specific models and retraining, which limits flexibility and scalability. In this paper, we propose a unified federated cross-domain log anomaly detection approach, FLUXLog, which is based on MoE (Mixture of Experts) to handle heterogeneous log data. Based on our insights, we establish a two-phase training process: pre-training the gating network to assign expert weights based on data distribution, followed by expert-driven top-down feature fusion. The following training of the gating network is based on fine-tuning the adapters, providing the necessary flexibility for the model to adapt across domains while maintaining expert specialization. This training paradigm enables a Hybrid Specialization Strategy, fostering both domain-specific expertise and cross-domain generalization. The Cross Gated-Experts Module (CGEM) then fuses expert weights and dual-channel outputs. Experiments on public datasets demonstrate that our model outperforms baseline models in handling unified cross-domain log data.
传统的日志异常检测系统是集中式的,存在数据传输过程中隐私泄露的风险。以前的研究主要集中在单域日志,需要特定于域的模型和再训练,这限制了灵活性和可扩展性。在本文中,我们提出了一种统一的联邦跨域日志异常检测方法FLUXLog,该方法基于MoE(混合专家)来处理异构日志数据。基于我们的见解,我们建立了一个两阶段的训练过程:预训练门控网络以根据数据分布分配专家权重,然后是专家驱动的自上而下的特征融合。门控网络的以下训练是基于对适配器的微调,为模型提供必要的灵活性,以便在保持专家专门化的同时适应跨域。这种训练范例支持混合专门化策略,既培养特定领域的专业知识,又培养跨领域的泛化。然后,交叉门专家模块(CGEM)融合专家权重和双通道输出。在公共数据集上的实验表明,该模型在处理统一的跨域日志数据方面优于基线模型。
{"title":"FLUXLog: A Federated Mixture-of-Experts Framework for Unified Log Anomaly Detection","authors":"Yixiao Xia;Yinghui Zhao;Jian Wan;Congfeng Jiang","doi":"10.1109/TPDS.2025.3638693","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3638693","url":null,"abstract":"Traditional log anomaly detection systems are centralized, which poses the risk of privacy leakage during data transmission. Previous research mainly focuses on single-domain logs, requiring domain-specific models and retraining, which limits flexibility and scalability. In this paper, we propose a unified federated cross-domain log anomaly detection approach, FLUXLog, which is based on MoE (Mixture of Experts) to handle heterogeneous log data. Based on our insights, we establish a two-phase training process: pre-training the gating network to assign expert weights based on data distribution, followed by expert-driven top-down feature fusion. The following training of the gating network is based on fine-tuning the adapters, providing the necessary flexibility for the model to adapt across domains while maintaining expert specialization. This training paradigm enables a Hybrid Specialization Strategy, fostering both domain-specific expertise and cross-domain generalization. The Cross Gated-Experts Module (CGEM) then fuses expert weights and dual-channel outputs. Experiments on public datasets demonstrate that our model outperforms baseline models in handling unified cross-domain log data.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"395-409"},"PeriodicalIF":6.0,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11271152","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778215","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Accelerating ML Inference via Opportunistic Pre-Loading on Serverless Clusters 通过在无服务器集群上的机会预加载加速ML推理
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-28 DOI: 10.1109/TPDS.2025.3638428
Yifan Sui;Hanfei Yu;Yitao Hu;Jianxun Li;Hao Wang
Serverless computing has emerged as a novel paradigm in cloud computing, characterized by its agile scalability, cost-effective pay-as-you-go billing, and user-friendly capabilities for Machine Learning (ML) inference tasks. Developers wrap their ML algorithms into serverless functions and run them in containers. However, the well-known cold-start problem significantly slows down the response time of functions. To address cold-starts, the technique of pre-warming, which proactively maintains containers in a warm state, has gained widespread adoption across both research and industry. Nevertheless, we observed that pre-warming does not address the distinct delays caused by the loading of ML artifacts. According to our analysis, in ML inference functions, the time required to load libraries and models significantly exceeds the time needed to warm containers. Thus, relying solely on pre-warming is insufficient for mitigating cold-starts. This paper presents Tyche, an opportunistic pre-loading approach designed to eliminate the latency associated with loading ML artifacts, enabling near-instant inference and minimizing function execution time. Tyche fully leverages the idle memory in warmed containers and GPUs to pre-load required libraries and models, striking an optimal balance between acceleration and resource efficiency. Additionally, Tyche is tailored for large-scale serverless platforms, incorporating cluster-wide scheduling and lightweight locality-aware load balancing to enhance performance. We design Tyche to be transparent to providers and compatible with existing pre-warming solutions. Experiments on OpenWhisk with real-world workloads show that Tyche reduces up to 93% loading latency and achieves up to 8× speedup compared to state-of-the-art pre-warming solutions. Compared with the state-of-the-art serverless pre-loading solution, Tyche also achieves up to 1.9× speedup.
无服务器计算已经成为云计算中的一种新范式,其特点是灵活的可扩展性、经济高效的按需付费计费以及对机器学习(ML)推理任务的用户友好功能。开发人员将ML算法封装到无服务器函数中,并在容器中运行。然而,众所周知的冷启动问题显著降低了函数的响应时间。为了解决冷启动问题,预热技术主动将容器保持在温暖状态,已在研究和工业中得到广泛采用。然而,我们观察到,预热并不能解决由ML工件加载引起的明显延迟。根据我们的分析,在ML推理函数中,加载库和模型所需的时间大大超过了加热容器所需的时间。因此,仅仅依靠预热是不足以缓解冷启动的。本文介绍了Tyche,一种机会预加载方法,旨在消除与加载ML工件相关的延迟,实现近乎即时的推理并最大限度地减少函数执行时间。Tyche充分利用加热容器和gpu中的空闲内存来预加载所需的库和模型,在加速和资源效率之间取得最佳平衡。此外,Tyche为大规模无服务器平台量身定制,结合集群范围内的调度和轻量级位置感知负载平衡来增强性能。我们将Tyche设计为对供应商透明,并与现有的预热解决方案兼容。在OpenWhisk上进行的真实工作负载实验表明,与最先进的预热解决方案相比,Tyche可减少高达93%的加载延迟,并实现高达8倍的加速。与最先进的无服务器预加载解决方案相比,Tyche还实现了高达1.9倍的加速。
{"title":"Accelerating ML Inference via Opportunistic Pre-Loading on Serverless Clusters","authors":"Yifan Sui;Hanfei Yu;Yitao Hu;Jianxun Li;Hao Wang","doi":"10.1109/TPDS.2025.3638428","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3638428","url":null,"abstract":"Serverless computing has emerged as a novel paradigm in cloud computing, characterized by its agile scalability, cost-effective pay-as-you-go billing, and user-friendly capabilities for Machine Learning (ML) inference tasks. Developers wrap their ML algorithms into serverless functions and run them in containers. However, the well-known cold-start problem significantly slows down the response time of functions. To address cold-starts, the technique of pre-warming, which proactively maintains containers in a warm state, has gained widespread adoption across both research and industry. Nevertheless, we observed that pre-warming does not address the distinct delays caused by the loading of ML artifacts. According to our analysis, in ML inference functions, the time required to load libraries and models significantly exceeds the time needed to warm containers. Thus, relying solely on pre-warming is insufficient for mitigating cold-starts. This paper presents <italic>Tyche</i>, an opportunistic pre-loading approach designed to eliminate the latency associated with loading ML artifacts, enabling near-instant inference and minimizing function execution time. <italic>Tyche</i> fully leverages the idle memory in warmed containers and GPUs to pre-load required libraries and models, striking an optimal balance between acceleration and resource efficiency. Additionally, <italic>Tyche</i> is tailored for large-scale serverless platforms, incorporating cluster-wide scheduling and lightweight locality-aware load balancing to enhance performance. We design <italic>Tyche</i> to be transparent to providers and compatible with existing pre-warming solutions. Experiments on OpenWhisk with real-world workloads show that <italic>Tyche</i> reduces up to 93% loading latency and achieves up to 8× speedup compared to state-of-the-art pre-warming solutions. Compared with the state-of-the-art serverless pre-loading solution, <italic>Tyche</i> also achieves up to 1.9× speedup.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"472-488"},"PeriodicalIF":6.0,"publicationDate":"2025-11-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145830972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Concurrent and Orthogonal Software Power Meters for Accurate Runtime Energy Profiling of Parallel Hybrid Programs on Heterogeneous Hybrid Servers 用于异构混合服务器上并行混合程序精确运行时能量分析的并发和正交软件功率计
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-26 DOI: 10.1109/TPDS.2025.3637511
Hafiz Adnan Niaz;Ravi Reddy Manumachu;Alexey Lastovetsky
Energy predictive models employing performance events have emerged as a promising alternative to other mainstream methods for developing software power meters used in runtime energy profiling of applications. These models are cost-effective and provide a highly accurate means of measuring the energy consumption of applications during execution. Recently, software power meters have been proposed to profile the dynamic energy consumption of data transfers between CPU and GPU in heterogeneous hybrid platforms, thereby effectively addressing the gap between software power meters that measure computations and those that measure data transfers. However, the state-of-the-art software power meters lack fundamental properties essential for achieving accurate runtime energy profiling of parallel hybrid programs on heterogeneous hybrid servers. Two critical properties are concurrency and orthogonality. In this work, we define these essential properties and propose a methodology for developing concurrent and orthogonal platform-level software power meters capable of accurate runtime energy profiling of parallel hybrid programs on heterogeneous hybrid servers. We apply this methodology to develop software power meters for three heterogeneous hybrid servers that consist of Intel multicore CPUs and Nvidia GPUs from different generations. Furthermore, we demonstrate the accuracy and efficiency of the proposed software power meters by using them to estimate the dynamic energy consumption of computation and communication activities in three parallel hybrid programs. Our results show that the average prediction error for dynamic energy consumption by these software power meters is just 2.5% across our servers.
采用性能事件的能源预测模型已经成为开发用于应用程序运行时能源分析的软件电表的其他主流方法的有前途的替代方法。这些模型具有成本效益,并提供了一种高度精确的方法来测量应用程序在执行期间的能耗。近年来,针对异构混合平台中CPU和GPU之间数据传输的动态能耗,提出了软件功耗表,从而有效地解决了测量计算的软件功耗表与测量数据传输的软件功耗表之间的差距。然而,最先进的软件电表缺乏实现异构混合服务器上并行混合程序的精确运行时能量分析所必需的基本特性。两个关键属性是并发性和正交性。在这项工作中,我们定义了这些基本属性,并提出了一种开发并发和正交平台级软件功率表的方法,该方法能够准确地分析异构混合服务器上并行混合程序的运行时能量。我们将此方法应用于三种异构混合服务器的软件功率表的开发,这三种服务器由不同世代的英特尔多核cpu和Nvidia gpu组成。此外,我们通过使用软件功率计来估计三个并行混合程序中计算和通信活动的动态能耗,证明了所提出的软件功率计的准确性和效率。我们的结果表明,在我们的服务器上,这些软件功耗表对动态能耗的平均预测误差仅为2.5%。
{"title":"Concurrent and Orthogonal Software Power Meters for Accurate Runtime Energy Profiling of Parallel Hybrid Programs on Heterogeneous Hybrid Servers","authors":"Hafiz Adnan Niaz;Ravi Reddy Manumachu;Alexey Lastovetsky","doi":"10.1109/TPDS.2025.3637511","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3637511","url":null,"abstract":"Energy predictive models employing performance events have emerged as a promising alternative to other mainstream methods for developing software power meters used in runtime energy profiling of applications. These models are cost-effective and provide a highly accurate means of measuring the energy consumption of applications during execution. Recently, software power meters have been proposed to profile the dynamic energy consumption of data transfers between CPU and GPU in heterogeneous hybrid platforms, thereby effectively addressing the gap between software power meters that measure computations and those that measure data transfers. However, the state-of-the-art software power meters lack fundamental properties essential for achieving accurate runtime energy profiling of parallel hybrid programs on heterogeneous hybrid servers. Two critical properties are <italic>concurrency</i> and <italic>orthogonality</i>. In this work, we define these essential properties and propose a methodology for developing concurrent and orthogonal platform-level software power meters capable of accurate runtime energy profiling of parallel hybrid programs on heterogeneous hybrid servers. We apply this methodology to develop software power meters for three heterogeneous hybrid servers that consist of Intel multicore CPUs and Nvidia GPUs from different generations. Furthermore, we demonstrate the accuracy and efficiency of the proposed software power meters by using them to estimate the dynamic energy consumption of computation and communication activities in three parallel hybrid programs. Our results show that the average prediction error for dynamic energy consumption by these software power meters is just 2.5% across our servers.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"322-339"},"PeriodicalIF":6.0,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11269896","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145778214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MAMILS: A Memory-Aware Multiobjective Scheduler for Real-Time Embedded EEG Depression Diagnosis MAMILS:用于实时嵌入式脑电图抑郁诊断的记忆感知多目标调度程序
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-26 DOI: 10.1109/TPDS.2025.3637175
Fuze Tian;Lixin Zhang;Qi Pan;Jingyu Liu;Qinglin Zhao;Bin Hu
Depression detection using Electroencephalogram (EEG) signals obtained from wearable medical-assisted diagnostic systems has become a well-established approach in the field of affective disorders. However, despite recent advancements, on-board Artificial Intelligence (AI) models still demand substantial computational resources, presenting significant challenges for deployment on resource-constrained wearable medical devices. Embedded Multi-core Processors (MPs) offer a promising solution for accelerating these models. However, the limited computational capabilities of embedded MPs, combined with the structural diversity of AI models, complicate resource allocation and increase associated costs. To address these challenges, we propose a Memory-Aware Multi-Objective Iterative Local Search (MAMILS) algorithm to optimize task scheduling, thereby improving the efficiency of AI model deployment on wearable EEG devices. Experimental results across seven AI models demonstrate that, the MAMILS approach yields substantial improvements in key performance indicators: Total Energy Consumption ($bm {TEC}$) with an average reduction of 47.57%, $bm {Makespan}$ with an average reduction of 48.75%, and $bm {Throughput}$ with an average increase of 198.37%, all while maintaining satisfactory classification performance for both Machine Learning (ML) and Deep Learning (DL) models. Especially, on-board deployment of EEGNeX achieves an accuracy of 93.4%, sensitivity of 91.6%, and specificity of 95.8%. Further analysis indicates that, when integrated with wearable EEG sensors and executable on-board AI models, the proposed MAMILS optimization strategy shows significant promise in facilitating the widespread adoption of low-power, real-time diagnostic systems for depression detection.
利用可穿戴医疗辅助诊断系统获得的脑电图(EEG)信号进行抑郁症检测已成为情感性障碍领域的一种行之有效的方法。然而,尽管最近取得了进展,但机载人工智能(AI)模型仍然需要大量的计算资源,这对在资源有限的可穿戴医疗设备上部署提出了重大挑战。嵌入式多核处理器(MPs)为加速这些模型提供了一个很有前途的解决方案。然而,嵌入式MPs有限的计算能力,加上人工智能模型的结构多样性,使资源分配复杂化,增加了相关成本。为了解决这些问题,我们提出了一种记忆感知多目标迭代局部搜索(MAMILS)算法来优化任务调度,从而提高人工智能模型在可穿戴脑电图设备上的部署效率。七个人工智能模型的实验结果表明,MAMILS方法在关键性能指标上取得了实质性的改进:总能耗($bm {TEC}$)平均降低47.57%,$bm {Makespan}$平均降低48.75%,$bm{吞吐量}$平均提高198.37%,同时对机器学习(ML)和深度学习(DL)模型都保持了令人满意的分类性能。特别是机载部署EEGNeX,准确率为93.4%,灵敏度为91.6%,特异性为95.8%。进一步的分析表明,当与可穿戴式脑电图传感器和可执行的车载AI模型集成时,所提出的MAMILS优化策略在促进低功耗、实时抑郁症诊断系统的广泛采用方面显示出巨大的前景。
{"title":"MAMILS: A Memory-Aware Multiobjective Scheduler for Real-Time Embedded EEG Depression Diagnosis","authors":"Fuze Tian;Lixin Zhang;Qi Pan;Jingyu Liu;Qinglin Zhao;Bin Hu","doi":"10.1109/TPDS.2025.3637175","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3637175","url":null,"abstract":"Depression detection using Electroencephalogram (EEG) signals obtained from wearable medical-assisted diagnostic systems has become a well-established approach in the field of affective disorders. However, despite recent advancements, on-board Artificial Intelligence (AI) models still demand substantial computational resources, presenting significant challenges for deployment on resource-constrained wearable medical devices. Embedded Multi-core Processors (MPs) offer a promising solution for accelerating these models. However, the limited computational capabilities of embedded MPs, combined with the structural diversity of AI models, complicate resource allocation and increase associated costs. To address these challenges, we propose a Memory-Aware Multi-Objective Iterative Local Search (MAMILS) algorithm to optimize task scheduling, thereby improving the efficiency of AI model deployment on wearable EEG devices. Experimental results across seven AI models demonstrate that, the MAMILS approach yields substantial improvements in key performance indicators: Total Energy Consumption (<inline-formula><tex-math>$bm {TEC}$</tex-math></inline-formula>) with an average reduction of 47.57%, <inline-formula><tex-math>$bm {Makespan}$</tex-math></inline-formula> with an average reduction of 48.75%, and <inline-formula><tex-math>$bm {Throughput}$</tex-math></inline-formula> with an average increase of 198.37%, all while maintaining satisfactory classification performance for both Machine Learning (ML) and Deep Learning (DL) models. Especially, on-board deployment of EEGNeX achieves an accuracy of 93.4%, sensitivity of 91.6%, and specificity of 95.8%. Further analysis indicates that, when integrated with wearable EEG sensors and executable on-board AI models, the proposed MAMILS optimization strategy shows significant promise in facilitating the widespread adoption of low-power, real-time diagnostic systems for depression detection.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 3","pages":"600-617"},"PeriodicalIF":6.0,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146071187","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1