首页 > 最新文献

IEEE Transactions on Parallel and Distributed Systems最新文献

英文 中文
S-Leon: An Efficient Split Learning Framework Over Heterogeneous LEO Satellite Networks 基于异构LEO卫星网络的高效分离学习框架
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-06 DOI: 10.1109/TPDS.2025.3629667
Yuxin Zhang;Zhe Chen;Xuanjie Hu;Jin Zhao;Yue Gao
The rapid deployment of low Earth orbit (LEO) satellite systems has propelled various space-based applications (e.g., agricultural monitoring and disaster response), which increasingly rely on advancements in deep learning (DL). However, ground stations (GS) cannot download such massive raw data for centralized training due to intermittent connectivity between satellites and GS, while the scaled-up DL models pose substantial barriers to distributed training on resource-constrained satellites. Although split learning (SL) has emerged as a promising solution to offload major training workloads to GS via model partitioning while retaining raw data on satellites, limited satellite-GS connectivity and heterogeneity of satellite resources remain substantial barriers. In this paper, we propose S-Leon, an SL framework tailored to tackle these challenges within heterogeneous LEO satellite networks. We develop a satellite early-exit model to eliminate training disruptions during non-contact periods and employ online knowledge distillation to incorporate ground knowledge, further enhancing satellite local training. Moreover, we devise a satellite model customization method that simultaneously accommodates the heterogeneous computation and communication capabilities of individual satellites. Lastly, we develop a partial model-agnostic training strategy to optimize the collaborative training effectiveness across customized satellite models. Extensive experiments with real-world LEO satellite networks demonstrate that S-Leon outperforms state-of-the-art benchmarks.
低地球轨道(LEO)卫星系统的快速部署推动了各种基于空间的应用(例如,农业监测和灾害响应),这些应用越来越依赖于深度学习(DL)的进步。然而,由于卫星与地面站之间的间歇性连接,地面站无法下载如此大量的原始数据进行集中训练,而在资源受限的卫星上,按比例放大的DL模型对分布式训练构成了很大的障碍。尽管拆分学习(SL)已经成为一种很有前途的解决方案,可以通过模型划分将主要训练工作量转移给GS,同时在卫星上保留原始数据,但卫星与GS之间有限的连通性和卫星资源的异质性仍然是重大障碍。在本文中,我们提出了S-Leon,这是一种专为解决异构LEO卫星网络中的这些挑战而设计的SL框架。我们开发了一个卫星早期退出模型,以消除非接触期间的培训中断,并采用在线知识蒸馏来整合地面知识,进一步加强卫星本地培训。此外,我们还设计了一种同时适应单个卫星异构计算和通信能力的卫星模型定制方法。最后,我们开发了一个局部模型不可知的训练策略,以优化跨定制卫星模型的协同训练效果。在真实的LEO卫星网络上进行的大量实验表明,S-Leon优于最先进的基准。
{"title":"S-Leon: An Efficient Split Learning Framework Over Heterogeneous LEO Satellite Networks","authors":"Yuxin Zhang;Zhe Chen;Xuanjie Hu;Jin Zhao;Yue Gao","doi":"10.1109/TPDS.2025.3629667","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3629667","url":null,"abstract":"The rapid deployment of low Earth orbit (LEO) satellite systems has propelled various space-based applications (e.g., agricultural monitoring and disaster response), which increasingly rely on advancements in deep learning (DL). However, ground stations (GS) cannot download such massive raw data for centralized training due to intermittent connectivity between satellites and GS, while the scaled-up DL models pose substantial barriers to distributed training on resource-constrained satellites. Although split learning (SL) has emerged as a promising solution to offload major training workloads to GS via model partitioning while retaining raw data on satellites, limited satellite-GS connectivity and heterogeneity of satellite resources remain substantial barriers. In this paper, we propose S-Leon, an SL framework tailored to tackle these challenges within heterogeneous LEO satellite networks. We develop a satellite early-exit model to eliminate training disruptions during non-contact periods and employ online knowledge distillation to incorporate ground knowledge, further enhancing satellite local training. Moreover, we devise a satellite model customization method that simultaneously accommodates the heterogeneous computation and communication capabilities of individual satellites. Lastly, we develop a partial model-agnostic training strategy to optimize the collaborative training effectiveness across customized satellite models. Extensive experiments with real-world LEO satellite networks demonstrate that S-Leon outperforms state-of-the-art benchmarks.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"106-121"},"PeriodicalIF":6.0,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
FOSS: Learning-Based Multi-Level Design Makes FIFO More Adaptive for CDN Caching 基于学习的多层次设计使FIFO更适合CDN缓存
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-03 DOI: 10.1109/TPDS.2025.3628547
Huiyou Zhan;Haisheng Tan;Xinyue Zhang;Han Tian;Hongqiu Ni;Yongzheng Liang;Changming Bai;Xiang-Yang Li
With the rapid growth of data-intensive applications, such as artificial intelligence and the Internet of Things, CDNs, which use persistent storage (e.g., SSDs and HDDs) to cache data at the edge, have become crucial for enhancing network efficiency. Two metrics—hit ratio and processing latency—are essential for evaluating CDN caching performance. However, CDN caching faces the challenge of write amplification, creating a trade-off between random access for higher hit ratios and sequential writes for reducing processing latency. Existing cache designs struggle to effectively balance these conflicting requirements across diverse workloads. In this paper, we present FOSS, a caching system specifically optimized for CDNs deployed on SSD-based storage and hybrid SSD–HDD storage, which features a streamlined, thin file system that operates independently of the kernel. At its heart, FOSS employs a multi-level FIFO queue to strike a balance between local sequential and global random access on SSDs. Then, FOSS incorporates a learning-based method to dynamically configure the multi-level structure configuration, making the system adaptive to various workload characteristics and caching algorithm requirements. Therefore, FOSS ensures better performance across different scenarios. Our extensive experiments show FOSS improves hit ratios significantly over existing systems, reduces end-to-end response latency by 16.5% and demonstrates a consistent performance improvement in various settings on large-scale commercial CDN traces.
随着人工智能和物联网等数据密集型应用的快速增长,使用持久存储(如ssd和hdd)在边缘缓存数据的cdn对于提高网络效率至关重要。两个指标——命中率和处理延迟——对于评估CDN缓存性能至关重要。然而,CDN缓存面临写放大的挑战,需要在随机访问以获得更高的命中率和顺序写入以减少处理延迟之间进行权衡。现有的缓存设计很难在不同工作负载之间有效地平衡这些相互冲突的需求。在本文中,我们介绍了FOSS,这是一种专门针对部署在基于ssd的存储和混合SSD-HDD存储上的cdn进行优化的缓存系统,其特点是精简,精简的文件系统独立于内核运行。在其核心,FOSS采用多级FIFO队列在ssd上实现本地顺序访问和全局随机访问之间的平衡。然后,采用基于学习的方法动态配置多层结构配置,使系统能够适应各种工作负载特征和缓存算法要求。因此,自由/开源软件确保了跨不同场景的更好性能。我们广泛的实验表明,自由/开源软件显著提高了现有系统的命中率,减少了16.5%的端到端响应延迟,并在大规模商业CDN跟踪的各种设置中展示了一致的性能改进。
{"title":"FOSS: Learning-Based Multi-Level Design Makes FIFO More Adaptive for CDN Caching","authors":"Huiyou Zhan;Haisheng Tan;Xinyue Zhang;Han Tian;Hongqiu Ni;Yongzheng Liang;Changming Bai;Xiang-Yang Li","doi":"10.1109/TPDS.2025.3628547","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3628547","url":null,"abstract":"With the rapid growth of data-intensive applications, such as artificial intelligence and the Internet of Things, CDNs, which use persistent storage (e.g., SSDs and HDDs) to cache data at the edge, have become crucial for enhancing network efficiency. Two metrics—hit ratio and processing latency—are essential for evaluating CDN caching performance. However, CDN caching faces the challenge of write amplification, creating a trade-off between random access for higher hit ratios and sequential writes for reducing processing latency. Existing cache designs struggle to effectively balance these conflicting requirements across diverse workloads. In this paper, we present FOSS, a caching system specifically optimized for CDNs deployed on SSD-based storage and hybrid SSD–HDD storage, which features a streamlined, thin file system that operates independently of the kernel. At its heart, FOSS employs a multi-level FIFO queue to strike a balance between local sequential and global random access on SSDs. Then, FOSS incorporates a learning-based method to dynamically configure the multi-level structure configuration, making the system adaptive to various workload characteristics and caching algorithm requirements. Therefore, FOSS ensures better performance across different scenarios. Our extensive experiments show FOSS improves hit ratios significantly over existing systems, reduces end-to-end response latency by 16.5% and demonstrates a consistent performance improvement in various settings on large-scale commercial CDN traces.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"155-168"},"PeriodicalIF":6.0,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546997","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Puffer: A Serverless Platform Based on Vertical Memory Scaling Puffer:基于垂直内存扩展的无服务器平台
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-11-03 DOI: 10.1109/TPDS.2025.3628202
Hao Fan;Kun Wang;Zhuo Huang;Xinmin Zhang;Haibo Mi;Song Wu;Chen Yu
This paper quantitatively analyses the potential of vertical scaling MicroVMs in serverless computing. Our analysis shows that under real-world serverless workloads, vertical scaling can significantly improve execution performance and resource utilization. However, we also find that the memory scaling of MicroVMs is the bottleneck that hinders vertical scaling from reaching the performance ceiling. We propose Faascale, a novel mechanism that efficiently scales the memory of MicroVMs for serverless applications. Faascale employs a series of techniques to tackle this bottleneck: 1) it sizes up/down the memory for a MicroVM by blocks that bind with a function instance instead of general pages; and 2) it pre-populates physical memory for function instances to reduce the delays introduced by the lazy-population. Compared with existing memory scaling mechanisms, Faascale improves the memory scaling efficiency by 2 to 3 orders of magnitude. Based on Faascale, we realize a serverless platform, named Puffer. Experiments conducted on eight serverless benchmark functions demonstrate that compared with horizontal scaling strategies, Puffer reduces time for cold-starting MicroVMs by 89.01%, improves memory utilization by 17.66%, and decreases functions execution time by 23.93% on average.
本文定量分析了垂直扩展microvm在无服务器计算中的潜力。我们的分析表明,在真实的无服务器工作负载下,垂直扩展可以显著提高执行性能和资源利用率。然而,我们也发现microvm的内存扩展是阻碍垂直扩展达到性能上限的瓶颈。我们提出Faascale,一种新的机制,可以有效地为无服务器应用程序扩展microvm的内存。Faascale采用了一系列技术来解决这个瓶颈:1)它通过绑定函数实例而不是普通页面的块来增加/减少MicroVM的内存;2)它为函数实例预填充物理内存,以减少惰性填充带来的延迟。与现有的内存缩放机制相比,Faascale将内存缩放效率提高了2到3个数量级。基于Faascale,我们实现了一个无服务器平台Puffer。在8个无服务器基准函数上进行的实验表明,与水平扩展策略相比,Puffer可使microvm冷启动时间减少89.01%,内存利用率提高17.66%,函数执行时间平均减少23.93%。
{"title":"Puffer: A Serverless Platform Based on Vertical Memory Scaling","authors":"Hao Fan;Kun Wang;Zhuo Huang;Xinmin Zhang;Haibo Mi;Song Wu;Chen Yu","doi":"10.1109/TPDS.2025.3628202","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3628202","url":null,"abstract":"This paper quantitatively analyses the potential of vertical scaling MicroVMs in serverless computing. Our analysis shows that under real-world serverless workloads, vertical scaling can significantly improve execution performance and resource utilization. However, we also find that the memory scaling of MicroVMs is the bottleneck that hinders vertical scaling from reaching the performance ceiling. We propose Faascale, a novel mechanism that efficiently scales the memory of MicroVMs for serverless applications. Faascale employs a series of techniques to tackle this bottleneck: 1) it sizes up/down the memory for a MicroVM by blocks that bind with a function instance instead of general pages; and 2) it pre-populates physical memory for function instances to reduce the delays introduced by the lazy-population. Compared with existing memory scaling mechanisms, Faascale improves the memory scaling efficiency by 2 to 3 orders of magnitude. Based on Faascale, we realize a serverless platform, named Puffer. Experiments conducted on eight serverless benchmark functions demonstrate that compared with horizontal scaling strategies, Puffer reduces time for cold-starting MicroVMs by 89.01%, improves memory utilization by 17.66%, and decreases functions execution time by 23.93% on average.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"184-197"},"PeriodicalIF":6.0,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11223881","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546994","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Efficient KV Cache Spillover Management on Memory-Constrained GPU for LLM Inference 基于LLM推理的GPU内存约束下KV缓存溢出管理
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-31 DOI: 10.1109/TPDS.2025.3626974
Jiazhi Jiang;Yao Chen;Zining Zhang;Bingsheng He;Pingyi Luo;Mian Lu;Yuqiang Chen;Hongbing Zhang;Jiangsu Du;Dan Huang;Yutong Lu
The rapid growth of model parameters presents a significant challenge when deploying large generative models on GPU. Existing LLM runtime memory management solutions tend to maximize batch size to saturate GPU device utilization. Nevertheless, this practice leads to situations where the KV Cache of certain sequences cannot be accommodated on GPUs with limited memory capacity during the model inference, requiring temporary eviction from GPU memory (referred to as KV Cache spillover). However, without careful consideration of the LLM inference’s runtime pattern, current LLM inference memory management solutions face issues like one-size-fits-all spillover handling approach for different platforms, under-utilization of GPU in prefill stage, and suboptimal sequence selection due to direct employment of swap or recomputation. In this article, we introduce FuseSpill, a holistic KV Cache management solution designed to boost LLM inference on memory-constrained GPU by efficiently handling KV Cache spillover. Specifically, FuseSpill consists of a spillover cost model that analyzes the system cost of spillover handling techniques quantitatively, a KV cache swap orchestrator to further refine the basic swap technique to sophisticated disaggregate KV Cache across heterogeneous devices for decoding iterations, a multi-executor scheduler to effectively coordinate task executors across devices, and a response length predictor to exploit the length-aware sequence selection strategy when KV Cache spillover occurs. The experimental results demonstrate that our implementation outperforms existing solutions, delivering a 20% to 40% increase in throughput while simultaneously reducing inference latency of spillover sequences.
模型参数的快速增长对在GPU上部署大型生成模型提出了重大挑战。现有的LLM运行时内存管理解决方案倾向于最大化批处理大小以饱和GPU设备利用率。然而,这种做法会导致某些序列的KV缓存在模型推理期间无法容纳在内存容量有限的GPU上,需要暂时从GPU内存中移除(称为KV缓存溢出)。然而,由于没有仔细考虑LLM推理的运行时模式,目前的LLM推理内存管理解决方案面临着诸如针对不同平台的一贯性溢出处理方法,预填充阶段GPU利用率不足以及由于直接使用交换或重计算而导致的次优序列选择等问题。在本文中,我们介绍了FuseSpill,一个完整的KV缓存管理解决方案,旨在通过有效处理KV缓存溢出来提高内存受限GPU上的LLM推理。具体来说,FuseSpill包括一个溢出成本模型,定量分析溢出处理技术的系统成本,一个KV缓存交换编排器,进一步完善基本交换技术,在异构设备上对KV缓存进行复杂的解码迭代,一个多执行器调度程序,有效地协调跨设备的任务执行器,以及响应长度预测器,用于在KV缓存溢出发生时利用长度感知序列选择策略。实验结果表明,我们的实现优于现有的解决方案,在减少溢出序列推理延迟的同时,吞吐量提高了20%到40%。
{"title":"Efficient KV Cache Spillover Management on Memory-Constrained GPU for LLM Inference","authors":"Jiazhi Jiang;Yao Chen;Zining Zhang;Bingsheng He;Pingyi Luo;Mian Lu;Yuqiang Chen;Hongbing Zhang;Jiangsu Du;Dan Huang;Yutong Lu","doi":"10.1109/TPDS.2025.3626974","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3626974","url":null,"abstract":"The rapid growth of model parameters presents a significant challenge when deploying large generative models on GPU. Existing LLM runtime memory management solutions tend to maximize batch size to saturate GPU device utilization. Nevertheless, this practice leads to situations where the KV Cache of certain sequences cannot be accommodated on GPUs with limited memory capacity during the model inference, requiring temporary eviction from GPU memory (referred to as KV Cache spillover). However, without careful consideration of the LLM inference’s runtime pattern, current LLM inference memory management solutions face issues like one-size-fits-all spillover handling approach for different platforms, under-utilization of GPU in prefill stage, and suboptimal sequence selection due to direct employment of swap or recomputation. In this article, we introduce FuseSpill, a holistic KV Cache management solution designed to boost LLM inference on memory-constrained GPU by efficiently handling KV Cache spillover. Specifically, FuseSpill consists of a spillover cost model that analyzes the system cost of spillover handling techniques quantitatively, a KV cache swap orchestrator to further refine the basic swap technique to sophisticated disaggregate KV Cache across heterogeneous devices for decoding iterations, a multi-executor scheduler to effectively coordinate task executors across devices, and a response length predictor to exploit the length-aware sequence selection strategy when KV Cache spillover occurs. The experimental results demonstrate that our implementation outperforms existing solutions, delivering a 20% to 40% increase in throughput while simultaneously reducing inference latency of spillover sequences.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"90-105"},"PeriodicalIF":6.0,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EDTC: Exact Triangle Counting for Dynamic Graphs on GPU EDTC: GPU上动态图形的精确三角形计数
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-31 DOI: 10.1109/TPDS.2025.3627974
Zhuo Wang;Jiahao Tang;Zhixiong Li;Jinxing Tu;Wei Xue;Jianqiang Huang
In the process of updating a dynamic graph, an update to one edge may result in the addition or deletion of multiple triangles, while an update to multiple edges may only result in the addition or deletion of a single triangle. Consequently, accurately counting triangles on a dynamic graph is a challenging undertaking. As dynamic graphs are continuously updated, the GPU’s memory may be insufficient to accommodate the storage of larger graphs. This presents a challenge when the graph, which is constantly growing, cannot be stored. The hash-based and binary search-based triangle counting algorithm is regarded as the most efficient for static graphs. However, when vertices with high degrees are encountered, the hash-based triangle counting method results in significant memory wastage due to the traditional construction of a hash table, leading to a shortage of memory. This issue remains unresolved. In this article a triangle counting system EDTC is developed for dynamic graphs while ensuring the accuracy of counting. The system addresses three main problems: 1) An efficient EHTC algorithm is introduced to rapidly and accurately count the number of triangles in a graph. 2) The concept of an Update Activation CSR (UA-CSR) is introduced, along with a data structure to facilitate its implementation. This structure loads only the subgraph portion affected by the updated edge into the GPU, allowing calculations to be performed on this specific subgraph. 3) A compressed hash table is designed to reduce memory consumption, along with a dynamic shared memory assignment(DSA) strategy to fully utilize the shared memory of the GPU.
在动态图的更新过程中,对一条边的更新可能导致多个三角形的增加或删除,而对多条边的更新可能只导致单个三角形的增加或删除。因此,准确地计算动态图上的三角形是一项具有挑战性的任务。由于动态图形不断更新,GPU的内存可能不足以容纳更大图形的存储。当不断增长的图无法存储时,这就提出了一个挑战。基于哈希和基于二进制搜索的三角形计数算法被认为是最有效的静态图计数算法。然而,当遇到高度的顶点时,基于哈希的三角形计数方法会由于传统的哈希表构造而导致显著的内存浪费,从而导致内存不足。这个问题仍然没有解决。在保证计数精度的前提下,开发了动态图形三角形计数系统EDTC。该系统解决了三个主要问题:1)引入了一种高效的EHTC算法,可以快速准确地计算图中三角形的数量。2)介绍了更新激活CSR (UA-CSR)的概念,以及促进其实现的数据结构。此结构仅将受更新边缘影响的子图部分加载到GPU中,允许在此特定子图上执行计算。3)压缩哈希表旨在减少内存消耗,以及动态共享内存分配(DSA)策略,以充分利用GPU的共享内存。
{"title":"EDTC: Exact Triangle Counting for Dynamic Graphs on GPU","authors":"Zhuo Wang;Jiahao Tang;Zhixiong Li;Jinxing Tu;Wei Xue;Jianqiang Huang","doi":"10.1109/TPDS.2025.3627974","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3627974","url":null,"abstract":"In the process of updating a dynamic graph, an update to one edge may result in the addition or deletion of multiple triangles, while an update to multiple edges may only result in the addition or deletion of a single triangle. Consequently, accurately counting triangles on a dynamic graph is a challenging undertaking. As dynamic graphs are continuously updated, the GPU’s memory may be insufficient to accommodate the storage of larger graphs. This presents a challenge when the graph, which is constantly growing, cannot be stored. The hash-based and binary search-based triangle counting algorithm is regarded as the most efficient for static graphs. However, when vertices with high degrees are encountered, the hash-based triangle counting method results in significant memory wastage due to the traditional construction of a hash table, leading to a shortage of memory. This issue remains unresolved. In this article a triangle counting system EDTC is developed for dynamic graphs while ensuring the accuracy of counting. The system addresses three main problems: 1) An efficient EHTC algorithm is introduced to rapidly and accurately count the number of triangles in a graph. 2) The concept of an Update Activation CSR (UA-CSR) is introduced, along with a data structure to facilitate its implementation. This structure loads only the subgraph portion affected by the updated edge into the GPU, allowing calculations to be performed on this specific subgraph. 3) A compressed hash table is designed to reduce memory consumption, along with a dynamic shared memory assignment(DSA) strategy to fully utilize the shared memory of the GPU.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"247-259"},"PeriodicalIF":6.0,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Joint Optimization of Resource Allocation and Request Batching for Multi-Tenant Inference Serving on GPU GPU上多租户推理服务的资源分配与请求批处理联合优化
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-31 DOI: 10.1109/TPDS.2025.3627574
Yuning Zhang;Nan Yang;Chen Pan;Dong Yuan
The GPU technology has significantly aided Deep Learning (DL), especially in enhancing the performance of inference services. Tenants deploy inference models on the GPU, which are then uniformly scheduled and executed by an inference serving system. In resource-constrained environments, a single GPU needs to handle requests from multiple tenants. The diversity of inference tasks, varying request frequencies, and different model architectures make designing an efficient inference serving system a significant challenge. Most current research discusses resource allocation and request batching separately, overlooking the critical connection between them. In such complex inference environments, this connection is particularly crucial. To rapidly process requests from various tenants in such a dynamic environment, we leverage the connection between resource allocation and request batching to design DRS: Deep Reinforcement Scheduler. In DRS, we use the Deep Deterministic Policy Gradient (DDPG) as our scheduling algorithm and NVIDIA Multi-Process Service (MPS) for spatial parallelism in sharing a single GPU among multiple tenants. By observing environmental information, we can rapidly adjust the GPU allocation for different tenants and find the proper request batch size, thereby maintaining high efficiency. In experiments, DRS achieves a speedup of up to 2.23× and 24× compared to the baselines with the Makespan and Job Completion Time (JCT) metrics.
GPU技术对深度学习(DL)有很大的帮助,特别是在提高推理服务的性能方面。租户将推理模型部署在GPU上,由推理服务系统统一调度执行。在资源受限的环境中,单个GPU需要处理来自多个租户的请求。推理任务的多样性、不同的请求频率和不同的模型体系结构使得设计高效的推理服务系统成为一个重大挑战。目前大多数研究分别讨论了资源分配和请求批处理,忽略了它们之间的关键联系。在这种复杂的推理环境中,这种连接尤为重要。为了在这样一个动态环境中快速处理来自各种租户的请求,我们利用资源分配和请求批处理之间的连接来设计DRS: Deep Reinforcement Scheduler。在DRS中,我们使用深度确定性策略梯度(DDPG)作为调度算法,并使用NVIDIA多进程服务(MPS)在多个租户之间共享单个GPU时实现空间并行性。通过观察环境信息,我们可以快速调整不同租户的GPU分配,找到合适的请求批大小,从而保持高效率。在实验中,与Makespan和Job Completion Time (JCT)指标的基线相比,DRS实现了高达2.23倍和24倍的加速。
{"title":"Joint Optimization of Resource Allocation and Request Batching for Multi-Tenant Inference Serving on GPU","authors":"Yuning Zhang;Nan Yang;Chen Pan;Dong Yuan","doi":"10.1109/TPDS.2025.3627574","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3627574","url":null,"abstract":"The GPU technology has significantly aided Deep Learning (DL), especially in enhancing the performance of inference services. Tenants deploy inference models on the GPU, which are then uniformly scheduled and executed by an inference serving system. In resource-constrained environments, a single GPU needs to handle requests from multiple tenants. The diversity of inference tasks, varying request frequencies, and different model architectures make designing an efficient inference serving system a significant challenge. Most current research discusses resource allocation and request batching separately, overlooking the critical connection between them. In such complex inference environments, this connection is particularly crucial. To rapidly process requests from various tenants in such a dynamic environment, we leverage the connection between resource allocation and request batching to design DRS: Deep Reinforcement Scheduler. In DRS, we use the Deep Deterministic Policy Gradient (DDPG) as our scheduling algorithm and NVIDIA Multi-Process Service (MPS) for spatial parallelism in sharing a single GPU among multiple tenants. By observing environmental information, we can rapidly adjust the GPU allocation for different tenants and find the proper request batch size, thereby maintaining high efficiency. In experiments, DRS achieves a speedup of up to 2.23× and 24× compared to the baselines with the Makespan and Job Completion Time (JCT) metrics.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"287-303"},"PeriodicalIF":6.0,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145674796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RPCE: Dynamic Data Replicas Placement Management by Cloud and Edge Collaboration RPCE:基于云和边缘协作的动态数据副本放置管理
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-31 DOI: 10.1109/TPDS.2025.3627553
Xiaofeng Lu;Luwen Zou;Yilu Mao;Pietro Lio;Pan Hui
With the rapid advancement of information technology, traditional centralized cloud computing systems face challenges in meeting the stringent low-latency demands of emerging applications. To tackle this issue, this paper proposes a delay-aware cloud-edge architecture that incorporates the distributed characteristics of edge infrastructure, enabling low-latency collaboration among edge nodes within the same geographic region. Furthermore, based on this architecture, a dynamic data replica management scheme is introduced, involving synergistic mechanisms between edge nodes and cloud centers to optimally place data replicas on the most suitable edge nodes. The scheme adopts a hierarchical strategy: edge nodes perform short-term localized management of data replicas, while the cloud executes long-term holistic oversight. Experimental results demonstrate that the dynamic approach effectively reduces user access latency, minimizes replica migration frequency, and decreases network bandwidth consumption.
随着信息技术的飞速发展,传统的集中式云计算系统在满足新兴应用严格的低延迟需求方面面临挑战。为了解决这个问题,本文提出了一种延迟感知的云边缘架构,该架构结合了边缘基础设施的分布式特征,实现了同一地理区域内边缘节点之间的低延迟协作。在此基础上,引入了一种动态数据副本管理方案,利用边缘节点和云中心之间的协同机制,将数据副本优化放置在最合适的边缘节点上。该方案采用分层策略:边缘节点对数据副本执行短期的本地化管理,而云执行长期的整体监督。实验结果表明,该方法有效地降低了用户访问延迟,减少了副本迁移频率,降低了网络带宽消耗。
{"title":"RPCE: Dynamic Data Replicas Placement Management by Cloud and Edge Collaboration","authors":"Xiaofeng Lu;Luwen Zou;Yilu Mao;Pietro Lio;Pan Hui","doi":"10.1109/TPDS.2025.3627553","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3627553","url":null,"abstract":"With the rapid advancement of information technology, traditional centralized cloud computing systems face challenges in meeting the stringent low-latency demands of emerging applications. To tackle this issue, this paper proposes a delay-aware cloud-edge architecture that incorporates the distributed characteristics of edge infrastructure, enabling low-latency collaboration among edge nodes within the same geographic region. Furthermore, based on this architecture, a dynamic data replica management scheme is introduced, involving synergistic mechanisms between edge nodes and cloud centers to optimally place data replicas on the most suitable edge nodes. The scheme adopts a hierarchical strategy: edge nodes perform short-term localized management of data replicas, while the cloud executes long-term holistic oversight. Experimental results demonstrate that the dynamic approach effectively reduces user access latency, minimizes replica migration frequency, and decreases network bandwidth consumption.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 2","pages":"548-561"},"PeriodicalIF":6.0,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929366","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Lightweight Application Distribution With Automated and Real-Time Computing and Communication (ARC2) in Microcomputer Clusters 微型计算机集群中具有自动和实时计算和通信(ARC2)的轻量级应用分发
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-28 DOI: 10.1109/TPDS.2025.3626327
Jianchun Luo;Zhongjia Wang;Fei Peng;Xuejun Yu;Dongsheng Wei;Bo Liu;Guoqi Xie
The microcomputer cluster is a group of connected microcomputers that work together to perform as a single system. Unlike high-performance computer clusters, microcomputer clusters are designed to provide reliable and efficient services for safety-critical embedded systems, which usually require low SWaP (Size, Weight, and Power) because of the high stability and cost control requirements. Considering that safety-critical systems have strict real-time constraints (i.e., deadline constraints) and resource constraints, each microcomputer usually needs to run a Real-Time Operating System (RTOS) instead of Linux to achieve precise scheduling and control, and a high-speed real-time network such as Time-Triggered Ethernet (TTE) is required for intra-cluster communication. In microcomputer clusters, a load imbalance between microcomputers usually leads to system instability, and a lightweight application distribution framework automatically migrates applications among microcomputers, thereby breaking resource isolation and improving resource utilization. However, mainstream application distribution frameworks, such as Kubernetes (K8s), MicroK8s, and K3s, can be applied neither to RTOS nor to TTE. In this study, we design a lightweight application distribution framework with automated and real-time computing and communication (ARC2). ARC2 monitors the microcomputer cluster resource state in real-time and introduces a resource hierarchical pooling method to utilize cluster resources flexibly. It employs TTE for application distribution combined with a real-time scheduling strategy, achieving low end-to-end latency and load balancing. It simplifies the existing application distribution framework and introduces a low-complexity cluster management logic to achieve low resource overhead. We conduct experimental evaluations on a heterogeneous platform. The results show that: (1) the load imbalance is reduced by at least 59.81% compared to the original system; (2) the deviation in real-time monitoring traffic is reduced by an average of 56.7 ms, with the application distribution success rate reaching 100% and an average distribution time of 393.0 ms; and (3) the CPU, memory, and bandwidth overhead are 9%, 3 MB, and 0.104 Mb/s, respectively.
微型计算机集群是一组相互连接的微型计算机,它们作为一个单一的系统一起工作。与高性能计算机集群不同,微型计算机集群旨在为安全关键型嵌入式系统提供可靠和高效的服务,由于高稳定性和成本控制要求,嵌入式系统通常需要低SWaP(尺寸、重量和功率)。由于安全关键型系统具有严格的实时性约束(即时限约束)和资源约束,每台微机通常需要运行实时操作系统(RTOS)而不是Linux来实现精确的调度和控制,集群内通信需要高速实时网络,如时间触发以太网(TTE)。在微机集群中,微机之间的负载不平衡通常会导致系统不稳定,而轻量级的应用分发框架可以在微机之间自动迁移应用,从而打破资源隔离,提高资源利用率。然而,主流的应用分发框架,如Kubernetes (k8)、microk8、k3,都不能应用于RTOS和TTE。在本研究中,我们设计了一个具有自动化和实时计算和通信(ARC2)的轻量级应用分发框架。ARC2实时监控微机集群资源状态,引入资源分层池化方法,灵活利用集群资源。它采用TTE进行应用程序分发,并结合实时调度策略,实现了低端到端延迟和负载平衡。它简化了现有的应用程序分发框架,并引入了低复杂性的集群管理逻辑,以实现低资源开销。我们在一个异构平台上进行实验评估。结果表明:(1)与原系统相比,负荷不均衡性至少降低了59.81%;(2)实时监控流量偏差平均降低56.7 ms,应用分发成功率达到100%,平均分发时间为393.0 ms;(3) CPU、内存和带宽开销分别为9%、3mb和0.104 MB /s。
{"title":"Lightweight Application Distribution With Automated and Real-Time Computing and Communication (ARC2) in Microcomputer Clusters","authors":"Jianchun Luo;Zhongjia Wang;Fei Peng;Xuejun Yu;Dongsheng Wei;Bo Liu;Guoqi Xie","doi":"10.1109/TPDS.2025.3626327","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3626327","url":null,"abstract":"The microcomputer cluster is a group of connected microcomputers that work together to perform as a single system. Unlike high-performance computer clusters, microcomputer clusters are designed to provide reliable and efficient services for safety-critical embedded systems, which usually require low SWaP (Size, Weight, and Power) because of the high stability and cost control requirements. Considering that safety-critical systems have strict real-time constraints (i.e., deadline constraints) and resource constraints, each microcomputer usually needs to run a Real-Time Operating System (RTOS) instead of Linux to achieve precise scheduling and control, and a high-speed real-time network such as Time-Triggered Ethernet (TTE) is required for intra-cluster communication. In microcomputer clusters, a load imbalance between microcomputers usually leads to system instability, and a lightweight application distribution framework automatically migrates applications among microcomputers, thereby breaking resource isolation and improving resource utilization. However, mainstream application distribution frameworks, such as Kubernetes (K8s), MicroK8s, and K3s, can be applied neither to RTOS nor to TTE. In this study, we design a lightweight application distribution framework with automated and real-time computing and communication (ARC2). ARC2 monitors the microcomputer cluster resource state in real-time and introduces a resource hierarchical pooling method to utilize cluster resources flexibly. It employs TTE for application distribution combined with a real-time scheduling strategy, achieving low end-to-end latency and load balancing. It simplifies the existing application distribution framework and introduces a low-complexity cluster management logic to achieve low resource overhead. We conduct experimental evaluations on a heterogeneous platform. The results show that: (1) the load imbalance is reduced by at least 59.81% compared to the original system; (2) the deviation in real-time monitoring traffic is reduced by an average of 56.7 ms, with the application distribution success rate reaching 100% and an average distribution time of 393.0 ms; and (3) the CPU, memory, and bandwidth overhead are 9%, 3 MB, and 0.104 Mb/s, respectively.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"169-183"},"PeriodicalIF":6.0,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546978","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MCG-Sched: Multi-Cluster GPU Scheduling for Resource Fragmentation Reduction and Load Balancing MCG-Sched:多集群GPU调度,用于减少资源碎片和负载均衡
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-28 DOI: 10.1109/TPDS.2025.3626153
Haijie Wu;Xinhua Wang;Xiaoxuan Luo;Wangbo Shen;Weiwei Lin
Since the rapid development of deep learning (DL) technology, large-scale GPU clusters receive a large number of DL workloads daily. To speed up the completion time, the workloads usually occupy several GPUs on a server. However, workload scheduling inevitably generates resource fragmentation, which results in many scattered GPU resources being unavailable. Existing works address improving resource utilization by reducing GPU resource fragmentation, while they focus on resource scheduling for a single cluster and ignore multiple clusters. Multi-cluster scenarios, such as virtual clusters and geo-distributed clusters, require load balancing to avoid some clusters exhausting resources while some clusters are idle while improving resource utilization, which is not well addressed by existing works. In this paper, we propose MCG-Sched, a scheduling strategy to reduce resource fragmentation in multiple GPU clusters while maintaining load balancing among clusters. MCG-Sched measures the fragmented resources with the distribution of workload demands and uses a scheme that minimizes fragmentation in workload scheduling. Meanwhile, MCG-Sched achieves balanced load scheduling across clusters through the load balancing index. MCG-Sched senses the workload requests in the waiting queue, and prioritizes the workloads by combining fragmentation measurement and load balancing index to maximize resource utilization and load balancing during load peak. Our experiments show that MCG-Sched reduces unallocated GPUs up to 1.45× and workload waiting time by more than 40% compared to existing fragmentation-aware methods and achieves effective load balancing.
随着深度学习技术的快速发展,大规模GPU集群每天都会接收大量的深度学习工作负载。为了加快完成时间,工作负载通常占用服务器上的多个gpu。然而,工作负载调度不可避免地会产生资源碎片,导致许多分散的GPU资源不可用。现有的工作致力于通过减少GPU资源碎片来提高资源利用率,而他们专注于单个集群的资源调度,而忽略了多个集群。在虚拟集群、地理分布式集群等多集群场景下,需要进行负载均衡,避免部分集群资源耗尽,部分集群资源闲置,提高资源利用率,这是现有工作没有很好解决的问题。在本文中,我们提出MCG-Sched,一种调度策略,以减少多个GPU集群中的资源碎片,同时保持集群之间的负载平衡。MCG-Sched通过工作负载需求的分布来度量碎片化的资源,并使用最小化工作负载调度中的碎片化的方案。同时,MCG-Sched通过负载均衡索引实现集群间的负载均衡调度。MCG-Sched感知等待队列中的工作负载请求,并结合碎片度量和负载均衡指标对工作负载进行优先级排序,最大限度地提高资源利用率,实现负载峰值时的负载均衡。实验表明,与现有的碎片感知方法相比,MCG-Sched将未分配gpu减少了1.45倍,工作负载等待时间减少了40%以上,实现了有效的负载均衡。
{"title":"MCG-Sched: Multi-Cluster GPU Scheduling for Resource Fragmentation Reduction and Load Balancing","authors":"Haijie Wu;Xinhua Wang;Xiaoxuan Luo;Wangbo Shen;Weiwei Lin","doi":"10.1109/TPDS.2025.3626153","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3626153","url":null,"abstract":"Since the rapid development of deep learning (DL) technology, large-scale GPU clusters receive a large number of DL workloads daily. To speed up the completion time, the workloads usually occupy several GPUs on a server. However, workload scheduling inevitably generates resource fragmentation, which results in many scattered GPU resources being unavailable. Existing works address improving resource utilization by reducing GPU resource fragmentation, while they focus on resource scheduling for a single cluster and ignore multiple clusters. Multi-cluster scenarios, such as virtual clusters and geo-distributed clusters, require load balancing to avoid some clusters exhausting resources while some clusters are idle while improving resource utilization, which is not well addressed by existing works. In this paper, we propose MCG-Sched, a scheduling strategy to reduce resource fragmentation in multiple GPU clusters while maintaining load balancing among clusters. MCG-Sched measures the fragmented resources with the distribution of workload demands and uses a scheme that minimizes fragmentation in workload scheduling. Meanwhile, MCG-Sched achieves balanced load scheduling across clusters through the load balancing index. MCG-Sched senses the workload requests in the waiting queue, and prioritizes the workloads by combining fragmentation measurement and load balancing index to maximize resource utilization and load balancing during load peak. Our experiments show that MCG-Sched reduces unallocated GPUs up to 1.45× and workload waiting time by more than 40% compared to existing fragmentation-aware methods and achieves effective load balancing.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"36 12","pages":"2789-2800"},"PeriodicalIF":6.0,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455880","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Scalable Hybrid Learning Techniques for Scientific Data Compression 用于科学数据压缩的可扩展混合学习技术
IF 6 2区 计算机科学 Q1 COMPUTER SCIENCE, THEORY & METHODS Pub Date : 2025-10-28 DOI: 10.1109/TPDS.2025.3623935
Tania Banerjee;Jong Choi;Jaemoon Lee;Qian Gong;Jieyang Chen;Scott Klasky;Anand Rangarajan;Sanjay Ranka
Data compression is becoming critical for storing scientific data because many scientific applications need to store large amounts of data and post process this data for scientific discovery. Unlike image and video compression algorithms that limit errors to primary data (PD), scientists require compression techniques that accurately preserve derived quantities of interest (QoIs). This article presents a physics-informed compression technique implemented as an end-to-end, scalable, GPU-based pipeline for data compression that addresses this requirement. Our hybrid compression technique combines machine learning techniques and standard compression methods. Specifically, we combine an autoencoder, an error-bounded lossy compressor to provide guarantees on raw data error, and a constraint satisfaction post-processing step to preserve the QoIs within a minimal error (generally less than floating point error). The effectiveness of the data compression pipeline is demonstrated by compressing nuclear fusion simulation data generated by a large-scale fusion code, XGC, which produces hundreds of terabytes of data in a single day. Our approach works within the ADIOS framework and results in compression by a factor of more than 150 while requiring only a few percent of the computational resources necessary for generating the data, making the overall approach highly effective for practical scenarios.
数据压缩对于存储科学数据变得至关重要,因为许多科学应用程序需要存储大量数据并对这些数据进行后期处理以进行科学发现。与将误差限制在原始数据(PD)的图像和视频压缩算法不同,科学家们需要的压缩技术能够准确地保留衍生兴趣量(qoi)。本文介绍了一种基于物理的压缩技术,该技术实现为端到端、可扩展的、基于gpu的数据压缩管道,以满足这一需求。我们的混合压缩技术结合了机器学习技术和标准压缩方法。具体来说,我们结合了一个自动编码器,一个错误有界的有损压缩器,以提供对原始数据错误的保证,以及一个约束满足后处理步骤,以将qos保持在最小的误差范围内(通常小于浮点误差)。通过压缩大型核聚变代码XGC生成的核聚变模拟数据,证明了数据压缩管道的有效性,该代码在一天内产生数百tb的数据。我们的方法在ADIOS框架内工作,结果压缩了150倍以上,而只需要生成数据所需计算资源的百分之几,使得整个方法在实际场景中非常有效。
{"title":"Scalable Hybrid Learning Techniques for Scientific Data Compression","authors":"Tania Banerjee;Jong Choi;Jaemoon Lee;Qian Gong;Jieyang Chen;Scott Klasky;Anand Rangarajan;Sanjay Ranka","doi":"10.1109/TPDS.2025.3623935","DOIUrl":"https://doi.org/10.1109/TPDS.2025.3623935","url":null,"abstract":"Data compression is becoming critical for storing scientific data because many scientific applications need to store large amounts of data and post process this data for scientific discovery. Unlike image and video compression algorithms that limit errors to primary data (PD), scientists require compression techniques that accurately preserve derived quantities of interest (QoIs). This article presents a physics-informed compression technique implemented as an end-to-end, scalable, GPU-based pipeline for data compression that addresses this requirement. Our hybrid compression technique combines machine learning techniques and standard compression methods. Specifically, we combine an autoencoder, an error-bounded lossy compressor to provide guarantees on raw data error, and a constraint satisfaction post-processing step to preserve the QoIs within a minimal error (generally less than floating point error). The effectiveness of the data compression pipeline is demonstrated by compressing nuclear fusion simulation data generated by a large-scale fusion code, XGC, which produces hundreds of terabytes of data in a single day. Our approach works within the ADIOS framework and results in compression by a factor of more than 150 while requiring only a few percent of the computational resources necessary for generating the data, making the overall approach highly effective for practical scenarios.","PeriodicalId":13257,"journal":{"name":"IEEE Transactions on Parallel and Distributed Systems","volume":"37 1","pages":"29-44"},"PeriodicalIF":6.0,"publicationDate":"2025-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145546967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Transactions on Parallel and Distributed Systems
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1