Graph Neural Networks (GNNs) require high-capacity, low-latency memory systems to process large graphs. A hierarchical hybrid memory architecture combining high-capacity Non-Volatile Memory (NVM) and low-latency DRAM offers a promising solution. However, the inherent sparsity of graph data results in poor locality for GNN memory requests, leading to low DRAM cache hit rates and numerous misses, which significantly impairs the hybrid memory system’s performance. A critical issue is that DRAM misses in serial access mode incur substantial latency. While parallel access mode can mitigate this for misses, it introduces long-tail latency and wastes bandwidth for DRAM hits. In this paper, we focus on addressing these issues from two aspects: increasing the cache hit rate and decreasing the miss latency. We mainly propose two predictors: a future data access predictor that enables accurate prefetching to DRAM, thereby improving cache hit rates, and a data location predictor that determines whether data resides in DRAM or NVM, optimizing the choice between serial and parallel access modes to reduce miss latency. By integrating these predictors, we achieve efficient data access in both DRAM and NVM. Our experiments show a 49.5% reduction in memory delay and a 38.1% increase in memory bandwidth utilization compared to baseline.
{"title":"Latency Optimization in Hybrid Memory System for GNNs","authors":"Zhaoyang Zeng;Yujuan Tan;Wei Chen;Jiali Li;Zhuoxin Bai;Ao Ren;Duo Liu;Xianzhang Chen","doi":"10.1109/TC.2025.3648646","DOIUrl":"https://doi.org/10.1109/TC.2025.3648646","url":null,"abstract":"Graph Neural Networks (GNNs) require high-capacity, low-latency memory systems to process large graphs. A hierarchical hybrid memory architecture combining high-capacity Non-Volatile Memory (NVM) and low-latency DRAM offers a promising solution. However, the inherent sparsity of graph data results in poor locality for GNN memory requests, leading to low DRAM cache hit rates and numerous misses, which significantly impairs the hybrid memory system’s performance. A critical issue is that DRAM misses in serial access mode incur substantial latency. While parallel access mode can mitigate this for misses, it introduces long-tail latency and wastes bandwidth for DRAM hits. In this paper, we focus on addressing these issues from two aspects: increasing the cache hit rate and decreasing the miss latency. We mainly propose two predictors: a future data access predictor that enables accurate prefetching to DRAM, thereby improving cache hit rates, and a data location predictor that determines whether data resides in DRAM or NVM, optimizing the choice between serial and parallel access modes to reduce miss latency. By integrating these predictors, we achieve efficient data access in both DRAM and NVM. Our experiments show a 49.5% reduction in memory delay and a 38.1% increase in memory bandwidth utilization compared to baseline.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 3","pages":"1183-1196"},"PeriodicalIF":3.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
For image-related deep learning tasks, the first step often involves reading data from external storage and performing preprocessing on the CPU. As accelerator speed increases and the number of single compute node accelerators increases, the computing and data transfer capabilities gap between accelerators and CPUs gradually increases. Data reading and preprocessing become progressively the bottleneck of these tasks. Our work, DDLP, addresses the data computing and transfer bottleneck of deep learning preprocessing using Computable Storage Devices (CSDs). DDLP allows the CPU and CSD to efficiently parallelize preprocessing from both ends of the datasets, respectively. To this end, we propose two adaptive dynamic selection strategies to make DDLP control the accelerator to automatically read data from different sources. The two strategies trade-off between consistency and efficiency. DDLP achieves sufficient computational overlap between CSD data preprocessing and CPU preprocessing, accelerator computation, and accelerator data reading. In addition, DDLP leverages direct storage technology to enable efficient SSD-to-accelerator data transfer. In addition, DDLP reduces the use of expensive CPU and DRAM resources with more energy-efficient CSDs, alleviating preprocessing bottlenecks while significantly reducing power consumption. Extensive experimental results show that DDLP can improve learning speed by up to 23.5% on ImageNet Dataset while reducing energy consumption by 19.7% and CPU and DRAM usage by 37.6%. DDLP also improves the learning speed by up to 27.6% on the Cifar-10 dataset.
{"title":"Dual-Pronged Deep Learning Preprocessing on Heterogeneous Platforms With CPU, Accelerator and CSD","authors":"Jia Wei;Xingjun Zhang;Witold Pedrycz;Longxiang Wang;Jie Zhao","doi":"10.1109/TC.2025.3649209","DOIUrl":"https://doi.org/10.1109/TC.2025.3649209","url":null,"abstract":"For image-related deep learning tasks, the first step often involves reading data from external storage and performing preprocessing on the CPU. As accelerator speed increases and the number of single compute node accelerators increases, the computing and data transfer capabilities gap between accelerators and CPUs gradually increases. Data reading and preprocessing become progressively the bottleneck of these tasks. Our work, DDLP, addresses the data computing and transfer bottleneck of deep learning preprocessing using Computable Storage Devices (CSDs). DDLP allows the CPU and CSD to efficiently parallelize preprocessing from both ends of the datasets, respectively. To this end, we propose two adaptive dynamic selection strategies to make DDLP control the accelerator to automatically read data from different sources. The two strategies trade-off between consistency and efficiency. DDLP achieves sufficient computational overlap between CSD data preprocessing and CPU preprocessing, accelerator computation, and accelerator data reading. In addition, DDLP leverages direct storage technology to enable efficient SSD-to-accelerator data transfer. In addition, DDLP reduces the use of expensive CPU and DRAM resources with more energy-efficient CSDs, alleviating preprocessing bottlenecks while significantly reducing power consumption. Extensive experimental results show that DDLP can improve learning speed by up to 23.5% on ImageNet Dataset while reducing energy consumption by 19.7% and CPU and DRAM usage by 37.6%. DDLP also improves the learning speed by up to 27.6% on the Cifar-10 dataset.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 3","pages":"1209-1223"},"PeriodicalIF":3.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this work, we present single-event upset (SEU) analysis for Forksheet FET (FSFET) based CMOS circuits. Next, we present an array-level power and performance analysis along with the Vmin evaluation for the FSFET-based SRAM. Physics-based TCAD and industry-standard BSIM-CMG compact models are calibrated for accurate circuit analysis in SPICE. The impact of varying Heavy-Ion Radiation (HIR) doses and strike orientations is investigated for the FSFETs. The robustness of CMOS inverter against HIR is also reported in terms of failure time (tfail) and output voltage swing ($Delta$VDrop). For the SRAM, we determine the critical Linear Energy Transfer (LET). For FSFET, the individual n-/p-FETs are more vulnerable to the irradiation incident on nearby devices. At the circuit level, in comparison to perpendicular strikes, the $Delta$VDrop increases by 1.25V and 2.75V respectively, for oblique and transverse incidences, at a dose of 2.0MeVcm$^2$/mg. The tfail also increases by 43% and 60% and the SRAM critical LET also decreases by 85% and 57.5%, respectively. The array level SRAM evaluation shows that the FSFET enables reliable operation with low-power consumption, impressive noise margins, and low minimum-operating voltage (Vmin) values. FSFET SRAM power dissipation during the read and write operations is as low as 7.02$mu$W, and 3.00$mu$W respectively. At VDD$=$0.70V, the noise margins for hold, read, and write operations are 289.27mV, 122.89mV, and 297.79mV. The Vmin for read and write operations are 0.30V and 0.35V respectively.
在这项工作中,我们提出了基于叉片场效应管(fset)的CMOS电路的单事件扰动(SEU)分析。接下来,我们提出了阵列级功率和性能分析以及基于fsfet的SRAM的Vmin评估。基于物理的TCAD和行业标准BSIM-CMG紧凑型模型在SPICE中进行了精确的电路分析校准。研究了不同的重离子辐射剂量和冲击方向对fsfet的影响。CMOS逆变器对HIR的鲁棒性也在故障时间(tfail)和输出电压摆幅($Delta$ VDrop)方面得到了报道。对于SRAM,我们确定了临界线性能量传递(LET)。对于fset,单个n-/p- fet更容易受到附近器件的辐照事件的影响。在电路水平上,与垂直入射相比,当剂量为2.0 MeVcm $^2$ /mg时,垂直入射和横向入射的$Delta$电压降分别增加1.25 V和2.75 V。失败也增加了43% and 60% and the SRAM critical LET also decreases by 85% and 57.5%, respectively. The array level SRAM evaluation shows that the FSFET enables reliable operation with low-power consumption, impressive noise margins, and low minimum-operating voltage (Vmin) values. FSFET SRAM power dissipation during the read and write operations is as low as 7.02 $mu$W, and 3.00 $mu$W respectively. At VDD$=$0.70 V, the noise margins for hold, read, and write operations are 289.27 mV, 122.89 mV, and 297.79 mV. The Vmin for read and write operations are 0.30 V and 0.35 V respectively.
{"title":"Evaluation of Radiation Resilience, Performance, and Vmin of Sub-3 nm FSFET Based SRAM Arrays","authors":"Hafeez Raza;Mahdi Benkhelifa;Koshal Kumar;Shivendra Singh Parihar;Yogesh Singh Chauhan;Hussam Amrouch;Avinash Lahgere","doi":"10.1109/TC.2025.3649150","DOIUrl":"https://doi.org/10.1109/TC.2025.3649150","url":null,"abstract":"In this work, we present single-event upset (SEU) analysis for Forksheet FET (FSFET) based CMOS circuits. Next, we present an array-level power and performance analysis along with the V<sub>min</sub> evaluation for the FSFET-based SRAM. Physics-based TCAD and industry-standard BSIM-CMG compact models are calibrated for accurate circuit analysis in SPICE. The impact of varying Heavy-Ion Radiation (HIR) doses and strike orientations is investigated for the FSFETs. The robustness of CMOS inverter against HIR is also reported in terms of failure time (t<sub>fail</sub>) and output voltage swing (<inline-formula><tex-math>$Delta$</tex-math></inline-formula>V<sub>Drop</sub>). For the SRAM, we determine the critical Linear Energy Transfer (LET). For FSFET, the individual n-/p-FETs are more vulnerable to the irradiation incident on nearby devices. At the circuit level, in comparison to perpendicular strikes, the <inline-formula><tex-math>$Delta$</tex-math></inline-formula>V<sub>Drop</sub> increases by 1.25<roman> </roman>V and 2.75<roman> </roman>V respectively, for oblique and transverse incidences, at a dose of 2.0<roman> </roman>MeVcm<inline-formula><tex-math>$^2$</tex-math></inline-formula>/mg. The t<sub>fail</sub> also increases by 43% and 60% and the SRAM critical LET also decreases by 85% and 57.5%, respectively. The array level SRAM evaluation shows that the FSFET enables reliable operation with low-power consumption, impressive noise margins, and low minimum-operating voltage (V<sub>min</sub>) values. FSFET SRAM power dissipation during the read and write operations is as low as 7.02<roman> </roman><inline-formula><tex-math>$mu$</tex-math></inline-formula>W, and 3.00<roman> </roman><inline-formula><tex-math>$mu$</tex-math></inline-formula>W respectively. At V<sub>DD</sub><inline-formula><tex-math>$=$</tex-math></inline-formula>0.70<roman> </roman>V, the noise margins for hold, read, and write operations are 289.27<roman> </roman>mV, 122.89<roman> </roman>mV, and 297.79<roman> </roman>mV. The V<sub>min</sub> for read and write operations are 0.30<roman> </roman>V and 0.35<roman> </roman>V respectively.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 3","pages":"1197-1208"},"PeriodicalIF":3.8,"publicationDate":"2025-12-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154432","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
David R. Lutz;Anisha Saini;Mairin Kroes;Thomas Elmer;Harsha Valsaraju;Javier D. Bruguera
In the era of Deep Learning, hardware acceleration has become essential for meeting the immense computational demands of modern applications. In many Machine Learning applications, Generalized Matrix Multiplication (GEMM) with dot product is an ubiquitous and computationally intensive operation. This paper introduces two innovative microarchitectures for executing a fused FP8 $m$‐way dot product with dynamic range scaling and FP32 accumulation. Both microarchitectures have been synthesized in a 3 nm technology node at 3.6 GHz, and were designed to deliver power- and area-efficient performance, targeting a 4-cycle latency for $m = 4,8$ and 5+ cycles for larger $m$ values. The first design – termed dot product with late accumulation – computes the dot product in the first cycles, then expands intermediate products to a fixed-point format (2 cycles for $m = 4, 8$ and 3+ cycles for $m gt 8$), before using an additional two cycles for accumulation. This approach enables the reuse of a modified, FMA-capable FP32 adder. The second design – dot product with early accumulation – employs a dedicated FP8 datapath that concurrently computes the FP8 sum-of-products while aligning the FP32 accumulator, followed by the addition of the significands (2 cycles for $m = 4, 8$ and 3+ cycles for $m gt 8$). This is then followed by two cycles for normalization and a single rounding operation. This design aligns addends (products and accumulator) from an “anchor” for efficient, arithmetically fused, $m$-way FP8 dot product computation. Comparative analysis with previous proposals reveals that, despite challenges in establishing a fair comparison, our designs achieve significant area savings.
{"title":"Fused FP8 Many-Terms Dot Product With Scaling and FP32 Accumulation","authors":"David R. Lutz;Anisha Saini;Mairin Kroes;Thomas Elmer;Harsha Valsaraju;Javier D. Bruguera","doi":"10.1109/TC.2025.3648544","DOIUrl":"https://doi.org/10.1109/TC.2025.3648544","url":null,"abstract":"In the era of Deep Learning, hardware acceleration has become essential for meeting the immense computational demands of modern applications. In many Machine Learning applications, Generalized Matrix Multiplication (GEMM) with dot product is an ubiquitous and computationally intensive operation. This paper introduces two innovative microarchitectures for executing a fused FP8 <inline-formula><tex-math>$m$</tex-math></inline-formula>‐way dot product with dynamic range scaling and FP32 accumulation. Both microarchitectures have been synthesized in a 3 nm technology node at 3.6 GHz, and were designed to deliver power- and area-efficient performance, targeting a 4-cycle latency for <inline-formula><tex-math>$m = 4,8$</tex-math></inline-formula> and 5+ cycles for larger <inline-formula><tex-math>$m$</tex-math></inline-formula> values. The first design – termed <i>dot product with late accumulation</i> – computes the dot product in the first cycles, then expands intermediate products to a fixed-point format (2 cycles for <inline-formula><tex-math>$m = 4, 8$</tex-math></inline-formula> and 3+ cycles for <inline-formula><tex-math>$m gt 8$</tex-math></inline-formula>), before using an additional two cycles for accumulation. This approach enables the reuse of a modified, FMA-capable FP32 adder. The second design – <i>dot product with early accumulation</i> – employs a dedicated FP8 datapath that concurrently computes the FP8 sum-of-products while aligning the FP32 accumulator, followed by the addition of the significands (2 cycles for <inline-formula><tex-math>$m = 4, 8$</tex-math></inline-formula> and 3+ cycles for <inline-formula><tex-math>$m gt 8$</tex-math></inline-formula>). This is then followed by two cycles for normalization and a single rounding operation. This design aligns addends (products and accumulator) from an “anchor” for efficient, arithmetically fused, <inline-formula><tex-math>$m$</tex-math></inline-formula>-way FP8 dot product computation. Comparative analysis with previous proposals reveals that, despite challenges in establishing a fair comparison, our designs achieve significant area savings.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 3","pages":"1171-1182"},"PeriodicalIF":3.8,"publicationDate":"2025-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154450","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The popularity of cloud storage makes it easy to share others’ data, such as accessing and editing electronic medical records. To control the editing, sanitizable signatures were developed to allow designated editors to modify restricted parts of the data with sanitizing keys. In particular, attribute-based sanitizable signature (ABSS) has been explored for fine-grained editing control, i.e., editors are determined by the owner through a policy. However, existing ABSS schemes can neither protect the anonymity of the owner due to signature verification under the owner’s public key, nor protect the data confidentiality due to the plaintext storage in the cloud server. Moreover, they do not consider the sanitizing key exposure issue. To this end, we propose FSEdit, a privacy-preserving and security-enhanced controllable editing framework for cloud storage. Specifically, we introduce an attribute-based sanitizable and puncturable signed encryption (AB-SPSE) primitive for FSEdit, where encrypted data can be anonymously verified via the owner’s attributes, and its admissible blocks can be modified by policy-authorized editors. Meanwhile, the editors’ sanitizing keys can be further punctured to guarantee forward secrecy. We design two novel building blocks, namely attribute-based equivalence-class signature, and attribute-based puncturable combined encryption and signature, and then construct FSEdit by leveraging them to instantiate AB-SPSE in asymmetric pairings. Finally, we show the security and efficiency of FSEdit through extensive security analysis and experimental results over existing ABSS schemes.
{"title":"FSEdit: Privacy-Preserving and Security-Enhanced Controllable Editing Framework for Cloud Storage","authors":"Qinlong Huang;Caiqun Shi;Xiyu Liang","doi":"10.1109/TC.2025.3648284","DOIUrl":"https://doi.org/10.1109/TC.2025.3648284","url":null,"abstract":"The popularity of cloud storage makes it easy to share others’ data, such as accessing and editing electronic medical records. To control the editing, sanitizable signatures were developed to allow designated editors to modify restricted parts of the data with sanitizing keys. In particular, attribute-based sanitizable signature (ABSS) has been explored for fine-grained editing control, i.e., editors are determined by the owner through a policy. However, existing ABSS schemes can neither protect the anonymity of the owner due to signature verification under the owner’s public key, nor protect the data confidentiality due to the plaintext storage in the cloud server. Moreover, they do not consider the sanitizing key exposure issue. To this end, we propose FSEdit, a privacy-preserving and security-enhanced controllable editing framework for cloud storage. Specifically, we introduce an attribute-based sanitizable and puncturable signed encryption (AB-SPSE) primitive for FSEdit, where encrypted data can be anonymously verified via the owner’s attributes, and its admissible blocks can be modified by policy-authorized editors. Meanwhile, the editors’ sanitizing keys can be further punctured to guarantee forward secrecy. We design two novel building blocks, namely attribute-based equivalence-class signature, and attribute-based puncturable combined encryption and signature, and then construct FSEdit by leveraging them to instantiate AB-SPSE in asymmetric pairings. Finally, we show the security and efficiency of FSEdit through extensive security analysis and experimental results over existing ABSS schemes.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 3","pages":"1156-1170"},"PeriodicalIF":3.8,"publicationDate":"2025-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross-stage coordination can substantially reduce redundant computation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm–hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add-only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator architecture, achieving up to $9.2times$ speedup and $71.2times$ energy efficiency over A100, and surpassing SOTA accelerators by up to $16.1times$ energy and $27.1times$ area efficiency gains. Further, we deploy STAR onto a multi-core spatial architecture, optimizing dataflow and execution orchestration for ultra-long sequence processing. Architectural evaluation shows that, compared to the baseline design, Spatial-STAR achieves a $20.1times$ throughput improvement.
{"title":"Designing Spatial Architectures for Sparse Attention: STAR Accelerator via Cross-Stage Tiling","authors":"Huizheng Wang;Taiquan Wei;Hongbin Wang;Zichuan Wang;Xinru Tang;Zhiheng Yue;Shaojun Wei;Yang Hu;Shouyi Yin","doi":"10.1109/TC.2025.3648055","DOIUrl":"https://doi.org/10.1109/TC.2025.3648055","url":null,"abstract":"Large language models (LLMs) rely on self-attention for contextual understanding, demanding high-throughput inference and large-scale token parallelism (LTPP). Existing dynamic sparsity accelerators falter under LTPP scenarios due to stage-isolated optimizations. Revisiting the end-to-end sparsity acceleration flow, we identify an overlooked opportunity: cross-stage coordination can substantially reduce redundant computation and memory access. We propose STAR, a cross-stage compute- and memory-efficient algorithm–hardware co-design tailored for Transformer inference under LTPP. STAR introduces a leading-zero-based sparsity prediction using log-domain add-only operations to minimize prediction overhead. It further employs distributed sorting and a sorted updating FlashAttention mechanism, guided by a coordinated tiling strategy that enables fine-grained stage interaction for improved memory efficiency and latency. These optimizations are supported by a dedicated STAR accelerator architecture, achieving up to <inline-formula><tex-math>$9.2times$</tex-math></inline-formula> speedup and <inline-formula><tex-math>$71.2times$</tex-math></inline-formula> energy efficiency over A100, and surpassing SOTA accelerators by up to <inline-formula><tex-math>$16.1times$</tex-math></inline-formula> energy and <inline-formula><tex-math>$27.1times$</tex-math></inline-formula> area efficiency gains. Further, we deploy STAR onto a multi-core spatial architecture, optimizing dataflow and execution orchestration for ultra-long sequence processing. Architectural evaluation shows that, compared to the baseline design, Spatial-STAR achieves a <inline-formula><tex-math>$20.1times$</tex-math></inline-formula> throughput improvement.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 3","pages":"1125-1140"},"PeriodicalIF":3.8,"publicationDate":"2025-12-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154476","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Deep learning on 3D point clouds plays a vital role in a wide range of applications such as AR/VR visualization, 3D cloth virtual try-on, and game rendering. As some applications require low latency, the point cloud services are also deployed on datacenter with powerful GPUs. While the queries of point cloud services show various workload change patterns due to different degrees of sparsity, current batching-based serving schemes result in either long latency or low throughput. We propose a scheme called Volans to address the above challenges and effectively support point cloud services. Volans comprises a workload predictor, a topology deployer, and a progress-aware scheduler. The predictor grids the input query and estimates the workload changes. Afterward, the deployer splits the model into several stages and determines the batch size for each stage based on the workload changes. The scheduler reduces the QoS violation when queries run slower due to unpredicted workload spikes. Experiments show that Volans enhances the peak supported throughput by up to 31.1% while maintaining the required 99%-ile latencies compared to state-of-the-art techniques.
{"title":"QoS Awareness and Improved Throughput of Point Cloud Services With Dynamic Workloads","authors":"Kaihua Fu;Jiuchen Shi;Yao Chen;Quan Chen;Weng-Fai Wong;Wei Wang;Bingsheng He;Minyi Guo","doi":"10.1109/TC.2025.3648132","DOIUrl":"https://doi.org/10.1109/TC.2025.3648132","url":null,"abstract":"Deep learning on 3D point clouds plays a vital role in a wide range of applications such as AR/VR visualization, 3D cloth virtual try-on, and game rendering. As some applications require low latency, the point cloud services are also deployed on datacenter with powerful GPUs. While the queries of point cloud services show various workload change patterns due to different degrees of sparsity, current batching-based serving schemes result in either long latency or low throughput. We propose a scheme called Volans to address the above challenges and effectively support point cloud services. Volans comprises a workload predictor, a topology deployer, and a progress-aware scheduler. The predictor grids the input query and estimates the workload changes. Afterward, the deployer splits the model into several stages and determines the batch size for each stage based on the workload changes. The scheduler reduces the QoS violation when queries run slower due to unpredicted workload spikes. Experiments show that Volans enhances the peak supported throughput by up to 31.1% while maintaining the required 99%-ile latencies compared to state-of-the-art techniques.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 3","pages":"1141-1155"},"PeriodicalIF":3.8,"publicationDate":"2025-12-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146154453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chen Zhao;Zhigao Zheng;Hao Huang;Hao Liu;Dacheng Tao
Core decomposition is a widely used hierarchical analysis algorithm for large-scale graphs. It achieves this decomposition by iteratively peeling the vertices along with their adjacency edges off into different hierarchies. With the timeliness requirements of modern applications, many researchers have introduced accelerators, particularly GPUs, to improve the computational efficiency of graph algorithms. However, the empty, sparse, and numerous hierarchies in large graphs lead to inefficient computation and parallelism, not only including unnecessary searching for the hierarchy’s vertices, but also significant thread wastage when peeling off the adjacency edges of these vertices. In this paper, we propose an adaptive parallel framework for core decomposition, named AdaptiveCore. First, it improves vertex searching efficiency by adaptively skipping the empty hierarchies and reducing the search space. Moreover, it greatly improves thread utilization by adaptively allocating the available threads to peel off the adjacency edges. Comprehensive experiments show that, compared with the state-of-the-art works, the proposed framework achieves an average speedup of $7.1times$ on the GPU platform and up to $2.0times$ on the multi-core CPU platform.
{"title":"AdaptiveCore: Adaptive Parallel Core Decomposition Framework","authors":"Chen Zhao;Zhigao Zheng;Hao Huang;Hao Liu;Dacheng Tao","doi":"10.1109/TC.2025.3646191","DOIUrl":"https://doi.org/10.1109/TC.2025.3646191","url":null,"abstract":"Core decomposition is a widely used hierarchical analysis algorithm for large-scale graphs. It achieves this decomposition by iteratively peeling the vertices along with their adjacency edges off into different hierarchies. With the timeliness requirements of modern applications, many researchers have introduced accelerators, particularly GPUs, to improve the computational efficiency of graph algorithms. However, the empty, sparse, and numerous hierarchies in large graphs lead to inefficient computation and parallelism, not only including unnecessary searching for the hierarchy’s vertices, but also significant thread wastage when peeling off the adjacency edges of these vertices. In this paper, we propose an adaptive parallel framework for core decomposition, named <i>AdaptiveCore</i>. First, it improves vertex searching efficiency by adaptively skipping the empty hierarchies and reducing the search space. Moreover, it greatly improves thread utilization by adaptively allocating the available threads to peel off the adjacency edges. Comprehensive experiments show that, compared with the state-of-the-art works, the proposed framework achieves an average speedup of <inline-formula><tex-math>$7.1times$</tex-math></inline-formula> on the GPU platform and up to <inline-formula><tex-math>$2.0times$</tex-math></inline-formula> on the multi-core CPU platform.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 3","pages":"1111-1124"},"PeriodicalIF":3.8,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The practical applications of quantum computing are currently limited by the small number of available qubits. Recent advances in quantum hardware have introduced mid-circuit measurements and resets, enabling the reuse of measured qubits and thus reducing the qubit requirements for executing quantum algorithms. In this work, we present a systematic study of dynamic quantum circuit compilation, a process that transforms static quantum circuits into their dynamic equivalents with fewer qubits through qubit reuse. We establish the first graph-based framework for optimizing qubit-reuse compilation. In particular, we characterize the task of finding the optimal compilation strategy for maximizing qubit reuse using binary integer programming and provide efficient heuristic algorithms for devising general compilation strategies. We conduct a thorough analysis of quantum circuits with practical relevance and offer their optimal qubit-reuse compilation strategies. We also perform a comparative analysis against state-of-the-art approaches, demonstrating the superior performance of our methods in both structured and random quantum circuits. Our framework lays a rigorous foundation for understanding dynamic quantum circuit compilation via qubit reuse, holding significant promise for the practical implementation of large-scale quantum algorithms on quantum computers with limited resources.
{"title":"Dynamic Quantum Circuit Compilation","authors":"Kun Fang;Munan Zhang;Ruqi Shi;Yinan Li","doi":"10.1109/TC.2025.3643826","DOIUrl":"https://doi.org/10.1109/TC.2025.3643826","url":null,"abstract":"The practical applications of quantum computing are currently limited by the small number of available qubits. Recent advances in quantum hardware have introduced mid-circuit measurements and resets, enabling the reuse of measured qubits and thus reducing the qubit requirements for executing quantum algorithms. In this work, we present a systematic study of dynamic quantum circuit compilation, a process that transforms static quantum circuits into their dynamic equivalents with fewer qubits through qubit reuse. We establish the first graph-based framework for optimizing qubit-reuse compilation. In particular, we characterize the task of finding the optimal compilation strategy for maximizing qubit reuse using binary integer programming and provide efficient heuristic algorithms for devising general compilation strategies. We conduct a thorough analysis of quantum circuits with practical relevance and offer their optimal qubit-reuse compilation strategies. We also perform a comparative analysis against state-of-the-art approaches, demonstrating the superior performance of our methods in both structured and random quantum circuits. Our framework lays a rigorous foundation for understanding dynamic quantum circuit compilation via qubit reuse, holding significant promise for the practical implementation of large-scale quantum algorithms on quantum computers with limited resources.","PeriodicalId":13087,"journal":{"name":"IEEE Transactions on Computers","volume":"75 2","pages":"748-759"},"PeriodicalIF":3.8,"publicationDate":"2025-12-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145963443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}