Pub Date : 2025-11-06DOI: 10.1109/LCA.2025.3630094
Tianyao Shi;Yanran Wu;Sihang Liu;Yi Ding
Large language models (LLMs) are increasingly deployed in practice but incur significant computational costs and environmental impacts. Disaggregated serving techniques, particularly decoupling prefill and decoding (DPD) across GPUs, have been introduced to improve performance and reduce carbon emissions. However, DPD suffers from high bandwidth overhead due to frequent large KV cache transfers. To address this, we present disaggregated speculative decoding (DSD), which leverages speculative decoding by assigning draft models to older GPUs and target models to newer GPUs, requiring only token and probability distribution transfers. Building on this insight, we introduce GreenLLM, an SLO- and bandwidth-aware framework that unifies DPD and DSD, profiles workload characteristics, and dynamically selects the most carbon-efficient configuration. Across diverse benchmarks, GreenLLM reduces carbon emissions by up to 40.6% while meeting latency SLOs.
{"title":"Disaggregated Speculative Decoding for Carbon-Efficient LLM Serving","authors":"Tianyao Shi;Yanran Wu;Sihang Liu;Yi Ding","doi":"10.1109/LCA.2025.3630094","DOIUrl":"https://doi.org/10.1109/LCA.2025.3630094","url":null,"abstract":"Large language models (LLMs) are increasingly deployed in practice but incur significant computational costs and environmental impacts. Disaggregated serving techniques, particularly decoupling prefill and decoding (DPD) across GPUs, have been introduced to improve performance and reduce carbon emissions. However, DPD suffers from high bandwidth overhead due to frequent large KV cache transfers. To address this, we present disaggregated speculative decoding (DSD), which leverages speculative decoding by assigning draft models to older GPUs and target models to newer GPUs, requiring only token and probability distribution transfers. Building on this insight, we introduce GreenLLM, an SLO- and bandwidth-aware framework that unifies DPD and DSD, profiles workload characteristics, and dynamically selects the most carbon-efficient configuration. Across diverse benchmarks, GreenLLM reduces carbon emissions by up to 40.6% while meeting latency SLOs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"369-372"},"PeriodicalIF":1.4,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Digital Compute-in-Memory (DCIM) has emerged as a promising solution for accelerating Artificial Intelligence (AI) workloads, especially at the edge. However, DCIM macros may lead to inefficiencies due to frequent weight reloads associated with the block-wise (tiled) computation (BWC) scheme, required due to limited accumulator buffer capacity. To address this challenge, we leverage a multi-storage-row (MSR) architecture that integrates multiple storage cells per multiplier within a DCIM macro. This approach transforms costly weight buffer accesses into highly efficient local row-switching (RS) operations, significantly reducing weight reloading overhead and enhancing accelerator performance. Using technology parameters extracted from our 3 nm SRAM-based MSR-DCIM testchip, coupled with architectural statistics from our in-house analytical framework, we evaluated power, performance, and area (PPA) across several Deep Neural Network (DNN) workloads. Our findings on four edge DNN workloads, including convolutional neural networks (CNNs) and large language models (LLMs), demonstrate that MSR architecture facilitates the BWC-based execution with reduced latency and energy consumption compared to a baseline DCIM macro. A case study reveals that an MSR-DCIM macro with 16-storage-row achieves 15%–55% savings in energy-delay product (EDP) across the selected workloads, underscoring its potential for efficient edge AI acceleration.
{"title":"Enhancing DCIM Efficiency with Multi-Storage-Row Architecture for Edge AI Workloads","authors":"Xiaoyu Sun;Haruki Mori;Wei-Chang Zhao;Je-Min Hung;Hidehiro Fujiwara;Brian Crafton;Bo Zhang;Win-San Khwa;Yu-Der Chih;Tsung-Yung Jonathan Chang;Kerem Akarvardar","doi":"10.1109/LCA.2025.3629390","DOIUrl":"https://doi.org/10.1109/LCA.2025.3629390","url":null,"abstract":"Digital Compute-in-Memory (DCIM) has emerged as a promising solution for accelerating Artificial Intelligence (AI) workloads, especially at the edge. However, DCIM macros may lead to inefficiencies due to frequent weight reloads associated with the block-wise (tiled) computation (BWC) scheme, required due to limited accumulator buffer capacity. To address this challenge, we leverage a multi-storage-row (MSR) architecture that integrates multiple storage cells per multiplier within a DCIM macro. This approach transforms costly weight buffer accesses into highly efficient local row-switching (RS) operations, significantly reducing weight reloading overhead and enhancing accelerator performance. Using technology parameters extracted from our 3 nm SRAM-based MSR-DCIM testchip, coupled with architectural statistics from our in-house analytical framework, we evaluated power, performance, and area (PPA) across several Deep Neural Network (DNN) workloads. Our findings on four edge DNN workloads, including convolutional neural networks (CNNs) and large language models (LLMs), demonstrate that MSR architecture facilitates the BWC-based execution with reduced latency and energy consumption compared to a baseline DCIM macro. A case study reveals that an MSR-DCIM macro with 16-storage-row achieves 15%–55% savings in energy-delay product (EDP) across the selected workloads, underscoring its potential for efficient edge AI acceleration.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"377-380"},"PeriodicalIF":1.4,"publicationDate":"2025-11-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Training Machine Learning (ML) models commonly rely on High-Performance Computing (HPC) centers or cloud servers that accommodate compute nodes with powerful resources. Users tend to enhance the accuracy of ML applications by continuous training, after model refinements, and dataset size increase. With such application changes, and heterogeneity of HPC nodes, knowing each job execution time in advance (i.e., prediction) is necessary for efficient job scheduling. We observe that I/O accesses highly influence the executing time of modern ML applications. Unfortunately, existing studies on estimating job execution time either (a) rely on overestimated user declared time, or (b) predict execution time mainly based on compute resources (ignoring I/O or storage effects), and use complex deep learning models for this purpose. In this paper, we propose a simple, yet effective method for predicting the execution time of ML training. Our approach explicitly accounts for I/O accesses as a critical factor. Our method combines (a) partial application execution & monitoring, (b) analytical modeling leveraging ML application characteristics, (c) dynamic re-estimation, and (d) simplified history-based analysis. Our evaluation on a number of Convolutional Neural Networks (CNNs) and Transformer models show that our proposed method predicts the execution time accurately (i.e., with error less than 8% for most cases) compared to actual execution.
{"title":"I/O-ETEM: An I/O-Aware Approach for Estimating Execution Time of Machine Learning Workloads","authors":"Elham Adibi;Mohammadamin Ajdari;Pouria Arefijamal;Amirsaeed Ahmadi-Tonekaboni;Hossein Asadi","doi":"10.1109/LCA.2025.3620629","DOIUrl":"https://doi.org/10.1109/LCA.2025.3620629","url":null,"abstract":"Training <italic>Machine Learning</i> (ML) models commonly rely on <italic>High-Performance Computing</i> (HPC) centers or cloud servers that accommodate compute nodes with powerful resources. Users tend to enhance the accuracy of ML applications by continuous training, after model refinements, and dataset size increase. With such application changes, and heterogeneity of HPC nodes, knowing each job execution time in advance (i.e., <italic>prediction</i>) is necessary for efficient job scheduling. We observe that I/O accesses highly influence the executing time of modern ML applications. Unfortunately, existing studies on estimating job execution time either (a) rely on overestimated user declared time, or (b) predict execution time mainly based on compute resources (ignoring I/O or storage effects), and use complex deep learning models for this purpose. <bold>In this paper</b>, we propose a simple, yet effective method for predicting the execution time of ML training. Our approach explicitly accounts for I/O accesses as a critical factor. Our method combines (a) partial application execution & monitoring, (b) analytical modeling leveraging ML application characteristics, (c) dynamic re-estimation, and (d) simplified history-based analysis. Our evaluation on a number of <italic>Convolutional Neural Networks</i> (CNNs) and Transformer models show that our proposed method predicts the execution time accurately (i.e., with error less than 8% for most cases) compared to actual execution.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"325-328"},"PeriodicalIF":1.4,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145455740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-04DOI: 10.1109/LCA.2025.3628805
Teresa Zhang
Main memory compression has re-emerged as a viable solution to DRAM cost challenges, enabled by CXL memory devices that support hardware-managed, coarse-grained compression. This shift decouples logical memory space usage from physical consumption, opening new opportunities to re-think the design and implementation of in-memory data structures. This letter presents case studies on blocked variants of chained and Cuckoo hashing that align naturally with the memory compression granularity. Targeting large-scale in-memory data stores, blocked hash tables trade logical sparsity for implementation efficiency while leveraging compression to avoid physical memory waste. We develop mathematical formulations to enable theoretical analysis of key metrics, including memory usage saving and operational throughput. Our results highlight the potential of compression-aware data structures to better leverage modern memory hierarchies and motivate further exploration in this largely untapped design space.
{"title":"Rethinking In-Memory Hash Table Design for CXL-Based Main Memory Compression","authors":"Teresa Zhang","doi":"10.1109/LCA.2025.3628805","DOIUrl":"https://doi.org/10.1109/LCA.2025.3628805","url":null,"abstract":"Main memory compression has re-emerged as a viable solution to DRAM cost challenges, enabled by CXL memory devices that support hardware-managed, coarse-grained compression. This shift decouples logical memory space usage from physical consumption, opening new opportunities to re-think the design and implementation of in-memory data structures. This letter presents case studies on <italic>blocked</i> variants of chained and Cuckoo hashing that align naturally with the memory compression granularity. Targeting large-scale in-memory data stores, blocked hash tables trade logical sparsity for implementation efficiency while leveraging compression to avoid physical memory waste. We develop mathematical formulations to enable theoretical analysis of key metrics, including memory usage saving and operational throughput. Our results highlight the potential of compression-aware data structures to better leverage modern memory hierarchies and motivate further exploration in this largely untapped design space.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"357-360"},"PeriodicalIF":1.4,"publicationDate":"2025-11-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03DOI: 10.1109/LCA.2025.3628325
Jaehong Cho;Hyunmin Choi;Jongse Park
This letter introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5 × fewer LoC and outperforms the predecessor’s hardware-simulator integration, demonstrating LLMServingSim2.0’s low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.
{"title":"LLMServingSim2.0: A Unified Simulator for Heterogeneous Hardware and Serving Techniques in LLM Infrastructure","authors":"Jaehong Cho;Hyunmin Choi;Jongse Park","doi":"10.1109/LCA.2025.3628325","DOIUrl":"https://doi.org/10.1109/LCA.2025.3628325","url":null,"abstract":"This letter introduces LLMServingSim2.0, a system simulator designed for exploring heterogeneous hardware in large-scale LLM serving systems. LLMServingSim2.0 addresses two key limitations of its predecessor: (1) integrating hardware models into system-level simulators is non-trivial due to the lack of a clear abstraction, and (2) existing simulators support only a narrow subset of serving techniques, leaving no infrastructure that captures the breadth of approaches in modern LLM serving. To overcome these issues, LLMServingSim2.0 adopts trace-driven performance modeling, accompanied by an operator-level latency profiler, enabling the integration of new accelerators with a single command. It further embeds up-to-date serving techniques while exposing flexible interfaces for request routing, cache management, and scheduling policies. In a TPU case study, our profiler requires 18.5 × fewer LoC and outperforms the predecessor’s hardware-simulator integration, demonstrating LLMServingSim2.0’s low-effort hardware extensibility. Our experiments further show that LLMServingSim2.0 reproduces GPU-based LLM serving with 1.9% error, while maintaining practical simulation time, making it a comprehensive platform for both hardware developers and LLM service providers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"361-364"},"PeriodicalIF":1.4,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1109/LCA.2025.3626929
Minki Jeong;Daegun Yoon;Soohong Ahn;Seungyong Lee;Jooyoung Kim;Jinuk Jeon;Joonseop Sim;Youngpyo Joo;Hoshik Kim
As large language models (LLMs) scale, their memory and computation demands increase, making weight-only quantization a widely adopted technique to reduce memory footprint with minimal accuracy loss. However, CUDA core–based dequantization introduces significant instruction overhead, memory traffic, and pipeline stalls across all batch sizes, and critically remains a persistent bottleneck even in large-batch, cloud-scale LLM serving. To address these challenges, we propose StreamDQ, a lightweight architectural enhancement for cloud-scale LLM inference that enables on-the-fly dequantization within the memory subsystem by integrating compact DeQuantization Blocks (DQBs) into the base-die of high-bandwidth memory (HBM). StreamDQ leverages reserved address bits for control signaling, requiring no modifications to the ISA or compiler. Our design minimizes data movement, offloads computation from CUDA cores, and delivers dequantized weights directly to tensor cores for general matrix multiplication (GEMM) execution. StreamDQ achieves up to 83.57% reduction in inference latency and 5.15 × improvement in tokens-per-second throughput, with only 0.013 ${mathrm{mm}}^{2}$ area and 0.17 W power overhead per DQB. The design is scalable, software-transparent, and well-suited for high-throughput LLM inference on modern HBM-enabled GPU platforms.
{"title":"StreamDQ: HBM-Integrated On-the-Fly DeQuantization via Memory Load for Large Language Models","authors":"Minki Jeong;Daegun Yoon;Soohong Ahn;Seungyong Lee;Jooyoung Kim;Jinuk Jeon;Joonseop Sim;Youngpyo Joo;Hoshik Kim","doi":"10.1109/LCA.2025.3626929","DOIUrl":"https://doi.org/10.1109/LCA.2025.3626929","url":null,"abstract":"As large language models (LLMs) scale, their memory and computation demands increase, making weight-only quantization a widely adopted technique to reduce memory footprint with minimal accuracy loss. However, CUDA core–based dequantization introduces significant instruction overhead, memory traffic, and pipeline stalls across all batch sizes, and critically remains a persistent bottleneck even in large-batch, cloud-scale LLM serving. To address these challenges, we propose StreamDQ, a lightweight architectural enhancement for cloud-scale LLM inference that enables on-the-fly dequantization within the memory subsystem by integrating compact DeQuantization Blocks (DQBs) into the base-die of high-bandwidth memory (HBM). StreamDQ leverages reserved address bits for control signaling, requiring no modifications to the ISA or compiler. Our design minimizes data movement, offloads computation from CUDA cores, and delivers dequantized weights directly to tensor cores for general matrix multiplication (GEMM) execution. StreamDQ achieves up to 83.57% reduction in inference latency and 5.15 × improvement in tokens-per-second throughput, with only 0.013 <inline-formula><tex-math>${mathrm{mm}}^{2}$</tex-math></inline-formula> area and 0.17 W power overhead per DQB. The design is scalable, software-transparent, and well-suited for high-throughput LLM inference on modern HBM-enabled GPU platforms.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"373-376"},"PeriodicalIF":1.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145612132","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31DOI: 10.1109/LCA.2025.3627539
Myunghyun Rhee;Sookyung Choi;Euiseok Kim;Joonseop Sim;Youngpyo Joo;Hoshik Kim
The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces Mixture of Shared KV Attention (MoSKA), an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of MoSKA is a novel Shared KV Attention mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an MoE-inspired sparse attention strategy that prunes the search space and a tailored Disaggregated Infrastructure that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7 × over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference
{"title":"MoSKA: Mixture of Shared KV Attention for Efficient Long-Sequence LLM Inference","authors":"Myunghyun Rhee;Sookyung Choi;Euiseok Kim;Joonseop Sim;Youngpyo Joo;Hoshik Kim","doi":"10.1109/LCA.2025.3627539","DOIUrl":"https://doi.org/10.1109/LCA.2025.3627539","url":null,"abstract":"The escalating context length in Large Language Models (LLMs) creates a severe performance bottleneck around the Key-Value (KV) cache, whose memory-bound nature leads to significant GPU under-utilization. This paper introduces <italic>Mixture of Shared KV Attention (MoSKA)</i>, an architecture that addresses this challenge by exploiting the heterogeneity of context data. It differentiates between per-request unique and massively reused shared sequences. The core of <italic>MoSKA</i> is a novel <italic>Shared KV Attention</i> mechanism that transforms the attention on shared data from a series of memory-bound GEMV operations into a single, compute-bound GEMM by batching concurrent requests. This is supported by an <italic>MoE-inspired sparse attention</i> strategy that prunes the search space and a tailored <italic>Disaggregated Infrastructure</i> that specializes hardware for unique and shared data. This comprehensive approach demonstrates a throughput increase of up to 538.7 × over baselines in workloads with high context sharing, offering a clear architectural path toward scalable LLM inference","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"365-368"},"PeriodicalIF":1.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560768","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Differential privacy (DP) has become a de facto standard in restricting data leakage when releasing statistical information. It is widely used in data-centric applications, ranging from the US Census Bureau's population data collection to modern deep learning training. Recent studies have shown that using floating-point in implementing DP can significantly degrade its mathematical guarantees and strongly advised to use an integer-based implementation instead. However, nearly all popular DP libraries currently use floating-point. In this paper, we characterize the performance of a recent integer-based DP library from Google. Our study reveals that the noise generation is significantly slower (by 187–296×) when using an integer-based implementation, and that noise sampling can become a non-negligible overhead in real applications. Our findings highlight an overlooked but important overhead in realizing high-privacy DP and call for greater focus from the community.
{"title":"Characterizing the System Overhead of Discrete Noise Generation for Differential Privacy","authors":"SeokHyeon Kong;Donghwan Kim;Euiseong Seo;Kiwan Maeng","doi":"10.1109/LCA.2025.3627101","DOIUrl":"https://doi.org/10.1109/LCA.2025.3627101","url":null,"abstract":"Differential privacy (DP) has become a de facto standard in restricting data leakage when releasing statistical information. It is widely used in data-centric applications, ranging from the US Census Bureau's population data collection to modern deep learning training. Recent studies have shown that using floating-point in implementing DP can significantly degrade its mathematical guarantees and strongly advised to use an integer-based implementation instead. However, nearly all popular DP libraries currently use floating-point. In this paper, we characterize the performance of a recent integer-based DP library from Google. Our study reveals that the noise generation is significantly slower (by 187–296×) when using an integer-based implementation, and that noise sampling can become a non-negligible overhead in real applications. Our findings highlight an overlooked but important overhead in realizing high-privacy DP and call for greater focus from the community.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"345-348"},"PeriodicalIF":1.4,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-23DOI: 10.1109/LCA.2025.3624787
Jinyu Liu;Kiwan Maeng
Secure multi-party computation (MPC) allows multiple parties to collaboratively run machine learning (ML) training and inference without each party revealing its secret data or model weights. Prior works characterized popular MPC-based ML libraries, such as Meta’s CrypTen, to reveal their system overheads and built optimizations guided by the observations. However, we found potential concerns in this process. Through a careful inspection of the CrypTen library, we discovered several inefficient implementations that could overshadow fundamental MPC-related overheads. Furthermore, we observed that the characteristics can vary significantly depending on several factors, such as the model type, batch size, sequence length, and network conditions, many of which prior works do not vary during their evaluation. Our results indicate that focusing solely on a narrow experimental setup and/or relying on characterization without a deep understanding can misguide researchers and call for a more mature framework and standardized evaluation methodology.
{"title":"In-Depth Characterization of Machine Learning on an Optimized Multi-Party Computing Library","authors":"Jinyu Liu;Kiwan Maeng","doi":"10.1109/LCA.2025.3624787","DOIUrl":"https://doi.org/10.1109/LCA.2025.3624787","url":null,"abstract":"Secure multi-party computation (MPC) allows multiple parties to collaboratively run machine learning (ML) training and inference without each party revealing its secret data or model weights. Prior works characterized popular MPC-based ML libraries, such as Meta’s CrypTen, to reveal their system overheads and built optimizations guided by the observations. However, we found potential concerns in this process. Through a careful inspection of the CrypTen library, we discovered several inefficient implementations that could overshadow fundamental MPC-related overheads. Furthermore, we observed that the characteristics can vary significantly depending on several factors, such as the model type, batch size, sequence length, and network conditions, many of which prior works do not vary during their evaluation. Our results indicate that focusing solely on a narrow experimental setup and/or relying on characterization without a deep understanding can misguide researchers and call for a more mature framework and standardized evaluation methodology.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"341-344"},"PeriodicalIF":1.4,"publicationDate":"2025-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560749","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The customization of accelerators for low-bit and mixed-bit convolutional neural works (CNNs) has been a promising approach to enhance computing efficiency of CNNs. However, current low-bit and mix-bit accelerator sacrifice some network accuracy to achieve higher performance and power efficiency, especially for lightweight CNNs like MobileNets containing depth-wise convolution (DW-CONV). These accelerators for low-bit or mixed-bit CNNS tend to achieve good results in only one aspect, such as performance, power efficiency, or Top1 accuracy, and it is difficult to achieve good experimental results in all aspects. In this work, we propose an accelerator that perform well in multifaceted aspects including performance, power efficiency, and Top1 accuracy. First, arbitrary-basis quantization (ABQ) method is used to enhance Top1 accuracy and a dedicated ABQ-based processing element (PE) is proposed to improve performance. Then, an adaptive data flow is presented to support standard convolution (SD-CONV) and depth-wise convolution (DW-CONV) efficiently in the primise of without increasing hardware consumption of the ABQ-based PE. Implemented on Zynp ZC706 platform, compared with other works, the proposed accelerator first achieve good experimental results in all aspects, achieving 1.28 × –5.76 × power efficiency and 1.11 × –5.81 × performance in the premise of the best Top1 accuracy.
{"title":"A Multiple-Aspect Optimal CNN Accelerator in Top1 Accuracy, Performance, and Power Efficiency","authors":"Xianghong Hu;Yuanmiao Lin;Xueming Li;Ruidian Zhan;Jie Cao;Dayong Zhu;Shuting Cai;Xin Zheng;Xiaoming Xiong","doi":"10.1109/LCA.2025.3624004","DOIUrl":"https://doi.org/10.1109/LCA.2025.3624004","url":null,"abstract":"The customization of accelerators for low-bit and mixed-bit convolutional neural works (CNNs) has been a promising approach to enhance computing efficiency of CNNs. However, current low-bit and mix-bit accelerator sacrifice some network accuracy to achieve higher performance and power efficiency, especially for lightweight CNNs like MobileNets containing depth-wise convolution (DW-CONV). These accelerators for low-bit or mixed-bit CNNS tend to achieve good results in only one aspect, such as performance, power efficiency, or Top1 accuracy, and it is difficult to achieve good experimental results in all aspects. In this work, we propose an accelerator that perform well in multifaceted aspects including performance, power efficiency, and Top1 accuracy. First, arbitrary-basis quantization (ABQ) method is used to enhance Top1 accuracy and a dedicated ABQ-based processing element (PE) is proposed to improve performance. Then, an adaptive data flow is presented to support standard convolution (SD-CONV) and depth-wise convolution (DW-CONV) efficiently in the primise of without increasing hardware consumption of the ABQ-based PE. Implemented on Zynp ZC706 platform, compared with other works, the proposed accelerator first achieve good experimental results in all aspects, achieving 1.28 × –5.76 × power efficiency and 1.11 × –5.81 × performance in the premise of the best Top1 accuracy.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"349-352"},"PeriodicalIF":1.4,"publicationDate":"2025-10-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145560751","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}