Composable chiplet-based architectures adopt a two-stage design flow: first chiplet design, then modular integration, while still presenting a shared-memory, single-OS system view. However, the heterogeneous and modular nature of the resulting network may introduce performance inefficiencies and functional correctness issues, calling for an advanced simulation tool. In this paper, we introduce UniCNet, a unified and cycle-accurate network simulator designed for composable chiplet-based architectures, which is open-sourced. To achieve both accurate modeling and unified simulation, UniCNet employs a design-integration workflow that closely aligns with the composable chiplet design flow. It supports several key features oriented towards composable chiplet scenarios and introduces the first cycle-level chiplet protocol interface model. UniCNet further supports multi-threaded simulation, achieving up to 4× speedup with no cycle accuracy loss. We validate UniCNet against RTL models and demonstrate its utility via several case studies.
{"title":"UniCNet: Unified Cycle-Accurate Simulation for Composable Chiplet Network With Modular Design-Integration Workflow","authors":"Peilin Wang;Mingyu Wang;Zhirong Ye;Tao Lu;Zhiyi Yu","doi":"10.1109/LCA.2026.3653809","DOIUrl":"https://doi.org/10.1109/LCA.2026.3653809","url":null,"abstract":"Composable chiplet-based architectures adopt a two-stage design flow: first chiplet design, then modular integration, while still presenting a shared-memory, single-OS system view. However, the heterogeneous and modular nature of the resulting network may introduce performance inefficiencies and functional correctness issues, calling for an advanced simulation tool. In this paper, we introduce UniCNet, a unified and cycle-accurate network simulator designed for composable chiplet-based architectures, which is open-sourced. To achieve both accurate modeling and unified simulation, UniCNet employs a <italic>design-integration</i> workflow that closely aligns with the composable chiplet design flow. It supports several key features oriented towards composable chiplet scenarios and introduces the first cycle-level chiplet protocol interface model. UniCNet further supports multi-threaded simulation, achieving up to 4× speedup with no cycle accuracy loss. We validate UniCNet against RTL models and demonstrate its utility via several case studies.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"37-40"},"PeriodicalIF":1.4,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Several interconnection networks are based on the complete graph topology. Networks with a moderate size can be based on a single complete graph. However, large-scale networks such as Dragonfly and HyperX use, respectively, a hierarchical or a multi-dimensional composition of complete graphs. The number of links in these networks is huge and grows rapidly with their size. This paper introduces LACIN, a set of complete graph implementations that use identically indexed ports to link switches. This way of implementing the network reduces the complexity of its cabling and its routing. LACIN eases the deployment of networks for parallel computers of different scales, from VLSI systems to the largest supercomputers.
{"title":"LACIN: Linearly Arranged Complete Interconnection Networks","authors":"Ramón Beivide;Cristóbal Camarero;Carmen Martínez;Enrique Vallejo;Mateo Valero","doi":"10.1109/LCA.2025.3649284","DOIUrl":"https://doi.org/10.1109/LCA.2025.3649284","url":null,"abstract":"Several interconnection networks are based on the complete graph topology. Networks with a moderate size can be based on a single complete graph. However, large-scale networks such as Dragonfly and HyperX use, respectively, a hierarchical or a multi-dimensional composition of complete graphs. The number of links in these networks is huge and grows rapidly with their size. This paper introduces LACIN, a set of complete graph implementations that use identically indexed ports to link switches. This way of implementing the network reduces the complexity of its cabling and its routing. LACIN eases the deployment of networks for parallel computers of different scales, from VLSI systems to the largest supercomputers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"29-32"},"PeriodicalIF":1.4,"publicationDate":"2025-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In-order (InO) cores are processor cores that execute instructions in program order. Due to their low complexity, InO cores have been widely used in situations where energy efficiency and small circuit area are required, but they provide limited performance. We focus on stalls in InO cores that are caused by load instructions and their consumer instructions. These stalls significantly degrade the performance, which accounts for 70% of the total execution time on SPEC CPU 2017. We found that many of these load instructions are ready to execute and could have been issued earlier. Based on this observation, we propose Inspex, which improves the performance of InO cores while maintaining its simplicity. Inspex predicts ready-to-execute loads and speculatively pre-executes them. This enables obtaining load results earlier, thereby removing the stalls caused by load consumers. Our simulation results show that Inspex improved the performance by 25.6% and reduced the energy consumption by 10.6%, and reduced the energy-delay product by 28.0% compared with a baseline InO core while incurring an area overhead of 0.65%.
按顺序(InO)核是按程序顺序执行指令的处理器核。由于其较低的复杂度,InO内核已广泛应用于需要能效和小电路面积的场合,但其提供的性能有限。我们关注的是由加载指令和它们的消费者指令引起的InO内核中的停滞。这些停顿严重降低了性能,在SPEC CPU 2017上占总执行时间的70%。我们发现,这些加载指令中有许多已经准备好执行了,可以更早地发出。基于这种观察,我们提出了Inspex,它在保持其简单性的同时提高了InO内核的性能。Inspex预测准备执行的负载,并推测性地预执行它们。这样可以更早地获得负载结果,从而消除由负载消费者引起的延迟。我们的仿真结果表明,Inspex与基准InO核心相比,性能提高了25.6%,能耗降低了10.6%,能量延迟产品降低了28.0%,而面积开销仅为0.65%。
{"title":"Inspex: Speculative Execution of Ready-to-Execute Loads in In-Order Cores","authors":"Yotaro Nada;Toru Koizumi;Ryota Shioya;Hidetsugu Irie;Shuichi Sakai","doi":"10.1109/LCA.2025.3646976","DOIUrl":"https://doi.org/10.1109/LCA.2025.3646976","url":null,"abstract":"In-order (InO) cores are processor cores that execute instructions in program order. Due to their low complexity, InO cores have been widely used in situations where energy efficiency and small circuit area are required, but they provide limited performance. We focus on stalls in InO cores that are caused by load instructions and their consumer instructions. These stalls significantly degrade the performance, which accounts for 70% of the total execution time on SPEC CPU 2017. We found that many of these load instructions are ready to execute and could have been issued earlier. Based on this observation, we propose Inspex, which improves the performance of InO cores while maintaining its simplicity. Inspex predicts ready-to-execute loads and speculatively pre-executes them. This enables obtaining load results earlier, thereby removing the stalls caused by load consumers. Our simulation results show that Inspex improved the performance by 25.6% and reduced the energy consumption by 10.6%, and reduced the energy-delay product by 28.0% compared with a baseline InO core while incurring an area overhead of 0.65%.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"17-20"},"PeriodicalIF":1.4,"publicationDate":"2025-12-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11316409","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-22DOI: 10.1109/LCA.2025.3646622
Heewoo Kim;Alan La;Joseph Izraelevitz
Large Language Models (LLMs) require substantial computational resources, making cost-efficient inference challenging. Scaling out with mid-tier GPUs (e.g., NVIDIA A10) appears attractive for LLMs, but our characterization shows that communication bottlenecks prevent them from matching high-end GPUs (e.g., 4 × A100). Using 16 × A10 GPUs, we find the decode stage—dominant in inference runtime—is memory-bandwidth-bound in matrix multiplications, I/O-bandwidth-bound in AllReduce, and underutilizes compute resources. These traits make it well-suited for both DIMM-based Near-Memory Processing (NMP) offloading and also communication quantization. Analytical modeling shows that a 16 × A10 with NMP DIMMs and INT8 communication quantization can match 4 × A100 performance at 30% lower cost and even surpass it under equal cost. These results demonstrate the potential of our approach for cost-efficient LLM inference on mid-tier GPUs.
{"title":"Enabling Cost-Efficient LLM Inference on Mid-Tier GPUs With NMP DIMMs","authors":"Heewoo Kim;Alan La;Joseph Izraelevitz","doi":"10.1109/LCA.2025.3646622","DOIUrl":"https://doi.org/10.1109/LCA.2025.3646622","url":null,"abstract":"Large Language Models (LLMs) require substantial computational resources, making cost-efficient inference challenging. Scaling out with mid-tier GPUs (e.g., NVIDIA A10) appears attractive for LLMs, but our characterization shows that communication bottlenecks prevent them from matching high-end GPUs (e.g., 4 × A100). Using 16 × A10 GPUs, we find the decode stage—dominant in inference runtime—is memory-bandwidth-bound in matrix multiplications, I/O-bandwidth-bound in <italic>AllReduce</i>, and underutilizes compute resources. These traits make it well-suited for both DIMM-based Near-Memory Processing (NMP) offloading and also communication quantization. Analytical modeling shows that a 16 × A10 with NMP DIMMs and INT8 communication quantization can match 4 × A100 performance at 30% lower cost and even surpass it under equal cost. These results demonstrate the potential of our approach for cost-efficient LLM inference on mid-tier GPUs.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"21-24"},"PeriodicalIF":1.4,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929490","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1109/LCA.2025.3646250
Jinwoo Hwang;Yeongmin Hwang;Tadiwos Meaza;Hyeonbin Bae;Jongse Park
Recent computational advances enable protein design pipelines to run end-to-end on GPUs, yet their heterogeneous computational behaviors remain undercharacterized at the system level. We implement and profile a representative pipeline at both component and full-pipeline granularities across varying inputs and hyperparameters. Our characterization identifies generally low GPU utilization and high sensitivity to sequence length and sampling strategies. We outline future research directions based on these insights and release an open-source pipeline and profiling scripts to facilitate further studies.
{"title":"Understanding the Performance Behaviors of End-to-End Protein Design Pipelines on GPUs","authors":"Jinwoo Hwang;Yeongmin Hwang;Tadiwos Meaza;Hyeonbin Bae;Jongse Park","doi":"10.1109/LCA.2025.3646250","DOIUrl":"https://doi.org/10.1109/LCA.2025.3646250","url":null,"abstract":"Recent computational advances enable protein design pipelines to run end-to-end on GPUs, yet their heterogeneous computational behaviors remain undercharacterized at the system level. We implement and profile a representative pipeline at both component and full-pipeline granularities across varying inputs and hyperparameters. Our characterization identifies generally low GPU utilization and high sensitivity to sequence length and sampling strategies. We outline future research directions based on these insights and release an open-source pipeline and profiling scripts to facilitate further studies.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"9-12"},"PeriodicalIF":1.4,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-19DOI: 10.1109/LCA.2025.3646170
Hyesung Ahn;Ranggi Hwang;Minsoo Rhu
Multimodal large language models (MLLMs) have demonstrated strong performance across modalities, such as image, video, and audio understanding, by leveraging large language models (LLMs) as a backbone. However, a critical challenge in MLLM inference is the large memory capacity required for the key–value (KV) cache, particularly when processing high-resolution images. This pressure often forces heterogeneous CPU–GPU systems to offload the KV cache to CPU memory, introducing substantial transfer latency. KV cache quantization is a promising way to reduce this memory demand, yet it remains underexplored for MLLM inference. In this work, we characterize MLLM inference and present a text-centric KV cache quantization method that retains only 10% of tokens in high precision while quantizing the rest. Our method reduces Time-To-First-Token (TTFT) by $1.7times$ and Time-Per-Output-Token (TPOT) by $4.3times$, with negligible accuracy loss.
{"title":"Exploring KV Cache Quantization in Multimodal Large Language Model Inference","authors":"Hyesung Ahn;Ranggi Hwang;Minsoo Rhu","doi":"10.1109/LCA.2025.3646170","DOIUrl":"https://doi.org/10.1109/LCA.2025.3646170","url":null,"abstract":"Multimodal large language models (MLLMs) have demonstrated strong performance across modalities, such as image, video, and audio understanding, by leveraging large language models (LLMs) as a backbone. However, a critical challenge in MLLM inference is the large memory capacity required for the key–value (KV) cache, particularly when processing high-resolution images. This pressure often forces heterogeneous CPU–GPU systems to offload the KV cache to CPU memory, introducing substantial transfer latency. KV cache quantization is a promising way to reduce this memory demand, yet it remains underexplored for MLLM inference. In this work, we characterize MLLM inference and present a text-centric KV cache quantization method that retains only 10% of tokens in high precision while quantizing the rest. Our method reduces Time-To-First-Token (TTFT) by <inline-formula><tex-math>$1.7times$</tex-math></inline-formula> and Time-Per-Output-Token (TPOT) by <inline-formula><tex-math>$4.3times$</tex-math></inline-formula>, with negligible accuracy loss.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"13-16"},"PeriodicalIF":1.4,"publicationDate":"2025-12-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145929484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-08DOI: 10.1109/LCA.2025.3640965
Qi Shao;Per Stenstrom
Prefetching is an important technique to reduce the miss penalty in deep memory hierarchies, employing multiple levels of cache and memory. Unfortunately, state-of-the-art techniques avoid prefetching across page boundaries in physically addressed memory because contiguous virtual pages are not guaranteed to map to contiguous physical pages. Apart from low accuracy, prefetching across page boundaries can break protection domains, opening up security vulnerabilities. This paper proposes CrossFetch — the first prefetching technique that accurately and securely prefetches data across physical page boundaries. It uses a simple and novel translation mechanism that combines a conventional TLB, called forward TLB (FTLB), with a reverse TLB (RTLB) that provides mappings of physical pages to virtual. CrossFetch leverages a conventional Page Table Walker invoked by a conventional TLB to load mappings into the FTLB and RTLB. The paper demonstrates how CrossFetch can hide far-memory misses in hybrid main-memory systems and last-level cache misses. We show that CrossFetch can improve IPC by 5.7% (up to 27.7%) compared to intra-page prefetchers on SPEC2017 benchmarks where the tolerance of far-memory misses dominates.
{"title":"CrossFetch: A Prefetching Scheme for Cross-Page Prefetching in the Physical Address Space","authors":"Qi Shao;Per Stenstrom","doi":"10.1109/LCA.2025.3640965","DOIUrl":"https://doi.org/10.1109/LCA.2025.3640965","url":null,"abstract":"Prefetching is an important technique to reduce the miss penalty in deep memory hierarchies, employing multiple levels of cache and memory. Unfortunately, state-of-the-art techniques avoid prefetching across page boundaries in physically addressed memory because contiguous virtual pages are not guaranteed to map to contiguous physical pages. Apart from low accuracy, prefetching across page boundaries can break protection domains, opening up security vulnerabilities. This paper proposes CrossFetch — the first prefetching technique that accurately and securely prefetches data across physical page boundaries. It uses a simple and novel translation mechanism that combines a conventional TLB, called forward TLB (FTLB), with a reverse TLB (RTLB) that provides mappings of physical pages to virtual. CrossFetch leverages a conventional Page Table Walker invoked by a conventional TLB to load mappings into the FTLB and RTLB. The paper demonstrates how CrossFetch can hide far-memory misses in hybrid main-memory systems and last-level cache misses. We show that CrossFetch can improve IPC by 5.7% (up to 27.7%) compared to intra-page prefetchers on SPEC2017 benchmarks where the tolerance of far-memory misses dominates.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"1-4"},"PeriodicalIF":1.4,"publicationDate":"2025-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=11282456","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145860191","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-12-02DOI: 10.1109/LCA.2025.3639372
Insup Lee;Daehyeon Bae;Seokhie Hong;Sangjin Lee
Deep learning has been extensively used in side-channel analysis (SCA), making trace data insufficiency and class imbalance a critical challenge. Although several studies have explored trace augmentation with generative models, two core limitations remain: (i) insufficient integration of SCA domain knowledge into the models and (ii) limited adoption of state-of-the-art diffusion transformers (DiT). This letter presents LeakDiT, a domain-specific one-dimensional DiT that generates high-quality traces. LeakDiT introduces a loss based on normalized inter-class variance (NICV) to produce realistic traces that preserve the leakage structure. Experimental results demonstrate that LeakDiT improves SCA performance and reduces the number of required traces for key recovery.
{"title":"LeakDiT: Diffusion Transformers for Trace-Augmented Side-Channel Analysis","authors":"Insup Lee;Daehyeon Bae;Seokhie Hong;Sangjin Lee","doi":"10.1109/LCA.2025.3639372","DOIUrl":"https://doi.org/10.1109/LCA.2025.3639372","url":null,"abstract":"Deep learning has been extensively used in side-channel analysis (SCA), making trace data insufficiency and class imbalance a critical challenge. Although several studies have explored trace augmentation with generative models, two core limitations remain: (i) insufficient integration of SCA domain knowledge into the models and (ii) limited adoption of state-of-the-art diffusion transformers (DiT). This letter presents <sc>LeakDiT</small>, a domain-specific one-dimensional DiT that generates high-quality traces. <sc>LeakDiT</small> introduces a loss based on normalized inter-class variance (NICV) to produce realistic traces that preserve the leakage structure. Experimental results demonstrate that <sc>LeakDiT</small> improves SCA performance and reduces the number of required traces for key recovery.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"5-8"},"PeriodicalIF":1.4,"publicationDate":"2025-12-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145861219","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-27DOI: 10.1109/LCA.2025.3638260
Chris Cao;Gururaj Saileshwar
A recent work presented at USENIX Security 2025, Systematic Evaluation of Randomized Cache Designs against Cache Occupancy (RCO), claims that cache-occupancy-based side-channel attacks can recover AES keys from the MIRAGE randomized cache. In this paper, we examine these claims and find that they arise from a flawed modeling of randomized caches in RCO. Critically, we find that the security properties of randomized caches strongly depend on the seeding methodology used to initialize random number generators (RNG) used in these caches. RCO’s modeling uses a constant seed to initialize the cache RNGs for each simulated AES encryption, causing every simulated AES encryption to artificially evict the same sequence of cache lines. This departs from accurate modeling of such randomized caches, where eviction sequences vary randomly for each program execution. We observe that an accurate modeling of such randomized caches, where the RNG seed is randomized in each simulation, causes correlations between AES T-table accesses and attacker observations to disappear, and the attack to fail. These findings show that the previously claimed leakages are due to flawed modeling and that with correct modeling, MIRAGE does not leak AES keys via occupancy based side-channels.
{"title":"Another Mirage of Breaking MIRAGE: Debunking Occupancy-Based Side-Channel Attacks on Fully Associative Randomized Caches","authors":"Chris Cao;Gururaj Saileshwar","doi":"10.1109/LCA.2025.3638260","DOIUrl":"https://doi.org/10.1109/LCA.2025.3638260","url":null,"abstract":"A recent work presented at USENIX Security 2025, <italic>Systematic Evaluation of Randomized Cache Designs against Cache Occupancy (RCO)</i>, claims that cache-occupancy-based side-channel attacks can recover AES keys from the MIRAGE randomized cache. In this paper, we examine these claims and find that they arise from a flawed modeling of randomized caches in RCO. Critically, we find that the security properties of randomized caches strongly depend on the seeding methodology used to initialize random number generators (RNG) used in these caches. RCO’s modeling uses a constant seed to initialize the cache RNGs for each simulated AES encryption, causing every simulated AES encryption to artificially evict the same sequence of cache lines. This departs from accurate modeling of such randomized caches, where eviction sequences vary randomly for each program execution. We observe that an accurate modeling of such randomized caches, where the RNG seed is randomized in each simulation, causes correlations between AES T-table accesses and attacker observations to disappear, and the attack to fail. These findings show that the previously claimed leakages are due to flawed modeling and that with correct modeling, MIRAGE does not leak AES keys via occupancy based side-channels.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"25-28"},"PeriodicalIF":1.4,"publicationDate":"2025-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082127","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-26DOI: 10.1109/LCA.2025.3637718
Pavel Golikov;Karthik Ganesan;Gennady Pekhimenko;Mark C. Jeffrey
Dot products are heavily used in applications like graphics, signal processing, navigation, and artificial intelligence (AI). These AI models in particular impose significant computational demands on modern computers. Current accelerators typically implement dot product hardware as a row of multipliers followed by a tree of adders. However, treating multiplication and summation operations separately leads to sub-optimal hardware. In contrast, we obtain significant area savings by considering the dot product operation as a whole. We propose FASED, which fuses components of a Booth multiplier with the adder tree to eliminate a significant portion of full adders from a baseline INT8×INT8,4,2 design. Compared to popular dot product hardware units, FASED reduces area by up to $1.9times$.
{"title":"Fusing Adds and Shifts for Efficient Dot Products","authors":"Pavel Golikov;Karthik Ganesan;Gennady Pekhimenko;Mark C. Jeffrey","doi":"10.1109/LCA.2025.3637718","DOIUrl":"https://doi.org/10.1109/LCA.2025.3637718","url":null,"abstract":"Dot products are heavily used in applications like graphics, signal processing, navigation, and artificial intelligence (AI). These AI models in particular impose significant computational demands on modern computers. Current accelerators typically implement dot product hardware as a row of multipliers followed by a tree of adders. However, treating multiplication and summation operations separately leads to sub-optimal hardware. In contrast, we obtain significant area savings by considering the dot product operation as a whole. We propose <monospace>FASED</monospace>, which fuses components of a Booth multiplier with the adder tree to eliminate a significant portion of full adders from a baseline INT8×INT8,4,2 design. Compared to popular dot product hardware units, <monospace>FASED</monospace> reduces area by up to <inline-formula><tex-math>$1.9times$</tex-math></inline-formula>.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"33-36"},"PeriodicalIF":1.4,"publicationDate":"2025-11-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}