Recently, processing-in-memory (PIM) units have been deployed to accelerate matrix-vector multiplications in large language models (LLMs). However, due to the limited flexibility of PIMs, PIMs require a strict data layout for storing matrices in memory. As LLM inference operates autoregressively, new elements are appended to the stored matrices during inference, necessitating costly data layout reorganization. Nevertheless, since the conventional workload allocation method assigns entire matrices solely to PIMs, it causes data layout reorganization overhead (i.e., excessive memory writes). Furthermore, the significant variance in matrix sizes exacerbates PIM load imbalance. In this letter, we propose DAWN, a novel workload allocation method. DAWN divides matrices into equally sized chunks and employs a single chunk as the allocation unit. DAWN assigns a portion of chunks to traditional accelerators (e.g., neural processing units), which have no constraints on data layout for computation, to mitigate reorganization overhead. DAWN evenly distributes the remaining chunks across PIMs using a greedy approach to achieve PIM load balancing. Our simulation results show that DAWN improves throughput by up to 44.2% (34.8% on average) over the conventional workload allocation method.
{"title":"DAWN: Efficient Distribution of Attention Workload in PIM-Enabled Systems for LLM Inference","authors":"Jaehoon Chung;Jinho Han;Young-Ho Gong;Sung Woo Chung","doi":"10.1109/LCA.2026.3665202","DOIUrl":"https://doi.org/10.1109/LCA.2026.3665202","url":null,"abstract":"Recently, processing-in-memory (PIM) units have been deployed to accelerate matrix-vector multiplications in large language models (LLMs). However, due to the limited flexibility of PIMs, PIMs require a strict data layout for storing matrices in memory. As LLM inference operates autoregressively, new elements are appended to the stored matrices during inference, necessitating costly data layout reorganization. Nevertheless, since the conventional workload allocation method assigns entire matrices solely to PIMs, it causes data layout reorganization overhead (i.e., excessive memory writes). Furthermore, the significant variance in matrix sizes exacerbates PIM load imbalance. In this letter, we propose DAWN, a novel workload allocation method. DAWN divides matrices into equally sized chunks and employs a single chunk as the allocation unit. DAWN assigns a portion of chunks to traditional accelerators (e.g., neural processing units), which have no constraints on data layout for computation, to mitigate reorganization overhead. DAWN evenly distributes the remaining chunks across PIMs using a greedy approach to achieve PIM load balancing. Our simulation results show that DAWN improves throughput by up to 44.2% (34.8% on average) over the conventional workload allocation method.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"65-68"},"PeriodicalIF":1.4,"publicationDate":"2026-02-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147299643","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-11DOI: 10.1109/LCA.2026.3663702
Roman K. Brunner;Rakesh Kumar
Branch target buffer (BTB) is a central component of high performance core front-ends as it not only steers instruction fetch by uncovering upcoming control flow but also enables highly effective fetch-directed instruction prefetching. However, the massive instruction footprints of modern server applications far exceed the capacities of moderately sized BTBs, resulting in frequent misses that inevitably hurt performance. While commercial CPUs deploy large BTBs to mitigate this problem, they incur high storage and area overheads. Prior efforts to reduce BTB storage have primarily targeted branch targets, which has proven highly effective — so much that the tag storage now dominates the BTB storage budget. We make a key observation that BTBs exhibit a large degree of tag redundancy, i.e. only a small fraction of entries contain unique tags, and this fraction falls sharply as BTB capacity grows. Leveraging this insight, we propose LiteBTB, which employs a dedicated hardware structure to store unique tags only once and replaces per-entry tags in BTB with compact tag pointers. To avoid latency overheads, LiteBTB accesses the tag storage and BTB in parallel. Our evaluation shows that LiteBTB reduces storage by up to 13.1% compared to the state-of-the-art BTB design, called BTB-X, while maintaining equivalent performance. Alternatively, with the same storage budget, LiteBTB accommodates up to 1.125× more branches, yielding up to 2.7% performance improvement.
{"title":"Driving the Core Frontend With LiteBTB","authors":"Roman K. Brunner;Rakesh Kumar","doi":"10.1109/LCA.2026.3663702","DOIUrl":"https://doi.org/10.1109/LCA.2026.3663702","url":null,"abstract":"Branch target buffer (BTB) is a central component of high performance core front-ends as it not only steers instruction fetch by uncovering upcoming control flow but also enables highly effective fetch-directed instruction prefetching. However, the massive instruction footprints of modern server applications far exceed the capacities of moderately sized BTBs, resulting in frequent misses that inevitably hurt performance. While commercial CPUs deploy large BTBs to mitigate this problem, they incur high storage and area overheads. Prior efforts to reduce BTB storage have primarily targeted branch targets, which has proven highly effective — so much that the tag storage now dominates the BTB storage budget. We make a key observation that BTBs exhibit a large degree of tag redundancy, i.e. only a small fraction of entries contain unique tags, and this fraction falls sharply as BTB capacity grows. Leveraging this insight, we propose LiteBTB, which employs a dedicated hardware structure to store unique tags only once and replaces per-entry tags in BTB with compact tag pointers. To avoid latency overheads, LiteBTB accesses the tag storage and BTB in parallel. Our evaluation shows that LiteBTB reduces storage by up to 13.1% compared to the state-of-the-art BTB design, called BTB-X, while maintaining equivalent performance. Alternatively, with the same storage budget, LiteBTB accommodates up to 1.125× more branches, yielding up to 2.7% performance improvement.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"57-60"},"PeriodicalIF":1.4,"publicationDate":"2026-02-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147299691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-06DOI: 10.1109/LCA.2026.3661563
Sangbeom Jeon;Sang-Hoon Kim
Compute Express Link (CXL) enables memory expansion via high-speed cache-coherent interconnects, yet it magnifies address translation overheads due to limited TLB reach. While hugepages alleviate translation costs, they introduce severe fragmentation and compaction overheads in long-running systems. Given this trade-off, we propose the CXL Translation Layer (CTL), a device-resident mechanism that provides host with hugepages using fine-grained basepages in the device. CTL preserves the hugepage-level translation efficiency while achieving flexible memory management, delivering near-native performance in ideal cases and up to 16% improvement under fragmented conditions.
{"title":"CTL: A Case for CXL Device-Managed Hugepages","authors":"Sangbeom Jeon;Sang-Hoon Kim","doi":"10.1109/LCA.2026.3661563","DOIUrl":"https://doi.org/10.1109/LCA.2026.3661563","url":null,"abstract":"Compute Express Link (CXL) enables memory expansion via high-speed cache-coherent interconnects, yet it magnifies address translation overheads due to limited TLB reach. While hugepages alleviate translation costs, they introduce severe fragmentation and compaction overheads in long-running systems. Given this trade-off, we propose the CXL Translation Layer (CTL), a device-resident mechanism that provides host with hugepages using fine-grained basepages in the device. CTL preserves the hugepage-level translation efficiency while achieving flexible memory management, delivering near-native performance in ideal cases and up to 16% improvement under fragmented conditions.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"69-72"},"PeriodicalIF":1.4,"publicationDate":"2026-02-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147362484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-02-04DOI: 10.1109/LCA.2026.3660969
Minho Ha;Euiseok Kim;Hoshik Kim
Large language model (LLM) inference requires massive memory capacity to process long sequences, posing a challenge due to the capacity limitations of high bandwidth memory (HBM). High bandwidth flash (HBF) is an emerging memory device based on NAND flash that offers HBM-comparable bandwidth with much larger capacity, but suffers from disadvantages such as longer access latency, lower write endurance, and higher power consumption. This paper proposes H3, a hybrid architecture designed to effectively utilize both HBM and HBF by leveraging their respective strengths. By storing read-only data in HBF and other data in HBM, H3-equipped systems can process more requests at once with the same number of GPUs than HBM-only systems, making H3 suitable for gigantic read-only use cases in LLM inference, particularly those employing a shared pre-computed key-value cache. Simulation results show that a GPU system with H3 achieves up to 2.69x higher throughput per power compared to a system with HBM-only. This result validates the cost-effectiveness of H3 for handling LLM inference with gigantic read-only data.
{"title":"H3: Hybrid Architecture Using High Bandwidth Memory and High Bandwidth Flash for Cost-Efficient LLM Inference","authors":"Minho Ha;Euiseok Kim;Hoshik Kim","doi":"10.1109/LCA.2026.3660969","DOIUrl":"https://doi.org/10.1109/LCA.2026.3660969","url":null,"abstract":"Large language model (LLM) inference requires massive memory capacity to process long sequences, posing a challenge due to the capacity limitations of high bandwidth memory (HBM). High bandwidth flash (HBF) is an emerging memory device based on NAND flash that offers HBM-comparable bandwidth with much larger capacity, but suffers from disadvantages such as longer access latency, lower write endurance, and higher power consumption. This paper proposes H<sup>3</sup>, a hybrid architecture designed to effectively utilize both HBM and HBF by leveraging their respective strengths. By storing read-only data in HBF and other data in HBM, H<sup>3</sup>-equipped systems can process more requests at once with the same number of GPUs than HBM-only systems, making H<sup>3</sup> suitable for gigantic read-only use cases in LLM inference, particularly those employing a shared pre-computed key-value cache. Simulation results show that a GPU system with H<sup>3</sup> achieves up to 2.69x higher throughput per power compared to a system with HBM-only. This result validates the cost-effectiveness of H<sup>3</sup> for handling LLM inference with gigantic read-only data.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"49-52"},"PeriodicalIF":1.4,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146223586","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-30DOI: 10.1109/LCA.2026.3659512
Junaid Ahmad Khan
Post-training 4-bit quantization is often treated as the default path for running large language models (LLMs) on prosumer GPUs: if decoding is memory-bandwidth bound, shrinking weights from FP16 to 4-bit should cut memory traffic and improve latency and energy efficiency. We revisit this assumption on an RTX 3090 (Ampere) that lacks native INT4 tensor support. For 1–8 billion parameter models (TinyLlama, Qwen-2.5, Mistral, Llama 3.1, DeepSeek-R1-8B), we compare native FP16 inference against AutoGPTQ 4-bit models and GGUF kernels in llama.cpp. On a standard Transformers+AutoGPTQ stack, interactive batch size $B=1$ decoding is still 1.3–2.2×slower than FP16 despite a 2.4× reduction in VRAM usage, and some INT4 configurations are up to 2.4×less energy-efficient. An optimized GGUF backend improves 4-bit TinyLlama throughput by 1.65× over GPTQ, indicating that the de-quantization penalty is dominated by kernel design rather than hardware limits. We conclude that on prosumer GPUs without native INT4 tensor cores, 4-bit quantization is only attractive when paired with mature low-bit kernels; otherwise, FP16 remains the more robust choice for interactive workloads.
{"title":"De-Quantization Penalties for Interactive LLM Inference on Prosumer GPUs","authors":"Junaid Ahmad Khan","doi":"10.1109/LCA.2026.3659512","DOIUrl":"https://doi.org/10.1109/LCA.2026.3659512","url":null,"abstract":"Post-training 4-bit quantization is often treated as the default path for running large language models (LLMs) on prosumer GPUs: if decoding is memory-bandwidth bound, shrinking weights from FP16 to 4-bit should cut memory traffic and improve latency and energy efficiency. We revisit this assumption on an RTX 3090 (Ampere) that lacks native INT4 tensor support. For 1–8 billion parameter models (TinyLlama, Qwen-2.5, Mistral, Llama 3.1, DeepSeek-R1-8B), we compare native FP16 inference against AutoGPTQ 4-bit models and GGUF kernels in <monospace>llama.cpp</monospace>. On a standard Transformers+AutoGPTQ stack, interactive batch size <inline-formula><tex-math>$B=1$</tex-math></inline-formula> decoding is still <i>1.3–2.2×</i> <roman>slower</roman> than FP16 despite a 2.4× reduction in VRAM usage, and some INT4 configurations are up to <i>2.4×</i> <roman>less energy-efficient</roman>. An optimized GGUF backend improves 4-bit TinyLlama throughput by 1.65× over GPTQ, indicating that the de-quantization penalty is dominated by kernel design rather than hardware limits. We conclude that on prosumer GPUs without native INT4 tensor cores, 4-bit quantization is only attractive when paired with mature low-bit kernels; otherwise, FP16 remains the more robust choice for interactive workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"45-48"},"PeriodicalIF":1.4,"publicationDate":"2026-01-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146223588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In multi-chiplet systems, inter-chiplet shared-data transfers pose a significant bottleneck, prolonging the critical paths of memory accesses. In inter-chiplet coherence traffic, since each chiplet often needs to wait reactively for data from remote chiplets, proactive data-fetching mechanisms such as prefetching are essential to anticipate inter-chiplet data accesses and mitigate latency. Nevertheless, traditional prefetchers are inadequate for explicitly handling inter-chiplet shared-data transfers, overlooking potential prefetching opportunities. To overcome this limitation, we propose SAP, a shared-aware prefetching mechanism that minimizes inter-chiplet data access latency. By transforming the IO chiplet into an active prefetching agent, SAP proactively fetches inter-chiplet shared data before demand requests arrive, utilizing a sharing table to track recent shared-data events and a prefetch agent to initiate inter-chiplet data transfers early. Our experiments on a chiplet-based system demonstrate that SAP improves system throughput by 13.44% and reduces execution time by 12.33% compared to the prior design.
{"title":"SAP: Shared-Aware Prefetching for Reducing Inter-Chiplet Data Access Latency","authors":"Junpei Huang;Haobo Xu;Yiming Gan;Ying Li;Wenhao Sun;Mengdi Wang;Xiaotong Wei;Feng Min;Ying Wang;Yinhe Han","doi":"10.1109/LCA.2026.3658371","DOIUrl":"https://doi.org/10.1109/LCA.2026.3658371","url":null,"abstract":"In multi-chiplet systems, inter-chiplet shared-data transfers pose a significant bottleneck, prolonging the critical paths of memory accesses. In inter-chiplet coherence traffic, since each chiplet often needs to wait reactively for data from remote chiplets, proactive data-fetching mechanisms such as prefetching are essential to anticipate inter-chiplet data accesses and mitigate latency. Nevertheless, traditional prefetchers are inadequate for explicitly handling inter-chiplet shared-data transfers, overlooking potential prefetching opportunities. To overcome this limitation, we propose SAP, a shared-aware prefetching mechanism that minimizes inter-chiplet data access latency. By transforming the IO chiplet into an active prefetching agent, SAP proactively fetches inter-chiplet shared data before demand requests arrive, utilizing a sharing table to track recent shared-data events and a prefetch agent to initiate inter-chiplet data transfers early. Our experiments on a chiplet-based system demonstrate that SAP improves system throughput by 13.44% and reduces execution time by 12.33% compared to the prior design.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"53-56"},"PeriodicalIF":1.4,"publicationDate":"2026-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147299648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-21DOI: 10.1109/LCA.2026.3656423
Siyuan Ma;Bagus Hanindhito;Anushka Subramanian;Lizy K. John
When the SPEC benchmark suite was first assembled in 1989, matrix multiplication code matrix300 was one of the 10 programs in the suite, but it was discarded within 2-3 years due to the high sensitivity of matrix multiplication to compiler optimizations. However, with the advent of machine learning (ML), neural networks and generative AI (GenAI), matrix multiplication is an integral part of the modern computing workload. While sensitive, general matrix multiplication (GEMM) cannot be ignored anymore, especially if hardware that runs ML workloads is being evaluated. In this paper the sensitivity of GEMM workloads to libraries and compiler optimizations is studied. While it may be inevitable to use matmul kernels as a benchmark to understand the performance of accelerators for machine learning, understanding the sensitivity to compiler optimizations and software libraries can help to optimize and interpret the results appropriately. We observe more than 9000× variation in CPU runtimes and around 84× variation in GPU run times depending on the optimizations used.
{"title":"GEMM the New Gem: The Inevitable Kernel and its Sensitivity to Compiler Optimizations and Libraries","authors":"Siyuan Ma;Bagus Hanindhito;Anushka Subramanian;Lizy K. John","doi":"10.1109/LCA.2026.3656423","DOIUrl":"https://doi.org/10.1109/LCA.2026.3656423","url":null,"abstract":"When the SPEC benchmark suite was first assembled in 1989, matrix multiplication code <italic>matrix300</i> was one of the 10 programs in the suite, but it was discarded within 2-3 years due to the high sensitivity of matrix multiplication to compiler optimizations. However, with the advent of machine learning (ML), neural networks and generative AI (GenAI), matrix multiplication is an integral part of the modern computing workload. While sensitive, general matrix multiplication (GEMM) cannot be ignored anymore, especially if hardware that runs ML workloads is being evaluated. In this paper the sensitivity of GEMM workloads to libraries and compiler optimizations is studied. While it may be inevitable to use matmul kernels as a benchmark to understand the performance of accelerators for machine learning, understanding the sensitivity to compiler optimizations and software libraries can help to optimize and interpret the results appropriately. We observe more than 9000× variation in CPU runtimes and around 84× variation in GPU run times depending on the optimizations used.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"61-64"},"PeriodicalIF":1.4,"publicationDate":"2026-01-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"147299710","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-14DOI: 10.1109/LCA.2026.3654119
Seonggyu Han;Sangwoong Kim;Minho Kim;Daehoon Kim
Modern tiered-memory architectures are increasingly adopted in cloud servers with extensive physical capacity. Realizing their full performance potential, however, requires effective page management. Existing systems, tuned for long-running workloads, primarily rely on access-count-based promotion. Yet this policy is ill-suited to the short-lived, event-driven model of Function-as-a-Service (FaaS) workloads. The resulting workload-architecture mismatch yields poor page placement and severely degrades architectural efficiency. We present Hisui, a FaaS-aware tiered-memory management system tailored to FaaS workloads. It stages pages with high expected reuse using two mechanisms: an FMem admission filter and an invocation-frequency–weighted valuation that promotes pages by descending gain. Hisui delivers up to 1.57× higher throughput than access-count baselines and consistently lowers latency on real workloads.
{"title":"Hisui: Unlocking Tiered Memory Efficiency for FaaS Workloads","authors":"Seonggyu Han;Sangwoong Kim;Minho Kim;Daehoon Kim","doi":"10.1109/LCA.2026.3654119","DOIUrl":"https://doi.org/10.1109/LCA.2026.3654119","url":null,"abstract":"Modern tiered-memory architectures are increasingly adopted in cloud servers with extensive physical capacity. Realizing their full performance potential, however, requires effective page management. Existing systems, tuned for long-running workloads, primarily rely on access-count-based promotion. Yet this policy is ill-suited to the short-lived, event-driven model of Function-as-a-Service (FaaS) workloads. The resulting workload-architecture mismatch yields poor page placement and severely degrades architectural efficiency. We present Hisui, a FaaS-aware tiered-memory management system tailored to FaaS workloads. It stages pages with high expected reuse using two mechanisms: an FMem admission filter and an invocation-frequency–weighted valuation that promotes pages by descending gain. Hisui delivers up to 1.57× higher throughput than access-count baselines and consistently lowers latency on real workloads.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"41-44"},"PeriodicalIF":1.4,"publicationDate":"2026-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146175937","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Composable chiplet-based architectures adopt a two-stage design flow: first chiplet design, then modular integration, while still presenting a shared-memory, single-OS system view. However, the heterogeneous and modular nature of the resulting network may introduce performance inefficiencies and functional correctness issues, calling for an advanced simulation tool. In this paper, we introduce UniCNet, a unified and cycle-accurate network simulator designed for composable chiplet-based architectures, which is open-sourced. To achieve both accurate modeling and unified simulation, UniCNet employs a design-integration workflow that closely aligns with the composable chiplet design flow. It supports several key features oriented towards composable chiplet scenarios and introduces the first cycle-level chiplet protocol interface model. UniCNet further supports multi-threaded simulation, achieving up to 4× speedup with no cycle accuracy loss. We validate UniCNet against RTL models and demonstrate its utility via several case studies.
{"title":"UniCNet: Unified Cycle-Accurate Simulation for Composable Chiplet Network With Modular Design-Integration Workflow","authors":"Peilin Wang;Mingyu Wang;Zhirong Ye;Tao Lu;Zhiyi Yu","doi":"10.1109/LCA.2026.3653809","DOIUrl":"https://doi.org/10.1109/LCA.2026.3653809","url":null,"abstract":"Composable chiplet-based architectures adopt a two-stage design flow: first chiplet design, then modular integration, while still presenting a shared-memory, single-OS system view. However, the heterogeneous and modular nature of the resulting network may introduce performance inefficiencies and functional correctness issues, calling for an advanced simulation tool. In this paper, we introduce UniCNet, a unified and cycle-accurate network simulator designed for composable chiplet-based architectures, which is open-sourced. To achieve both accurate modeling and unified simulation, UniCNet employs a <italic>design-integration</i> workflow that closely aligns with the composable chiplet design flow. It supports several key features oriented towards composable chiplet scenarios and introduces the first cycle-level chiplet protocol interface model. UniCNet further supports multi-threaded simulation, achieving up to 4× speedup with no cycle accuracy loss. We validate UniCNet against RTL models and demonstrate its utility via several case studies.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"25 1","pages":"37-40"},"PeriodicalIF":1.4,"publicationDate":"2026-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146082129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}