首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
WoperTM: Got Nacks? Use Them! 有零食吗?使用它们!
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-28 DOI: 10.1109/LCA.2025.3565199
Víctor Nicolás-Conesa;Rubén Titos-Gil;Ricardo Fernández-Pascual;Manuel E. Acacio;Alberto Ros
The simplicity of requester-wins has made it the preferred choice for conflict resolution in commercial implementations of Hardware Transactional Memory (HTM), which typically have relied on conventional locking to escape from conflict-induced livelocks. Prior work advocates for combining requester-wins and requester-loses to ensure progress for higher-priority transactions, yet it fails to take full advantage of the available features, namely, protocol support for nacks. This paper introduces WoperTM, a dual-policy, best-effort HTM design that resolves conflicts using requester-loses policy in the common case. Our key insight is that, since nacks are required to support priorities in HTM, performance can be improved at nearly no extra cost by allowing regular transactions to benefit from requester-loses, instead of only those involving a high-priority transaction. Experimental results using gem5 and STAMP show that WoperTM can significantly reduce squashed work and improve execution times by 12% with respect to power transactions, with negligible hardware overhead.
请求者获胜的简单性使其成为硬件事务性内存(Hardware Transactional Memory, HTM)的商业实现中解决冲突的首选,后者通常依赖于传统的锁定来避免冲突引发的活动。先前的工作主张将请求者获胜和请求者失败结合起来,以确保高优先级事务的进展,但它未能充分利用可用的特性,即对nack的协议支持。本文介绍了WoperTM,这是一种双策略、尽力而为的HTM设计,在一般情况下使用请求者丢失策略来解决冲突。我们的关键见解是,由于需要在HTM中支持优先级,因此通过允许常规事务从请求者丢失中获益,而不仅仅是那些涉及高优先级事务的事务,可以在几乎没有额外成本的情况下提高性能。使用gem5和STAMP的实验结果表明,WoperTM可以显著减少压缩工作,并将执行时间提高12%,而硬件开销可以忽略不计。
{"title":"WoperTM: Got Nacks? Use Them!","authors":"Víctor Nicolás-Conesa;Rubén Titos-Gil;Ricardo Fernández-Pascual;Manuel E. Acacio;Alberto Ros","doi":"10.1109/LCA.2025.3565199","DOIUrl":"https://doi.org/10.1109/LCA.2025.3565199","url":null,"abstract":"The simplicity of requester-wins has made it the preferred choice for conflict resolution in commercial implementations of Hardware Transactional Memory (HTM), which typically have relied on conventional locking to escape from conflict-induced livelocks. Prior work advocates for combining requester-wins and requester-loses to ensure progress for higher-priority transactions, yet it fails to take full advantage of the available features, namely, protocol support for <italic>nacks</i>. This paper introduces WoperTM, a dual-policy, best-effort HTM design that resolves conflicts using <italic>requester-loses</i> policy in the common case. Our key insight is that, since <italic>nacks</i> are required to support priorities in HTM, performance can be improved at nearly no extra cost by allowing regular transactions to benefit from requester-loses, instead of only those involving a high-priority transaction. Experimental results using gem5 and STAMP show that WoperTM can significantly reduce squashed work and improve execution times by 12% with respect to <italic>power transactions</i>, with negligible hardware overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"157-160"},"PeriodicalIF":1.4,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cache and Near-Data Co-Design for Chiplets 小芯片的缓存和近数据协同设计
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-25 DOI: 10.1109/LCA.2025.3564535
Arteen Abrishami;Zhengrong Wang;Tony Nowatzki
Vendors are increasingly adopting chiplet-based designs to manage cost for large-scale multi-cores. While near-data computing, a paradigm involving offloading computation near where data is located in memory, has been studied in the context of monolithic chip designs – its applications to chiplets remain unexplored. In this letter, we explore how the paradigm extends to chiplets in a system where computation is offloaded to accelerators collocated within the last-level-cache structure. We explore both shared and private last-level-cache designs across a variety of different workloads, both large-scale graph computations and more regular-access workloads, in order to understand how to optimize the cache and topology design for near-data workloads. We find that with a mesh chiplet architecture with shared last-level-cache (LLC), near-data optimization can achieve an 8.70× speedup on graph workloads, providing an even greater benefit than in traditional systems.
供应商越来越多地采用基于芯片的设计来管理大规模多核的成本。虽然在单片芯片设计的背景下研究了近数据计算,一种涉及在内存中数据位置附近卸载计算的范式,但其在小芯片上的应用仍未探索。在这封信中,我们将探讨如何将范式扩展到系统中的小芯片,其中计算被卸载到最后一级缓存结构中并置的加速器。为了了解如何为近数据工作负载优化缓存和拓扑设计,我们探索了跨各种不同工作负载的共享和私有最后一级缓存设计,包括大规模图计算和更常规访问的工作负载。我们发现,使用具有共享最后一级缓存(LLC)的网格芯片架构,近数据优化可以在图形工作负载上实现8.70倍的加速,提供比传统系统更大的好处。
{"title":"Cache and Near-Data Co-Design for Chiplets","authors":"Arteen Abrishami;Zhengrong Wang;Tony Nowatzki","doi":"10.1109/LCA.2025.3564535","DOIUrl":"https://doi.org/10.1109/LCA.2025.3564535","url":null,"abstract":"Vendors are increasingly adopting chiplet-based designs to manage cost for large-scale multi-cores. While near-data computing, a paradigm involving offloading computation near where data is located in memory, has been studied in the context of monolithic chip designs – its applications to chiplets remain unexplored. In this letter, we explore how the paradigm extends to chiplets in a system where computation is offloaded to accelerators collocated within the last-level-cache structure. We explore both shared and private last-level-cache designs across a variety of different workloads, both large-scale graph computations and more regular-access workloads, in order to understand how to optimize the cache and topology design for near-data workloads. We find that with a mesh chiplet architecture with shared last-level-cache (LLC), near-data optimization can achieve an 8.70× speedup on graph workloads, providing an even greater benefit than in traditional systems.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"149-152"},"PeriodicalIF":1.4,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In-Memory Computing Accelerator for Iterative Linear Algebra Solvers 迭代线性代数求解的内存计算加速器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-22 DOI: 10.1109/LCA.2025.3563365
Rui Liu;Zerun Li;Xiaoyu Zhang;Xiaoming Chen;Yinhe Han;Minghua Tang
Iterative linear solvers are a crucial kernel in many numerical analysis problems. The performance and energy efficiency of iterative solvers based on traditional architectures are severely constrained by the memory wall bottleneck. Computing-in-memory (CIM) has the potential to enhance solving efficiency. Existing CIM architectures are mostly customized for specific algorithms and primarily focus on handling fixed-point operations, which makes them difficult to meet the demands of diverse and high-precision applications. In this work, we propose a CIM architecture that natively supports various iterative linear solvers based on floating-point operations. We develop a new instruction set for the accelerator, which can be flexibly combined to implement various iterative solvers. The evaluation results show that, compared with the GPU implementation, our accelerator achieves more than 10.1× speedup and 6.8× energy savings when executing different iterative solvers.
迭代线性求解是许多数值分析问题的核心。基于传统架构的迭代求解器的性能和能效受到内存墙瓶颈的严重制约。内存计算(CIM)具有提高求解效率的潜力。现有的CIM体系结构大多是针对特定算法定制的,主要侧重于处理定点操作,难以满足多样化和高精度应用的需求。在这项工作中,我们提出了一个CIM架构,该架构支持基于浮点运算的各种迭代线性求解器。我们开发了一种新的加速器指令集,可以灵活地组合实现各种迭代求解。评估结果表明,与GPU实现相比,在执行不同的迭代求解器时,我们的加速器实现了10.1倍以上的加速和6.8倍以上的节能。
{"title":"In-Memory Computing Accelerator for Iterative Linear Algebra Solvers","authors":"Rui Liu;Zerun Li;Xiaoyu Zhang;Xiaoming Chen;Yinhe Han;Minghua Tang","doi":"10.1109/LCA.2025.3563365","DOIUrl":"https://doi.org/10.1109/LCA.2025.3563365","url":null,"abstract":"Iterative linear solvers are a crucial kernel in many numerical analysis problems. The performance and energy efficiency of iterative solvers based on traditional architectures are severely constrained by the memory wall bottleneck. Computing-in-memory (CIM) has the potential to enhance solving efficiency. Existing CIM architectures are mostly customized for specific algorithms and primarily focus on handling fixed-point operations, which makes them difficult to meet the demands of diverse and high-precision applications. In this work, we propose a CIM architecture that natively supports various iterative linear solvers based on floating-point operations. We develop a new instruction set for the accelerator, which can be flexibly combined to implement various iterative solvers. The evaluation results show that, compared with the GPU implementation, our accelerator achieves more than 10.1× speedup and 6.8× energy savings when executing different iterative solvers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"161-164"},"PeriodicalIF":1.4,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Volatile FPGAs Potential for Accelerating Energy-Harvesting IoT Applications 探索易失性fpga加速能量收集物联网应用的潜力
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-21 DOI: 10.1109/LCA.2025.3563105
Aalaa M.A. Babai;Koji Inoue
Low-power volatile FPGAs (VFPGAs) naturally meet the intertwined processing and flexibility demands of IoT devices. However, as IoT devices shift toward Energy Harvesting (EH) for self-sustained operation, VFPGAs are overlooked because they struggle under harvested power. Their volatile SRAM configuration memory cells frequently lose their data, causing high reconfiguration penalties. These penalties grow with FPGAs’ resource usage, limiting it under EH. Still, advances in low-power FPGAs and energy-buffering systems’ efficiency motivate us to explore EH-powered FPGAs. Thus, we analyze the interplay of their resources, performance, and reconfiguration; simulate their operation under different EH conditions; and show how they can be utilized up to an application- and EH-dependent threshold.
低功耗易失性fpga (vfpga)自然能够满足物联网设备的复杂处理和灵活性需求。然而,随着物联网设备转向能量收集(EH)以实现自我持续运行,vfpga被忽视了,因为它们在收集的功率下挣扎。它们的易失性SRAM配置存储单元经常丢失数据,导致重配置的代价很高。这些惩罚随着fpga的资源使用而增加,限制了它在EH下的使用。尽管如此,低功耗fpga和能量缓冲系统效率的进步促使我们探索以eh为动力的fpga。因此,我们分析了它们的资源、绩效和重构之间的相互作用;模拟它们在不同EH条件下的运行;并展示如何利用它们达到应用程序和eh相关的阈值。
{"title":"Exploring Volatile FPGAs Potential for Accelerating Energy-Harvesting IoT Applications","authors":"Aalaa M.A. Babai;Koji Inoue","doi":"10.1109/LCA.2025.3563105","DOIUrl":"https://doi.org/10.1109/LCA.2025.3563105","url":null,"abstract":"Low-power volatile FPGAs (VFPGAs) naturally meet the intertwined processing and flexibility demands of IoT devices. However, as IoT devices shift toward Energy Harvesting (EH) for self-sustained operation, VFPGAs are overlooked because they struggle under harvested power. Their volatile SRAM configuration memory cells frequently lose their data, causing high reconfiguration penalties. These penalties grow with FPGAs’ resource usage, limiting it under EH. Still, advances in low-power FPGAs and energy-buffering systems’ efficiency motivate us to explore EH-powered FPGAs. Thus, we analyze the interplay of their resources, performance, and reconfiguration; simulate their operation under different EH conditions; and show how they can be utilized up to an application- and EH-dependent threshold.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"137-140"},"PeriodicalIF":1.4,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the DIMM PIM Architecture for Accelerating Time Series Analysis 探索加速时间序列分析的DIMM PIM架构
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-18 DOI: 10.1109/LCA.2025.3562431
Shunchen Shi;Fan Yang;Zhichun Li;Xueqi Li;Ninghui Sun
Time series analysis (TSA) is an important technique for extracting information from domain data. TSA is memory-bound on conventional platforms due to excessive off-chip data movements between processing units and the main memory of the system. Processing in memory (PIM) is a paradigm that alleviates the bottleneck of memory access for data-intensive applications by enabling computation to be performed directly within memory. In this paper, we first perform profiling to characterize TSA on conventional CPUs. Then, we implement TSA on real-world commercial DRAM Dual-Inline Memory Module (DIMM) PIM hardware UPMEM and identify computation as the primary bottleneck on PIM. Finally, we evaluate the impact of enhancing the computational capability of current DIMM PIM hardware on accelerating TSA. Overall, our work provides insights for designing the optimized DIMM PIM architecture for high-performance and efficient time series analysis.
时间序列分析是从领域数据中提取信息的重要技术。由于在处理单元和系统的主存储器之间有过多的片外数据移动,TSA在传统平台上是内存受限的。内存中处理(PIM)是一种范例,它允许在内存中直接执行计算,从而缓解了数据密集型应用程序的内存访问瓶颈。在本文中,我们首先执行性能分析来表征传统cpu上的TSA。然后,我们在实际商用DRAM Dual-Inline Memory Module (DIMM) PIM硬件UPMEM上实现了TSA,并确定计算是PIM的主要瓶颈。最后,我们评估了提高当前DIMM PIM硬件的计算能力对加速TSA的影响。总的来说,我们的工作为设计优化的DIMM PIM架构提供了见解,用于高性能和高效的时间序列分析。
{"title":"Exploring the DIMM PIM Architecture for Accelerating Time Series Analysis","authors":"Shunchen Shi;Fan Yang;Zhichun Li;Xueqi Li;Ninghui Sun","doi":"10.1109/LCA.2025.3562431","DOIUrl":"https://doi.org/10.1109/LCA.2025.3562431","url":null,"abstract":"Time series analysis (TSA) is an important technique for extracting information from domain data. TSA is memory-bound on conventional platforms due to excessive off-chip data movements between processing units and the main memory of the system. Processing in memory (PIM) is a paradigm that alleviates the bottleneck of memory access for data-intensive applications by enabling computation to be performed directly within memory. In this paper, we first perform profiling to characterize TSA on conventional CPUs. Then, we implement TSA on real-world commercial DRAM Dual-Inline Memory Module (DIMM) PIM hardware UPMEM and identify computation as the primary bottleneck on PIM. Finally, we evaluate the impact of enhancing the computational capability of current DIMM PIM hardware on accelerating TSA. Overall, our work provides insights for designing the optimized DIMM PIM architecture for high-performance and efficient time series analysis.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"169-172"},"PeriodicalIF":1.4,"publicationDate":"2025-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144196840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Segin: Synergistically Enabling Fine-Grained Multi-Tenant and Resource Optimized SpMV Segin:协同启用细粒度多租户和资源优化SpMV
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-17 DOI: 10.1109/LCA.2025.3562120
Helya Hosseini;Ubaid Bakhtiar;Donghyeon Joo;Bahar Asgari
Sparse matrix-vector multiplication (SpMV) is a critical operation across numerous application domains. As a memory-bound kernel, SpMV does not require a complex compute engine but still needs efficient use of available compute units to achieve peak performance efficiently. However, sparsity causes resource underutilization. To efficiently run SpMV, we propose Segin that leverages a novel fine-grained multi-tenancy, allowing multiple SpMV operations to be executed simultaneously on a single hardware with minimal modifications, which in turn improves throughput. To achieve this, Segin employs hierarchical bitmaps, hence a lightweight logical circuit, to quickly and efficiently identify optimal pairs of sparse matrices to overlap. Our evaluations demonstrate that Segin can improve throughput by 1.92×, while enhancing resource utilization.
稀疏矩阵向量乘法(SpMV)是一种跨多个应用领域的关键运算。作为一个内存受限的内核,SpMV不需要复杂的计算引擎,但仍然需要有效地利用可用的计算单元来有效地实现峰值性能。然而,稀疏性导致资源利用不足。为了有效地运行SpMV,我们提出了Segin,它利用了一种新颖的细粒度多租户,允许在单个硬件上同时执行多个SpMV操作,只需进行最小的修改,从而提高了吞吐量。为了实现这一点,Segin采用了分层位图,因此是一种轻量级的逻辑电路,可以快速有效地识别要重叠的稀疏矩阵的最佳对。我们的评估表明,Segin可以将吞吐量提高1.92倍,同时提高资源利用率。
{"title":"Segin: Synergistically Enabling Fine-Grained Multi-Tenant and Resource Optimized SpMV","authors":"Helya Hosseini;Ubaid Bakhtiar;Donghyeon Joo;Bahar Asgari","doi":"10.1109/LCA.2025.3562120","DOIUrl":"https://doi.org/10.1109/LCA.2025.3562120","url":null,"abstract":"Sparse matrix-vector multiplication (SpMV) is a critical operation across numerous application domains. As a memory-bound kernel, SpMV does not require a complex compute engine but still needs efficient use of available compute units to achieve peak performance efficiently. However, sparsity causes resource underutilization. To efficiently run SpMV, we propose Segin that leverages a novel <italic>fine-grained multi-tenancy</i>, allowing multiple SpMV operations to be executed simultaneously on a single hardware with minimal modifications, which in turn improves throughput. To achieve this, Segin employs hierarchical bitmaps, hence a lightweight logical circuit, to quickly and efficiently identify optimal pairs of sparse matrices to overlap. Our evaluations demonstrate that Segin can improve throughput by 1.92×, while enhancing resource utilization.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"181-184"},"PeriodicalIF":1.4,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144196729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MixDiT: Accelerating Image Diffusion Transformer Inference With Mixed-Precision MX Quantization MixDiT:用混合精度MX量化加速图像扩散变压器推理
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-15 DOI: 10.1109/LCA.2025.3560786
Daeun Kim;Jinwoo Hwang;Changhun Oh;Jongse Park
Diffusion Transformer (DiT) has driven significant progress in image generation tasks. However, DiT inferencing is notoriously compute-intensive and incurs long latency even on datacenter-scale GPUs, primarily due to its iterative nature and heavy reliance on GEMM operations inherent to its encoder-based structure. To address the challenge, prior work has explored quantization, but achieving low-precision quantization for DiT inferencing with both high accuracy and substantial speedup remains an open problem. To this end, this paper proposes MixDiT, an algorithm-hardware co-designed acceleration solution that exploits mixed Microscaling (MX) formats to quantize DiT activation values. MixDiTquantizes the DiT activation tensors by selectively applying higher precision to magnitude-based outliers, which produce mixed-precision GEMM operations. To achieve tangible speedup from the mixed-precision arithmetic, we design a MixDiTaccelerator that enables precision-flexible multiplications and efficient MX precision conversions. Our experimental results show that MixDiTdelivers a speedup of 2.10–5.32× over RTX 3090, with no loss in FID.
扩散转换器(Diffusion Transformer, DiT)在图像生成任务中取得了重大进展。然而,DiT推理是出了名的计算密集型,即使在数据中心规模的gpu上也会导致很长的延迟,这主要是由于它的迭代性质和对基于编码器的结构固有的gem操作的严重依赖。为了应对挑战,之前的工作已经探索了量化,但是实现低精度量化的DiT推理,同时具有高精度和显著的加速仍然是一个悬而未决的问题。为此,本文提出了MixDiT,这是一种算法-硬件协同设计的加速解决方案,它利用混合微缩放(MX)格式来量化DiT激活值。mixditx通过选择性地对基于震级的异常值应用更高精度来量化DiT激活张量,从而产生混合精度的GEMM操作。为了从混合精度算法中获得切实的加速,我们设计了一个MixDiTaccelerator,它可以实现精确灵活的乘法和高效的MX精度转换。我们的实验结果表明,mixdi比RTX 3090提供了2.10 - 5.32倍的加速,而FID没有损失。
{"title":"MixDiT: Accelerating Image Diffusion Transformer Inference With Mixed-Precision MX Quantization","authors":"Daeun Kim;Jinwoo Hwang;Changhun Oh;Jongse Park","doi":"10.1109/LCA.2025.3560786","DOIUrl":"https://doi.org/10.1109/LCA.2025.3560786","url":null,"abstract":"<underline>Di</u>ffusion <underline>T</u>ransformer (DiT) has driven significant progress in image generation tasks. However, DiT inferencing is notoriously compute-intensive and incurs long latency even on datacenter-scale GPUs, primarily due to its iterative nature and heavy reliance on GEMM operations inherent to its encoder-based structure. To address the challenge, prior work has explored quantization, but achieving low-precision quantization for DiT inferencing with both high accuracy and substantial speedup remains an open problem. To this end, this paper proposes MixDiT, an algorithm-hardware co-designed acceleration solution that exploits mixed Microscaling (MX) formats to quantize DiT activation values. MixDiTquantizes the DiT activation tensors by selectively applying higher precision to magnitude-based outliers, which produce mixed-precision GEMM operations. To achieve tangible speedup from the mixed-precision arithmetic, we design a MixDiTaccelerator that enables precision-flexible multiplications and efficient MX precision conversions. Our experimental results show that MixDiTdelivers a speedup of 2.10–5.32× over RTX 3090, with no loss in FID.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"141-144"},"PeriodicalIF":1.4,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
L-DTC: Load-based Dynamic Throughput Control for Guaranteed I/O Performance in Virtualized Environments L-DTC:虚拟化环境中基于负载的动态吞吐量控制,保证I/O性能
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-10 DOI: 10.1109/LCA.2025.3559454
TaeHoon Kim;Jaechun No
In this letter, we identified an issue where the I/O performance of specific tasks could not be guaranteed during multi-process I/O operations, despite the use of the latest storage technologies in virtualized environments. To address this issue, we propose L-DTC, a novel Load-based Dynamic Throughput Control technique, designed to achieve the guaranteed I/O performance in virtualized environments. Operating at the hypervisor level, L-DTC provides fine-grained throughput control based on I/O queues and ensures the independent I/O performance for each process by allowing users to define maximum and minimum throughput levels for each queue. We conducted an evaluation of L-DTC and confirmed that it successfully guarantees the I/O performance requirements of specific processes in multi-process environments. Furthermore, L-DTC achieved more stable I/O performance compared to existing methods, with improvements in I/O performance of up to 2.1 times, regardless of the I/O scheduler.
在这封信中,我们发现了一个问题,即尽管在虚拟化环境中使用了最新的存储技术,但在多进程I/O操作期间,无法保证特定任务的I/O性能。为了解决这个问题,我们提出了L-DTC,一种新的基于负载的动态吞吐量控制技术,旨在实现虚拟化环境中有保证的I/O性能。L-DTC在管理程序级别上运行,提供基于I/O队列的细粒度吞吐量控制,并通过允许用户为每个队列定义最大和最小吞吐量级别来确保每个进程的独立I/O性能。我们对L-DTC进行了评估,确认它成功地保证了多进程环境中特定进程的I/O性能要求。此外,与现有方法相比,L-DTC实现了更稳定的I/O性能,与I/O调度器无关,I/O性能的提高高达2.1倍。
{"title":"L-DTC: Load-based Dynamic Throughput Control for Guaranteed I/O Performance in Virtualized Environments","authors":"TaeHoon Kim;Jaechun No","doi":"10.1109/LCA.2025.3559454","DOIUrl":"https://doi.org/10.1109/LCA.2025.3559454","url":null,"abstract":"In this letter, we identified an issue where the I/O performance of specific tasks could not be guaranteed during multi-process I/O operations, despite the use of the latest storage technologies in virtualized environments. To address this issue, we propose <italic>L-DTC</i>, a novel Load-based Dynamic Throughput Control technique, designed to achieve the guaranteed I/O performance in virtualized environments. Operating at the hypervisor level, <italic>L-DTC</i> provides fine-grained throughput control based on I/O queues and ensures the independent I/O performance for each process by allowing users to define maximum and minimum throughput levels for each queue. We conducted an evaluation of <italic>L-DTC</i> and confirmed that it successfully guarantees the I/O performance requirements of specific processes in multi-process environments. Furthermore, <italic>L-DTC</i> achieved more stable I/O performance compared to existing methods, with improvements in I/O performance of up to 2.1 times, regardless of the I/O scheduler.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"145-148"},"PeriodicalIF":1.4,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Pyramid: Accelerating LLM Inference With Cross-Level Processing-in-Memory 金字塔:用内存中的跨层处理加速LLM推理
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-10 DOI: 10.1109/LCA.2025.3559738
Liang Yan;Xiaoyang Lu;Xiaoming Chen;Yinhe Han;Xian-He Sun
Integrating processing-in-memory (PIM) with GPUs accelerates large language model (LLM) inference, but existing GPU-PIM systems encounter several challenges. While GPUs excel in large general matrix-matrix multiplications (GEMM), they struggle with small-scale operations better suited for PIM, which currently cannot handle them independently. Additionally, the computational demands of activation operations exceed the capabilities of current PIM technologies, leading to excessive data movement between the GPU and memory. PIM's potential for general matrix-vector multiplications (GEMV) is also limited by insufficient support for fine-grained parallelism. To address these issues, we propose Pyramid, a novel GPU-PIM system that optimizes PIM for LLM inference by strategically allocating cross-level computational resources within PIM to meet diverse needs and leveraging the strengths of both technologies. Evaluation results demonstrate that Pyramid outperforms existing systems like NeuPIM, AiM, and AttAcc by factors of 2.31×, $1.91times$, and $1.72times$, respectively.
将内存处理(PIM)与gpu集成可以加速大语言模型(LLM)推理,但现有的GPU-PIM系统面临一些挑战。虽然gpu在大型一般矩阵-矩阵乘法(GEMM)中表现出色,但它们在更适合PIM的小规模操作中挣扎,PIM目前无法独立处理它们。此外,激活操作的计算需求超出了当前PIM技术的能力,导致GPU和内存之间的数据移动过多。PIM对一般矩阵向量乘法(GEMV)的潜力也受到对细粒度并行性支持不足的限制。为了解决这些问题,我们提出了一种新的GPU-PIM系统Pyramid,该系统通过在PIM内战略性地分配跨层计算资源来优化PIM的LLM推理,以满足不同的需求并利用两种技术的优势。评估结果表明,Pyramid比现有系统如NeuPIM、AiM和AttAcc分别高出2.31倍、1.91倍和1.72倍。
{"title":"Pyramid: Accelerating LLM Inference With Cross-Level Processing-in-Memory","authors":"Liang Yan;Xiaoyang Lu;Xiaoming Chen;Yinhe Han;Xian-He Sun","doi":"10.1109/LCA.2025.3559738","DOIUrl":"https://doi.org/10.1109/LCA.2025.3559738","url":null,"abstract":"Integrating processing-in-memory (PIM) with GPUs accelerates large language model (LLM) inference, but existing GPU-PIM systems encounter several challenges. While GPUs excel in large general matrix-matrix multiplications (GEMM), they struggle with small-scale operations better suited for PIM, which currently cannot handle them independently. Additionally, the computational demands of activation operations exceed the capabilities of current PIM technologies, leading to excessive data movement between the GPU and memory. PIM's potential for general matrix-vector multiplications (GEMV) is also limited by insufficient support for fine-grained parallelism. To address these issues, we propose Pyramid, a novel GPU-PIM system that optimizes PIM for LLM inference by strategically allocating cross-level computational resources within PIM to meet diverse needs and leveraging the strengths of both technologies. Evaluation results demonstrate that Pyramid outperforms existing systems like NeuPIM, AiM, and AttAcc by factors of 2.31×, <inline-formula><tex-math>$1.91times$</tex-math></inline-formula>, and <inline-formula><tex-math>$1.72times$</tex-math></inline-formula>, respectively.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"121-124"},"PeriodicalIF":1.4,"publicationDate":"2025-04-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143896317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Memory-Centric MCM-GPU Architecture 以内存为中心的MCM-GPU架构
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-09 DOI: 10.1109/LCA.2025.3553766
Hossein SeyyedAghaei;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout
The demand for powerful GPUs continues to grow, driven by modern-day applications that require ever increasing computational power and memory bandwidth. Multi-Chip Module (MCM) GPUs provide the scalability potential by integrating GPU chiplets on an interposer substrate, however, they are hindered by their GPU-centric design, i.e., off-chip GPU bandwidth is statically (at design time) allocated to local versus remote memory accesses. This paper presents the memory-centric MCM-GPU architecture. By connecting the HBM stacks on the interposer, rather than the GPUs, and by connecting the GPUs to bridges on the interposer network, the full off-chip GPU bandwidth can be dynamically allocated to local and remote memory accesses. Preliminary results demonstrate the potential of the memory-centric architecture offering an average 1.36× (and up to 1.90×) performance improvement over a GPU-centric architecture.
由于现代应用程序需要不断增加的计算能力和内存带宽,对强大gpu的需求持续增长。多芯片模块(MCM) GPU通过在中间层基板上集成GPU小芯片来提供可扩展性潜力,然而,它们受到以GPU为中心的设计的阻碍,即片外GPU带宽是静态地(在设计时)分配给本地而不是远程内存访问的。本文提出了以内存为中心的MCM-GPU架构。通过将HBM堆栈连接到中间层(而不是GPU)上,并将GPU连接到中间层网络上的网桥上,可以动态地将片外GPU的全部带宽分配给本地和远程内存访问。初步结果表明,与以gpu为中心的体系结构相比,以内存为中心的体系结构的性能平均提高了1.36倍(最高可达1.90倍)。
{"title":"Memory-Centric MCM-GPU Architecture","authors":"Hossein SeyyedAghaei;Mahmood Naderan-Tahan;Magnus Jahre;Lieven Eeckhout","doi":"10.1109/LCA.2025.3553766","DOIUrl":"https://doi.org/10.1109/LCA.2025.3553766","url":null,"abstract":"The demand for powerful GPUs continues to grow, driven by modern-day applications that require ever increasing computational power and memory bandwidth. Multi-Chip Module (MCM) GPUs provide the scalability potential by integrating GPU chiplets on an interposer substrate, however, they are hindered by their GPU-centric design, i.e., off-chip GPU bandwidth is statically (at design time) allocated to local versus remote memory accesses. This paper presents the memory-centric MCM-GPU architecture. By connecting the HBM stacks on the interposer, rather than the GPUs, and by connecting the GPUs to bridges on the interposer network, the full off-chip GPU bandwidth can be dynamically allocated to local and remote memory accesses. Preliminary results demonstrate the potential of the memory-centric architecture offering an average 1.36× (and up to 1.90×) performance improvement over a GPU-centric architecture.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"101-104"},"PeriodicalIF":1.4,"publicationDate":"2025-04-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143817792","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1