首页 > 最新文献

IEEE Computer Architecture Letters最新文献

英文 中文
Stardust: Scalable and Transferable Workload Mapping for Large AI on Multi-Chiplet Systems 星尘:多芯片系统上大型人工智能的可扩展和可转移工作负载映射
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-17 DOI: 10.1109/LCA.2025.3580562
Wencheng Zou;Feiyun Zhao;Nan Wu
Workload partitioning and mapping are critical to optimizing performance in multi-chiplet systems. However, existing approaches struggle with scalability in large search spaces and lack transferability across different workloads. To overcome these limitations, we propose Stardust, a scalable and transferable workload mapping on multi-chiplet systems. Stardust combines learnable graph clustering to downscale computation graphs for efficient partitioning, topology-masked attention to capture structural information, and deep reinforcement learning (DRL) for optimized workload mapping. Evaluations on production-scale AI models show that (1) Stardust-generated mappings significantly outperform commonly used heuristics in throughput, and (2) fine-tuning a pre-trained Stardust model improves sample efficiency by up to 15× compared to training from scratch.
工作负载分区和映射是优化多芯片系统性能的关键。然而,现有的方法难以在大型搜索空间中实现可伸缩性,并且缺乏跨不同工作负载的可移植性。为了克服这些限制,我们提出了Stardust,一个可扩展和可转移的多芯片系统工作负载映射。Stardust结合了可学习的图聚类,以缩小计算图的规模,实现高效分区,拓扑掩码关注,捕获结构信息,深度强化学习(DRL),优化工作负载映射。对生产规模人工智能模型的评估表明:(1)Stardust生成的映射在吞吐量方面显著优于常用的启发式算法;(2)与从头开始训练相比,对预训练的Stardust模型进行微调可将样本效率提高15倍。
{"title":"Stardust: Scalable and Transferable Workload Mapping for Large AI on Multi-Chiplet Systems","authors":"Wencheng Zou;Feiyun Zhao;Nan Wu","doi":"10.1109/LCA.2025.3580562","DOIUrl":"https://doi.org/10.1109/LCA.2025.3580562","url":null,"abstract":"Workload partitioning and mapping are critical to optimizing performance in multi-chiplet systems. However, existing approaches struggle with scalability in large search spaces and lack transferability across different workloads. To overcome these limitations, we propose <sc>Stardust</small>, a <underline>s</u>calable and <underline>t</u>r<underline>a</u>nsfe<underline>r</u>able workloa<underline>d</u> mapping on m<underline>u</u>lti-chiplet sy<underline>st</u>ems. <sc>Stardust</small> combines learnable graph clustering to downscale computation graphs for efficient partitioning, topology-masked attention to capture structural information, and deep reinforcement learning (DRL) for optimized workload mapping. Evaluations on production-scale AI models show that (1) <sc>Stardust</small>-generated mappings significantly outperform commonly used heuristics in throughput, and (2) fine-tuning a pre-trained <sc>Stardust</small> model improves sample efficiency by up to 15× compared to training from scratch.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"201-204"},"PeriodicalIF":1.4,"publicationDate":"2025-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144623874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
pNet-gem5: Full-System Simulation With High-Performance Networking Enabled by Parallel Network Packet Processing pNet-gem5:通过并行网络数据包处理实现高性能网络的全系统仿真
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-06 DOI: 10.1109/LCA.2025.3577232
Jongmin Shin;Seongtae Bang;Gyeongseo Park;Daehoon Kim
Modern server processors in data centers equipped with high-performance networking technologies (e.g., 100 Gigabit Ethernet) commonly support parallel packet processing via multi-queue NICs, enabling multiple cores to efficiently handle massive traffic loads. However, existing architectural simulators such as gem5 lack support for these techniques and suffer from limited bandwidth due to outdated networking models. Although a recent study introduced a simulation framework supporting userspace high-performance networking via the Data Plane Development Kit (DPDK), many applications still rely on kernel-based networking. To address these limitations, we present pNet-gem5, a full-system simulation framework designed to model server systems under high-performance network workloads, targeting data center architecture research. pNet-gem5 extends gem5 by supporting parallel packet processing on multi-core systems through the integration of multiple hardware queues and a more advanced interrupt mechanism—Message Signaled Interrupts (MSI)—which allows each NIC queue to be mapped to a dedicated core with its own IRQ. It also provides a high-performance network interface and device driver that support scalable and configurable packet distribution between hardware and software. Moreover, by decoupling packet distribution and scheduling from NIC core logic, pNet-gem5 enables flexible experimentation with custom policies. As a result, pNet-gem5 enables more realistic simulation of modern server environments by modeling multi-queue NICs and supporting bandwidths up to 46 Gbps—a significant improvement over the previous limit of only a few Gbps and more closely aligned with today’s tens-of-Gbps networks.
配备高性能网络技术(例如,100千兆以太网)的数据中心中的现代服务器处理器通常支持通过多队列网卡并行数据包处理,使多个核心能够有效地处理大量流量负载。然而,现有的架构模拟器(如gem5)缺乏对这些技术的支持,并且由于过时的网络模型而受到带宽限制。尽管最近的一项研究引入了一个模拟框架,通过数据平面开发工具包(Data Plane Development Kit, DPDK)支持用户空间高性能网络,但许多应用程序仍然依赖于基于内核的网络。为了解决这些限制,我们提出了pNet-gem5,这是一个全系统仿真框架,旨在对高性能网络工作负载下的服务器系统进行建模,目标是数据中心架构研究。pNet-gem5扩展了gem5,通过集成多个硬件队列和更高级的中断机制——消息信号中断(message signaling Interrupts, MSI)——在多核系统上支持并行数据包处理,MSI允许每个NIC队列被映射到具有自己IRQ的专用核心。它还提供了一个高性能的网络接口和设备驱动程序,支持硬件和软件之间可伸缩和可配置的数据包分发。此外,通过将数据包分发和调度与网卡核心逻辑解耦,pNet-gem5支持灵活的自定义策略实验。因此,pNet-gem5通过建模多队列nic并支持高达46 Gbps的带宽,从而能够更逼真地模拟现代服务器环境,这比以前仅为几Gbps的限制有了重大改进,并且与今天的数十Gbps网络更加接近。
{"title":"pNet-gem5: Full-System Simulation With High-Performance Networking Enabled by Parallel Network Packet Processing","authors":"Jongmin Shin;Seongtae Bang;Gyeongseo Park;Daehoon Kim","doi":"10.1109/LCA.2025.3577232","DOIUrl":"https://doi.org/10.1109/LCA.2025.3577232","url":null,"abstract":"Modern server processors in data centers equipped with high-performance networking technologies (e.g., 100 Gigabit Ethernet) commonly support parallel packet processing via multi-queue NICs, enabling multiple cores to efficiently handle massive traffic loads. However, existing architectural simulators such as <monospace>gem5</monospace> lack support for these techniques and suffer from limited bandwidth due to outdated networking models. Although a recent study introduced a simulation framework supporting userspace high-performance networking via the Data Plane Development Kit (DPDK), many applications still rely on kernel-based networking. To address these limitations, we present <monospace>pNet-gem5</monospace>, a full-system simulation framework designed to model server systems under high-performance network workloads, targeting data center architecture research. <monospace>pNet-gem5</monospace> extends <monospace>gem5</monospace> by supporting parallel packet processing on multi-core systems through the integration of multiple hardware queues and a more advanced interrupt mechanism—Message Signaled Interrupts (MSI)—which allows each NIC queue to be mapped to a dedicated core with its own IRQ. It also provides a high-performance network interface and device driver that support scalable and configurable packet distribution between hardware and software. Moreover, by decoupling packet distribution and scheduling from NIC core logic, <monospace>pNet-gem5</monospace> enables flexible experimentation with custom policies. As a result, <monospace>pNet-gem5</monospace> enables more realistic simulation of modern server environments by modeling multi-queue NICs and supporting bandwidths up to 46 Gbps—a significant improvement over the previous limit of only a few Gbps and more closely aligned with today’s tens-of-Gbps networks.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"193-196"},"PeriodicalIF":1.4,"publicationDate":"2025-06-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144536558","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Architectural Sustainability Indicator 建筑可持续发展指标
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-06-05 DOI: 10.1109/LCA.2025.3576891
Jaime Roelandts;Ajeya Naithani;Lieven Eeckhout
Computing devices are responsible for a significant fraction of the world’s total carbon footprint. Designing sustainable systems is a challenging endeavor because of the huge design space, the complex objective function, and the inherent data uncertainty. To make matters worse, a design that seems sustainable at first, might turn out to not be when taking rebound effects into account. In this paper, we propose the Architectural Sustainability Indicator (ASI), a novel metric to assess the sustainability of a given design and determine whether it is strongly, weakly, or unsustainable. ASI provides insight and hints for turning unsustainable and weakly sustainable design points into strongly sustainable ones that are robust against potential rebound effects. A case study illustrates how ASI steers Scalar Vector Runahead, a weakly sustainable hardware prefetching technique, into a strongly sustainable one while offering a 3.2× performance boost.
计算设备的碳足迹占世界总碳足迹的很大一部分。由于设计空间巨大,目标函数复杂,以及固有的数据不确定性,设计可持续系统是一项具有挑战性的工作。更糟糕的是,一开始看起来可持续的设计,在考虑反弹效应时可能会变得不可行。在本文中,我们提出了建筑可持续性指标(ASI),这是一种评估给定设计的可持续性并确定其是强、弱还是不可持续的新指标。ASI提供了将不可持续和弱可持续设计点转变为强可持续设计点的见解和提示,这些设计点可以抵御潜在的反弹效应。一个案例研究说明了ASI如何将一个弱可持续的硬件预取技术——标量矢量提前预取(Scalar Vector Runahead)转变为一个强可持续的技术,同时提供3.2倍的性能提升。
{"title":"The Architectural Sustainability Indicator","authors":"Jaime Roelandts;Ajeya Naithani;Lieven Eeckhout","doi":"10.1109/LCA.2025.3576891","DOIUrl":"https://doi.org/10.1109/LCA.2025.3576891","url":null,"abstract":"Computing devices are responsible for a significant fraction of the world’s total carbon footprint. Designing sustainable systems is a challenging endeavor because of the huge design space, the complex objective function, and the inherent data uncertainty. To make matters worse, a design that seems sustainable at first, might turn out to not be when taking rebound effects into account. In this paper, we propose the Architectural Sustainability Indicator (ASI), a novel metric to assess the sustainability of a given design and determine whether it is strongly, weakly, or unsustainable. ASI provides insight and hints for turning unsustainable and weakly sustainable design points into strongly sustainable ones that are robust against potential rebound effects. A case study illustrates how ASI steers Scalar Vector Runahead, a weakly sustainable hardware prefetching technique, into a strongly sustainable one while offering a 3.2× performance boost.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 2","pages":"205-208"},"PeriodicalIF":1.4,"publicationDate":"2025-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144680905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
WoperTM: Got Nacks? Use Them! 有零食吗?使用它们!
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-28 DOI: 10.1109/LCA.2025.3565199
Víctor Nicolás-Conesa;Rubén Titos-Gil;Ricardo Fernández-Pascual;Manuel E. Acacio;Alberto Ros
The simplicity of requester-wins has made it the preferred choice for conflict resolution in commercial implementations of Hardware Transactional Memory (HTM), which typically have relied on conventional locking to escape from conflict-induced livelocks. Prior work advocates for combining requester-wins and requester-loses to ensure progress for higher-priority transactions, yet it fails to take full advantage of the available features, namely, protocol support for nacks. This paper introduces WoperTM, a dual-policy, best-effort HTM design that resolves conflicts using requester-loses policy in the common case. Our key insight is that, since nacks are required to support priorities in HTM, performance can be improved at nearly no extra cost by allowing regular transactions to benefit from requester-loses, instead of only those involving a high-priority transaction. Experimental results using gem5 and STAMP show that WoperTM can significantly reduce squashed work and improve execution times by 12% with respect to power transactions, with negligible hardware overhead.
请求者获胜的简单性使其成为硬件事务性内存(Hardware Transactional Memory, HTM)的商业实现中解决冲突的首选,后者通常依赖于传统的锁定来避免冲突引发的活动。先前的工作主张将请求者获胜和请求者失败结合起来,以确保高优先级事务的进展,但它未能充分利用可用的特性,即对nack的协议支持。本文介绍了WoperTM,这是一种双策略、尽力而为的HTM设计,在一般情况下使用请求者丢失策略来解决冲突。我们的关键见解是,由于需要在HTM中支持优先级,因此通过允许常规事务从请求者丢失中获益,而不仅仅是那些涉及高优先级事务的事务,可以在几乎没有额外成本的情况下提高性能。使用gem5和STAMP的实验结果表明,WoperTM可以显著减少压缩工作,并将执行时间提高12%,而硬件开销可以忽略不计。
{"title":"WoperTM: Got Nacks? Use Them!","authors":"Víctor Nicolás-Conesa;Rubén Titos-Gil;Ricardo Fernández-Pascual;Manuel E. Acacio;Alberto Ros","doi":"10.1109/LCA.2025.3565199","DOIUrl":"https://doi.org/10.1109/LCA.2025.3565199","url":null,"abstract":"The simplicity of requester-wins has made it the preferred choice for conflict resolution in commercial implementations of Hardware Transactional Memory (HTM), which typically have relied on conventional locking to escape from conflict-induced livelocks. Prior work advocates for combining requester-wins and requester-loses to ensure progress for higher-priority transactions, yet it fails to take full advantage of the available features, namely, protocol support for <italic>nacks</i>. This paper introduces WoperTM, a dual-policy, best-effort HTM design that resolves conflicts using <italic>requester-loses</i> policy in the common case. Our key insight is that, since <italic>nacks</i> are required to support priorities in HTM, performance can be improved at nearly no extra cost by allowing regular transactions to benefit from requester-loses, instead of only those involving a high-priority transaction. Experimental results using gem5 and STAMP show that WoperTM can significantly reduce squashed work and improve execution times by 12% with respect to <italic>power transactions</i>, with negligible hardware overhead.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"157-160"},"PeriodicalIF":1.4,"publicationDate":"2025-04-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139966","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cache and Near-Data Co-Design for Chiplets 小芯片的缓存和近数据协同设计
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-25 DOI: 10.1109/LCA.2025.3564535
Arteen Abrishami;Zhengrong Wang;Tony Nowatzki
Vendors are increasingly adopting chiplet-based designs to manage cost for large-scale multi-cores. While near-data computing, a paradigm involving offloading computation near where data is located in memory, has been studied in the context of monolithic chip designs – its applications to chiplets remain unexplored. In this letter, we explore how the paradigm extends to chiplets in a system where computation is offloaded to accelerators collocated within the last-level-cache structure. We explore both shared and private last-level-cache designs across a variety of different workloads, both large-scale graph computations and more regular-access workloads, in order to understand how to optimize the cache and topology design for near-data workloads. We find that with a mesh chiplet architecture with shared last-level-cache (LLC), near-data optimization can achieve an 8.70× speedup on graph workloads, providing an even greater benefit than in traditional systems.
供应商越来越多地采用基于芯片的设计来管理大规模多核的成本。虽然在单片芯片设计的背景下研究了近数据计算,一种涉及在内存中数据位置附近卸载计算的范式,但其在小芯片上的应用仍未探索。在这封信中,我们将探讨如何将范式扩展到系统中的小芯片,其中计算被卸载到最后一级缓存结构中并置的加速器。为了了解如何为近数据工作负载优化缓存和拓扑设计,我们探索了跨各种不同工作负载的共享和私有最后一级缓存设计,包括大规模图计算和更常规访问的工作负载。我们发现,使用具有共享最后一级缓存(LLC)的网格芯片架构,近数据优化可以在图形工作负载上实现8.70倍的加速,提供比传统系统更大的好处。
{"title":"Cache and Near-Data Co-Design for Chiplets","authors":"Arteen Abrishami;Zhengrong Wang;Tony Nowatzki","doi":"10.1109/LCA.2025.3564535","DOIUrl":"https://doi.org/10.1109/LCA.2025.3564535","url":null,"abstract":"Vendors are increasingly adopting chiplet-based designs to manage cost for large-scale multi-cores. While near-data computing, a paradigm involving offloading computation near where data is located in memory, has been studied in the context of monolithic chip designs – its applications to chiplets remain unexplored. In this letter, we explore how the paradigm extends to chiplets in a system where computation is offloaded to accelerators collocated within the last-level-cache structure. We explore both shared and private last-level-cache designs across a variety of different workloads, both large-scale graph computations and more regular-access workloads, in order to understand how to optimize the cache and topology design for near-data workloads. We find that with a mesh chiplet architecture with shared last-level-cache (LLC), near-data optimization can achieve an 8.70× speedup on graph workloads, providing an even greater benefit than in traditional systems.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"149-152"},"PeriodicalIF":1.4,"publicationDate":"2025-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139975","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
In-Memory Computing Accelerator for Iterative Linear Algebra Solvers 迭代线性代数求解的内存计算加速器
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-22 DOI: 10.1109/LCA.2025.3563365
Rui Liu;Zerun Li;Xiaoyu Zhang;Xiaoming Chen;Yinhe Han;Minghua Tang
Iterative linear solvers are a crucial kernel in many numerical analysis problems. The performance and energy efficiency of iterative solvers based on traditional architectures are severely constrained by the memory wall bottleneck. Computing-in-memory (CIM) has the potential to enhance solving efficiency. Existing CIM architectures are mostly customized for specific algorithms and primarily focus on handling fixed-point operations, which makes them difficult to meet the demands of diverse and high-precision applications. In this work, we propose a CIM architecture that natively supports various iterative linear solvers based on floating-point operations. We develop a new instruction set for the accelerator, which can be flexibly combined to implement various iterative solvers. The evaluation results show that, compared with the GPU implementation, our accelerator achieves more than 10.1× speedup and 6.8× energy savings when executing different iterative solvers.
迭代线性求解是许多数值分析问题的核心。基于传统架构的迭代求解器的性能和能效受到内存墙瓶颈的严重制约。内存计算(CIM)具有提高求解效率的潜力。现有的CIM体系结构大多是针对特定算法定制的,主要侧重于处理定点操作,难以满足多样化和高精度应用的需求。在这项工作中,我们提出了一个CIM架构,该架构支持基于浮点运算的各种迭代线性求解器。我们开发了一种新的加速器指令集,可以灵活地组合实现各种迭代求解。评估结果表明,与GPU实现相比,在执行不同的迭代求解器时,我们的加速器实现了10.1倍以上的加速和6.8倍以上的节能。
{"title":"In-Memory Computing Accelerator for Iterative Linear Algebra Solvers","authors":"Rui Liu;Zerun Li;Xiaoyu Zhang;Xiaoming Chen;Yinhe Han;Minghua Tang","doi":"10.1109/LCA.2025.3563365","DOIUrl":"https://doi.org/10.1109/LCA.2025.3563365","url":null,"abstract":"Iterative linear solvers are a crucial kernel in many numerical analysis problems. The performance and energy efficiency of iterative solvers based on traditional architectures are severely constrained by the memory wall bottleneck. Computing-in-memory (CIM) has the potential to enhance solving efficiency. Existing CIM architectures are mostly customized for specific algorithms and primarily focus on handling fixed-point operations, which makes them difficult to meet the demands of diverse and high-precision applications. In this work, we propose a CIM architecture that natively supports various iterative linear solvers based on floating-point operations. We develop a new instruction set for the accelerator, which can be flexibly combined to implement various iterative solvers. The evaluation results show that, compared with the GPU implementation, our accelerator achieves more than 10.1× speedup and 6.8× energy savings when executing different iterative solvers.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"161-164"},"PeriodicalIF":1.4,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring Volatile FPGAs Potential for Accelerating Energy-Harvesting IoT Applications 探索易失性fpga加速能量收集物联网应用的潜力
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-21 DOI: 10.1109/LCA.2025.3563105
Aalaa M.A. Babai;Koji Inoue
Low-power volatile FPGAs (VFPGAs) naturally meet the intertwined processing and flexibility demands of IoT devices. However, as IoT devices shift toward Energy Harvesting (EH) for self-sustained operation, VFPGAs are overlooked because they struggle under harvested power. Their volatile SRAM configuration memory cells frequently lose their data, causing high reconfiguration penalties. These penalties grow with FPGAs’ resource usage, limiting it under EH. Still, advances in low-power FPGAs and energy-buffering systems’ efficiency motivate us to explore EH-powered FPGAs. Thus, we analyze the interplay of their resources, performance, and reconfiguration; simulate their operation under different EH conditions; and show how they can be utilized up to an application- and EH-dependent threshold.
低功耗易失性fpga (vfpga)自然能够满足物联网设备的复杂处理和灵活性需求。然而,随着物联网设备转向能量收集(EH)以实现自我持续运行,vfpga被忽视了,因为它们在收集的功率下挣扎。它们的易失性SRAM配置存储单元经常丢失数据,导致重配置的代价很高。这些惩罚随着fpga的资源使用而增加,限制了它在EH下的使用。尽管如此,低功耗fpga和能量缓冲系统效率的进步促使我们探索以eh为动力的fpga。因此,我们分析了它们的资源、绩效和重构之间的相互作用;模拟它们在不同EH条件下的运行;并展示如何利用它们达到应用程序和eh相关的阈值。
{"title":"Exploring Volatile FPGAs Potential for Accelerating Energy-Harvesting IoT Applications","authors":"Aalaa M.A. Babai;Koji Inoue","doi":"10.1109/LCA.2025.3563105","DOIUrl":"https://doi.org/10.1109/LCA.2025.3563105","url":null,"abstract":"Low-power volatile FPGAs (VFPGAs) naturally meet the intertwined processing and flexibility demands of IoT devices. However, as IoT devices shift toward Energy Harvesting (EH) for self-sustained operation, VFPGAs are overlooked because they struggle under harvested power. Their volatile SRAM configuration memory cells frequently lose their data, causing high reconfiguration penalties. These penalties grow with FPGAs’ resource usage, limiting it under EH. Still, advances in low-power FPGAs and energy-buffering systems’ efficiency motivate us to explore EH-powered FPGAs. Thus, we analyze the interplay of their resources, performance, and reconfiguration; simulate their operation under different EH conditions; and show how they can be utilized up to an application- and EH-dependent threshold.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"137-140"},"PeriodicalIF":1.4,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139971","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Exploring the DIMM PIM Architecture for Accelerating Time Series Analysis 探索加速时间序列分析的DIMM PIM架构
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-18 DOI: 10.1109/LCA.2025.3562431
Shunchen Shi;Fan Yang;Zhichun Li;Xueqi Li;Ninghui Sun
Time series analysis (TSA) is an important technique for extracting information from domain data. TSA is memory-bound on conventional platforms due to excessive off-chip data movements between processing units and the main memory of the system. Processing in memory (PIM) is a paradigm that alleviates the bottleneck of memory access for data-intensive applications by enabling computation to be performed directly within memory. In this paper, we first perform profiling to characterize TSA on conventional CPUs. Then, we implement TSA on real-world commercial DRAM Dual-Inline Memory Module (DIMM) PIM hardware UPMEM and identify computation as the primary bottleneck on PIM. Finally, we evaluate the impact of enhancing the computational capability of current DIMM PIM hardware on accelerating TSA. Overall, our work provides insights for designing the optimized DIMM PIM architecture for high-performance and efficient time series analysis.
时间序列分析是从领域数据中提取信息的重要技术。由于在处理单元和系统的主存储器之间有过多的片外数据移动,TSA在传统平台上是内存受限的。内存中处理(PIM)是一种范例,它允许在内存中直接执行计算,从而缓解了数据密集型应用程序的内存访问瓶颈。在本文中,我们首先执行性能分析来表征传统cpu上的TSA。然后,我们在实际商用DRAM Dual-Inline Memory Module (DIMM) PIM硬件UPMEM上实现了TSA,并确定计算是PIM的主要瓶颈。最后,我们评估了提高当前DIMM PIM硬件的计算能力对加速TSA的影响。总的来说,我们的工作为设计优化的DIMM PIM架构提供了见解,用于高性能和高效的时间序列分析。
{"title":"Exploring the DIMM PIM Architecture for Accelerating Time Series Analysis","authors":"Shunchen Shi;Fan Yang;Zhichun Li;Xueqi Li;Ninghui Sun","doi":"10.1109/LCA.2025.3562431","DOIUrl":"https://doi.org/10.1109/LCA.2025.3562431","url":null,"abstract":"Time series analysis (TSA) is an important technique for extracting information from domain data. TSA is memory-bound on conventional platforms due to excessive off-chip data movements between processing units and the main memory of the system. Processing in memory (PIM) is a paradigm that alleviates the bottleneck of memory access for data-intensive applications by enabling computation to be performed directly within memory. In this paper, we first perform profiling to characterize TSA on conventional CPUs. Then, we implement TSA on real-world commercial DRAM Dual-Inline Memory Module (DIMM) PIM hardware UPMEM and identify computation as the primary bottleneck on PIM. Finally, we evaluate the impact of enhancing the computational capability of current DIMM PIM hardware on accelerating TSA. Overall, our work provides insights for designing the optimized DIMM PIM architecture for high-performance and efficient time series analysis.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"169-172"},"PeriodicalIF":1.4,"publicationDate":"2025-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144196840","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Segin: Synergistically Enabling Fine-Grained Multi-Tenant and Resource Optimized SpMV Segin:协同启用细粒度多租户和资源优化SpMV
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-17 DOI: 10.1109/LCA.2025.3562120
Helya Hosseini;Ubaid Bakhtiar;Donghyeon Joo;Bahar Asgari
Sparse matrix-vector multiplication (SpMV) is a critical operation across numerous application domains. As a memory-bound kernel, SpMV does not require a complex compute engine but still needs efficient use of available compute units to achieve peak performance efficiently. However, sparsity causes resource underutilization. To efficiently run SpMV, we propose Segin that leverages a novel fine-grained multi-tenancy, allowing multiple SpMV operations to be executed simultaneously on a single hardware with minimal modifications, which in turn improves throughput. To achieve this, Segin employs hierarchical bitmaps, hence a lightweight logical circuit, to quickly and efficiently identify optimal pairs of sparse matrices to overlap. Our evaluations demonstrate that Segin can improve throughput by 1.92×, while enhancing resource utilization.
稀疏矩阵向量乘法(SpMV)是一种跨多个应用领域的关键运算。作为一个内存受限的内核,SpMV不需要复杂的计算引擎,但仍然需要有效地利用可用的计算单元来有效地实现峰值性能。然而,稀疏性导致资源利用不足。为了有效地运行SpMV,我们提出了Segin,它利用了一种新颖的细粒度多租户,允许在单个硬件上同时执行多个SpMV操作,只需进行最小的修改,从而提高了吞吐量。为了实现这一点,Segin采用了分层位图,因此是一种轻量级的逻辑电路,可以快速有效地识别要重叠的稀疏矩阵的最佳对。我们的评估表明,Segin可以将吞吐量提高1.92倍,同时提高资源利用率。
{"title":"Segin: Synergistically Enabling Fine-Grained Multi-Tenant and Resource Optimized SpMV","authors":"Helya Hosseini;Ubaid Bakhtiar;Donghyeon Joo;Bahar Asgari","doi":"10.1109/LCA.2025.3562120","DOIUrl":"https://doi.org/10.1109/LCA.2025.3562120","url":null,"abstract":"Sparse matrix-vector multiplication (SpMV) is a critical operation across numerous application domains. As a memory-bound kernel, SpMV does not require a complex compute engine but still needs efficient use of available compute units to achieve peak performance efficiently. However, sparsity causes resource underutilization. To efficiently run SpMV, we propose Segin that leverages a novel <italic>fine-grained multi-tenancy</i>, allowing multiple SpMV operations to be executed simultaneously on a single hardware with minimal modifications, which in turn improves throughput. To achieve this, Segin employs hierarchical bitmaps, hence a lightweight logical circuit, to quickly and efficiently identify optimal pairs of sparse matrices to overlap. Our evaluations demonstrate that Segin can improve throughput by 1.92×, while enhancing resource utilization.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"181-184"},"PeriodicalIF":1.4,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144196729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MixDiT: Accelerating Image Diffusion Transformer Inference With Mixed-Precision MX Quantization MixDiT:用混合精度MX量化加速图像扩散变压器推理
IF 1.4 3区 计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Pub Date : 2025-04-15 DOI: 10.1109/LCA.2025.3560786
Daeun Kim;Jinwoo Hwang;Changhun Oh;Jongse Park
Diffusion Transformer (DiT) has driven significant progress in image generation tasks. However, DiT inferencing is notoriously compute-intensive and incurs long latency even on datacenter-scale GPUs, primarily due to its iterative nature and heavy reliance on GEMM operations inherent to its encoder-based structure. To address the challenge, prior work has explored quantization, but achieving low-precision quantization for DiT inferencing with both high accuracy and substantial speedup remains an open problem. To this end, this paper proposes MixDiT, an algorithm-hardware co-designed acceleration solution that exploits mixed Microscaling (MX) formats to quantize DiT activation values. MixDiTquantizes the DiT activation tensors by selectively applying higher precision to magnitude-based outliers, which produce mixed-precision GEMM operations. To achieve tangible speedup from the mixed-precision arithmetic, we design a MixDiTaccelerator that enables precision-flexible multiplications and efficient MX precision conversions. Our experimental results show that MixDiTdelivers a speedup of 2.10–5.32× over RTX 3090, with no loss in FID.
扩散转换器(Diffusion Transformer, DiT)在图像生成任务中取得了重大进展。然而,DiT推理是出了名的计算密集型,即使在数据中心规模的gpu上也会导致很长的延迟,这主要是由于它的迭代性质和对基于编码器的结构固有的gem操作的严重依赖。为了应对挑战,之前的工作已经探索了量化,但是实现低精度量化的DiT推理,同时具有高精度和显著的加速仍然是一个悬而未决的问题。为此,本文提出了MixDiT,这是一种算法-硬件协同设计的加速解决方案,它利用混合微缩放(MX)格式来量化DiT激活值。mixditx通过选择性地对基于震级的异常值应用更高精度来量化DiT激活张量,从而产生混合精度的GEMM操作。为了从混合精度算法中获得切实的加速,我们设计了一个MixDiTaccelerator,它可以实现精确灵活的乘法和高效的MX精度转换。我们的实验结果表明,mixdi比RTX 3090提供了2.10 - 5.32倍的加速,而FID没有损失。
{"title":"MixDiT: Accelerating Image Diffusion Transformer Inference With Mixed-Precision MX Quantization","authors":"Daeun Kim;Jinwoo Hwang;Changhun Oh;Jongse Park","doi":"10.1109/LCA.2025.3560786","DOIUrl":"https://doi.org/10.1109/LCA.2025.3560786","url":null,"abstract":"<underline>Di</u>ffusion <underline>T</u>ransformer (DiT) has driven significant progress in image generation tasks. However, DiT inferencing is notoriously compute-intensive and incurs long latency even on datacenter-scale GPUs, primarily due to its iterative nature and heavy reliance on GEMM operations inherent to its encoder-based structure. To address the challenge, prior work has explored quantization, but achieving low-precision quantization for DiT inferencing with both high accuracy and substantial speedup remains an open problem. To this end, this paper proposes MixDiT, an algorithm-hardware co-designed acceleration solution that exploits mixed Microscaling (MX) formats to quantize DiT activation values. MixDiTquantizes the DiT activation tensors by selectively applying higher precision to magnitude-based outliers, which produce mixed-precision GEMM operations. To achieve tangible speedup from the mixed-precision arithmetic, we design a MixDiTaccelerator that enables precision-flexible multiplications and efficient MX precision conversions. Our experimental results show that MixDiTdelivers a speedup of 2.10–5.32× over RTX 3090, with no loss in FID.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"141-144"},"PeriodicalIF":1.4,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144139908","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
IEEE Computer Architecture Letters
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1