IEEE Computer Architecture Letters最新文献

英文中文

PINSim: A Processing In- and Near-Sensor Simulator to Model Intelligent Vision Sensors PINSim：用于模拟智能视觉传感器的处理内传感器和近传感器模拟器

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-12-25 DOI: 10.1109/LCA.2024.3522777

Sepehr Tabrizchi;Mehrdad Morsali;David Pan;Shaahin Angizi;Arman Roohi

This letter introduces PINSim, a user-friendly and flexible framework for simulating emerging smart vision sensors in the early design stages. PINSim enables the realization of integrated sensing and processing near and in the sensor, effectively addressing challenges such as data movement and power-hungry analog-to-digital converters. The framework offers a flexible interface and a wide range of design options for customizing the efficiency and accuracy of processing-near/in-sensor-based accelerators using a hierarchical structure. Its organization spans from the device level upward to the algorithm level. PINSim realizes instruction-accurate evaluation of circuit-level performance metrics. PINSim achieves over

$25,000times$

speed-up compared to SPICE simulation with less than a 4.1% error rate on average. Furthermore, it supports both multilayer perceptron (MLP) and convolutional neural network (CNN) models, with limitations determined by IoT budget constraints. By facilitating the exploration and optimization of various design parameters, PiNSim empowers researchers and engineers to develop energy-efficient and high-performance smart vision sensors for a wide range of applications.

这封信介绍了PINSim，一个用户友好且灵活的框架，用于在早期设计阶段模拟新兴的智能视觉传感器。PINSim能够实现传感器附近和传感器内部的集成传感和处理，有效地解决数据移动和耗电的模数转换器等挑战。该框架提供了一个灵活的接口和广泛的设计选项，可以使用分层结构定制基于传感器的加速器的效率和精度。它的组织从设备级向上跨越到算法级。PINSim实现了电路级性能指标的指令精确评估。与SPICE模拟相比，PINSim实现了超过25,000倍的加速，平均错误率低于4.1%。此外，它支持多层感知器（MLP）和卷积神经网络（CNN）模型，其局限性取决于物联网预算约束。通过促进各种设计参数的探索和优化，PiNSim使研究人员和工程师能够为广泛的应用开发节能和高性能的智能视觉传感器。

{"title":"PINSim: A Processing In- and Near-Sensor Simulator to Model Intelligent Vision Sensors","authors":"Sepehr Tabrizchi;Mehrdad Morsali;David Pan;Shaahin Angizi;Arman Roohi","doi":"10.1109/LCA.2024.3522777","DOIUrl":"https://doi.org/10.1109/LCA.2024.3522777","url":null,"abstract":"This letter introduces PINSim, a user-friendly and flexible framework for simulating emerging smart vision sensors in the early design stages. PINSim enables the realization of integrated sensing and processing near and in the sensor, effectively addressing challenges such as data movement and power-hungry analog-to-digital converters. The framework offers a flexible interface and a wide range of design options for customizing the efficiency and accuracy of processing-near/in-sensor-based accelerators using a hierarchical structure. Its organization spans from the device level upward to the algorithm level. PINSim realizes instruction-accurate evaluation of circuit-level performance metrics. PINSim achieves over <inline-formula><tex-math>$25,000times$</tex-math></inline-formula> speed-up compared to SPICE simulation with less than a 4.1% error rate on average. Furthermore, it supports both multilayer perceptron (MLP) and convolutional neural network (CNN) models, with limitations determined by IoT budget constraints. By facilitating the exploration and optimization of various design parameters, PiNSim empowers researchers and engineers to develop energy-efficient and high-performance smart vision sensors for a wide range of applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"24 1","pages":"17-20"},"PeriodicalIF":1.4,"publicationDate":"2024-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142993765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ZoneBuffer: An Efficient Buffer Management Scheme for ZNS SSDs ZoneBuffer：适用于 ZNS 固态硬盘的高效缓冲区管理方案

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-12-16 DOI: 10.1109/LCA.2024.3498103

Hongtao Wang;Peiquan Jin

The introduction of Zoned Namespace SSDs (ZNS SSDs) presents new challenges for existing buffer management schemes. In addition to traditional SSD characteristics such as read/write asymmetry and limited write endurance, ZNS SSDs possess unique constraints, such as requiring sequential writes within each zone. These features make conventional buffering policies incompatible with ZNS SSDs. This paper introduces ZoneBuffer, a novel buffering scheme designed specifically for ZNS SSDs. ZoneBuffer's innovation lies in two key aspects. First, it introduces a new buffer structure comprising a Work Region and a Priority Region. The Priority Region is further divided into a clean page queue and a zone cluster of dirty pages. By confining buffer replacement to the Priority Region, ZoneBuffer ensures optimization for ZNS SSDs. Second, ZoneBuffer incorporates a lifetime-based clustering algorithm to group dirty pages within the Priority Region, optimizing write operations. Preliminary experiments conducted on a real ZNS SSD demonstrate the effectiveness of ZoneBuffer. Compared with conventional schemes like LRU and CFLRU, the results indicate that ZoneBuffer significantly improves performance.

分区命名空间ssd （ZNS ssd）的引入对现有的缓冲区管理方案提出了新的挑战。除了读写不对称和有限的写入持久性等传统SSD特性外，ZNS SSD还具有独特的约束，例如要求在每个区域内进行顺序写入。这些特性使得传统的缓冲策略与ZNS ssd不兼容。本文介绍了ZoneBuffer，一种专为ZNS固态硬盘设计的新型缓冲方案。ZoneBuffer的创新在于两个关键方面。首先，它引入了一个新的缓冲区结构，包括一个工作区和一个优先级区域。优先级区域进一步划分为干净页面队列和脏页面区域集群。通过将缓冲区替换限制在优先级区域，ZoneBuffer确保了ZNS ssd的优化。其次，ZoneBuffer结合了一个基于生命周期的聚类算法，将脏页分组在优先级区域内，优化写操作。在真实的ZNS固态硬盘上进行的初步实验证明了ZoneBuffer的有效性。与LRU和CFLRU等传统方案相比，ZoneBuffer的性能得到了显著提高。

{"title":"ZoneBuffer: An Efficient Buffer Management Scheme for ZNS SSDs","authors":"Hongtao Wang;Peiquan Jin","doi":"10.1109/LCA.2024.3498103","DOIUrl":"https://doi.org/10.1109/LCA.2024.3498103","url":null,"abstract":"The introduction of Zoned Namespace SSDs (ZNS SSDs) presents new challenges for existing buffer management schemes. In addition to traditional SSD characteristics such as read/write asymmetry and limited write endurance, ZNS SSDs possess unique constraints, such as requiring sequential writes within each zone. These features make conventional buffering policies incompatible with ZNS SSDs. This paper introduces ZoneBuffer, a novel buffering scheme designed specifically for ZNS SSDs. ZoneBuffer's innovation lies in two key aspects. First, it introduces a new buffer structure comprising a Work Region and a Priority Region. The Priority Region is further divided into a clean page queue and a zone cluster of dirty pages. By confining buffer replacement to the Priority Region, ZoneBuffer ensures optimization for ZNS SSDs. Second, ZoneBuffer incorporates a lifetime-based clustering algorithm to group dirty pages within the Priority Region, optimizing write operations. Preliminary experiments conducted on a real ZNS SSD demonstrate the effectiveness of ZoneBuffer. Compared with conventional schemes like LRU and CFLRU, the results indicate that ZoneBuffer significantly improves performance.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"239-242"},"PeriodicalIF":1.4,"publicationDate":"2024-12-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Straw: A Stress-Aware WL-Based Read Reclaim Technique for High-Density NAND Flash-Based SSDs Straw：一种应力感知的基于wl的高密度NAND闪存固态硬盘读回收技术

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-12-12 DOI: 10.1109/LCA.2024.3516205

Myoungjun Chun;Jaeyong Lee;Inhyuk Choi;Jisung Park;Myungsuk Kim;Jihong Kim

Although read disturbance has emerged as a major reliability concern, managing read disturbance in modern NAND flash memory has not been thoroughly investigated yet. From a device characterization study using real modern NAND flash memory, we observe that reading a page incurs heterogeneous reliability impacts on each WL, which makes the existing block-level read reclaim extremely inefficient. We propose a new WL-level read-reclaim technique, called Straw, which keeps track of the accumulated read-disturbance effect on each WL and reclaims only heavily-disturbed WLs. By avoiding unnecessary read-reclaim operations, Straw reduces read-reclaim-induced page writes by 83.6% with negligible storage overhead.

虽然读干扰已经成为一个主要的可靠性问题，但在现代NAND闪存中，读干扰的管理还没有得到彻底的研究。通过使用现代NAND闪存的器件特性研究，我们观察到读取页面会对每个WL产生异构可靠性影响，这使得现有的块级读取回收效率极低。我们提出了一种新的WL级读取回收技术，称为Straw，它可以跟踪每个WL上累积的读取干扰效应，并且只回收受严重干扰的WL。通过避免不必要的读回收操作，Straw减少了83.6%由读回收引起的页写，而存储开销可以忽略不计。

引用次数: 0

Electra: Eliminating the Ineffectual Computations on Bitmap Compressed Matrices 消除位图压缩矩阵的无效计算

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-12-12 DOI: 10.1109/LCA.2024.3516057

Chaithanya Krishna Vadlamudi;Bahar Asgari

The primary computations in several applications, such as deep learning recommendation models, graph neural networks, and scientific computing, involve sparse matrix sparse matrix multiplications (SpMSpM). Unlike standard multiplications, SpMSpMs introduce ineffective computations that can negatively impact performance. While several accelerators have been proposed to execute SpMSpM more efficiently, they often incur additional overhead in identifying the effectual arithmetic computations. To solve this issue, we propose Electra, a novel approach designed to reduce ineffectual computations in bitmap-compressed matrices. Electra achieves this by i) performing logical operations on the bitmap data to know whether the arithmetic computation has a zero or non-zero value, and ii) implementing finer granular scheduling of non-zero elements to arithmetic units. Our evaluations suggest that on average, Electra achieves a speedup of 1.27× over the state-of-the-art SpMSpM accelerator with a small area overhead of 64.92

$text{mm}^{2}$

based on 45 nm process.

在深度学习推荐模型、图神经网络和科学计算等应用中，主要的计算涉及稀疏矩阵稀疏矩阵乘法（SpMSpM）。与标准乘法不同，spmspm引入了可能对性能产生负面影响的无效计算。虽然已经提出了一些加速器来更有效地执行SpMSpM，但它们通常会在识别有效的算术计算时产生额外的开销。为了解决这个问题，我们提出了Electra，一种新颖的方法，旨在减少位图压缩矩阵中的无效计算。Electra通过以下方式实现了这一点：1)对位图数据执行逻辑操作，以了解算术计算是否具有零值或非零值；2)实现更精细的非零元素到算术单元的粒度调度。我们的评估表明，平均而言，Electra比最先进的SpMSpM加速器实现了1.27倍的加速，而基于45纳米工艺的面积开销仅为64.92 $text{mm}^{2}$。

引用次数: 0

IntervalSim++: Enhanced Interval Simulation for Unbalanced Processor Designs IntervalSim++：非平衡处理器设计的增强间隔仿真

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-12-09 DOI: 10.1109/LCA.2024.3514917

Haseung Bong;Nahyeon Kang;Youngsok Kim;Joonsung Kim;Hanhwi Jang

As processor microarchitecture is getting complicated, an accurate analytic model becomes crucial for exploring large processor design space within limited development time. An interval simulation is a widely used analytic model for processor designs in the early stage. However, it cannot accurately model modern microarchitecture, which has an unbalanced pipeline. In this work, we introduce IntervalSim++, an accurate analytic model for a modern microarchitecture design based on the interval simulation. We identify key components highly related to the unbalanced pipeline and propose new modeling techniques atop the interval simulation without incurring significant overheads. Our evaluations show IntervalSim++ accurately models a modern out-of-order processor with minimal overheads, showing 1% average CPI error and only 8.8% simulation time increase compared to the baseline interval simulation.

随着处理器微体系结构的日益复杂，在有限的开发时间内探索大型处理器设计空间，精确的分析模型变得至关重要。区间仿真是一种广泛应用于处理器早期设计的分析模型。然而，它不能准确地模拟具有不平衡管道的现代微体系结构。本文介绍了基于区间仿真的现代微架构设计精确分析模型IntervalSim++。我们确定了与不平衡管道高度相关的关键组件，并在不产生重大开销的情况下，在区间模拟的基础上提出了新的建模技术。我们的评估表明，IntervalSim++以最小的开销准确地模拟了一个现代无序处理器，与基线间隔模拟相比，显示出1%的平均CPI误差和仅8.8%的模拟时间增加。

引用次数: 0

SCALES: SCALable and Area-Efficient Systolic Accelerator for Ternary Polynomial Multiplication 缩放：可伸缩和面积有效的收缩加速器为三元多项式乘法

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-11-25 DOI: 10.1109/LCA.2024.3505872

Samuel Coulon;Tianyou Bao;Jiafeng Xie

Polynomial multiplication is a key component in many post-quantum cryptography and homomorphic encryption schemes. One recurring variation, ternary polynomial multiplication over ring

$mathbb {Z}_{q}/(x^{n}+1)$

where one input polynomial has ternary coefficients {−1,0,1} and the other has large integer coefficients {0,

$q-1$

}, has recently drawn significant attention from various communities. Following this trend, this paper presents a novel SCALable and area-Efficient Systolic (SCALES) accelerator for ternary polynomial multiplication. In total, we have carried out three layers of coherent interdependent efforts. First, we have rigorously derived a novel block-processing strategy and algorithm based on the schoolbook method for polynomial multiplication. Then, we have innovatively implemented the proposed algorithm as the SCALES accelerator with the help of a number of field-programmable gate array (FPGA)-oriented optimization techniques. Lastly, we have conducted a thorough implementation analysis to showcase the efficiency of the proposed accelerator. The comparison demonstrated that the SCALES accelerator has at least 19.0% and 23.8% less equivalent area-time product (eATP) than the state-of-the-art designs. We hope this work can stimulate continued research in the field.

多项式乘法是许多后量子密码和同态加密方案的关键组成部分。一个反复出现的变化，三元多项式乘法在环$mathbb {Z}_{q}/(x^{n}+1)$上，其中一个输入多项式具有三元系数{- 1,0,1}，另一个具有大整数系数{0，$q-1$}，最近引起了各个社区的极大关注。根据这一趋势，本文提出了一种新的可伸缩和面积高效的三元多项式乘法收缩加速器。总的来说，我们进行了三个层次的连贯的相互依赖的努力。首先，我们严格推导了一种新的基于教科书方法的多项式乘法块处理策略和算法。然后，我们借助一些面向现场可编程门阵列（FPGA）的优化技术，创新地实现了所提出的算法作为SCALES加速器。最后，我们进行了全面的实施分析，以展示拟议加速器的效率。对比表明，与最先进的设计相比，SCALES加速器的等效面积时间积（eATP）至少减少19.0%和23.8%。我们希望这项工作能够促进该领域的进一步研究。

{"title":"SCALES: SCALable and Area-Efficient Systolic Accelerator for Ternary Polynomial Multiplication","authors":"Samuel Coulon;Tianyou Bao;Jiafeng Xie","doi":"10.1109/LCA.2024.3505872","DOIUrl":"https://doi.org/10.1109/LCA.2024.3505872","url":null,"abstract":"Polynomial multiplication is a key component in many post-quantum cryptography and homomorphic encryption schemes. One recurring variation, ternary polynomial multiplication over ring \u0000<inline-formula><tex-math>$mathbb {Z}_{q}/(x^{n}+1)$</tex-math></inline-formula>\u0000 where one input polynomial has ternary coefficients {−1,0,1} and the other has large integer coefficients {0, \u0000<inline-formula><tex-math>$q-1$</tex-math></inline-formula>\u0000}, has recently drawn significant attention from various communities. Following this trend, this paper presents a novel \u0000SCAL\u0000able and area-\u0000E\u0000fficient \u0000S\u0000ystolic (SCALES) accelerator for ternary polynomial multiplication. In total, we have carried out three layers of coherent interdependent efforts. First, we have rigorously derived a novel block-processing strategy and algorithm based on the schoolbook method for polynomial multiplication. Then, we have innovatively implemented the proposed algorithm as the SCALES accelerator with the help of a number of field-programmable gate array (FPGA)-oriented optimization techniques. Lastly, we have conducted a thorough implementation analysis to showcase the efficiency of the proposed accelerator. The comparison demonstrated that the SCALES accelerator has at least 19.0% and 23.8% less equivalent area-time product (eATP) than the state-of-the-art designs. We hope this work can stimulate continued research in the field.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"243-246"},"PeriodicalIF":1.4,"publicationDate":"2024-11-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142825860","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Case for Hardware Memoization in Server CPUs 服务器cpu硬件记忆的案例

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-11-22 DOI: 10.1109/LCA.2024.3505075

Farid Samandi;Natheesan Ratnasegar;Michael Ferdman

Server applications exhibit a high degree of code repetition because they handle many similar requests. In turn, repeated execution of the same code, often with identical inputs, highlights an inefficiency in the execution of server software and suggests memoization as a way to improve performance. Memoization has been extensively explored in software, and several hardware- and hardware-assisted memoization schemes have been proposed in the literature. However, these works targeted memoization of mathematical or algorithmic processing, whereas server applications call for a different approach. We observe that the opportunity for memoization in servers arises not from eliminating the repetition of complex computation, but from eliminating the repetition of software orchestration code. This work studies hardware memoization in servers, ultimately focusing on one pattern, instruction sequences starting with indirect jumps. We explore how an out-of-order pipeline can be extended to support memoization of these instruction sequences, demonstrating the potential of hardware memoization for servers. Using 26 applications to make our case (3 CloudSuite workloads and 23 vSwarm serverless functions), we show how targeting just this one pattern of instruction sequences can memoize over 10% (up to 15.6%) of the dynamically executed instructions in these server applications.

服务器应用程序表现出高度的代码重复，因为它们处理许多类似的请求。反过来，重复执行相同的代码，通常使用相同的输入，突出了服务器软件执行的低效率，并建议将记忆作为提高性能的一种方法。记忆已经在软件中进行了广泛的探索，并且在文献中提出了几种硬件和硬件辅助的记忆方案。然而，这些工作针对的是数学或算法处理的记忆，而服务器应用程序则需要不同的方法。我们观察到，服务器中记忆的机会不是来自消除复杂计算的重复，而是来自消除软件编排代码的重复。这项工作研究了服务器中的硬件记忆，最终集中在一种模式上，即从间接跳转开始的指令序列。我们探讨了如何扩展乱序管道以支持这些指令序列的记忆，展示了服务器硬件记忆的潜力。使用26个应用程序（3个CloudSuite工作负载和23个vSwarm无服务器功能），我们展示了如何仅针对这一指令序列模式就可以记住这些服务器应用程序中超过10%（最多15.6%）的动态执行指令。

{"title":"A Case for Hardware Memoization in Server CPUs","authors":"Farid Samandi;Natheesan Ratnasegar;Michael Ferdman","doi":"10.1109/LCA.2024.3505075","DOIUrl":"https://doi.org/10.1109/LCA.2024.3505075","url":null,"abstract":"Server applications exhibit a high degree of code repetition because they handle many similar requests. In turn, repeated execution of the same code, often with identical inputs, highlights an inefficiency in the execution of server software and suggests memoization as a way to improve performance. Memoization has been extensively explored in software, and several hardware- and hardware-assisted memoization schemes have been proposed in the literature. However, these works targeted memoization of mathematical or algorithmic processing, whereas server applications call for a different approach. We observe that the opportunity for memoization in servers arises not from eliminating the repetition of complex computation, but from eliminating the repetition of software orchestration code. This work studies hardware memoization in servers, ultimately focusing on one pattern, instruction sequences starting with indirect jumps. We explore how an out-of-order pipeline can be extended to support memoization of these instruction sequences, demonstrating the potential of hardware memoization for servers. Using 26 applications to make our case (3 CloudSuite workloads and 23 vSwarm serverless functions), we show how targeting just this one pattern of instruction sequences can memoize over 10% (up to 15.6%) of the dynamically executed instructions in these server applications.","PeriodicalId":51248,"journal":{"name":"IEEE Computer Architecture Letters","volume":"23 2","pages":"231-234"},"PeriodicalIF":1.4,"publicationDate":"2024-11-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142761396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Characterization and Analysis of the 3D Gaussian Splatting Rendering Pipeline 三维高斯飞溅渲染管道的表征与分析

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-11-21 DOI: 10.1109/LCA.2024.3504579

Jiwon Lee;Yunjae Lee;Youngeun Kwon;Minsoo Rhu

Novel view synthesis, a task generating a 2D image frame from a specific viewpoint within a 3D object or scene, plays a crucial role in 3D rendering. Neural Radiance Field (NeRF) emerged as a prominent method for implementing novel view synthesis, but 3D Gaussian Splatting (3DGS) recently began to emerge as a viable alternative. Despite the tremendous interest from both academia and industry, there has been a lack of research to identify the computational bottlenecks of 3DGS, which is critical for its deployment in real-world products. In this work, we present a comprehensive end-to-end characterization of the 3DGS rendering pipeline, identifying the alpha blending stage within the tile-based rasterizer as causing a significant performance bottleneck. Based on our findings, we discuss several future research directions aiming to inspire continued exploration within this burgeoning application domain.

新颖视图合成是一种从3D对象或场景中的特定视点生成2D图像帧的任务，在3D渲染中起着至关重要的作用。神经辐射场（NeRF）是实现新视图合成的重要方法，但3D高斯飞溅（3DGS）最近开始成为一种可行的替代方法。尽管学术界和工业界对此都非常感兴趣，但目前还缺乏研究来确定3DGS的计算瓶颈，而这对于其在实际产品中的部署至关重要。在这项工作中，我们提出了3DGS渲染管道的全面端到端表征，确定了基于tile的光栅化器中的alpha混合阶段，这导致了显著的性能瓶颈。基于我们的发现，我们讨论了几个未来的研究方向，旨在激发在这个新兴应用领域的持续探索。

引用次数: 0

SPGPU: Spatially Programmed GPU SPGPU：空间编程 GPU

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-11-14 DOI: 10.1109/LCA.2024.3499339

Shizhuo Zhu;Illia Shkirko;Jacob Levinson;Zhengrong Wang;Tony Nowatzki

Communication is a critical bottleneck for GPUs, manifesting as energy and performance overheads due to network-on-chip (NoC) delay and congestion. While many algorithms exhibit locality among thread blocks and accessed data, modern GPUs lack the interface to exploit this locality: GPU thread blocks are mapped to cores obliviously. In this work, we explore a simple extension to the conventional GPU programming interface to enable control over the spatial placement of data and threads, yielding new opportunities for aggressive locality optimizations within a GPU kernel. Across 7 workloads that can take advantage of these optimizations, for a 32 (or 128) SM GPU: we achieve a 1.28× (1.54×) speedup and 35% (44%) reduction in NoC traffic, compared to baseline non-spatial GPUs.

通信是 GPU 的一个关键瓶颈，由于片上网络（NoC）延迟和拥塞，通信表现为能耗和性能开销。虽然许多算法在线程块和访问数据之间表现出局部性，但现代 GPU 缺乏利用这种局部性的接口：GPU 线程块是被无意识地映射到内核上的。在这项工作中，我们探索了对传统 GPU 编程接口的简单扩展，以实现对数据和线程空间位置的控制，为在 GPU 内核中进行积极的位置优化提供新的机会。与基线非空间 GPU 相比，在 7 种可利用这些优化的工作负载中，对于 32（或 128）SM GPU，我们实现了 1.28 倍（1.54 倍）的速度提升和 35%（44%）的 NoC 流量减少。

引用次数: 0

Quantum Assertion Scheme for Assuring Qudit Robustness 确保 Qudit 稳健性的量子断言方案

IF 1.4 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Computer Architecture Letters

Pub Date : 2024-11-04 DOI: 10.1109/LCA.2024.3483840

Navnil Choudhury;Chao Lu;Kanad Basu

Noisy Intermediate-Scale Quantum (NISQ) computers are impeded by constraints such as limited qubit count and susceptibility to noise, hindering the progression towards fault-tolerant quantum computing for intricate and practical applications. To augment the computational capabilities of quantum computers, research is gravitating towards qudits featuring more than two energy levels. This paper presents the inaugural examination of the repercussions of errors in qudit circuits. Subsequently, we introduce an innovative qudit-based assertion framework aimed at automatically detecting and reporting errors and warnings during the quantum circuit design and compilation process. Our proposed framework, when subjected to evaluation on existing quantum computing platforms, can detect both new and existing bugs with up to 100% coverage of the bugs mentioned in this paper.

有噪声的中等规模量子（NISQ）计算机受到诸如有限的量子比特数和对噪声的易感性等限制的阻碍，阻碍了向复杂和实际应用的容错量子计算的发展。为了增强量子计算机的计算能力，研究正趋向于具有两个以上能级的量子。本文提出了在量子电路误差的影响的初步检查。随后，我们引入了一个创新的基于量子比特的断言框架，旨在自动检测和报告量子电路设计和编译过程中的错误和警告。我们提出的框架在现有的量子计算平台上进行评估时，可以检测到新的和现有的错误，并且对本文提到的错误的覆盖率高达100%。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IEEE Computer Architecture Letters

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀