IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems最新文献_第7页

GOURD: Tensorizing Streaming Applications to Generate Multi-Instance Compute Platforms GOURD：张张流应用，生成多实例计算平台

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3445810

Patrick Schmid;Paul Palomero Bernardo;Christoph Gerum;Oliver Bringmann

In this article, we rethink the dataflow processing paradigm to a higher level of abstraction to automate the generation of multi-instance compute and memory platforms with interfaces to I/O devices (sensors and actuators). Since the different compute instances (NPUs, CPUs, DSPs, etc.) and I/O devices do not necessarily have compatible interfaces on a dataflow level, an automated translation is required. However, in multidimensional dataflow scenarios, it becomes inherently difficult to reason about buffer sizes and iteration order without knowing the shape of the data access pattern (DAP) that the dataflow follows. To capture this shape and the platform composition, we define a domain-specific representation (DSR) and devise a toolchain to generate a synthesizable platform, including appropriate streaming buffers for platform-specific tensorization of the data between incompatible interfaces. This allows platforms, such as sensor edge AI devices, to be easily specified by simply focusing on the shape of the data provided by the sensors and transmitted among compute units, giving the ability to evaluate and generate different dataflow design alternatives with significantly reduced design time.

在本文中，我们重新思考了数据流处理范式，将其提升到更高的抽象层次，以自动生成带有 I/O 设备（传感器和执行器）接口的多实例计算和内存平台。由于不同的计算实例（NPU、CPU、DSP 等）和 I/O 设备不一定具有数据流级别的兼容接口，因此需要进行自动转换。然而，在多维数据流场景中，如果不知道数据流所遵循的数据访问模式（DAP）的形状，就很难推理出缓冲区大小和迭代顺序。为了捕捉这种形状和平台组成，我们定义了一种特定领域表示法（DSR），并设计了一个工具链来生成一个可合成的平台，其中包括适当的流缓冲区，用于在不兼容接口之间对数据进行特定平台张量化。这样，只需关注传感器提供的数据形状和计算单元之间的传输，就能轻松指定传感器边缘人工智能设备等平台，从而能够评估和生成不同的数据流设计方案，大大缩短设计时间。

{"title":"GOURD: Tensorizing Streaming Applications to Generate Multi-Instance Compute Platforms","authors":"Patrick Schmid;Paul Palomero Bernardo;Christoph Gerum;Oliver Bringmann","doi":"10.1109/TCAD.2024.3445810","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3445810","url":null,"abstract":"In this article, we rethink the dataflow processing paradigm to a higher level of abstraction to automate the generation of multi-instance compute and memory platforms with interfaces to I/O devices (sensors and actuators). Since the different compute instances (NPUs, CPUs, DSPs, etc.) and I/O devices do not necessarily have compatible interfaces on a dataflow level, an automated translation is required. However, in multidimensional dataflow scenarios, it becomes inherently difficult to reason about buffer sizes and iteration order without knowing the shape of the data access pattern (DAP) that the dataflow follows. To capture this shape and the platform composition, we define a domain-specific representation (DSR) and devise a toolchain to generate a synthesizable platform, including appropriate streaming buffers for platform-specific tensorization of the data between incompatible interfaces. This allows platforms, such as sensor edge AI devices, to be easily specified by simply focusing on the shape of the data provided by the sensors and transmitted among compute units, giving the ability to evaluate and generate different dataflow design alternatives with significantly reduced design time.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4166-4177"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://ieeexplore.ieee.org/stamp/stamp.jsp?tp=&arnumber=10745814","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

DREAMx: A Data-Driven Error Estimation Methodology for Adders Composed of Cascaded Approximate Units DREAMx：针对由级联近似单元组成的加法器的数据驱动误差估计方法学

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3447209

Muhammad Abdullah Hanif;Ayoub Arous;Muhammad Shafique

Due to the significance and broad utilization of adders in computing systems, the design of low-power approximate adders (LPAAs) has received a significant amount of attention from the system design community. However, the selection and deployment of appropriate approximate modules require a thorough design space exploration, which is (in general) an extremely time-consuming process. Toward reducing the exploration time, different error estimation techniques have been proposed in the literature for evaluating the quality metrics of approximate adders. However, most of them are based on certain assumptions that limit the usability of such techniques for real-world settings. In this work, we highlight the impact of those assumptions on the quality of error estimates provided by the state-of-the-art techniques and how they limit the use of such techniques for real-world settings. Moreover, we highlight the significance of considering input data characteristics to improve the quality of error estimation. Based on our analysis, we propose a systematic data-driven error estimation methodology, DREAMx, for adders composed of cascaded approximate units, which covers a predominant set of LPAAs. DREAMx in principle factors in the dependence between input bits based on the given input distribution to compute the probability mass function (PMF) of error value at the output of an approximate adder. It achieves improved results compared to the state-of-the-art techniques while offering a substantial decrease in the overall execution(/exploration) time compared to exhaustive simulations. Our results further show that there exists a delicate tradeoff between the achievable quality of error estimates and the overall execution time.

由于加法器在计算系统中的重要性和广泛应用，低功耗近似加法器（LPAAs）的设计受到了系统设计界的极大关注。然而，选择和部署合适的近似模块需要对设计空间进行彻底探索，而这通常是一个极其耗时的过程。为了缩短探索时间，文献中提出了不同的误差估计技术，用于评估近似加法器的质量指标。然而，这些技术大多基于某些假设，限制了这些技术在实际环境中的可用性。在这项工作中，我们强调了这些假设对最先进技术所提供的误差估计质量的影响，以及它们如何限制了此类技术在实际环境中的应用。此外，我们还强调了考虑输入数据特征以提高误差估计质量的重要性。基于我们的分析，我们针对由级联近似单元组成的加法器提出了一种系统的数据驱动误差估计方法 DREAMx，它涵盖了一组主要的 LPAAs。DREAMx 原则上根据给定的输入分布，考虑输入位之间的依赖性，计算近似加法器输出端误差值的概率质量函数 (PMF)。与最先进的技术相比，它取得了更好的结果，同时与穷举模拟相比，大大减少了整体执行（/探索）时间。我们的研究结果进一步表明，在可实现的误差估计质量和总体执行时间之间存在着微妙的权衡。

{"title":"DREAMx: A Data-Driven Error Estimation Methodology for Adders Composed of Cascaded Approximate Units","authors":"Muhammad Abdullah Hanif;Ayoub Arous;Muhammad Shafique","doi":"10.1109/TCAD.2024.3447209","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447209","url":null,"abstract":"Due to the significance and broad utilization of adders in computing systems, the design of low-power approximate adders (LPAAs) has received a significant amount of attention from the system design community. However, the selection and deployment of appropriate approximate modules require a thorough design space exploration, which is (in general) an extremely time-consuming process. Toward reducing the exploration time, different error estimation techniques have been proposed in the literature for evaluating the quality metrics of approximate adders. However, most of them are based on certain assumptions that limit the usability of such techniques for real-world settings. In this work, we highlight the impact of those assumptions on the quality of error estimates provided by the state-of-the-art techniques and how they limit the use of such techniques for real-world settings. Moreover, we highlight the significance of considering input data characteristics to improve the quality of error estimation. Based on our analysis, we propose a systematic data-driven error estimation methodology, DREAMx, for adders composed of cascaded approximate units, which covers a predominant set of LPAAs. DREAMx in principle factors in the dependence between input bits based on the given input distribution to compute the probability mass function (PMF) of error value at the output of an approximate adder. It achieves improved results compared to the state-of-the-art techniques while offering a substantial decrease in the overall execution(/exploration) time compared to exhaustive simulations. Our results further show that there exists a delicate tradeoff between the achievable quality of error estimates and the overall execution time.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3348-3357"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Large Data Transfer Optimization for Improved Robustness in Real-Time V2X-Communication 优化大数据传输，提高 V2X 实时通信的稳健性

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3436548

Alex Bendrick;Nora Sperling;Rolf Ernst

Vehicle-to-everything (V2X) roadmaps envision future applications that require the reliable exchange of large sensor data over a wireless network in real time. Applications include sensor fusion for cooperative perception or remote vehicle control that are subject to stringent real-time and safety constraints. Real-time requirements result from end-to-end latency constraints, while reliability refers to the quest for loss-free sensor data transfer to reach maximum application quality. In wireless networks, both requirements are in conflict, because of the need for error correction. Notably, the established video coding standards are not suitable for this task, as demonstrated in experiments. This article shows that middleware-based backward error correction (BEC) in combination with application controlled selective data transmission is far more effective for this purpose. The mechanisms proposed in this article use application and context knowledge to dynamically adapt the data object volume at high error rates at sustained application resilience. We evaluate popular camera datasets and perception pipelines from the automotive domain and apply two complementary strategies. The results and comparisons show that this approach has great benefits, far beyond the state of the art. It also shows that there is no single strategy that outperforms the other in all use cases.

车对物（V2X）路线图设想了需要通过无线网络实时可靠地交换大量传感器数据的未来应用。这些应用包括用于协同感知或远程车辆控制的传感器融合，这些应用受到严格的实时性和安全性限制。实时性要求源于端到端的延迟限制，而可靠性则是指传感器数据传输必须无损，以达到最高的应用质量。在无线网络中，由于需要纠错，这两种要求相互冲突。值得注意的是，现有的视频编码标准并不适合这项任务，实验证明了这一点。本文表明，基于中间件的后向纠错（BEC）与应用控制的选择性数据传输相结合，能更有效地实现这一目的。本文提出的机制利用应用和上下文知识，在高错误率情况下动态调整数据对象量，同时保持应用弹性。我们评估了汽车领域流行的摄像头数据集和感知管道，并应用了两种互补策略。结果和比较表明，这种方法具有极大的优势，远远超出了目前的技术水平。它还表明，在所有使用案例中，没有一种策略能优于另一种策略。

{"title":"Large Data Transfer Optimization for Improved Robustness in Real-Time V2X-Communication","authors":"Alex Bendrick;Nora Sperling;Rolf Ernst","doi":"10.1109/TCAD.2024.3436548","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3436548","url":null,"abstract":"Vehicle-to-everything (V2X) roadmaps envision future applications that require the reliable exchange of large sensor data over a wireless network in real time. Applications include sensor fusion for cooperative perception or remote vehicle control that are subject to stringent real-time and safety constraints. Real-time requirements result from end-to-end latency constraints, while reliability refers to the quest for loss-free sensor data transfer to reach maximum application quality. In wireless networks, both requirements are in conflict, because of the need for error correction. Notably, the established video coding standards are not suitable for this task, as demonstrated in experiments. This article shows that middleware-based backward error correction (BEC) in combination with application controlled selective data transmission is far more effective for this purpose. The mechanisms proposed in this article use application and context knowledge to dynamically adapt the data object volume at high error rates at sustained application resilience. We evaluate popular camera datasets and perception pipelines from the automotive domain and apply two complementary strategies. The results and comparisons show that this approach has great benefits, far beyond the state of the art. It also shows that there is no single strategy that outperforms the other in all use cases.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3515-3526"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595798","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

iFKVS: Lightweight Key-Value Store for Flash-Based Intermittently Computing Devices iFKVS：基于闪存的间歇计算设备的轻量级键值存储器

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3443698

Yen-Hsun Chen;Ting-En Liao;Li-Pin Chang

Energy harvesting enables long-running sensing applications on tiny Internet of Things (IoT) devices without a battery installed. To overcome the intermittency of ambient energy sources, system software creates intermittent computation using checkpoints. While the scope of intermittent computation is quickly expanding, there is a strong demand for data storage and local data processing in such IoT devices. When considering data storage options, flash memory is more compelling than other types of nonvolatile memory due to its affordability and availability. We introduce iFKVS, a flash-based key-value store for multisensor IoT devices. In this study, we aim at supporting efficient key-value operations while guaranteeing the correctness of program execution across power interruptions. For indexing of multidimensional sensor data, we propose a quadtree-based structure for the minimization of extra writes from splitting and rebalancing; for checkpointing in flash storage, we propose a rollback-based algorithm that exploits the capabilities of byte-level writing and one-way bit flipping of flash memory. Experimental results based on a real energy-driven testbed demonstrate that with the same index structure design, our rollback-based approach obtains a significant reduction of 45% and 84% in the total execution time compared with checkpointing using write-ahead logging (WAL) and copying on write (COW), respectively.

通过能量收集，微型物联网（IoT）设备可以在不安装电池的情况下长时间运行传感应用。为了克服环境能源的间歇性，系统软件使用检查点创建间歇计算。虽然间歇计算的范围正在迅速扩大，但此类物联网设备对数据存储和本地数据处理的需求也非常强烈。在考虑数据存储选项时，闪存因其经济实惠和可用性而比其他类型的非易失性存储器更具吸引力。我们为多传感器物联网设备引入了基于闪存的键值存储 iFKVS。在这项研究中，我们旨在支持高效的键值操作，同时保证在电源中断时程序执行的正确性。对于多维传感器数据的索引，我们提出了一种基于四叉树的结构，以尽量减少拆分和重新平衡带来的额外写入；对于闪存中的检查点，我们提出了一种基于回滚的算法，该算法利用了闪存的字节级写入和单向位翻转功能。基于真实能源驱动测试平台的实验结果表明，在相同的索引结构设计下，我们基于回滚的方法与使用先写日志（WAL）和写入复制（COW）的检查点方法相比，总执行时间分别显著减少了 45% 和 84%。

{"title":"iFKVS: Lightweight Key-Value Store for Flash-Based Intermittently Computing Devices","authors":"Yen-Hsun Chen;Ting-En Liao;Li-Pin Chang","doi":"10.1109/TCAD.2024.3443698","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443698","url":null,"abstract":"Energy harvesting enables long-running sensing applications on tiny Internet of Things (IoT) devices without a battery installed. To overcome the intermittency of ambient energy sources, system software creates intermittent computation using checkpoints. While the scope of intermittent computation is quickly expanding, there is a strong demand for data storage and local data processing in such IoT devices. When considering data storage options, flash memory is more compelling than other types of nonvolatile memory due to its affordability and availability. We introduce iFKVS, a flash-based key-value store for multisensor IoT devices. In this study, we aim at supporting efficient key-value operations while guaranteeing the correctness of program execution across power interruptions. For indexing of multidimensional sensor data, we propose a quadtree-based structure for the minimization of extra writes from splitting and rebalancing; for checkpointing in flash storage, we propose a rollback-based algorithm that exploits the capabilities of byte-level writing and one-way bit flipping of flash memory. Experimental results based on a real energy-driven testbed demonstrate that with the same index structure design, our rollback-based approach obtains a significant reduction of 45% and 84% in the total execution time compared with checkpointing using write-ahead logging (WAL) and copying on write (COW), respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3564-3575"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

BERN-NN-IBF: Enhancing Neural Network Bound Propagation Through Implicit Bernstein Form and Optimized Tensor Operations BERN-NN-IBF：通过隐式伯恩斯坦形式和优化的张量运算增强神经网络边界传播

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3447577

Wael Fatnassi;Arthur Feeney;Valen Yamamoto;Aparna Chandramowlishwaran;Yasser Shoukry

Neural networks have emerged as powerful tools across various domains, exhibiting remarkable empirical performance that motivated their widespread adoption in safety-critical applications, which, in turn, necessitates rigorous formal verification techniques to ensure their reliability and robustness. Tight bound propagation plays a crucial role in the formal verification process by providing precise bounds that can be used to formulate and verify properties, such as safety, robustness, and fairness. While state-of-the-art tools use linear and convex approximations to compute upper/lower bounds for each neuron’s outputs, recent advances have shown that nonlinear approximations based on Bernstein polynomials lead to tighter bounds but suffer from scalability issues. To that end, this article introduces BERN-NN-IBF, a significant enhancement of the Bernstein-polynomial-based bound propagation algorithms. BERN-NN-IBF offers three main contributions: 1) a memory-efficient encoding of Bernstein polynomials to scale the bound propagation algorithms; 2) optimized tensor operations for the new polynomial encoding to maintain the integrity of the bounds while enhancing computational efficiency; and 3) tighter under-approximations of the ReLU activation function using quadratic polynomials tailored to minimize approximation errors. Through comprehensive testing, we demonstrate that BERN-NN-IBF achieves tighter bounds and higher computational efficiency compared to the original BERN-NN and state-of-the-art methods, including linear and convex programming used within the winner of the VNN-COMPETITION.

神经网络已成为各个领域的强大工具，表现出非凡的经验性能，促使其在安全关键型应用中得到广泛采用，这反过来又需要严格的形式验证技术来确保其可靠性和鲁棒性。严格的边界传播在形式验证过程中起着至关重要的作用，它提供了精确的边界，可用于制定和验证安全性、鲁棒性和公平性等属性。虽然最先进的工具使用线性和凸近似来计算每个神经元输出的上/下限，但最近的进展表明，基于伯恩斯坦多项式的非线性近似会带来更严格的约束，但存在可扩展性问题。为此，本文介绍了 BERN-NN-IBF，这是对基于伯恩斯坦多项式的边界传播算法的重大改进。BERN-NN-IBF 有三大贡献：1) 对伯恩斯坦多项式进行内存高效编码，以扩展约束传播算法；2) 对新多项式编码进行优化的张量运算，以保持约束的完整性，同时提高计算效率；3) 使用二次多项式对 ReLU 激活函数进行更严格的欠逼近，以最大限度地减少逼近误差。通过综合测试，我们证明 BERN-NN-IBF 与原始 BERN-NN 和最先进的方法（包括在 VNN-COMPETITION 优胜者中使用的线性和凸编程）相比，实现了更严格的边界和更高的计算效率。

{"title":"BERN-NN-IBF: Enhancing Neural Network Bound Propagation Through Implicit Bernstein Form and Optimized Tensor Operations","authors":"Wael Fatnassi;Arthur Feeney;Valen Yamamoto;Aparna Chandramowlishwaran;Yasser Shoukry","doi":"10.1109/TCAD.2024.3447577","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447577","url":null,"abstract":"Neural networks have emerged as powerful tools across various domains, exhibiting remarkable empirical performance that motivated their widespread adoption in safety-critical applications, which, in turn, necessitates rigorous formal verification techniques to ensure their reliability and robustness. Tight bound propagation plays a crucial role in the formal verification process by providing precise bounds that can be used to formulate and verify properties, such as safety, robustness, and fairness. While state-of-the-art tools use linear and convex approximations to compute upper/lower bounds for each neuron’s outputs, recent advances have shown that nonlinear approximations based on Bernstein polynomials lead to tighter bounds but suffer from scalability issues. To that end, this article introduces BERN-NN-IBF, a significant enhancement of the Bernstein-polynomial-based bound propagation algorithms. BERN-NN-IBF offers three main contributions: 1) a memory-efficient encoding of Bernstein polynomials to scale the bound propagation algorithms; 2) optimized tensor operations for the new polynomial encoding to maintain the integrity of the bounds while enhancing computational efficiency; and 3) tighter under-approximations of the ReLU activation function using quadratic polynomials tailored to minimize approximation errors. Through comprehensive testing, we demonstrate that BERN-NN-IBF achieves tighter bounds and higher computational efficiency compared to the original BERN-NN and state-of-the-art methods, including linear and convex programming used within the winner of the VNN-COMPETITION.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4334-4345"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636263","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Near-Free Lifetime Extension for 3-D nand Flash via Opportunistic Self-Healing 通过机会性自修复技术近乎免费地延长 3-D nand 闪存的使用寿命

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3447225

Tianyu Ren;Qiao Li;Yina Lv;Min Ye;Nan Guan;Chun Jason Xue

3-D nand flash memories are the dominant storage media in modern data centers due to their high performance, large storage capacity, and low-power consumption. However, the lifetime of flash memory has decreased as technology scaling advances. Recent work has revealed that the number of achievable program/erase (P/E) cycles of flash blocks is related to the dwell time (DT) between two adjacent erase operations. A longer DT can lead to higher-achievable P/E cycles and, therefore, a longer lifetime for flash memories. This article found that the achievable P/E cycles would increase when flash blocks endure uneven DT distribution. Based on this observation, this article presents an opportunistic self-healing method to extend the lifetime of flash memory. By maintaining two groups with unequal block counts, namely, Active Group and Healing Group, the proposed method creates an imbalance in erase operation distribution. The Active Group undergoes more frequent erase operations, resulting in shorter DT, while the Healing Group experiences longer DT. Periodically, the roles of the two groups are switched based on the Active Group’s partitioning ratio. This role switching ensures that each block experiences both short and long DT periods, leading to an uneven DT distribution that magnifies the self-healing effect. The evaluation shows that the proposed method can improve the flash lifetime by 19.3% and 13.2% on average with near-free overheads, compared with the baseline and the related work, respectively.

3-D nand 闪存因其高性能、大存储容量和低功耗而成为现代数据中心的主流存储介质。然而，随着技术的进步，闪存的使用寿命却在缩短。最近的研究发现，闪存块可实现的编程/擦除（P/E）周期数与两个相邻擦除操作之间的停留时间（DT）有关。较长的 DT 可以提高可实现的 P/E 周期，从而延长闪存的使用寿命。本文发现，当闪存块的 DT 分布不均匀时，可实现的 P/E 周期会增加。基于这一观察结果，本文提出了一种延长闪存寿命的机会主义自修复方法。通过保持两个区块数量不等的组，即活动组和修复组，所提出的方法在擦除操作分布上造成了不平衡。活跃组进行更频繁的擦除操作，导致更短的 DT，而愈合组则经历更长的 DT。根据活跃组的分区比例，定期切换两个组的角色。这种角色切换可确保每个区块同时经历较短和较长的 DT 时间，从而导致 DT 分布不均，放大自修复效果。评估结果表明，与基线方法和相关工作相比，所提出的方法平均可将闪存寿命提高 19.3% 和 13.2%，且几乎不产生开销。

{"title":"Near-Free Lifetime Extension for 3-D nand Flash via Opportunistic Self-Healing","authors":"Tianyu Ren;Qiao Li;Yina Lv;Min Ye;Nan Guan;Chun Jason Xue","doi":"10.1109/TCAD.2024.3447225","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447225","url":null,"abstract":"3-D \u0000<sc>nand</small>\u0000 flash memories are the dominant storage media in modern data centers due to their high performance, large storage capacity, and low-power consumption. However, the lifetime of flash memory has decreased as technology scaling advances. Recent work has revealed that the number of achievable program/erase (P/E) cycles of flash blocks is related to the dwell time (DT) between two adjacent erase operations. A longer DT can lead to higher-achievable P/E cycles and, therefore, a longer lifetime for flash memories. This article found that the achievable P/E cycles would increase when flash blocks endure uneven DT distribution. Based on this observation, this article presents an opportunistic self-healing method to extend the lifetime of flash memory. By maintaining two groups with unequal block counts, namely, Active Group and Healing Group, the proposed method creates an imbalance in erase operation distribution. The Active Group undergoes more frequent erase operations, resulting in shorter DT, while the Healing Group experiences longer DT. Periodically, the roles of the two groups are switched based on the Active Group’s partitioning ratio. This role switching ensures that each block experiences both short and long DT periods, leading to an uneven DT distribution that magnifies the self-healing effect. The evaluation shows that the proposed method can improve the flash lifetime by 19.3% and 13.2% on average with near-free overheads, compared with the baseline and the related work, respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4226-4237"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142636372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

AttentionRC: A Novel Approach to Improve Locality Sensitive Hashing Attention on Dual-Addressing Memory AttentionRC：在双寻址内存上改进位置敏感散列注意力的新方法

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3447217

Chun-Lin Chu;Yun-Chih Chen;Wei Cheng;Ing-Chao Lin;Yuan-Hao Chang

Attention is a crucial component of the Transformer architecture and a key factor in its success. However, it suffers from quadratic growth in time and space complexity as input sequence length increases. One popular approach to address this issue is the Reformer model, which uses locality-sensitive hashing (LSH) attention to reduce computational complexity. LSH attention hashes similar tokens in the input sequence to the same bucket and attends tokens only within the same bucket. Meanwhile, a new emerging nonvolatile memory (NVM) architecture, row column NVM (RC-NVM), has been proposed to support row- and column-oriented addressing (i.e., dual addressing). In this work, we present AttentionRC, which takes advantage of RC-NVM to further improve the efficiency of LSH attention. We first propose an LSH-friendly data mapping strategy that improves memory write and read cycles by 60.9% and 4.9%, respectively. Then, we propose a sort-free RC-aware bucket access and a swap strategy that utilizes dual-addressing to reduce 38% of the data access cycles in attention. Finally, by taking advantage of dual-addressing, we propose transpose-free attention to eliminate the transpose operations that were previously required by the attention, resulting in a 51% reduction in the matrix multiplication time.

注意力是 Transformer 架构的重要组成部分，也是其成功的关键因素。然而，随着输入序列长度的增加，它的时间和空间复杂性也呈二次增长。解决这一问题的一种流行方法是 Reformer 模型，该模型使用位置敏感散列（LSH）注意力来降低计算复杂度。LSH 注意将输入序列中的相似标记散列到同一个桶中，并只注意同一个桶中的标记。与此同时，一种新兴的非易失性存储器（NVM）架构--行列式非易失性存储器（RC-NVM）被提出来支持面向行和列的寻址（即双寻址）。在这项工作中，我们提出了 AttentionRC，它利用 RC-NVM 的优势进一步提高了 LSH 注意的效率。我们首先提出了对 LSH 友好的数据映射策略，将内存写入和读取周期分别提高了 60.9% 和 4.9%。然后，我们提出了一种无排序 RC 感知桶访问和交换策略，利用双寻址减少了注意力中 38% 的数据访问周期。最后，利用双寻址的优势，我们提出了无转置注意，消除了以前注意所需的转置操作，从而将矩阵乘法时间减少了 51%。

{"title":"AttentionRC: A Novel Approach to Improve Locality Sensitive Hashing Attention on Dual-Addressing Memory","authors":"Chun-Lin Chu;Yun-Chih Chen;Wei Cheng;Ing-Chao Lin;Yuan-Hao Chang","doi":"10.1109/TCAD.2024.3447217","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3447217","url":null,"abstract":"Attention is a crucial component of the Transformer architecture and a key factor in its success. However, it suffers from quadratic growth in time and space complexity as input sequence length increases. One popular approach to address this issue is the Reformer model, which uses locality-sensitive hashing (LSH) attention to reduce computational complexity. LSH attention hashes similar tokens in the input sequence to the same bucket and attends tokens only within the same bucket. Meanwhile, a new emerging nonvolatile memory (NVM) architecture, row column NVM (RC-NVM), has been proposed to support row- and column-oriented addressing (i.e., dual addressing). In this work, we present AttentionRC, which takes advantage of RC-NVM to further improve the efficiency of LSH attention. We first propose an LSH-friendly data mapping strategy that improves memory write and read cycles by 60.9% and 4.9%, respectively. Then, we propose a sort-free RC-aware bucket access and a swap strategy that utilizes dual-addressing to reduce 38% of the data access cycles in attention. Finally, by taking advantage of dual-addressing, we propose transpose-free attention to eliminate the transpose operations that were previously required by the attention, resulting in a 51% reduction in the matrix multiplication time.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3925-3936"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

GPU Performance Optimization via Intergroup Cache Cooperation 通过组间缓存合作优化 GPU 性能

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3443707

Guosheng Wang;Yajuan Du;Weiming Huang

Modern GPUs have integrated multilevel cache hierarchy to provide high bandwidth and mitigate the memory wall problem. However, the benefit of on-chip cache is far from achieving optimal performance. In this article, we investigate existing cache architecture and find that the cache utilization is imbalanced and there exists serious data duplication among L1 cache groups.In order to exploit the duplicate data, we propose an intergroup cache cooperation (ICC) method to establish the cooperation across L1 cache groups. According the cooperation scope, we design two schemes of the adjacent cache cooperation (ICC-AGC) and the multiple cache cooperation (ICC-MGC). In ICC-AGC, we design an adjacent cooperative directory table to realize the perception of duplicate data and integrate a lightweight network for communication. In ICC-MGC, a ring bi-directional network is designed to realize the connection among multiple groups. And we present a two-way sending mechanism and a dynamic sending mechanism to balance the overhead and efficiency involved in request probing and sending.Evaluation results show that the proposed two ICC methods can reduce the average traffic to L2 cache by 10% and 20%, respectively, and improve overall GPU performance by 19% and 49% on average, respectively, compared with the existing work.

现代 GPU 集成了多级缓存层次结构，以提供高带宽并缓解内存墙问题。然而，片上缓存的优势远未达到最佳性能。为了利用重复数据，我们提出了一种组间缓存合作（ICC）方法，以建立 L1 缓存组间的合作。根据合作范围，我们设计了相邻缓存合作（ICC-AGC）和多缓存合作（ICC-MGC）两种方案。在 ICC-AGC 中，我们设计了一个相邻合作目录表来实现对重复数据的感知，并集成了一个轻量级网络用于通信。在 ICC-MGC 中，我们设计了一个环形双向网络来实现多个组之间的连接。评估结果表明，与现有工作相比，所提出的两种 ICC 方法可分别减少 10% 和 20% 的二级缓存平均流量，并可分别平均提高 19% 和 49% 的 GPU 整体性能。

{"title":"GPU Performance Optimization via Intergroup Cache Cooperation","authors":"Guosheng Wang;Yajuan Du;Weiming Huang","doi":"10.1109/TCAD.2024.3443707","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443707","url":null,"abstract":"Modern GPUs have integrated multilevel cache hierarchy to provide high bandwidth and mitigate the memory wall problem. However, the benefit of on-chip cache is far from achieving optimal performance. In this article, we investigate existing cache architecture and find that the cache utilization is imbalanced and there exists serious data duplication among L1 cache groups.In order to exploit the duplicate data, we propose an intergroup cache cooperation (ICC) method to establish the cooperation across L1 cache groups. According the cooperation scope, we design two schemes of the adjacent cache cooperation (ICC-AGC) and the multiple cache cooperation (ICC-MGC). In ICC-AGC, we design an adjacent cooperative directory table to realize the perception of duplicate data and integrate a lightweight network for communication. In ICC-MGC, a ring bi-directional network is designed to realize the connection among multiple groups. And we present a two-way sending mechanism and a dynamic sending mechanism to balance the overhead and efficiency involved in request probing and sending.Evaluation results show that the proposed two ICC methods can reduce the average traffic to L2 cache by 10% and 20%, respectively, and improve overall GPU performance by 19% and 49% on average, respectively, compared with the existing work.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"4142-4153"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

CHEF: A Framework for Deploying Heterogeneous Models on Clusters With Heterogeneous FPGAs CHEF：在配备异构 FPGA 的集群上部署异构模型的框架

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3438994

Yue Tang;Yukai Song;Naveena Elango;Sheena Ratnam Priya;Alex K. Jones;Jinjun Xiong;Peipei Zhou;Jingtong Hu

Deep neural networks (DNNs) are rapidly evolving from streamlined single-modality single-task (SMST) to multimodality multitask (MMMT) with large variations for different layers and complex data dependencies among layers. To support such models, hardware systems also evolved to be heterogeneous. The heterogeneous system comes from the prevailing trend to integrate diverse accelerators into the system for lower latency. FPGAs have high-computation density and communication bandwidth and are configurable to be deployed with different designs of accelerators, which are widely used for various machine-learning applications. However, scaling from SMST to MMMT on heterogeneous FPGAs is challenging since MMMT has much larger layer variations, a massive number of layers, and complex data dependency among different backbones. Previous mapping algorithms are either inefficient or over-simplified which makes them impractical in general scenarios. In this work, we propose CHEF to enable efficient implementation of MMMT models in realistic heterogeneous FPGA clusters, i.e., deploying heterogeneous accelerators on heterogeneous FPGAs (A2F) and mapping the heterogeneous DNNs on the deployed heterogeneous accelerators (M2A). We propose CHEF-A2F, a two-stage accelerators-to-FPGAs deployment approach to co-optimize hardware deployment and accelerator mapping. In addition, we propose CHEF-M2A, which can support general and practical cases compared to previous mapping algorithms. To the best of our knowledge, this is the first attempt to implement MMMT models in real heterogeneous FPGA clusters. Experimental results show that the latency obtained with CHEF is near-optimal while the search time is 10

$000times $

less than exhaustively searching the optimal solution.

深度神经网络（DNN）正在从精简的单模态单任务（SMST）快速发展到多模态多任务（MMMT），不同层之间的差异很大，层与层之间的数据依赖关系也很复杂。为了支持这种模式，硬件系统也演变为异构系统。异构系统源于将各种加速器集成到系统中以降低延迟的流行趋势。FPGA 具有高计算密度和通信带宽，可配置不同设计的加速器，广泛应用于各种机器学习应用。然而，在异构 FPGA 上从 SMST 扩展到 MMMT 是一项挑战，因为 MMMT 的层变化更大，层数更多，而且不同骨干网之间存在复杂的数据依赖关系。以往的映射算法要么效率低下，要么过于简化，因此在一般情况下不切实际。在这项工作中，我们提出了 CHEF，以便在现实的异构 FPGA 集群中高效实现 MMMT 模型，即在异构 FPGA 上部署异构加速器（A2F），并在部署的异构加速器上映射异构 DNN（M2A）。我们提出了 CHEF-A2F，一种两阶段加速器到 FPGA 的部署方法，以共同优化硬件部署和加速器映射。此外，我们还提出了 CHEF-M2A，与之前的映射算法相比，它可以支持一般的实用案例。据我们所知，这是首次尝试在实际异构 FPGA 集群中实施 MMMT 模型。实验结果表明，使用 CHEF 得到的延迟接近最优，而搜索时间则比穷举搜索最优解少 10000 次。

{"title":"CHEF: A Framework for Deploying Heterogeneous Models on Clusters With Heterogeneous FPGAs","authors":"Yue Tang;Yukai Song;Naveena Elango;Sheena Ratnam Priya;Alex K. Jones;Jinjun Xiong;Peipei Zhou;Jingtong Hu","doi":"10.1109/TCAD.2024.3438994","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3438994","url":null,"abstract":"Deep neural networks (DNNs) are rapidly evolving from streamlined single-modality single-task (SMST) to multimodality multitask (MMMT) with large variations for different layers and complex data dependencies among layers. To support such models, hardware systems also evolved to be heterogeneous. The heterogeneous system comes from the prevailing trend to integrate diverse accelerators into the system for lower latency. FPGAs have high-computation density and communication bandwidth and are configurable to be deployed with different designs of accelerators, which are widely used for various machine-learning applications. However, scaling from SMST to MMMT on heterogeneous FPGAs is challenging since MMMT has much larger layer variations, a massive number of layers, and complex data dependency among different backbones. Previous mapping algorithms are either inefficient or over-simplified which makes them impractical in general scenarios. In this work, we propose CHEF to enable efficient implementation of MMMT models in realistic heterogeneous FPGA clusters, i.e., deploying heterogeneous accelerators on heterogeneous FPGAs (A2F) and mapping the heterogeneous DNNs on the deployed heterogeneous accelerators (M2A). We propose CHEF-A2F, a two-stage accelerators-to-FPGAs deployment approach to co-optimize hardware deployment and accelerator mapping. In addition, we propose CHEF-M2A, which can support general and practical cases compared to previous mapping algorithms. To the best of our knowledge, this is the first attempt to implement MMMT models in real heterogeneous FPGA clusters. Experimental results show that the latency obtained with CHEF is near-optimal while the search time is 10\u0000<inline-formula> <tex-math>$000times $ </tex-math></inline-formula>\u0000 less than exhaustively searching the optimal solution.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3937-3948"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

NeRF-PIM: PIM Hardware-Software Co-Design of Neural Rendering Networks NeRF-PIM：神经渲染网络的 PIM 硬件-软件协同设计

IF 2.7 3区计算机科学 Q2 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE

IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems

Pub Date : 2024-11-06 DOI: 10.1109/TCAD.2024.3443712

Jaeyoung Heo;Sungjoo Yoo

Neural radiance field (NeRF) has emerged as a state-of-the-art technique, offering unprecedented realism in rendering. Despite its advancements, the adoption of NeRF is constrained by high computational cost, leading to slow rendering speed. Voxel-based optimization of NeRF addresses this by reducing the computational cost, but it introduces substantial memory overheads. To address this problem, we propose NeRF-PIM, a hardware-software co-design approach. In order to address the problem of the memory accesses to the large model (of the voxel grid) with poor locality and low compute density, we propose exploiting processing-in-memory (PIM) together with PIM-aware software optimizations in terms of the data layout, redundancy removal, and computation reuse. Our PIM hardware aims to accelerate the trilinear interpolation and dot product operations. Specifically, to address the low utilization of internal bandwidth due to the random accesses to the voxels, we propose a data layout that judiciously exploits the characteristics of the interpolation operation on the voxel grid, which helps remove bank conflicts in voxel accesses and also improves the efficiency of PIM command issue by exploiting the all-bank mode in the existing PIM device. As PIM-aware software optimizations, we also propose occupancy-grid-aware pruning and one-voxel two-sampling (1V2S) methods, which contribute to compute the efficiency improvement (by avoiding the redundant computation on the empty space) and memory traffic reduction (by reusing the per-voxel dot product results). We conduct experiments using an actual baseline HBM-PIM device. Our NeRF-PIM demonstrates a speedup of 7.4 and

$5.0times $

compared to the baseline on the two datasets, Synthetic-NeRF and Tanks and Temples, respectively.

神经辐射场（NeRF）已成为一种最先进的技术，可提供前所未有的逼真渲染效果。尽管 NeRF 技术不断进步，但其应用仍受到高计算成本的限制，导致渲染速度缓慢。基于体素的 NeRF 优化通过降低计算成本解决了这一问题，但却带来了大量内存开销。为解决这一问题，我们提出了软硬件协同设计方法 NeRF-PIM。为了解决大型模型（体素网格）的内存访问定位性差和计算密度低的问题，我们提出利用内存处理（PIM），并在数据布局、冗余消除和计算重用方面进行 PIM 感知软件优化。我们的 PIM 硬件旨在加速三线性插值和点乘操作。具体来说，为了解决随机存取体素导致的内部带宽利用率低的问题，我们提出了一种数据布局，该布局可明智地利用体素网格上插值操作的特性，有助于消除体素存取中的库冲突，还可利用现有 PIM 设备中的全库模式提高 PIM 命令发布的效率。作为 PIM 感知软件优化，我们还提出了占用网格感知修剪和单体素双采样（1V2S）方法，这有助于计算效率的提高（通过避免对空白空间的冗余计算）和内存流量的减少（通过重复使用每个体素点乘结果）。我们使用实际的基准 HBM-PIM 设备进行了实验。与基线相比，我们的 NeRF-PIM 在 Synthetic-NeRF 和 Tanks and Temples 这两个数据集上的速度分别提高了 7.4 倍和 5.0 倍。

{"title":"NeRF-PIM: PIM Hardware-Software Co-Design of Neural Rendering Networks","authors":"Jaeyoung Heo;Sungjoo Yoo","doi":"10.1109/TCAD.2024.3443712","DOIUrl":"https://doi.org/10.1109/TCAD.2024.3443712","url":null,"abstract":"Neural radiance field (NeRF) has emerged as a state-of-the-art technique, offering unprecedented realism in rendering. Despite its advancements, the adoption of NeRF is constrained by high computational cost, leading to slow rendering speed. Voxel-based optimization of NeRF addresses this by reducing the computational cost, but it introduces substantial memory overheads. To address this problem, we propose NeRF-PIM, a hardware-software co-design approach. In order to address the problem of the memory accesses to the large model (of the voxel grid) with poor locality and low compute density, we propose exploiting processing-in-memory (PIM) together with PIM-aware software optimizations in terms of the data layout, redundancy removal, and computation reuse. Our PIM hardware aims to accelerate the trilinear interpolation and dot product operations. Specifically, to address the low utilization of internal bandwidth due to the random accesses to the voxels, we propose a data layout that judiciously exploits the characteristics of the interpolation operation on the voxel grid, which helps remove bank conflicts in voxel accesses and also improves the efficiency of PIM command issue by exploiting the all-bank mode in the existing PIM device. As PIM-aware software optimizations, we also propose occupancy-grid-aware pruning and one-voxel two-sampling (1V2S) methods, which contribute to compute the efficiency improvement (by avoiding the redundant computation on the empty space) and memory traffic reduction (by reusing the per-voxel dot product results). We conduct experiments using an actual baseline HBM-PIM device. Our NeRF-PIM demonstrates a speedup of 7.4 and \u0000<inline-formula> <tex-math>$5.0times $ </tex-math></inline-formula>\u0000 compared to the baseline on the two datasets, Synthetic-NeRF and Tanks and Temples, respectively.","PeriodicalId":13251,"journal":{"name":"IEEE Transactions on Computer-Aided Design of Integrated Circuits and Systems","volume":"43 11","pages":"3900-3912"},"PeriodicalIF":2.7,"publicationDate":"2024-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142595054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0