ACM Transactions on Design Automation of Electronic Systems最新文献_第3页

CuPBoP: Making CUDA a Portable Language CuPBoP：让 CUDA 成为可移植语言

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-04-23 DOI: 10.1145/3659949

Ruobing Han, Jun Chen, Bhanu Garg, Xule Zhou, John Lu, Jeffrey Young, Jaewoong Sim, Hyesoon Kim

CUDA is designed specifically for NVIDIA GPUs and is not compatible with non-NVIDIA devices. Enabling CUDA execution on alternative backends could greatly benefit the hardware community by fostering a more diverse software ecosystem. To address the need for portability, our objective is to develop a framework that meets key requirements, such as extensive coverage, comprehensive end-to-end support, superior performance, and hardware scalability. Existing solutions that translate CUDA source code into other high-level languages, however, fall short of these goals. In contrast to these source-to-source approaches, we present a novel framework, CuPBoP, which treats CUDA as a portable language in its own right. Compared to two commercial source-to-source solutions, CuPBoP offers a broader coverage and superior performance for the CUDA-to-CPU migration. Additionally, we evaluate the performance of CuPBoP against manually optimized CPU programs, highlighting the differences between CPU programs derived from CUDA and those that are manually optimized. Furthermore, we demonstrate the hardware scalability of CuPBoP by showcasing its successful migration of CUDA to AMD GPUs. To promote further research in this field, we have released CuPBoP as an open-source resource.

CUDA 专为英伟达™（NVIDIA®）GPU 设计，与非英伟达™（NVIDIA®）设备不兼容。在其他后端上执行 CUDA 可以促进软件生态系统的多样化，从而使硬件社区受益匪浅。为了满足可移植性的需求，我们的目标是开发一个能满足关键要求的框架，如广泛的覆盖范围、全面的端到端支持、卓越的性能和硬件可扩展性。然而，现有的将 CUDA 源代码翻译成其他高级语言的解决方案无法实现这些目标。与这些源代码到源代码的方法不同，我们提出了一个新颖的框架 CuPBoP，它将 CUDA 本身视为一种可移植语言。与两种商业源代码到源代码解决方案相比，CuPBoP 为 CUDA 到 CPU 的迁移提供了更广泛的覆盖范围和更优越的性能。此外，我们还评估了 CuPBoP 与经过人工优化的 CPU 程序的性能，强调了从 CUDA 派生的 CPU 程序与经过人工优化的 CPU 程序之间的差异。此外，我们还展示了 CuPBoP 的硬件可扩展性，成功地将 CUDA 移植到 AMD GPU。为促进该领域的进一步研究，我们已将 CuPBoP 作为开源资源发布。

{"title":"CuPBoP: Making CUDA a Portable Language","authors":"Ruobing Han, Jun Chen, Bhanu Garg, Xule Zhou, John Lu, Jeffrey Young, Jaewoong Sim, Hyesoon Kim","doi":"10.1145/3659949","DOIUrl":"https://doi.org/10.1145/3659949","url":null,"abstract":"CUDA is designed specifically for NVIDIA GPUs and is not compatible with non-NVIDIA devices. Enabling CUDA execution on alternative backends could greatly benefit the hardware community by fostering a more diverse software ecosystem.\u0000 To address the need for portability, our objective is to develop a framework that meets key requirements, such as extensive coverage, comprehensive end-to-end support, superior performance, and hardware scalability. Existing solutions that translate CUDA source code into other high-level languages, however, fall short of these goals.\u0000 In contrast to these source-to-source approaches, we present a novel framework, CuPBoP, which treats CUDA as a portable language in its own right. Compared to two commercial source-to-source solutions, CuPBoP offers a broader coverage and superior performance for the CUDA-to-CPU migration. Additionally, we evaluate the performance of CuPBoP against manually optimized CPU programs, highlighting the differences between CPU programs derived from CUDA and those that are manually optimized.\u0000 Furthermore, we demonstrate the hardware scalability of CuPBoP by showcasing its successful migration of CUDA to AMD GPUs.\u0000 To promote further research in this field, we have released CuPBoP as an open-source resource.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140670889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Scenario-Based DVFS-Aware Hybrid Application Mapping Methodology for MPSoCs 面向 MPSoC 的基于场景的 DVFS 感知混合应用映射方法学

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-04-23 DOI: 10.1145/3660633

J. Spieck, Stefan Wildermann, Jürgen Teich

Sound techniques for mapping soft real-time applications to resources are indispensable for meeting the application deadlines and minimizing objectives such as energy consumption, particularly on heterogeneous MPSoC architectures. For applications with input-dependent workload variations, static mappings are not able to sufficiently cope with the run-time variation, which can lead to deadline misses or unnecessary energy consumption. As a remedy, hybrid application mapping (HAM) techniques combine a design-time optimization with run-time management that adapts the mappings dynamically to the changes of the arriving input. This paper focuses on scenario-based HAM techniques. Here, the application input space is systematically clustered such that data inside the same scenario exhibit similar characteristics concerning workload when being processed under the same operating points. This static clustering of the input space into data scenarios has proven to be a good abstraction layer for simplifying the design and employment of high-quality run-time managers. However, existing state-of-the-art scenario-based HAM approaches neglect or underutilize the synergistic interplay between mapping selection and the usage of dynamic voltage/frequency scaling (DVFS) when adapting to workload variation. By combining mapping and DVFS selection, variations in the input can be either compensated by a complete re-mapping of the application, evoking a potential high reconfiguration overhead or by just changing the DVFS settings of the resources, offering a low-overhead adaptation alternative and thus significantly reducing the necessary overhead compared to DVFS-agnostic HAM. Furthermore, DVFS enables a fine-grained adaptation of a mapped application to the input data variation, e.g., by slowing down tasks with no impact on the end-to-end latency for the current input using low-frequency DVFS settings. It is shown that this combined approach can save even more energy than a pure mapping adaptation scheme, especially in the presence of data scenarios. In particular, scenario-based design operates as a catalyst for eliciting the synergies between a combined DVFS and mapping optimization and the peculiarities inside a data scenario, i.e., exploiting the commonalities inside a data scenario by perfectly tailored DVFS settings and task mapping. In this scope, this paper proposes two supplementary scenario-based DVFS-aware HAM approaches that consistently outperform existing state-of-the-art mapping approaches in terms of the number of deadline misses and energy consumption as we demonstrate in an empirical study on the basis of four different applications and three different architectures. It is also shown that these benefits still apply to target architectures with increasing mapping migration overheads, thwarting frequent mapping reconfigurations.

将软实时应用映射到资源的合理技术对于满足应用截止日期和最大限度降低能耗等目标不可或缺，特别是在异构 MPSoC 架构上。对于工作负载随输入变化而变化的应用，静态映射无法充分应对运行时的变化，这可能导致错过截止日期或不必要的能耗。作为一种补救措施，混合应用映射（HAM）技术将设计时优化与运行时管理相结合，使映射动态适应输入的变化。本文的重点是基于场景的 HAM 技术。在这里，应用输入空间被系统地聚类，以便同一场景中的数据在相同的操作点下处理时，在工作量方面表现出相似的特征。这种将输入空间静态聚类为数据场景的方法已被证明是一种很好的抽象层，可简化高质量运行时管理器的设计和使用。然而，在适应工作负载变化时，现有的最先进的基于场景的 HAM 方法忽视或未充分利用映射选择与动态电压/频率扩展（DVFS）之间的协同作用。通过将映射选择和 DVFS 选择相结合，输入的变化可以通过对应用进行完全重新映射来补偿，这可能会引起较高的重新配置开销；或者只通过改变资源的 DVFS 设置来补偿，这提供了一种低开销的适应替代方案，从而与不考虑 DVFS 的 HAM 相比显著降低了必要的开销。此外，DVFS 还能根据输入数据的变化对映射应用进行细粒度调整，例如，利用低频 DVFS 设置，在不影响当前输入的端到端延迟的情况下降低任务速度。研究表明，这种组合方法比纯粹的映射适应方案能节省更多能源，尤其是在存在数据场景的情况下。特别是，基于场景的设计是激发 DVFS 和映射优化组合与数据场景内部特殊性之间协同作用的催化剂，即通过完美定制的 DVFS 设置和任务映射来利用数据场景内部的共性。在此范围内，本文提出了两种基于场景的 DVFS 感知 HAM 补充方法，这些方法在最后期限错过次数和能耗方面始终优于现有的最先进映射方法，我们在基于四种不同应用和三种不同架构的实证研究中证明了这一点。研究还表明，这些优势仍然适用于映射迁移开销不断增加、映射重新配置频繁受挫的目标架构。

{"title":"A Scenario-Based DVFS-Aware Hybrid Application Mapping Methodology for MPSoCs","authors":"J. Spieck, Stefan Wildermann, Jürgen Teich","doi":"10.1145/3660633","DOIUrl":"https://doi.org/10.1145/3660633","url":null,"abstract":"Sound techniques for mapping soft real-time applications to resources are indispensable for meeting the application deadlines and minimizing objectives such as energy consumption, particularly on heterogeneous MPSoC architectures. For applications with input-dependent workload variations, static mappings are not able to sufficiently cope with the run-time variation, which can lead to deadline misses or unnecessary energy consumption. As a remedy, hybrid application mapping (HAM) techniques combine a design-time optimization with run-time management that adapts the mappings dynamically to the changes of the arriving input. This paper focuses on scenario-based HAM techniques. Here, the application input space is systematically clustered such that data inside the same scenario exhibit similar characteristics concerning workload when being processed under the same operating points. This static clustering of the input space into data scenarios has proven to be a good abstraction layer for simplifying the design and employment of high-quality run-time managers. However, existing state-of-the-art scenario-based HAM approaches neglect or underutilize the synergistic interplay between mapping selection and the usage of dynamic voltage/frequency scaling (DVFS) when adapting to workload variation. By combining mapping and DVFS selection, variations in the input can be either compensated by a complete re-mapping of the application, evoking a potential high reconfiguration overhead or by just changing the DVFS settings of the resources, offering a low-overhead adaptation alternative and thus significantly reducing the necessary overhead compared to DVFS-agnostic HAM. Furthermore, DVFS enables a fine-grained adaptation of a mapped application to the input data variation, e.g., by slowing down tasks with no impact on the end-to-end latency for the current input using low-frequency DVFS settings. It is shown that this combined approach can save even more energy than a pure mapping adaptation scheme, especially in the presence of data scenarios. In particular, scenario-based design operates as a catalyst for eliciting the synergies between a combined DVFS and mapping optimization and the peculiarities inside a data scenario, i.e., exploiting the commonalities inside a data scenario by perfectly tailored DVFS settings and task mapping. In this scope, this paper proposes two supplementary scenario-based DVFS-aware HAM approaches that consistently outperform existing state-of-the-art mapping approaches in terms of the number of deadline misses and energy consumption as we demonstrate in an empirical study on the basis of four different applications and three different architectures. It is also shown that these benefits still apply to target architectures with increasing mapping migration overheads, thwarting frequent mapping reconfigurations.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140669739","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhanced Compiler Technology for Software-based Hardware Fault Detection 基于软件的硬件故障检测的增强型编译器技术

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-04-22 DOI: 10.1145/3660524

Davide Baroffio, Federico Reghenzani, William Fornaciari

Software-Implemented Hardware Fault Tolerance (SIHFT) is a modern approach for tackling random hardware faults of dependable systems employing solely software solutions. This work extends an automatic compiler-based SIHFT hardening tool called ASPIS, enhancing it with novel protection mechanisms and overhead-reduction techniques, also providing an extensive analysis of its compliance with the non-trivial workload of the open-source Real-Time Operating System FreeRTOS. A thorough experimental fault-injection campaign on an STM32 board shows how the system achieves remarkably high tolerance to single-event upsets and a comparison between the SIHFT mechanisms implemented summarises the trade-off between the overhead introduced and the detection capabilities of the various solutions.

软件实现的硬件容错（SIHFT）是一种解决仅采用软件解决方案的可靠系统随机硬件故障的现代方法。这项工作扩展了基于编译器的 SIHFT 自动加固工具 ASPIS，通过新颖的保护机制和减少开销技术对其进行了增强，同时还对其与开源实时操作系统 FreeRTOS 的非简单工作负载的兼容性进行了广泛分析。在 STM32 电路板上进行的全面故障注入实验表明，该系统对单次事件中断具有极高的耐受性，对已实施的 SIHFT 机制进行的比较总结了各种解决方案在开销和检测能力之间的权衡。

引用次数: 0

Load Balanced PIM-Based Graph Processing 基于负载平衡 PIM 的图形处理

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-04-18 DOI: 10.1145/3659951

Xiang Zhao, Song Chen, Yi Kang

Graph processing is widely used for many modern applications, such as social networks, recommendation systems, and knowledge graphs. However, processing large-scale graphs on traditional Von Neumann architectures is challenging due to the irregular graph data and memory-bound graph algorithms. Processing-in-memory (PIM) architecture has emerged as a promising approach for accelerating graph processing by enabling computation to be performed directly on memory. Despite having many processing units and high local memory bandwidth, PIM often suffers from insufficient global communication bandwidth and high synchronization overhead due to load imbalance.

This paper proposes GraphB, a novel PIM-based graph processing system, to address all these issues. From the algorithm perspective, we propose a degree-aware graph partitioning algorithm that can generate balanced partitioning at a low cost. From the architecture perspective, we introduce tile buffers incorporated with an on-chip 2D-Mesh, which provides high bandwidth for inter-node data transfer. Dataflow in GraphB is designed to enable computation-communication overlap and dynamic load balancing. In a PyMTL3-based cycle-accurate simulator with five real-world graphs and three common algorithms, GraphB achieves an average 2.2 × and maximum 2.8 × speedup compared to the SOTA PIM-based graph processing system GraphQ.

图处理被广泛应用于许多现代应用中，如社交网络、推荐系统和知识图谱。然而，由于图数据不规则和图算法受内存限制，在传统的冯-诺依曼架构上处理大规模图具有挑战性。内存中处理（PIM）架构可直接在内存中进行计算，是一种很有前途的加速图处理方法。尽管 PIM 有很多处理单元和很高的本地内存带宽，但由于负载不平衡，它经常会出现全局通信带宽不足和同步开销过高的问题。本文提出了基于 PIM 的新型图处理系统 GraphB，以解决所有这些问题。从算法角度看，我们提出了一种度感知图分割算法，它能以较低的成本生成均衡的分割。从架构角度看，我们引入了与片上二维网格相结合的瓦片缓冲器，为节点间数据传输提供了高带宽。GraphB 中的数据流旨在实现计算-通信重叠和动态负载平衡。在基于 PyMTL3 的周期精确模拟器中，与基于 SOTA PIM 的图形处理系统 GraphQ 相比，GraphB 在五个真实图形和三种常见算法上实现了平均 2.2 倍和最高 2.8 倍的速度提升。

{"title":"Load Balanced PIM-Based Graph Processing","authors":"Xiang Zhao, Song Chen, Yi Kang","doi":"10.1145/3659951","DOIUrl":"https://doi.org/10.1145/3659951","url":null,"abstract":"Graph processing is widely used for many modern applications, such as social networks, recommendation systems, and knowledge graphs. However, processing large-scale graphs on traditional Von Neumann architectures is challenging due to the irregular graph data and memory-bound graph algorithms. Processing-in-memory (PIM) architecture has emerged as a promising approach for accelerating graph processing by enabling computation to be performed directly on memory. Despite having many processing units and high local memory bandwidth, PIM often suffers from insufficient global communication bandwidth and high synchronization overhead due to load imbalance. This paper proposes GraphB, a novel PIM-based graph processing system, to address all these issues. From the algorithm perspective, we propose a degree-aware graph partitioning algorithm that can generate balanced partitioning at a low cost. From the architecture perspective, we introduce tile buffers incorporated with an on-chip 2D-Mesh, which provides high bandwidth for inter-node data transfer. Dataflow in GraphB is designed to enable computation-communication overlap and dynamic load balancing. In a PyMTL3-based cycle-accurate simulator with five real-world graphs and three common algorithms, GraphB achieves an average 2.2 × and maximum 2.8 × speedup compared to the SOTA PIM-based graph processing system GraphQ.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140617741","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Wages: The Worst Transistor Aging Analysis for Large-scale Analog Integrated Circuits via Domain Generalization 工资：通过领域泛化进行大规模模拟集成电路最差晶体管老化分析

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-04-17 DOI: 10.1145/3659950

Tinghuan Chen, Hao Geng, Qi Sun, Sanping Wan, Yongsheng Sun, Huatao Yu, Bei Yu

Transistor aging leads to the deterioration of analog circuit performance over time. The worst aging degradation is used to evaluate the circuit reliability. It is extremely expensive to obtain it since several circuit stimuli need to be simulated. The worst degradation collection cost reduction brings an inaccurate training dataset when a machine learning (ML) model is used to fast perform the estimation. Motivated by the fact that there are many similar subcircuits in large-scale analog circuits, in this paper, we propose Wages to train an ML model on an inaccurate dataset for the worst aging degradation estimation via domain generalization technique. A sampling-based method on the feature space of the transistor and its neighborhood subcircuit is developed to replace inaccurate labels. A consistent estimation for the worst degradation is enforced to update model parameters. Label updating and model updating are performed alternately to train an ML model on the inaccurate dataset. Experimental results on the very advanced 5nm technology node show our Wages can significantly reduce the label collection cost with a negligible estimation error for the worst aging degradations compared to the traditional methods.

晶体管老化会导致模拟电路性能随时间而下降。最差老化衰减用于评估电路可靠性。由于需要模拟多个电路刺激，因此获取最差老化退化的成本极高。当使用机器学习（ML）模型快速进行估算时，最差老化收集成本的降低会带来不准确的训练数据集。由于大规模模拟电路中存在许多类似的子电路，因此我们在本文中提出了 Wages 方法，通过域泛化技术在不准确的数据集上训练 ML 模型，以进行最差老化退化估计。我们在晶体管及其邻近子电路的特征空间上开发了一种基于采样的方法来替换不准确的标签。对最差老化退化进行一致的估计，以更新模型参数。标签更新和模型更新交替进行，以便在不准确的数据集上训练 ML 模型。在非常先进的 5 纳米技术节点上进行的实验结果表明，与传统方法相比，我们的算法可以显著降低标签收集成本，而对最严重老化退化的估计误差却可以忽略不计。

{"title":"Wages: The Worst Transistor Aging Analysis for Large-scale Analog Integrated Circuits via Domain Generalization","authors":"Tinghuan Chen, Hao Geng, Qi Sun, Sanping Wan, Yongsheng Sun, Huatao Yu, Bei Yu","doi":"10.1145/3659950","DOIUrl":"https://doi.org/10.1145/3659950","url":null,"abstract":"Transistor aging leads to the deterioration of analog circuit performance over time. The worst aging degradation is used to evaluate the circuit reliability. It is extremely expensive to obtain it since several circuit stimuli need to be simulated. The worst degradation collection cost reduction brings an inaccurate training dataset when a machine learning (ML) model is used to fast perform the estimation. Motivated by the fact that there are many similar subcircuits in large-scale analog circuits, in this paper, we propose Wages to train an ML model on an inaccurate dataset for the worst aging degradation estimation via domain generalization technique. A sampling-based method on the feature space of the transistor and its neighborhood subcircuit is developed to replace inaccurate labels. A consistent estimation for the worst degradation is enforced to update model parameters. Label updating and model updating are performed alternately to train an ML model on the inaccurate dataset. Experimental results on the very advanced 5nm technology node show our Wages can significantly reduce the label collection cost with a negligible estimation error for the worst aging degradations compared to the traditional methods.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140610540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Capacity-Aware Wash Optimization with Dynamic Fluid Scheduling and Channel Storage for Continuous-Flow Microfluidic Biochips 利用动态流体调度和通道存储对连续流微流控生物芯片进行容量感知清洗优化

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-04-17 DOI: 10.1145/3659952

Zhisheng Chen, Xu Hu, Wenzhong Guo, Genggeng Liu, Jiaxuan Wang, Tsung-Yi Ho, Xing Huang

Continuous-flow microfluidic biochips are gaining increasing attention with promising applications for automatically executing various laboratory procedures in biology and biochemistry. Biochips with distributed channel-storage architectures enable each channel to switch between the roles of transportation and storage. Consequently, fluid transportation, caching, and fetch can occur concurrently through different flow paths. When two dissimilar types of fluidic flows occur through the same channels in a time-interleaved manner, it may cause contamination to the latter as some residues of the former flow may be stuck at the channel wall during transportation. To remove the residues, wash operations are introduced as an essential step to avoid incorrect assay outcomes. However, existing work has been considered that the washing capacity of a buffer fluid is unlimited. In the actual scenario, a fixed-volume buffer fluid irrefutably possesses a limited washing capacity, which can be successively consumed while washing away residues from the channels. Hence, capacity-aware wash scheme is a basic requirement to fulfil the dynamic fluid scheduling and channel storage. In this paper, we formulate a practical wash optimization problem for microfluidic biochips, which considers the requirements of dynamic fluid scheduling, channel storage, as well as washing capacity constraints of buffer fluids simultaneously, and present an efficient design flow to solve this problem systematically. Given the high-level synthesis result of a biochemical application and the corresponding component placement solution, our goal is to complete a contamination-aware flow-path planning with short flow-channel length. Meanwhile, the biochemical application can be executed efficiently and correctly with an optimized capacity-aware wash scheme. Experimental results show that compared to a state-of-the-art washing method, the proposed method achieves an average reduction of 26.1%, 43.1%, and 34.1% across all the benchmarks with respect to the total channel length, total wash time, and execution time of bioassays, respectively.

连续流微流体生物芯片在生物和生物化学领域自动执行各种实验室程序的应用前景广阔，正日益受到关注。采用分布式通道-存储架构的生物芯片使每个通道都能在传输和存储之间切换。因此，流体输送、缓存和提取可通过不同的流动路径同时进行。当两种不同类型的流体以时间交错的方式通过同一通道时，可能会对后一种流体造成污染，因为前一种流体的一些残留物可能会在运输过程中滞留在通道壁上。为了清除这些残留物，清洗操作是避免错误检测结果的必要步骤。然而，现有工作认为缓冲液的洗涤能力是无限的。在实际情况中，固定容量的缓冲液无可辩驳地拥有有限的洗涤容量，在洗涤通道中的残留物时，洗涤容量会被陆续消耗掉。因此，容量感知洗涤方案是实现动态流体调度和通道存储的基本要求。本文提出了一个实用的微流控生物芯片清洗优化问题，该问题同时考虑了动态流体调度、通道存储以及缓冲液清洗容量约束等要求，并给出了系统解决该问题的高效设计流程。鉴于生化应用的高层次综合结果和相应的元件布局方案，我们的目标是完成污染感知的流道规划，并缩短流道长度。同时，通过优化的容量感知清洗方案，生化应用可以高效、正确地执行。实验结果表明，与最先进的清洗方法相比，在所有基准测试中，所提出的方法在通道总长度、总清洗时间和生物测定执行时间方面分别平均减少了 26.1%、43.1% 和 34.1%。

{"title":"Capacity-Aware Wash Optimization with Dynamic Fluid Scheduling and Channel Storage for Continuous-Flow Microfluidic Biochips","authors":"Zhisheng Chen, Xu Hu, Wenzhong Guo, Genggeng Liu, Jiaxuan Wang, Tsung-Yi Ho, Xing Huang","doi":"10.1145/3659952","DOIUrl":"https://doi.org/10.1145/3659952","url":null,"abstract":"Continuous-flow microfluidic biochips are gaining increasing attention with promising applications for automatically executing various laboratory procedures in biology and biochemistry. Biochips with distributed channel-storage architectures enable each channel to switch between the roles of transportation and storage. Consequently, fluid transportation, caching, and fetch can occur concurrently through different flow paths. When two dissimilar types of fluidic flows occur through the same channels in a time-interleaved manner, it may cause contamination to the latter as some residues of the former flow may be stuck at the channel wall during transportation. To remove the residues, wash operations are introduced as an essential step to avoid incorrect assay outcomes. However, existing work has been considered that the washing capacity of a buffer fluid is unlimited. In the actual scenario, a fixed-volume buffer fluid irrefutably possesses a limited washing capacity, which can be successively consumed while washing away residues from the channels. Hence, capacity-aware wash scheme is a basic requirement to fulfil the dynamic fluid scheduling and channel storage. In this paper, we formulate a practical wash optimization problem for microfluidic biochips, which considers the requirements of dynamic fluid scheduling, channel storage, as well as washing capacity constraints of buffer fluids simultaneously, and present an efficient design flow to solve this problem systematically. Given the high-level synthesis result of a biochemical application and the corresponding component placement solution, our goal is to complete a contamination-aware flow-path planning with short flow-channel length. Meanwhile, the biochemical application can be executed efficiently and correctly with an optimized capacity-aware wash scheme. Experimental results show that compared to a state-of-the-art washing method, the proposed method achieves an average reduction of 26.1%, 43.1%, and 34.1% across all the benchmarks with respect to the total channel length, total wash time, and execution time of bioassays, respectively.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140610571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Enhancing Lifetime and Performance of MLC NVM Caches using Embedded Trace buffers 利用嵌入式跟踪缓冲器提高 MLC NVM 高速缓存的寿命和性能

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-04-16 DOI: 10.1145/3659102

S. Sivakumar, John Jose, Vijaykrishnan Narayanan

Large volumes of on-chip and off-chip memory are required by contemporary applications. Emerging non-volatile memory technologies including STT-RAM, PCM, and ReRAM are becoming popular for on-chip and off-chip memories as a result of their desirable properties. Compared to traditional memory technologies like SRAM and DRAM, they have minimal leakage current and high packing density. Non Volatile Memories (NVM), however, have a low write endurance, a high write latency, and high write energy. Non-volatile Single Level Cell (SLC) memories can store a single bit of data in each memory cell, whereas Multi Level Cells (MLC) can store two or more bits in each memory cell. Although MLC NVMs have substantially higher packing density than SLCs, their lifetime and access speed are key concerns. For a given cache size, MLC caches consume 1.84x less space and 2.62x less leakage power than SLC caches. We propose Trace buffer Assisted Non-volatile Memory Cache (TANC), an approach that increases the lifespan and performance of MLC-based last-level caches using the underutilised Embedded Trace Buffers (ETB). TANC improves the lifetime of MLC LLCs up to 4.36x, and decreases average memory access time by 4% compared to SLC NVM LLCs and by 6.41x and 11%, respectively, compared to baseline MLC LLCs.

现代应用需要大量的片上和片外存储器。由于 STT-RAM、PCM 和 ReRAM 等新兴非易失性存储器技术的理想特性，这些技术正逐渐成为片上和片外存储器的流行技术。与 SRAM 和 DRAM 等传统存储器技术相比，它们具有最小的漏电流和较高的封装密度。然而，非易失性存储器（NVM）的写入耐久性低、写入延迟高、写入能量大。非易失性单层单元（SLC）存储器可在每个存储单元中存储一位数据，而多层单元（MLC）可在每个存储单元中存储两位或更多位数据。虽然 MLC NVM 的封装密度大大高于 SLC，但其使用寿命和访问速度也是关键问题。在给定缓存大小的情况下，MLC 缓存的空间消耗是 SLC 缓存的 1.84 倍，漏电功率是 SLC 缓存的 2.62 倍。我们提出了跟踪缓冲器辅助非易失性内存高速缓存（TANC），这是一种利用未充分利用的嵌入式跟踪缓冲器（ETB）提高基于 MLC 的末级高速缓存寿命和性能的方法。与 SLC NVM LLC 相比，TANC 将 MLC LLC 的寿命提高了 4.36 倍，平均内存访问时间缩短了 4%，与基线 MLC LLC 相比，分别缩短了 6.41 倍和 11%。

{"title":"Enhancing Lifetime and Performance of MLC NVM Caches using Embedded Trace buffers","authors":"S. Sivakumar, John Jose, Vijaykrishnan Narayanan","doi":"10.1145/3659102","DOIUrl":"https://doi.org/10.1145/3659102","url":null,"abstract":"Large volumes of on-chip and off-chip memory are required by contemporary applications. Emerging non-volatile memory technologies including STT-RAM, PCM, and ReRAM are becoming popular for on-chip and off-chip memories as a result of their desirable properties. Compared to traditional memory technologies like SRAM and DRAM, they have minimal leakage current and high packing density. Non Volatile Memories (NVM), however, have a low write endurance, a high write latency, and high write energy. Non-volatile Single Level Cell (SLC) memories can store a single bit of data in each memory cell, whereas Multi Level Cells (MLC) can store two or more bits in each memory cell. Although MLC NVMs have substantially higher packing density than SLCs, their lifetime and access speed are key concerns. For a given cache size, MLC caches consume 1.84x less space and 2.62x less leakage power than SLC caches. We propose Trace buffer Assisted Non-volatile Memory Cache (TANC), an approach that increases the lifespan and performance of MLC-based last-level caches using the underutilised Embedded Trace Buffers (ETB). TANC improves the lifetime of MLC LLCs up to 4.36x, and decreases average memory access time by 4% compared to SLC NVM LLCs and by 6.41x and 11%, respectively, compared to baseline MLC LLCs.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573734","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Modeling Retention Errors of 3D NAND Flash for Optimizing Data Placement 模拟 3D NAND 闪存的保留误差以优化数据放置

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-04-16 DOI: 10.1145/3659101

Huanhuan Tian, Jiewen Tang, Jun Li, Zhibing Sha, Fan Yang, Zhigang Cai, Jianwei Liao

Considering 3D NAND flash has a new property of process variation (PV), which causes different raw bit error rates (RBER) among different layers of the flash block. This paper builds a mathematical model for estimating the retention errors of flash cells, by considering the factor of layer-to-layer PV in 3D NAND flash memory, as well as the factors of program/erase (P/E) cycle and retention time of data. Then, it proposes classifying the layers of flash block in 3D NAND flash memory into profitable and unprofitable categories, according to the error correction overhead. After understanding the retention error variation of different layers in 3D NAND flash, we design a mechanism of data placement, which maps the write data onto a suitable layer of flash block, according to the data hotness and the error correction overhead of layers, to boost read performance of 3D NAND flash. The experimental results demonstrate that our proposed retention error estimation model can yield a R² value of 0.966 on average, verifying the accuracy of the model. Based on the estimated retention error rates of layers, the proposed data placement mechanism can noticeably reduce the read latency by 29.8% on average, compared with state-of-the-art methods against retention errors for 3D NAND flash memory.

考虑到 3D NAND 闪存具有工艺变化（PV）的新特性，这会导致闪存块不同层之间的原始误码率（RBER）不同。本文通过考虑 3D NAND 闪存中层与层之间的 PV 因素，以及程序/擦除（P/E）周期和数据保留时间等因素，建立了一个估算闪存单元保留误差的数学模型。然后，根据纠错开销将 3D NAND 闪存中的闪存块层划分为盈利和不盈利两类。在了解三维 NAND 闪存中不同层的保留误差变化后，我们设计了一种数据放置机制，根据数据热度和各层的纠错开销，将写入数据映射到合适的闪存块层上，以提高三维 NAND 闪存的读取性能。实验结果表明，我们提出的保留误差估算模型的平均 R2 值为 0.966，验证了模型的准确性。根据估算的各层滞留误差率，与针对 3D NAND 闪存滞留误差的最先进方法相比，所提出的数据放置机制可显著降低平均 29.8% 的读取延迟。

{"title":"Modeling Retention Errors of 3D NAND Flash for Optimizing Data Placement","authors":"Huanhuan Tian, Jiewen Tang, Jun Li, Zhibing Sha, Fan Yang, Zhigang Cai, Jianwei Liao","doi":"10.1145/3659101","DOIUrl":"https://doi.org/10.1145/3659101","url":null,"abstract":"Considering 3D NAND flash has a new property of process variation (PV), which causes different raw bit error rates (RBER) among different layers of the flash block. This paper builds a mathematical model for estimating the retention errors of flash cells, by considering the factor of layer-to-layer PV in 3D NAND flash memory, as well as the factors of program/erase (P/E) cycle and retention time of data. Then, it proposes classifying the layers of flash block in 3D NAND flash memory into profitable and unprofitable categories, according to the error correction overhead. After understanding the retention error variation of different layers in 3D NAND flash, we design a mechanism of data placement, which maps the write data onto a suitable layer of flash block, according to the data hotness and the error correction overhead of layers, to boost read performance of 3D NAND flash. The experimental results demonstrate that our proposed retention error estimation model can yield a R2 value of <monospace>0.966</monospace> on average, verifying the accuracy of the model. Based on the estimated retention error rates of layers, the proposed data placement mechanism can noticeably reduce the read latency by <monospace>29.8</monospace>% on average, compared with state-of-the-art methods against retention errors for 3D NAND flash memory.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573631","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

WCPNet: Jointly Predicting Wirelength, Congestion and Power for FPGA Using Multi-task Learning WCPNet：利用多任务学习联合预测 FPGA 的线长、拥塞和功率

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-04-08 DOI: 10.1145/3656170

Juming Xian, Yan Xing, Shuting Cai, Weijun Li, Xiaoming Xiong, Zhengfa Hu

To speed up the design closure and improve the QoR of FPGA, supervised single-task machine learning techniques have been used to predict individual design metric based on placement results. However, the design objective is to achieve optimal performance while considering multiple conflicting metrics. The single-task approaches predict each metric in isolation and neglect the potential correlations or dependencies among them. To address the limitations, this paper proposes a multi-task learning approach to jointly predict wirelength, congestion and power. By sharing the common feature representations and adopting the joint optimization strategy, the novel WCPNet models (including WCPNet-HS and WCPNet-SS) can not only predict the three metrics of different scales simultaneously, but also outperform the majority of single-task models in terms of both prediction performance and time cost, which are demonstrated by the results of the cross design experiment. By adopting the cross-stitch structure in the encoder, WCPNet-SS outperforms WCPNet-HS in prediction performance, but WCPNet-HS is faster because of the simpler parameters sharing structure. The significance of the feature image_{pinUtilization} on predicting power and wirelength are demonstrated by the ablation experiment.

为了加快 FPGA 的设计闭合速度并提高其 QoR，人们使用了有监督的单任务机器学习技术来根据贴片结果预测单个设计指标。然而，设计目标是在考虑多个相互冲突的指标的同时实现最佳性能。单任务方法只能孤立地预测每个指标，而忽略了它们之间潜在的相关性或依赖性。为了解决这些局限性，本文提出了一种多任务学习方法来联合预测线长、拥塞和功率。通过共享共同特征表征和采用联合优化策略，新型 WCPNet 模型（包括 WCPNet-HS 和 WCPNet-SS）不仅能同时预测不同规模的三个指标，而且在预测性能和时间成本方面都优于大多数单任务模型，交叉设计实验的结果证明了这一点。通过在编码器中采用交叉缝合结构，WCPNet-SS 的预测性能优于 WCPNet-HS，但由于参数共享结构更简单，WCPNet-HS 的预测速度更快。消融实验证明了特征 imagepinUtilization 对预测功率和线长的重要性。

{"title":"WCPNet: Jointly Predicting Wirelength, Congestion and Power for FPGA Using Multi-task Learning","authors":"Juming Xian, Yan Xing, Shuting Cai, Weijun Li, Xiaoming Xiong, Zhengfa Hu","doi":"10.1145/3656170","DOIUrl":"https://doi.org/10.1145/3656170","url":null,"abstract":"To speed up the design closure and improve the QoR of FPGA, supervised single-task machine learning techniques have been used to predict individual design metric based on placement results. However, the design objective is to achieve optimal performance while considering multiple conflicting metrics. The single-task approaches predict each metric in isolation and neglect the potential correlations or dependencies among them. To address the limitations, this paper proposes a multi-task learning approach to jointly predict wirelength, congestion and power. By sharing the common feature representations and adopting the joint optimization strategy, the novel WCPNet models (including WCPNet-HS and WCPNet-SS) can not only predict the three metrics of different scales simultaneously, but also outperform the majority of single-task models in terms of both prediction performance and time cost, which are demonstrated by the results of the cross design experiment. By adopting the cross-stitch structure in the encoder, WCPNet-SS outperforms WCPNet-HS in prediction performance, but WCPNet-HS is faster because of the simpler parameters sharing structure. The significance of the feature imagepinUtilization on predicting power and wirelength are demonstrated by the ablation experiment.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573830","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

ARM-CO-UP: ARM COoperative Utilization of Processors ARM-CO-UP：ARM 处理器的协同利用

IF 1.4 4区计算机科学 Q2 Computer Science

ACM Transactions on Design Automation of Electronic Systems

Pub Date : 2024-04-08 DOI: 10.1145/3656472

Ehsan Aghapour, Dolly Sapra, Andy Pimentel, Anuj Pathania

HMPSoCs combine different processors on a single chip. They enable powerful embedded devices, which increasingly perform ML inference tasks at the edge. State-of-the-art HMPSoCs can perform on-chip embedded inference using different processors, such as CPUs, GPUs, and NPUs. HMPSoCs can potentially overcome the limitation of low single-processor CNN inference performance and efficiency by cooperative use of multiple processors. However, standard inference frameworks for edge devices typically utilize only a single processor.

We present the ARM-CO-UP framework built on the ARM-CL library. The ARM-CO-UP framework supports two modes of operation – Pipeline and Switch. It optimizes inference throughput using pipelined execution of network partitions for consecutive input frames in the Pipeline mode. It improves inference latency through layer-switched inference for a single input frame in the Switch mode. Furthermore, it supports layer-wise CPU/GPU DVFS in both modes for improving power efficiency and energy consumption. ARM-CO-UP is a comprehensive framework for multi-processor CNN inference that automates CNN partitioning and mapping, pipeline synchronization, processor type switching, layer-wise DVFS, and closed-source NPU integration.

HMPSoC 在单个芯片上集成了不同的处理器。它们支持功能强大的嵌入式设备，这些设备越来越多地在边缘执行 ML 推断任务。最先进的 HMPSoC 可以使用不同的处理器（如 CPU、GPU 和 NPU）执行片上嵌入式推理。HMPSoC 可以通过合作使用多个处理器来克服单处理器 CNN 推理性能和效率低的限制。然而，用于边缘设备的标准推理框架通常只使用单个处理器。我们介绍了基于 ARM-CL 库的 ARM-CO-UP 框架。ARM-CO-UP 框架支持两种运行模式--管道和交换。在流水线模式下，它通过流水线执行连续输入帧的网络分区来优化推理吞吐量。在开关模式下，它通过对单个输入帧进行层切换推理来改善推理延迟。此外，它还支持这两种模式下的 CPU/GPU DVFS 分层，以提高能效和能耗。ARM-CO-UP 是用于多处理器 CNN 推理的综合框架，可自动进行 CNN 分区和映射、流水线同步、处理器类型切换、分层 DVFS 和闭源 NPU 集成。

{"title":"ARM-CO-UP: ARM COoperative Utilization of Processors","authors":"Ehsan Aghapour, Dolly Sapra, Andy Pimentel, Anuj Pathania","doi":"10.1145/3656472","DOIUrl":"https://doi.org/10.1145/3656472","url":null,"abstract":"HMPSoCs combine different processors on a single chip. They enable powerful embedded devices, which increasingly perform ML inference tasks at the edge. State-of-the-art HMPSoCs can perform on-chip embedded inference using different processors, such as CPUs, GPUs, and NPUs. HMPSoCs can potentially overcome the limitation of low single-processor CNN inference performance and efficiency by cooperative use of multiple processors. However, standard inference frameworks for edge devices typically utilize only a single processor. We present the ARM-CO-UP framework built on the ARM-CL library. The ARM-CO-UP framework supports two modes of operation – Pipeline and Switch. It optimizes inference throughput using pipelined execution of network partitions for consecutive input frames in the Pipeline mode. It improves inference latency through layer-switched inference for a single input frame in the Switch mode. Furthermore, it supports layer-wise CPU/GPU DVFS in both modes for improving power efficiency and energy consumption. ARM-CO-UP is a comprehensive framework for multi-processor CNN inference that automates CNN partitioning and mapping, pipeline synchronization, processor type switching, layer-wise DVFS, and closed-source NPU integration.","PeriodicalId":50944,"journal":{"name":"ACM Transactions on Design Automation of Electronic Systems","volume":null,"pages":null},"PeriodicalIF":1.4,"publicationDate":"2024-04-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140573738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0