ACM Trans. Design Autom. Electr. Syst.最新文献

High-Level Synthesis Implementation of an Embedded Real-Time HEVC Intra Encoder on FPGA for Media Applications 基于FPGA的媒体应用嵌入式实时HEVC内部编码器的高级综合实现

ACM Trans. Design Autom. Electr. Syst.

Pub Date : 2022-07-31 DOI: 10.1145/3491215

Panu Sjövall, Ari Lemmetti, Jarno Vanne, Sakari Lahti, Timo D. Hämäläinen

High Efficiency Video Coding (HEVC) is the key enabling technology for numerous modern media applications. Overcoming its computational complexity and customizing its rich features for real-time HEVC encoder implementations, calls for automated design methodologies. This article introduces the first complete High-Level Synthesis (HLS) implementation for HEVC intra encoder on FPGA. The C source code of our open-source Kvazaar HEVC encoder is used as a design entry point for HLS that is applied throughout the whole encoder design process, from data-intensive coding tools like intra prediction and discrete transforms to more control-oriented tools such as context-adaptive binary arithmetic coding (CABAC). Our prototype is run on Nokia AirFrame Cloud Server equipped with 2.4 GHz dual 14-core Intel Xeon processors and two Intel Arria 10 PCIe FPGA accelerator cards with 40 Gigabit Ethernet. This proof-of-concept system is designed for hardware-accelerated HEVC encoding and it achieves real-time 4K coding speed up to 120 fps. The coding performance can be easily scaled up by adding practically any number of network-connected FPGA cards to the system. These results indicate that our HLS proposal not only boosts development time, but also provides previously unseen design scalability with competitive performance over the existing FPGA and ASIC encoder implementations.

高效视频编码(HEVC)是众多现代媒体应用的关键实现技术。克服其计算复杂性和定制其丰富的功能，以实现实时HEVC编码器，需要自动化设计方法。本文介绍了HEVC内编码器在FPGA上的第一个完整的高级合成(High-Level Synthesis, HLS)实现。我们的开源Kvazaar HEVC编码器的C源代码被用作HLS的设计入口点，它应用于整个编码器设计过程，从数据密集型编码工具(如帧内预测和离散变换)到更面向控制的工具(如上下文自适应二进制算术编码(CABAC))。我们的原型在诺基亚AirFrame云服务器上运行，配备2.4 GHz双14核英特尔至强处理器和两个英特尔Arria 10 PCIe FPGA加速卡，具有40千兆以太网。该概念验证系统专为硬件加速HEVC编码而设计，可实现高达120 fps的实时4K编码速度。通过向系统中添加几乎任意数量的网络连接FPGA卡，可以很容易地扩展编码性能。这些结果表明，我们的HLS方案不仅缩短了开发时间，而且提供了前所未有的设计可扩展性，与现有的FPGA和ASIC编码器实现相比具有竞争力的性能。

{"title":"High-Level Synthesis Implementation of an Embedded Real-Time HEVC Intra Encoder on FPGA for Media Applications","authors":"Panu Sjövall, Ari Lemmetti, Jarno Vanne, Sakari Lahti, Timo D. Hämäläinen","doi":"10.1145/3491215","DOIUrl":"https://doi.org/10.1145/3491215","url":null,"abstract":"High Efficiency Video Coding (HEVC) is the key enabling technology for numerous modern media applications. Overcoming its computational complexity and customizing its rich features for real-time HEVC encoder implementations, calls for automated design methodologies. This article introduces the first complete High-Level Synthesis (HLS) implementation for HEVC intra encoder on FPGA. The C source code of our open-source Kvazaar HEVC encoder is used as a design entry point for HLS that is applied throughout the whole encoder design process, from data-intensive coding tools like intra prediction and discrete transforms to more control-oriented tools such as context-adaptive binary arithmetic coding (CABAC). Our prototype is run on Nokia AirFrame Cloud Server equipped with 2.4 GHz dual 14-core Intel Xeon processors and two Intel Arria 10 PCIe FPGA accelerator cards with 40 Gigabit Ethernet. This proof-of-concept system is designed for hardware-accelerated HEVC encoding and it achieves real-time 4K coding speed up to 120 fps. The coding performance can be easily scaled up by adding practically any number of network-connected FPGA cards to the system. These results indicate that our HLS proposal not only boosts development time, but also provides previously unseen design scalability with competitive performance over the existing FPGA and ASIC encoder implementations.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"154 8 1","pages":"35:1-35:34"},"PeriodicalIF":0.0,"publicationDate":"2022-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83185883","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Achieving High In Situ Training Accuracy and Energy Efficiency with Analog Non-Volatile Synaptic Devices 利用模拟非易失性突触器件实现高原位训练精度和能量效率

ACM Trans. Design Autom. Electr. Syst.

Pub Date : 2022-07-31 DOI: 10.1145/3500929

Shanshi Huang, Xiaoyu Sun, Xiaochen Peng, Hongwu Jiang, Shimeng Yu

On-device embedded artificial intelligence prefers the adaptive learning capability when deployed in the field, and thus in situ training is required. The compute-in-memory approach, which exploits the analog computation within the memory array, is a promising solution for deep neural network (DNN) on-chip acceleration. Emerging non-volatile memories are of great interest, serving as analog synapses due to their multilevel programmability. However, the asymmetry and nonlinearity in the conductance tuning remain grand challenges for achieving high in situ training accuracy. In addition, analog-to-digital converters at the edge of the memory array introduce quantization errors. In this work, we present an algorithm-hardware co-optimization to overcome these challenges. We incorporate the device/circuit non-ideal effects into the DNN propagation and weight update steps. By introducing the adaptive “momentum” in the weight update rule, in situ training accuracy on CIFAR-10 could approach its software baseline even under severe asymmetry/nonlinearity and analog-to-digital converter quantization error. The hardware performance of the on-chip training architecture and the overhead for adding “momentum” are also evaluated. By optimizing the backpropagation dataflow, 23.59 TOPS/W training energy efficiency (12× improvement compared to naïve dataflow) is achieved. The circuits that handle “momentum” introduce only 4.2% energy overhead. Our results show great potential and more relaxed requirements that enable emerging non-volatile memories for DNN acceleration on the embedded artificial intelligence platforms.

设备内嵌入式人工智能在现场部署时更倾向于自适应学习能力，因此需要现场培训。利用存储器阵列内模拟计算的内存计算方法是实现深度神经网络(DNN)片上加速的一种很有前途的解决方案。新兴的非易失性存储器由于其多层可编程性而被用作模拟突触，引起了人们的极大兴趣。然而，电导调谐中的不对称性和非线性仍然是实现高原位训练精度的巨大挑战。此外，存储阵列边缘的模数转换器会引入量化误差。在这项工作中，我们提出了一种算法-硬件协同优化来克服这些挑战。我们将器件/电路非理想效应纳入深度神经网络传播和权值更新步骤中。通过在权重更新规则中引入自适应“动量”，即使在严重的不对称/非线性和模数转换器量化误差下，CIFAR-10的原位训练精度也能接近其软件基线。还评估了片上训练架构的硬件性能和增加“动量”的开销。通过对反向传播数据流的优化，达到23.59 TOPS/W的训练能量效率(比naïve数据流提高12倍)。处理“动量”的电路只引入了4.2%的能量开销。我们的研究结果显示了巨大的潜力和更宽松的要求，使嵌入式人工智能平台上DNN加速的非易失性存储器成为可能。

{"title":"Achieving High In Situ Training Accuracy and Energy Efficiency with Analog Non-Volatile Synaptic Devices","authors":"Shanshi Huang, Xiaoyu Sun, Xiaochen Peng, Hongwu Jiang, Shimeng Yu","doi":"10.1145/3500929","DOIUrl":"https://doi.org/10.1145/3500929","url":null,"abstract":"On-device embedded artificial intelligence prefers the adaptive learning capability when deployed in the field, and thus in situ training is required. The compute-in-memory approach, which exploits the analog computation within the memory array, is a promising solution for deep neural network (DNN) on-chip acceleration. Emerging non-volatile memories are of great interest, serving as analog synapses due to their multilevel programmability. However, the asymmetry and nonlinearity in the conductance tuning remain grand challenges for achieving high in situ training accuracy. In addition, analog-to-digital converters at the edge of the memory array introduce quantization errors. In this work, we present an algorithm-hardware co-optimization to overcome these challenges. We incorporate the device/circuit non-ideal effects into the DNN propagation and weight update steps. By introducing the adaptive “momentum” in the weight update rule, in situ training accuracy on CIFAR-10 could approach its software baseline even under severe asymmetry/nonlinearity and analog-to-digital converter quantization error. The hardware performance of the on-chip training architecture and the overhead for adding “momentum” are also evaluated. By optimizing the backpropagation dataflow, 23.59 TOPS/W training energy efficiency (12× improvement compared to naïve dataflow) is achieved. The circuits that handle “momentum” introduce only 4.2% energy overhead. Our results show great potential and more relaxed requirements that enable emerging non-volatile memories for DNN acceleration on the embedded artificial intelligence platforms.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"121 1","pages":"37:1-37:19"},"PeriodicalIF":0.0,"publicationDate":"2022-07-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85033125","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

An Efficient Execution Framework of Two-Part Execution Scenario Analysis 两部分执行场景分析的高效执行框架

ACM Trans. Design Autom. Electr. Syst.

Pub Date : 2022-01-31 DOI: 10.1145/3465474

Ding Han, Guohui Li, Quan Zhou, Jianjun Li, Yong Yang, Xiaofei Hu

Response Time Analysis ( RTA ) is an important and promising technique for analyzing the schedulability of real-time tasks under both Global Fixed-Priority ( G-FP ) scheduling and Global Earliest Deadline First ( G-EDF ) scheduling. Most existing RTA methods for tasks under global scheduling are dominated by partitioned scheduling, due to the pessimism of the -based interference calculation where is the number of processors. Two-part execution scenario is an effective technique that addresses this pessimism at the cost of efficiency. The major idea of two-part execution scenario is to calculate a more accurate upper bound of the interference by dividing the execution of the target job into two parts and calculating the interference on the target job in each part. This article proposes a novel RTA execution framework that improves two-part execution scenario by reducing some unnecessary calculation, without sacrificing accuracy of the schedulability test. The key observation is that, after the division of the execution of the target job, two-part execution scenario enumerates all possible execution time of the target job in the first part for calculating the final Worst-Case Response Time ( WCRT ). However, only some special execution time can cause the final result. A set of experiments is conducted to test the performance of the proposed execution framework and the result shows that the proposed execution framework can improve the efficiency of two-part execution scenario analysis by up to in terms of the execution time.

响应时间分析(RTA)是分析全局固定优先级(G-FP)调度和全局最早截止日期优先(G-EDF)调度下实时任务可调度性的一种重要且有前途的技术。现有的全局调度任务的RTA方法大多采用分区调度，这是由于基于干扰计算的悲观主义，其中为处理器数。两部分执行场景是一种有效的技术，可以以牺牲效率为代价解决这种悲观情绪。两部分执行场景的主要思想是通过将目标作业的执行分为两部分，计算每一部分对目标作业的干扰，从而计算出更准确的干扰上界。本文提出了一种新颖等执行框架,提高两部分执行场景通过减少一些不必要的计算,在不牺牲精度的调度性测试。关键的观察结果是，在划分目标作业的执行之后，两部分执行场景在第一部分中枚举目标作业的所有可能执行时间，以计算最终的最坏情况响应时间(WCRT)。但是，只有一些特殊的执行时间才能产生最终结果。通过一组实验对所提出的执行框架进行了性能测试，结果表明所提出的执行框架在执行时间方面可以将两部分执行场景分析的效率提高高达。

{"title":"An Efficient Execution Framework of Two-Part Execution Scenario Analysis","authors":"Ding Han, Guohui Li, Quan Zhou, Jianjun Li, Yong Yang, Xiaofei Hu","doi":"10.1145/3465474","DOIUrl":"https://doi.org/10.1145/3465474","url":null,"abstract":"\u0000 Response Time Analysis\u0000 (\u0000 RTA\u0000 ) is an important and promising technique for analyzing the schedulability of real-time tasks under both\u0000 Global Fixed-Priority\u0000 (\u0000 G-FP\u0000 ) scheduling and\u0000 Global Earliest Deadline First\u0000 (\u0000 G-EDF\u0000 ) scheduling. Most existing RTA methods for tasks under global scheduling are dominated by partitioned scheduling, due to the pessimism of the\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 -based interference calculation where\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 is the number of processors. Two-part execution scenario is an effective technique that addresses this pessimism at the cost of efficiency. The major idea of two-part execution scenario is to calculate a more accurate upper bound of the interference by dividing the execution of the target job into two parts and calculating the interference on the target job in each part. This article proposes a novel RTA execution framework that improves two-part execution scenario by reducing some unnecessary calculation, without sacrificing accuracy of the schedulability test. The key observation is that, after the division of the execution of the target job, two-part execution scenario enumerates all possible execution time of the target job in the first part for calculating the final\u0000 Worst-Case Response Time\u0000 (\u0000 WCRT\u0000 ). However, only some special execution time can cause the final result. A set of experiments is conducted to test the performance of the proposed execution framework and the result shows that the proposed execution framework can improve the efficiency of two-part execution scenario analysis by up to\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 in terms of the execution time.\u0000","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"58 1","pages":"3:1-3:24"},"PeriodicalIF":0.0,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83390279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Improving LDPC Decoding Performance for 3D TLC NAND Flash by LLR Optimization Scheme for Hard and Soft Decision 采用软硬判决LLR优化方案提高3D TLC NAND闪存LDPC解码性能

ACM Trans. Design Autom. Electr. Syst.

Pub Date : 2022-01-31 DOI: 10.1145/3473305

Lanlan Cui, Fei Wu, Xiaojian Liu, Meng Zhang, Renzhi Xiao, Changsheng Xie

Low-density parity-check (LDPC) codes have been widely adopted in NAND flash in recent years to enhance data reliability. There are two types of decoding, hard-decision and soft-decision decoding. However, for the two types, their error correction capability degrades due to inaccurate log-likelihood ratio (LLR) . To improve the LLR accuracy of LDPC decoding, this article proposes LLR optimization schemes, which can be utilized for both hard-decision and soft-decision decoding. First, we build a threshold voltage distribution model for 3D floating gate (FG) triple level cell (TLC) NAND flash. Then, by exploiting the model, we introduce a scheme to quantize LLR during hard-decision and soft-decision decoding. And by amplifying a portion of small LLRs, which is essential in the layer min-sum decoder, more precise LLR can be obtained. For hard-decision decoding, the proposed new modes can significantly improve the decoder’s error correction capability compared with traditional solutions. Soft-decision decoding starts when hard-decision decoding fails. For this part, we study the influence of the reference voltage arrangement of LLR calculation and apply the quantization scheme. The simulation shows that the proposed approach can reduce frame error rate (FER) for several orders of magnitude.

低密度奇偶校验(LDPC)码近年来被广泛应用于NAND闪存中，以提高数据可靠性。译码有硬译码和软译码两种类型。然而，对于这两种类型，由于不准确的对数似然比(LLR)，它们的纠错能力下降。为了提高LDPC译码的LLR精度，本文提出了可用于硬判决译码和软判决译码的LLR优化方案。首先，我们建立了三维浮栅(FG)三电平单元(TLC) NAND闪存的阈值电压分布模型。然后，利用该模型，提出了一种在硬判决译码和软判决译码过程中量化LLR的方案。通过放大层最小和解码器中必不可少的部分小LLR，可以获得更精确的LLR。对于硬判决译码，与传统译码方案相比，提出的新模式能显著提高译码器的纠错能力。硬译码失败时开始软译码。在这一部分，我们研究了基准电压排列对LLR计算的影响，并应用了量化方案。仿真结果表明，该方法可以将帧误码率降低几个数量级。

{"title":"Improving LDPC Decoding Performance for 3D TLC NAND Flash by LLR Optimization Scheme for Hard and Soft Decision","authors":"Lanlan Cui, Fei Wu, Xiaojian Liu, Meng Zhang, Renzhi Xiao, Changsheng Xie","doi":"10.1145/3473305","DOIUrl":"https://doi.org/10.1145/3473305","url":null,"abstract":"\u0000 Low-density parity-check (LDPC)\u0000 codes have been widely adopted in NAND flash in recent years to enhance data reliability. There are two types of decoding, hard-decision and soft-decision decoding. However, for the two types, their error correction capability degrades due to inaccurate\u0000 log-likelihood ratio (LLR)\u0000 . To improve the LLR accuracy of LDPC decoding, this article proposes LLR optimization schemes, which can be utilized for both hard-decision and soft-decision decoding. First, we build a threshold voltage distribution model for 3D\u0000 floating gate (FG)\u0000 triple level cell (TLC)\u0000 NAND flash. Then, by exploiting the model, we introduce a scheme to quantize LLR during hard-decision and soft-decision decoding. And by amplifying a portion of small LLRs, which is essential in the layer min-sum decoder, more precise LLR can be obtained. For hard-decision decoding, the proposed new modes can significantly improve the decoder’s error correction capability compared with traditional solutions. Soft-decision decoding starts when hard-decision decoding fails. For this part, we study the influence of the reference voltage arrangement of LLR calculation and apply the quantization scheme. The simulation shows that the proposed approach can\u0000 reduce frame error rate (FER)\u0000 for several orders of magnitude.\u0000","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"138 1","pages":"5:1-5:20"},"PeriodicalIF":0.0,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73058984","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

A Comprehensive Survey of Attacks without Physical Access Targeting Hardware Vulnerabilities in IoT/IIoT Devices, and Their Detection Mechanisms 针对物联网/工业物联网设备硬件漏洞的无物理访问攻击及其检测机制的综合调查

ACM Trans. Design Autom. Electr. Syst.

Pub Date : 2022-01-31 DOI: 10.1145/3471936

Nikolaos Foivos Polychronou, Pierre-Henri Thevenon, Maxime Puys, V. Beroulle

With the advances in the field of the Internet of Things (IoT) and Industrial IoT (IIoT), these devices are increasingly used in daily life or industry. To reduce costs related to the time required to develop these devices, security features are usually not considered. This situation creates a major security concern. Many solutions have been proposed to protect IoT/IIoT against various attacks, most of which are based on attacks involving physical access. However, a new class of attacks has emerged targeting hardware vulnerabilities in the micro-architecture that do not require physical access. We present attacks based on micro-architectural hardware vulnerabilities and the side effects they produce in the system. In addition, we present security mechanisms that can be implemented to address some of these attacks. Most of the security mechanisms target a small set of attack vectors or a single specific attack vector. As many attack vectors exist, solutions must be found to protect against a wide variety of threats. This survey aims to inform designers about the side effects related to attacks and detection mechanisms that have been described in the literature. For this purpose, we present two tables listing and classifying the side effects and detection mechanisms based on the given criteria.

随着物联网(IoT)和工业物联网(IIoT)领域的进步，这些设备越来越多地用于日常生活或工业。为了减少与开发这些设备所需的时间相关的成本，通常不考虑安全特性。这种情况造成了严重的安全问题。已经提出了许多解决方案来保护物联网/工业物联网免受各种攻击，其中大多数是基于涉及物理访问的攻击。然而，针对微架构中不需要物理访问的硬件漏洞的新一类攻击已经出现。我们提出了基于微架构硬件漏洞及其在系统中产生的副作用的攻击。此外，我们还提出了可以实现的安全机制，以解决其中一些攻击。大多数安全机制针对的是一小部分攻击向量或单个特定的攻击向量。由于存在许多攻击媒介，因此必须找到解决方案来防范各种各样的威胁。这项调查的目的是告知设计师有关的副作用的攻击和检测机制，已在文献中描述。为此，我们提出了两个表，列出并分类基于给定标准的副作用和检测机制。

{"title":"A Comprehensive Survey of Attacks without Physical Access Targeting Hardware Vulnerabilities in IoT/IIoT Devices, and Their Detection Mechanisms","authors":"Nikolaos Foivos Polychronou, Pierre-Henri Thevenon, Maxime Puys, V. Beroulle","doi":"10.1145/3471936","DOIUrl":"https://doi.org/10.1145/3471936","url":null,"abstract":"With the advances in the field of the Internet of Things (IoT) and Industrial IoT (IIoT), these devices are increasingly used in daily life or industry. To reduce costs related to the time required to develop these devices, security features are usually not considered. This situation creates a major security concern. Many solutions have been proposed to protect IoT/IIoT against various attacks, most of which are based on attacks involving physical access. However, a new class of attacks has emerged targeting hardware vulnerabilities in the micro-architecture that do not require physical access. We present attacks based on micro-architectural hardware vulnerabilities and the side effects they produce in the system. In addition, we present security mechanisms that can be implemented to address some of these attacks. Most of the security mechanisms target a small set of attack vectors or a single specific attack vector. As many attack vectors exist, solutions must be found to protect against a wide variety of threats. This survey aims to inform designers about the side effects related to attacks and detection mechanisms that have been described in the literature. For this purpose, we present two tables listing and classifying the side effects and detection mechanisms based on the given criteria.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"11 1","pages":"1:1-1:35"},"PeriodicalIF":0.0,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"72794372","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11

Demand-Driven Multi-Target Sample Preparation on Resource-Constrained Digital Microfluidic Biochips 资源受限的数字微流控生物芯片需求驱动的多靶点样品制备

ACM Trans. Design Autom. Electr. Syst.

Pub Date : 2022-01-31 DOI: 10.1145/3474392

Sudip Poddar, Sukanta Bhattacharjee, Shao-Yun Fang, Tsung-Yi Ho, B. B. Bhattacharya

Microfluidic lab-on-chips offer promising technology for the automation of various biochemical laboratory protocols on a minuscule chip. Sample preparation (SP) is an essential part of any biochemical experiments, which aims to produce dilution of a sample or a mixture of multiple reagents in a certain ratio. One major objective in this area is to prepare dilutions of a given fluid with different concentration factors, each with certain volume, which is referred to as the demand-driven multiple-target (DDMT) generation problem. SP with microfluidic biochips requires proper sequencing of mix-split steps on fluid volumes and needs storage units to save intermediate fluids while producing the desired target ratio. The performance of SP depends on the underlying mixing algorithm and the availability of on-chip storage, and the latter is often limited by the constraints imposed during physical design. Since DDMT involves several target ratios, solving it under storage constraints becomes even harder. Furthermore, reduction of mix-split steps is desirable from the viewpoint of accuracy of SP, as every such step is a potential source of volumetric split error. In this article, we propose a storage-aware DDMT algorithm that reduces the number of mix-split operations on a digital microfluidic lab-on-chip. We also present the layout of the biochip with -storage cells and their allocation technique for . Simulation results reveal the superiority of the proposed method compared to the state-of-the-art multi-target SP algorithms.

微流控芯片实验室为在微小芯片上实现各种生化实验室方案的自动化提供了有前途的技术。样品制备(Sample preparation, SP)是任何生化实验的重要组成部分，其目的是对样品或多种试剂按一定比例进行稀释。该领域的一个主要目标是制备具有不同浓度因子的给定流体的稀释剂，每种稀释剂都具有一定的体积，这被称为需求驱动的多目标(DDMT)生成问题。带有微流控生物芯片的SP需要对流体体积的混合分裂步骤进行适当的排序，并且需要存储单元来节省中间流体，同时产生所需的目标比率。SP的性能取决于底层混合算法和片上存储的可用性，而后者通常受到物理设计期间施加的约束的限制。由于DDMT涉及多个目标比率，因此在存储限制下解决它变得更加困难。此外，从SP精度的角度来看，减少混合分裂步骤是可取的，因为每一个这样的步骤都是体积分裂误差的潜在来源。在本文中，我们提出了一种存储感知的DDMT算法，该算法减少了数字微流控芯片实验室的混合分裂操作次数。我们还介绍了生物芯片的布局和它们的分配技术。仿真结果表明，与现有的多目标SP算法相比，该方法具有明显的优越性。

{"title":"Demand-Driven Multi-Target Sample Preparation on Resource-Constrained Digital Microfluidic Biochips","authors":"Sudip Poddar, Sukanta Bhattacharjee, Shao-Yun Fang, Tsung-Yi Ho, B. B. Bhattacharya","doi":"10.1145/3474392","DOIUrl":"https://doi.org/10.1145/3474392","url":null,"abstract":"\u0000 Microfluidic lab-on-chips offer promising technology for the automation of various biochemical laboratory protocols on a minuscule chip. Sample preparation (SP) is an essential part of any biochemical experiments, which aims to produce dilution of a sample or a mixture of multiple reagents in a certain ratio. One major objective in this area is to prepare dilutions of a given fluid with different concentration factors, each with certain volume, which is referred to as the demand-driven multiple-target (DDMT) generation problem. SP with microfluidic biochips requires proper sequencing of mix-split steps on fluid volumes and needs storage units to save intermediate fluids while producing the desired target ratio. The performance of SP depends on the underlying mixing algorithm and the availability of on-chip storage, and the latter is often limited by the constraints imposed during physical design. Since DDMT involves several target ratios, solving it under storage constraints becomes even harder. Furthermore, reduction of mix-split steps is desirable from the viewpoint of accuracy of SP, as every such step is a potential source of volumetric split error. In this article, we propose a storage-aware DDMT algorithm that reduces the number of mix-split operations on a digital microfluidic lab-on-chip. We also present the layout of the biochip with\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 -storage cells and their allocation technique for\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 . Simulation results reveal the superiority of the proposed method compared to the state-of-the-art multi-target SP algorithms.\u0000","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"236 1","pages":"7:1-7:21"},"PeriodicalIF":0.0,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74185114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

A Novel Hybrid Cache Coherence with Global Snooping for Many-core Architectures 一种新的多核体系结构的混合缓存一致性与全局窥探

ACM Trans. Design Autom. Electr. Syst.

Pub Date : 2022-01-31 DOI: 10.1145/3462775

G. Harsha, Sujay Deb

Cache coherence ensures correctness of cached data in multi-core processors. Traditional implementations of existing protocols make them unscalable for many core architectures. While snoopy coherence requires unscalable ordered networks, directory coherence is weighed down by high area and energy overheads. In this work, we propose Wireless-enabled Share-aware Hybrid (WiSH) to provide scalable coherence in many core processors. WiSH implements a novel Snoopy over Directory protocol using on-chip wireless links and hierarchical, clustered Network-on-Chip to achieve low-overhead and highly efficient coherence. A local directory protocol maintains coherence within a cluster of cores, while coherence among such clusters is achieved through global snoopy protocol. The ordered network for global snooping is provided through low-latency and low-energy broadcast wireless links. The overheads are further reduced through share-aware cache segmentation to eliminate coherence for private blocks. Evaluations show that WiSH reduces traffic by and runtime by , while requiring smaller storage and lower energy as compared to existing hierarchical and hybrid coherence protocols. Owing to its modularity, WiSH provides highly efficient and scalable coherence for many core processors.

缓存一致性确保了多核处理器中缓存数据的正确性。现有协议的传统实现使得它们在许多核心体系结构中无法扩展。虽然史努比一致性需要不可伸缩的有序网络，但目录一致性受到高面积和能量开销的影响。在这项工作中，我们提出了无线共享感知混合(WiSH)，以在许多核心处理器中提供可扩展的一致性。WiSH使用片上无线链路和分层、集群的片上网络实现了一种新颖的Snoopy over Directory协议，以实现低开销和高效的一致性。本地目录协议保持核心集群内的一致性，而集群之间的一致性是通过全局snoopy协议实现的。通过低延迟和低能量的广播无线链路提供有序的全局窥探网络。通过共享感知缓存分段来消除私有块的一致性，进一步降低了开销。评估表明，与现有的分层和混合相干协议相比，WiSH减少了流量和运行时间，同时需要更小的存储和更低的能量。由于它的模块化，WiSH为许多核心处理器提供了高效和可扩展的一致性。

{"title":"A Novel Hybrid Cache Coherence with Global Snooping for Many-core Architectures","authors":"G. Harsha, Sujay Deb","doi":"10.1145/3462775","DOIUrl":"https://doi.org/10.1145/3462775","url":null,"abstract":"\u0000 Cache coherence ensures correctness of cached data in multi-core processors. Traditional implementations of existing protocols make them unscalable for many core architectures. While snoopy coherence requires unscalable ordered networks, directory coherence is weighed down by high area and energy overheads. In this work, we propose Wireless-enabled Share-aware Hybrid (WiSH) to provide scalable coherence in many core processors. WiSH implements a novel Snoopy over Directory protocol using on-chip wireless links and hierarchical, clustered Network-on-Chip to achieve low-overhead and highly efficient coherence. A local directory protocol maintains coherence within a cluster of cores, while coherence among such clusters is achieved through global snoopy protocol. The ordered network for global snooping is provided through low-latency and low-energy broadcast wireless links. The overheads are further reduced through share-aware cache segmentation to eliminate coherence for private blocks. Evaluations show that WiSH reduces traffic by\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 and runtime by\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 , while requiring\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 smaller storage and\u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 \u0000 lower energy as compared to existing hierarchical and hybrid coherence protocols. Owing to its modularity, WiSH provides highly efficient and scalable coherence for many core processors.\u0000","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"21 1","pages":"2:1-2:31"},"PeriodicalIF":0.0,"publicationDate":"2022-01-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84911917","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Energy Efficient Error Resilient Multiplier Using Low-power Compressors 使用低功率压缩机的节能容错倍增器

ACM Trans. Design Autom. Electr. Syst.

Pub Date : 2022-01-01 DOI: 10.1145/3488837

S. Deepsita, M. Dhayalakumar, S. Mahammad

引用次数: 1

An Energy-Efficient Inference Method in Convolutional Neural Networks Based on Dynamic Adjustment of the Pruning Level 基于剪枝水平动态调整的卷积神经网络节能推理方法

ACM Trans. Design Autom. Electr. Syst.

Pub Date : 2021-08-01 DOI: 10.1145/3460972

M. A. Maleki, Alireza Nabipour-Meybodi, M. Kamal, A. Afzali-Kusha, M. Pedram

In this article, we present a low-energy inference method for convolutional neural networks in image classification applications. The lower energy consumption is achieved by using a highly pruned (lower-energy) network if the resulting network can provide a correct output. More specifically, the proposed inference method makes use of two pruned neural networks (NNs), namely mildly and aggressively pruned networks, which are both designed offline. In the system, a third NN makes use of the input data for the online selection of the appropriate pruned network. The third network, for its feature extraction, employs the same convolutional layers as those of the aggressively pruned NN, thereby reducing the overhead of the online management. There is some accuracy loss induced by the proposed method where, for a given level of accuracy, the energy gain of the proposed method is considerably larger than the case of employing any one pruning level. The proposed method is independent of both the pruning method and the network architecture. The efficacy of the proposed inference method is assessed on Eyeriss hardware accelerator platform for some of the state-of-the-art NN architectures. Our studies show that this method may provide, on average, 70% energy reduction compared to the original NN at the cost of about 3% accuracy loss on the CIFAR-10 dataset.

在本文中，我们提出了卷积神经网络在图像分类中的低能量推理方法。如果最终的网络能够提供正确的输出，则通过使用高度修剪(低能量)的网络可以实现较低的能耗。更具体地说，所提出的推理方法使用了两种修剪神经网络(nn)，即温和修剪和积极修剪网络，它们都是离线设计的。在系统中，第三个神经网络利用输入数据在线选择适当的修剪网络。第三种网络的特征提取采用了与积极修剪的神经网络相同的卷积层，从而减少了在线管理的开销。所提出的方法引起一些精度损失，其中，对于给定的精度水平，所提出的方法的能量增益比采用任何一个修剪水平的情况大得多。该方法不依赖于剪枝方法和网络结构。在Eyeriss硬件加速器平台上对一些最先进的神经网络架构的有效性进行了评估。我们的研究表明，在CIFAR-10数据集上，与原始神经网络相比，该方法平均可以提供70%的能量减少，而代价是大约3%的精度损失。

{"title":"An Energy-Efficient Inference Method in Convolutional Neural Networks Based on Dynamic Adjustment of the Pruning Level","authors":"M. A. Maleki, Alireza Nabipour-Meybodi, M. Kamal, A. Afzali-Kusha, M. Pedram","doi":"10.1145/3460972","DOIUrl":"https://doi.org/10.1145/3460972","url":null,"abstract":"In this article, we present a low-energy inference method for convolutional neural networks in image classification applications. The lower energy consumption is achieved by using a highly pruned (lower-energy) network if the resulting network can provide a correct output. More specifically, the proposed inference method makes use of two pruned neural networks (NNs), namely mildly and aggressively pruned networks, which are both designed offline. In the system, a third NN makes use of the input data for the online selection of the appropriate pruned network. The third network, for its feature extraction, employs the same convolutional layers as those of the aggressively pruned NN, thereby reducing the overhead of the online management. There is some accuracy loss induced by the proposed method where, for a given level of accuracy, the energy gain of the proposed method is considerably larger than the case of employing any one pruning level. The proposed method is independent of both the pruning method and the network architecture. The efficacy of the proposed inference method is assessed on Eyeriss hardware accelerator platform for some of the state-of-the-art NN architectures. Our studies show that this method may provide, on average, 70% energy reduction compared to the original NN at the cost of about 3% accuracy loss on the CIFAR-10 dataset.","PeriodicalId":7063,"journal":{"name":"ACM Trans. Design Autom. Electr. Syst.","volume":"43 1","pages":"49:1-49:20"},"PeriodicalIF":0.0,"publicationDate":"2021-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86972178","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Machine Learning for Electronic Design Automation: A Survey 电子设计自动化中的机器学习:综述

ACM Trans. Design Autom. Electr. Syst.

Pub Date : 2021-01-10 DOI: 10.1145/3451179

Guyue Huang, Jingbo Hu, Yifan He, Jialong Liu, Mingyuan Ma, Zhaoyang Shen, Juejian Wu, Yuanfan Xu, Hengrui Zhang, Kai Zhong, Xuefei Ning, Yuzhe Ma, Haoyu Yang, Bei Yu, Huazhong Yang, Yu Wang

With the down-scaling of CMOS technology, the design complexity of very large-scale integrated is increasing. Although the application of machine learning (ML) techniques in electronic design automation (EDA) can trace its history back to the 1990s, the recent breakthrough of ML and the increasing complexity of EDA tasks have aroused more interest in incorporating ML to solve EDA tasks. In this article, we present a comprehensive review of existing ML for EDA studies, organized following the EDA hierarchy.

随着CMOS技术的小型化，超大规模集成电路的设计复杂度日益增加。虽然机器学习(ML)技术在电子设计自动化(EDA)中的应用可以追溯到20世纪90年代，但最近ML的突破和EDA任务的日益复杂引起了人们对将ML用于解决EDA任务的更多兴趣。在这篇文章中，我们根据EDA的层次结构，对现有的EDA研究进行了全面的回顾。

引用次数: 114