2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)最新文献_第10页

Squeezing Accumulators in Binary Neural Networks for Extremely Resource-Constrained Applications 资源极度受限的二元神经网络压缩累加器

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549418

Azat Azamat, Jaewoo Park, Jongeun Lee

The cost and power consumption of BNN (Binarized Neural Network) hardware is dominated by additions. In particular, accumulators account for a large fraction of hardware overhead, which could be effectively reduced by using reduced-width accumulators. However, it is not straightforward to find the optimal accumulator width due to the complex interplay between width, scale, and the effect of training. In this paper we present algorithmic and hardware-level methods to find the optimal accumulator size for BNN hardware with minimal impact on the quality of result. First, we present partial sum scaling, a top-down approach to minimize the BNN accumulator size based on advanced quantization techniques. We also present an efficient, zero-overhead hardware design for partial sum scaling. Second, we evaluate a bottom-up approach that is to use saturating accumulator, which is more robust against overflows. Our experimental results using CIFAR-10 dataset demonstrate that our partial sum scaling along with our optimized accumulator architecture can reduce the area and power consumption of datapath by 15.50% and 27.03%, respectively, with little impact on inference performance (less than 2%), compared to using 16-bit accumulator.

二值化神经网络(BNN)硬件的成本和功耗主要由加法控制。特别是，累加器占硬件开销的很大一部分，这可以通过使用减宽累加器来有效地减少。然而，由于宽度、规模和训练效果之间复杂的相互作用，找到最优累加器宽度并不是简单的。在本文中，我们提出了算法和硬件级别的方法来寻找对结果质量影响最小的BNN硬件的最佳累加器大小。首先，我们提出了部分和缩放，这是一种基于先进量化技术的自上而下最小化BNN累加器大小的方法。我们还提出了一种高效、零开销的部分和缩放硬件设计。其次，我们评估了一种自下而上的方法，即使用饱和累加器，它对溢出更健壮。我们使用CIFAR-10数据集的实验结果表明，与使用16位累加器相比，我们的部分和缩放以及我们优化的累加器架构可以将数据路径的面积和功耗分别减少15.50%和27.03%，对推理性能的影响很小(小于2%)。

{"title":"Squeezing Accumulators in Binary Neural Networks for Extremely Resource-Constrained Applications","authors":"Azat Azamat, Jaewoo Park, Jongeun Lee","doi":"10.1145/3508352.3549418","DOIUrl":"https://doi.org/10.1145/3508352.3549418","url":null,"abstract":"The cost and power consumption of BNN (Binarized Neural Network) hardware is dominated by additions. In particular, accumulators account for a large fraction of hardware overhead, which could be effectively reduced by using reduced-width accumulators. However, it is not straightforward to find the optimal accumulator width due to the complex interplay between width, scale, and the effect of training. In this paper we present algorithmic and hardware-level methods to find the optimal accumulator size for BNN hardware with minimal impact on the quality of result. First, we present partial sum scaling, a top-down approach to minimize the BNN accumulator size based on advanced quantization techniques. We also present an efficient, zero-overhead hardware design for partial sum scaling. Second, we evaluate a bottom-up approach that is to use saturating accumulator, which is more robust against overflows. Our experimental results using CIFAR-10 dataset demonstrate that our partial sum scaling along with our optimized accumulator architecture can reduce the area and power consumption of datapath by 15.50% and 27.03%, respectively, with little impact on inference performance (less than 2%), compared to using 16-bit accumulator.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"62 2","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114034486","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management 一体化:具有动态电源管理的边缘设备高度代表性的DNN剪枝框架

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549379

Yifan Gong, Zheng Zhan, Pu Zhao, Yushu Wu, Chaoan Wu, Caiwen Ding, Weiwen Jiang, Minghai Qin, Yanzhi Wang

During the deployment of deep neural networks (DNNs) on edge devices, many research efforts are devoted to the limited hardware resource. However, little attention is paid to the influence of dynamic power management. As edge devices typically only have a budget of energy with batteries (rather than almost unlimited energy support on servers or workstations), their dynamic power management often changes the execution frequency as in the widely-used dynamic voltage and frequency scaling (DVFS) technique. This leads to highly unstable inference speed performance, especially for computation-intensive DNN models, which can harm user experience and waste hardware resources. We firstly identify this problem and then propose All-in-One, a highly representative pruning framework to work with dynamic power management using DVFS. The framework can use only one set of model weights and soft masks (together with other auxiliary parameters of negligible storage) to represent multiple models of various pruning ratios. By re-configuring the model to the corresponding pruning ratio for a specific execution frequency (and voltage), we are able to achieve stable inference speed, i.e., keeping the difference in speed performance under various execution frequencies as small as possible. Our experiments demonstrate that our method not only achieves high accuracy for multiple models of different pruning ratios, but also reduces their variance of inference latency for various frequencies, with minimal memory consumption of only one model and one soft mask.

在边缘设备上部署深度神经网络(dnn)时，许多研究工作都致力于有限的硬件资源。然而，人们对动态电源管理的影响却很少关注。由于边缘设备通常只有电池的能量预算(而不是服务器或工作站上几乎无限的能量支持)，它们的动态电源管理经常改变执行频率，如广泛使用的动态电压和频率缩放(DVFS)技术。这导致推理速度性能非常不稳定，特别是对于计算密集型的DNN模型，这可能会损害用户体验并浪费硬件资源。我们首先确定了这个问题，然后提出了All-in-One，这是一个非常有代表性的修剪框架，用于使用DVFS进行动态电源管理。该框架只能使用一组模型权值和软掩模(连同其他可忽略存储的辅助参数)来表示不同剪枝比的多个模型。通过将模型重新配置为特定执行频率(和电压)对应的剪枝比，我们可以获得稳定的推理速度，即在各种执行频率下保持速度性能的差异尽可能小。实验结果表明，该方法不仅在不同剪枝比的多个模型上获得了较高的准确率，而且在不同频率下减少了它们的推理延迟方差，并且仅消耗一个模型和一个软掩模的最小内存。

{"title":"All-in-One: A Highly Representative DNN Pruning Framework for Edge Devices with Dynamic Power Management","authors":"Yifan Gong, Zheng Zhan, Pu Zhao, Yushu Wu, Chaoan Wu, Caiwen Ding, Weiwen Jiang, Minghai Qin, Yanzhi Wang","doi":"10.1145/3508352.3549379","DOIUrl":"https://doi.org/10.1145/3508352.3549379","url":null,"abstract":"During the deployment of deep neural networks (DNNs) on edge devices, many research efforts are devoted to the limited hardware resource. However, little attention is paid to the influence of dynamic power management. As edge devices typically only have a budget of energy with batteries (rather than almost unlimited energy support on servers or workstations), their dynamic power management often changes the execution frequency as in the widely-used dynamic voltage and frequency scaling (DVFS) technique. This leads to highly unstable inference speed performance, especially for computation-intensive DNN models, which can harm user experience and waste hardware resources. We firstly identify this problem and then propose All-in-One, a highly representative pruning framework to work with dynamic power management using DVFS. The framework can use only one set of model weights and soft masks (together with other auxiliary parameters of negligible storage) to represent multiple models of various pruning ratios. By re-configuring the model to the corresponding pruning ratio for a specific execution frequency (and voltage), we are able to achieve stable inference speed, i.e., keeping the difference in speed performance under various execution frequencies as small as possible. Our experiments demonstrate that our method not only achieves high accuracy for multiple models of different pruning ratios, but also reduces their variance of inference latency for various frequencies, with minimal memory consumption of only one model and one soft mask.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128261691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Superfast Full-Scale GPU-Accelerated Global Routing 超高速全尺寸gpu加速全局路由

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549474

Shiju Lin, Martin D. F. Wong

Global routing is an essential step in physical design. Recently there are works on accelerating global routers using GPU. However, they only focus on certain stages of global routing, and have limited overall speedup. In this paper, we present a superfast full-scale GPU-accelerated global router and introduce useful parallelization techniques for routing. Experiments show that our 3D router achieves both good quality and short runtime compared to other state-of-the-art academic global routers.

全局路由是物理设计中必不可少的一步。最近有一些使用GPU加速全局路由器的工作。然而，它们只关注全局路由的某些阶段，整体加速有限。在本文中，我们提出了一个超高速全尺寸gpu加速的全局路由器，并介绍了有用的路由并行化技术。实验结果表明，与其他先进的学术全球路由器相比，我们的3D路由器具有良好的质量和较短的运行时间。

引用次数: 0

Sound Source Localization using Stochastic Computing 基于随机计算的声源定位

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549373

Peter Schober, Seyedeh Newsha Estiri, Sercan Aygün, N. Taherinejad, M. Najafi

Stochastic computing (SC) is an alternative computing paradigm that processes data in the form of long uniform bit-streams rather than conventional compact weighted binary numbers. SC is fault-tolerant and can compute on small, efficient circuits, promising advantages over conventional arithmetic for smaller computer chips. SC has been primarily used in scientific research, not in practical applications. Digital sound source localization (SSL) is a useful signal processing technique that locates speakers using multiple microphones in cell phones, laptops, and other voice-controlled devices. SC has not been integrated into SSL in practice or theory. In this work, for the first time to the best of our knowledge, we implement an SSL algorithm in the stochastic domain and develop a functional SC-based sound source localizer. The developed design can replace the conventional design of the algorithm. The practical part of this work shows that the proposed stochastic circuit does not rely on conventional analog-to-digital conversion and can process data in the form of pulse-width-modulated (PWM) signals. The proposed SC design consumes up to 39% less area than the conventional baseline design. The SC-based design can consume less power depending on the computational accuracy, for example, 6% less power consumption for 3-bit inputs. The presented stochastic circuit is not limited to SSL and is readily applicable to other practical applications such as radar ranging, wireless location, sonar direction finding, beamforming, and sensor calibration.

随机计算(SC)是一种替代的计算范式，它以长统一的比特流的形式处理数据，而不是传统的紧凑加权二进制数。SC具有容错性，可以在小型、高效的电路上进行计算，与小型计算机芯片上的传统算法相比，具有更大的优势。SC主要用于科学研究，而不是实际应用。数字声源定位(SSL)是一种有用的信号处理技术，用于定位使用移动电话、笔记本电脑和其他语音控制设备中的多个麦克风的扬声器。SC在实践和理论上都没有与SSL相结合。在这项工作中，据我们所知，我们第一次在随机域中实现了SSL算法，并开发了一个功能性的基于sc的声源定位器。开发的设计可以代替传统的算法设计。该工作的实际部分表明，所提出的随机电路不依赖于传统的模数转换，可以处理脉冲宽度调制(PWM)信号形式的数据。提出的SC设计比传统的基线设计消耗的面积少39%。基于sc的设计可以根据计算精度降低功耗，例如，3位输入功耗降低6%。所提出的随机电路不仅限于SSL，而且很容易适用于其他实际应用，如雷达测距、无线定位、声纳测向、波束形成和传感器校准。

{"title":"Sound Source Localization using Stochastic Computing","authors":"Peter Schober, Seyedeh Newsha Estiri, Sercan Aygün, N. Taherinejad, M. Najafi","doi":"10.1145/3508352.3549373","DOIUrl":"https://doi.org/10.1145/3508352.3549373","url":null,"abstract":"Stochastic computing (SC) is an alternative computing paradigm that processes data in the form of long uniform bit-streams rather than conventional compact weighted binary numbers. SC is fault-tolerant and can compute on small, efficient circuits, promising advantages over conventional arithmetic for smaller computer chips. SC has been primarily used in scientific research, not in practical applications. Digital sound source localization (SSL) is a useful signal processing technique that locates speakers using multiple microphones in cell phones, laptops, and other voice-controlled devices. SC has not been integrated into SSL in practice or theory. In this work, for the first time to the best of our knowledge, we implement an SSL algorithm in the stochastic domain and develop a functional SC-based sound source localizer. The developed design can replace the conventional design of the algorithm. The practical part of this work shows that the proposed stochastic circuit does not rely on conventional analog-to-digital conversion and can process data in the form of pulse-width-modulated (PWM) signals. The proposed SC design consumes up to 39% less area than the conventional baseline design. The SC-based design can consume less power depending on the computational accuracy, for example, 6% less power consumption for 3-bit inputs. The presented stochastic circuit is not limited to SSL and is readily applicable to other practical applications such as radar ranging, wireless location, sonar direction finding, beamforming, and sensor calibration.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130591209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Speculative Load Forwarding Attack on Modern Processors 现代处理器的推测负载转发攻击

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549417

Hasini Witharana, P. Mishra

Modern processors deliver high performance by utilizing advanced features such as out-of-order execution, branch prediction, speculative execution, and sophisticated buffer management. Unfortunately, these techniques have introduced diverse vulnerabilities including Spectre, Meltdown, and microarchitectural data sampling (MDS). Although Spectre and Meltdown can leak data via memory side channels, MDS has shown to leak data from the CPU internal buffers in Intel architectures. AMD has reported that its processors are not vulnerable to MDS/Meltdown type attacks. In this paper, we present a Meltdown/MDS type of attack to leak data from the load queue in AMD Zen family architectures. To the best of our knowledge, our approach is the first attempt in developing an attack on AMD architectures using speculative load forwarding to leak data through the load queue. Experimental evaluation demonstrates that our proposed attack is successful on multiple machines with AMD processors. We also explore a lightweight mitigation to defend against speculative load forwarding attack on modern processors.

现代处理器通过利用诸如乱序执行、分支预测、推测执行和复杂的缓冲区管理等高级特性来提供高性能。不幸的是，这些技术带来了各种各样的漏洞，包括Spectre、Meltdown和微架构数据采样(MDS)。尽管Spectre和Meltdown可以通过内存侧通道泄露数据，但在英特尔架构中，MDS可以从CPU内部缓冲区泄露数据。AMD报告称其处理器不容易受到MDS/Meltdown类型的攻击。在本文中，我们提出了一种Meltdown/MDS类型的攻击，用于从AMD Zen系列架构的负载队列中泄漏数据。据我们所知，我们的方法是第一次尝试使用推测负载转发来通过负载队列泄露数据来开发对AMD架构的攻击。实验评估表明，我们提出的攻击在多台AMD处理器上是成功的。我们还探讨了一种轻量级缓解方法，以防御现代处理器上的推测性负载转发攻击。

引用次数: 3

How Good Is Your Verilog RTL Code? A Quick Answer from Machine Learning 你的Verilog RTL代码有多好?来自机器学习的快速回答

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549375

Prianka Sengupta, Aakash Tyagi, Yiran Chen, Jiangkun Hu

Hardware Description Language (HDL) is a common entry point for designing digital circuits. Differences in HDL coding styles and design choices may lead to considerably different design quality and performance-power tradeoff. In general, the impact of HDL coding is not clear until logic synthesis or even layout is completed. However, running synthesis merely as a feedback for HDL code is computationally not economical especially in early design phases when the code needs to be frequently modified. Furthermore, in late stages of design convergence burdened with high-impact engineering change orders (ECO’s), design iterations become prohibitively expensive. To this end, we propose a machine learning approach to Verilog-based Register-Transfer Level (RTL) design assessment without going through the synthesis process. It would allow designers to quickly evaluate the performance-power tradeoff among different options of RTL designs. Experimental results show that our proposed technique achieves an average of 95% prediction accuracy in terms of post-placement analysis, and is 6 orders of magnitude faster than evaluation by running logic synthesis and placement.

硬件描述语言(HDL)是设计数字电路的通用入口。HDL编码风格和设计选择的差异可能导致相当不同的设计质量和性能-功率权衡。一般来说，直到逻辑合成甚至布局完成后，HDL编码的影响才会清楚。然而，运行综合仅仅作为对HDL代码的反馈在计算上是不经济的，特别是在需要频繁修改代码的早期设计阶段。此外，在设计融合的后期阶段，由于高影响的工程变更订单(ECO)，设计迭代变得非常昂贵。为此，我们提出了一种机器学习方法来进行基于verilog的Register-Transfer Level (RTL)设计评估，而无需经过合成过程。它将允许设计师快速评估不同RTL设计选项之间的性能-功率权衡。实验结果表明，我们提出的方法在放置后分析方面的预测准确率平均达到95%，比运行逻辑综合和放置的评估快6个数量级。

{"title":"How Good Is Your Verilog RTL Code? A Quick Answer from Machine Learning","authors":"Prianka Sengupta, Aakash Tyagi, Yiran Chen, Jiangkun Hu","doi":"10.1145/3508352.3549375","DOIUrl":"https://doi.org/10.1145/3508352.3549375","url":null,"abstract":"Hardware Description Language (HDL) is a common entry point for designing digital circuits. Differences in HDL coding styles and design choices may lead to considerably different design quality and performance-power tradeoff. In general, the impact of HDL coding is not clear until logic synthesis or even layout is completed. However, running synthesis merely as a feedback for HDL code is computationally not economical especially in early design phases when the code needs to be frequently modified. Furthermore, in late stages of design convergence burdened with high-impact engineering change orders (ECO’s), design iterations become prohibitively expensive. To this end, we propose a machine learning approach to Verilog-based Register-Transfer Level (RTL) design assessment without going through the synthesis process. It would allow designers to quickly evaluate the performance-power tradeoff among different options of RTL designs. Experimental results show that our proposed technique achieves an average of 95% prediction accuracy in terms of post-placement analysis, and is 6 orders of magnitude faster than evaluation by running logic synthesis and placement.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"549 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116559244","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Hidden-ROM: A Compute-in-ROM Architecture to Deploy Large-Scale Neural Networks on Chip with Flexible and Scalable Post-Fabrication Task Transfer Capability 隐藏rom:一种具有灵活和可扩展的制造后任务传输能力的芯片上大规模神经网络的计算-in- rom架构

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549335

Yiming Chen, Guodong Yin, Ming-En Lee, Wenjun Tang, Zekun Yang, Yongpan Liu, Huazhong Yang, Xueqing Li

Motivated by reducing the data transfer activities in dataintensive neural network computing, SRAM-based compute-inmemory (CiM) has made significant progress. Unfortunately, SRAM has low density and limited on-chip capacity. This makes the deployment of large models inefficient due to the frequent DRAM access to update the weight in SRAM. Recently, a ROM-based CiM design, YOLoC, reveals the unique opportunity of deploying a large-scale neural network in CMOS by exploring the intriguing high density of ROM. However, even though assisting SRAM has been adopted in YOLoC for task transfer within the same domain, it is still a big challenge to overcome the read-only limitation in ROM and enable more flexibility. Therefore, it is of paramount significance to develop new ROM-based CiM architectures and provide broader task space and model expansion capability for more complex tasks.This paper presents Hidden-ROM for high flexibility of ROM-based CiM. Hidden-ROM provides several novel ideas beyond YOLoC. First, it adopts a one-SRAM-many-ROM method that "hides" ROM cells to support various datasets of different domains, including CIFAR10/100, FER2013, and ImageNet. Second, HiddenROM provides the model expansion capability after chip fabrication to update the model for more complex tasks when needed. Experiments show that Hidden-ROM designed for ResNet-18 pretrained on CIFAR100 (item classification) can achieve <0.5% accuracy loss in FER2013 (facial expression recognition), while YOLoC degrades by >40%. After expanding to ResNet-50/101, Hidden-ROM even achieves 68.6%/72.3% accuracy in ImageNet, close to 74.9%/76.4% by software. Such expansion costs only 7.6%/12.7% energy efficiency overhead while providing 12%/16% accuracy improvement after expansion.

由于减少了数据密集型神经网络计算中的数据传输活动，基于sram的内存计算(CiM)取得了重大进展。不幸的是，SRAM密度低，片上容量有限。由于频繁访问DRAM以更新SRAM中的权重，这使得大型模型的部署效率低下。最近，一种基于ROM的CiM设计，YOLoC，通过探索有趣的高密度ROM，揭示了在CMOS中部署大规模神经网络的独特机会。然而，即使在YOLoC中已采用辅助SRAM在同一域内进行任务传输，克服ROM的只读限制并实现更大的灵活性仍然是一个很大的挑战。因此，开发新的基于rom的CiM体系结构，为更复杂的任务提供更广阔的任务空间和模型扩展能力具有至关重要的意义。为了实现基于rom的CiM的高灵活性，本文提出了一种隐藏rom。除了YOLoC之外，Hidden-ROM还提供了一些新颖的想法。首先，采用“隐藏”ROM单元的1 - sram -多ROM方法，支持CIFAR10/100、FER2013、ImageNet等不同领域的各种数据集。其次，HiddenROM提供了芯片制造后的模型扩展能力，以便在需要时更新模型以应对更复杂的任务。实验表明，针对ResNet-18设计的Hidden-ROM在CIFAR100 (item classification)上进行预训练，识别率达到40%。扩展到ResNet-50/101后，Hidden-ROM在ImageNet上甚至达到了68.6%/72.3%的准确率，接近软件上的74.9%/76.4%。这样的扩展成本仅为7.6%/12.7%的能源效率开销，而扩展后的精度提高了12%/16%。

{"title":"Hidden-ROM: A Compute-in-ROM Architecture to Deploy Large-Scale Neural Networks on Chip with Flexible and Scalable Post-Fabrication Task Transfer Capability","authors":"Yiming Chen, Guodong Yin, Ming-En Lee, Wenjun Tang, Zekun Yang, Yongpan Liu, Huazhong Yang, Xueqing Li","doi":"10.1145/3508352.3549335","DOIUrl":"https://doi.org/10.1145/3508352.3549335","url":null,"abstract":"Motivated by reducing the data transfer activities in dataintensive neural network computing, SRAM-based compute-inmemory (CiM) has made significant progress. Unfortunately, SRAM has low density and limited on-chip capacity. This makes the deployment of large models inefficient due to the frequent DRAM access to update the weight in SRAM. Recently, a ROM-based CiM design, YOLoC, reveals the unique opportunity of deploying a large-scale neural network in CMOS by exploring the intriguing high density of ROM. However, even though assisting SRAM has been adopted in YOLoC for task transfer within the same domain, it is still a big challenge to overcome the read-only limitation in ROM and enable more flexibility. Therefore, it is of paramount significance to develop new ROM-based CiM architectures and provide broader task space and model expansion capability for more complex tasks.This paper presents Hidden-ROM for high flexibility of ROM-based CiM. Hidden-ROM provides several novel ideas beyond YOLoC. First, it adopts a one-SRAM-many-ROM method that \"hides\" ROM cells to support various datasets of different domains, including CIFAR10/100, FER2013, and ImageNet. Second, HiddenROM provides the model expansion capability after chip fabrication to update the model for more complex tasks when needed. Experiments show that Hidden-ROM designed for ResNet-18 pretrained on CIFAR100 (item classification) can achieve <0.5% accuracy loss in FER2013 (facial expression recognition), while YOLoC degrades by >40%. After expanding to ResNet-50/101, Hidden-ROM even achieves 68.6%/72.3% accuracy in ImageNet, close to 74.9%/76.4% by software. Such expansion costs only 7.6%/12.7% energy efficiency overhead while providing 12%/16% accuracy improvement after expansion.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127707885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Robust Global Routing Engine with High-accuracy Cell Movement under Advanced Constraints 先进约束下具有高精度单元移动的鲁棒全局路由引擎

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549421

Ziran Zhu, Fuheng Shen, Yangjie Mei, Zhipeng Huang, Jianli Chen, Jun-Zhi Yang

Placement and routing are typically defined as two separate problems to reduce the design complexity. However, such a divide-and-conquer approach inevitably incurs the degradation of solution quality due to the correlation/objectives of placement and routing are not entirely consistent. Besides, with various constraints (e.g., timing, R/C characteristic, voltage area, etc.) imposed by advanced circuit designs, bridging the gap between placement and routing while satisfying the advanced constraints has become more challenging. In this paper, we develop a robust global routing engine with high-accuracy cell movement under advanced constraints to narrow the gap and improve the routing solution. We first present a routing refinement technique to obtain the convergent routing result based on fixed placement, which provides more accurate information for subsequent cell movement. To achieve fast and high-accuracy position prediction for cell movement, we construct a lookup table (LUT) considering complex constraints/objectives (e.g., routing direction and layer-based power consumption), and generate a timing-driven gain map for each cell based on the LUT. Finally, based on the prediction, we propose an alternating cell movement and cluster movement scheme followed by partial rip-up and reroute to optimize the routing solution. Experimental results on the ICCAD 2020 contest benchmarks show that our algorithm achieves the best total scores among all published works. Compared with the champion of the ICCAD 2021 contest, experimental results on the ICCAD 2021 contest benchmarks show that our algorithm achieves better solution quality in shorter runtime.

放置和布线通常被定义为两个独立的问题，以减少设计的复杂性。然而，这种分而治之的方法不可避免地会导致解决方案质量的下降，因为放置和路由的相关性/目标并不完全一致。此外，由于先进的电路设计施加了各种约束(例如，时序，R/C特性，电压面积等)，在满足先进约束的同时弥合放置和布线之间的差距变得更加具有挑战性。在本文中，我们开发了一个鲁棒的全局路由引擎，在先进的约束条件下具有高精度的单元移动，以缩小差距并改进路由解决方案。我们首先提出了一种基于固定位置的路由优化技术，以获得收敛的路由结果，为后续的细胞运动提供更准确的信息。为了实现对细胞运动的快速和高精度的位置预测，我们构建了一个考虑复杂约束/目标(例如路由方向和基于层的功耗)的查找表(LUT)，并基于LUT为每个细胞生成时序驱动的增益图。最后，在预测的基础上，我们提出了一种细胞移动和簇移动交替的方案，然后是部分撕裂和重路由，以优化路由解决方案。在ICCAD 2020竞赛基准上的实验结果表明，我们的算法在所有已发表的作品中获得了最好的总分。与ICCAD 2021竞赛冠军相比，在ICCAD 2021竞赛基准上的实验结果表明，我们的算法在更短的运行时间内获得了更好的解质量。

{"title":"A Robust Global Routing Engine with High-accuracy Cell Movement under Advanced Constraints","authors":"Ziran Zhu, Fuheng Shen, Yangjie Mei, Zhipeng Huang, Jianli Chen, Jun-Zhi Yang","doi":"10.1145/3508352.3549421","DOIUrl":"https://doi.org/10.1145/3508352.3549421","url":null,"abstract":"Placement and routing are typically defined as two separate problems to reduce the design complexity. However, such a divide-and-conquer approach inevitably incurs the degradation of solution quality due to the correlation/objectives of placement and routing are not entirely consistent. Besides, with various constraints (e.g., timing, R/C characteristic, voltage area, etc.) imposed by advanced circuit designs, bridging the gap between placement and routing while satisfying the advanced constraints has become more challenging. In this paper, we develop a robust global routing engine with high-accuracy cell movement under advanced constraints to narrow the gap and improve the routing solution. We first present a routing refinement technique to obtain the convergent routing result based on fixed placement, which provides more accurate information for subsequent cell movement. To achieve fast and high-accuracy position prediction for cell movement, we construct a lookup table (LUT) considering complex constraints/objectives (e.g., routing direction and layer-based power consumption), and generate a timing-driven gain map for each cell based on the LUT. Finally, based on the prediction, we propose an alternating cell movement and cluster movement scheme followed by partial rip-up and reroute to optimize the routing solution. Experimental results on the ICCAD 2020 contest benchmarks show that our algorithm achieves the best total scores among all published works. Compared with the champion of the ICCAD 2021 contest, experimental results on the ICCAD 2021 contest benchmarks show that our algorithm achieves better solution quality in shorter runtime.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132437142","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Pin Accessibility and Routing Congestion Aware DRC Hotspot Prediction using Graph Neural Network and U-Net 基于图神经网络和U-Net的引脚可达性和路由拥塞感知DRC热点预测

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549346

Kyeonghyeon Baek, Hyunbum Park, Suwan Kim, Kyumyung Choi, Taewhan Kim

An accurate DRC (design rule check) hotspot prediction at the placement stage is essential in order to reduce a substantial amount of design time required for the iterations of placement and routing. It is known that for implementing chips with advanced technology nodes, (1) pin accessibility and (2) routing congestion are two major causes of DRVs (design rule violations). Though many ML (machine learning) techniques have been proposed to address this prediction problem, it was not easy to assemble the aggregate data on items 1 and 2 in a unified fashion for training ML models, resulting in a considerable accuracy loss in DRC hotspot prediction. This work overcomes this limitation by proposing a novel ML based DRC hotspot prediction technique, which is able to accurately capture the combined impact of items 1 and 2 on DRC hotspots. Precisely, we devise a graph, called pin proximity graph, that effectively models the spatial information on cell I/O pins and the information on pin-to-pin disturbance relation. Then, we propose a new ML model, called PGNN, which tightly combines GNN (graph neural network) and U-net in a way that GNN is used to embed pin accessibility information abstracted from our pin proximity graph while U-net is used to extract routing congestion information from grid-based features. Through experiments with a set of benchmark designs using Nangate 15nm library, our PGNN outperforms the existing ML models on all benchmark designs, achieving on average 7.8~12.5% improvements on F1-score while taking 5.5× fast inference time in comparison with that of the state-of-the-art techniques.

为了减少放置和路由迭代所需的大量设计时间，在放置阶段准确的DRC(设计规则检查)热点预测是必不可少的。众所周知，对于实现具有先进技术节点的芯片，(1)引脚可访问性和(2)路由拥塞是drv(违反设计规则)的两个主要原因。尽管已经提出了许多ML(机器学习)技术来解决这一预测问题，但以统一的方式组装项目1和2的汇总数据用于训练ML模型并不容易，导致DRC热点预测的准确性损失相当大。这项工作通过提出一种新的基于ML的DRC热点预测技术来克服这一限制，该技术能够准确地捕获项目1和2对DRC热点的综合影响。准确地说，我们设计了一个图，称为引脚接近图，有效地模拟了单元I/O引脚的空间信息和引脚对引脚干扰关系的信息。然后，我们提出了一种新的机器学习模型，称为PGNN，它将GNN(图神经网络)和U-net紧密结合，GNN用于嵌入从引脚接近图中提取的引脚可达性信息，而U-net用于从基于网格的特征中提取路由拥塞信息。通过一组使用Nangate 15nm库的基准设计的实验，我们的PGNN在所有基准设计上都优于现有的ML模型，在f1得分上平均提高7.8~12.5%，而与最先进的技术相比，推理时间缩短了5.5倍。

{"title":"Pin Accessibility and Routing Congestion Aware DRC Hotspot Prediction using Graph Neural Network and U-Net","authors":"Kyeonghyeon Baek, Hyunbum Park, Suwan Kim, Kyumyung Choi, Taewhan Kim","doi":"10.1145/3508352.3549346","DOIUrl":"https://doi.org/10.1145/3508352.3549346","url":null,"abstract":"An accurate DRC (design rule check) hotspot prediction at the placement stage is essential in order to reduce a substantial amount of design time required for the iterations of placement and routing. It is known that for implementing chips with advanced technology nodes, (1) pin accessibility and (2) routing congestion are two major causes of DRVs (design rule violations). Though many ML (machine learning) techniques have been proposed to address this prediction problem, it was not easy to assemble the aggregate data on items 1 and 2 in a unified fashion for training ML models, resulting in a considerable accuracy loss in DRC hotspot prediction. This work overcomes this limitation by proposing a novel ML based DRC hotspot prediction technique, which is able to accurately capture the combined impact of items 1 and 2 on DRC hotspots. Precisely, we devise a graph, called pin proximity graph, that effectively models the spatial information on cell I/O pins and the information on pin-to-pin disturbance relation. Then, we propose a new ML model, called PGNN, which tightly combines GNN (graph neural network) and U-net in a way that GNN is used to embed pin accessibility information abstracted from our pin proximity graph while U-net is used to extract routing congestion information from grid-based features. Through experiments with a set of benchmark designs using Nangate 15nm library, our PGNN outperforms the existing ML models on all benchmark designs, achieving on average 7.8~12.5% improvements on F1-score while taking 5.5× fast inference time in comparison with that of the state-of-the-art techniques.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130532575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

On Minimizing the Read Latency of Flash Memory to Preserve Inter-tree Locality in Random Forest 最小化闪存读延迟以保持随机森林树间局部性的研究

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)

Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549365

Yu-Cheng Lin, Yu-Pei Liang, Tseng-Yi Chen, Yuan-Hao Chang, Shuo-Han Chen, W. Shih

Many prior research works have been widely discussed how to bring machine learning algorithms to embedded systems. Because of resource constraints, embedded platforms for machine learning applications play the role of a predictor. That is, an inference model will be constructed on a personal computer or a server platform, and then integrated into embedded systems for just-in-time inference. With the consideration of the limited main memory space in embedded systems, an important problem for embedded machine learning systems is how to efficiently move inference model between the main memory and a secondary storage (e.g., flash memory). For tackling this problem, we need to consider how to preserve the locality inside the inference model during model construction. Therefore, we have proposed a solution, namely locality-aware random forest (LaRF), to preserve the inter-locality of all decision trees within a random forest model during the model construction process. Owing to the locality preservation, LaRF can improve the read latency by 81.5% at least, compared to the original random forest library.

许多先前的研究工作已经广泛讨论了如何将机器学习算法引入嵌入式系统。由于资源限制，机器学习应用的嵌入式平台扮演着预测器的角色。即在个人计算机或服务器平台上构建推理模型，然后集成到嵌入式系统中进行实时推理。考虑到嵌入式系统中主存空间有限，嵌入式机器学习系统的一个重要问题是如何有效地在主存和辅助存储器(如闪存)之间移动推理模型。为了解决这个问题，我们需要考虑如何在模型构建过程中保持推理模型内部的局部性。因此，我们提出了一种解决方案，即位置感知随机森林(LaRF)，以在模型构建过程中保持随机森林模型中所有决策树的局域性。与原始随机森林库相比，LaRF库的读延迟至少提高了81.5%。

{"title":"On Minimizing the Read Latency of Flash Memory to Preserve Inter-tree Locality in Random Forest","authors":"Yu-Cheng Lin, Yu-Pei Liang, Tseng-Yi Chen, Yuan-Hao Chang, Shuo-Han Chen, W. Shih","doi":"10.1145/3508352.3549365","DOIUrl":"https://doi.org/10.1145/3508352.3549365","url":null,"abstract":"Many prior research works have been widely discussed how to bring machine learning algorithms to embedded systems. Because of resource constraints, embedded platforms for machine learning applications play the role of a predictor. That is, an inference model will be constructed on a personal computer or a server platform, and then integrated into embedded systems for just-in-time inference. With the consideration of the limited main memory space in embedded systems, an important problem for embedded machine learning systems is how to efficiently move inference model between the main memory and a secondary storage (e.g., flash memory). For tackling this problem, we need to consider how to preserve the locality inside the inference model during model construction. Therefore, we have proposed a solution, namely locality-aware random forest (LaRF), to preserve the inter-locality of all decision trees within a random forest model during the model construction process. Owing to the locality preservation, LaRF can improve the read latency by 81.5% at least, compared to the original random forest library.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125364534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0