首页 > 最新文献

2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)最新文献

英文 中文
Neurally-Inspired Hyperdimensional Classification for Efficient and Robust Biosignal Processing 高效鲁棒生物信号处理的神经启发超维分类
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549477
Yang Ni, N. Lesica, Fan-Gang Zeng, M. Imani
The biosignals consist of several sensors that collect time series information. Since time series contain temporal dependencies, they are difficult to process by existing machine learning algorithms. Hyper-Dimensional Computing (HDC) is introduced as a brain-inspired paradigm for lightweight time series classification. However, there are the following drawbacks with existing HDC algorithms: (1) low classification accuracy that comes from linear hyperdimensional representation, (2) lack of real-time learning support due to costly and non-hardware friendly operations, and (3) unable to build up a strong model from partially labeled data.In this paper, we propose TempHD, a novel hyperdimensional computing method for efficient and accurate biosignal classification. We first develop a novel non-linear hyperdimensional encoding that maps data points into high-dimensional space. Unlike existing HDC solutions that use costly mathematics for encoding, TempHD preserves spatial-temporal information of data in original space before mapping data into high-dimensional space. To obtain the most informative representation, our encoding method considers the non-linear interactions between both spatial sensors and temporally sampled data. Our evaluation shows that TempHD provides higher classification accuracy, significantly higher computation efficiency, and, more importantly, the capability to learn from partially labeled data. We evaluate TempHD effectiveness on noisy EEG data used for a brain-machine interface. Our results show that TempHD achieves, on average, 2.3% higher classification accuracy as well as 7.7× and 21.8× speedup for training and testing time compared to state-of-the-art HDC algorithms, respectively.
生物信号由几个收集时间序列信息的传感器组成。由于时间序列包含时间依赖性,现有的机器学习算法很难处理它们。超维计算(HDC)是一种受大脑启发的轻量级时间序列分类范式。然而,现有的HDC算法存在以下缺点:(1)分类精度低,主要来自线性超维表示;(2)由于操作成本高且对硬件不友好,缺乏实时学习支持;(3)无法从部分标记的数据中建立强模型。在本文中,我们提出了一种新的超维计算方法TempHD,用于高效准确的生物信号分类。我们首先开发了一种新的非线性超维编码,将数据点映射到高维空间。与现有的HDC解决方案使用昂贵的数学编码不同,TempHD在将数据映射到高维空间之前保留了原始空间中数据的时空信息。为了获得最丰富的信息表示,我们的编码方法考虑了空间传感器和时间采样数据之间的非线性相互作用。我们的评估表明,TempHD提供了更高的分类精度,显著提高了计算效率,更重要的是,具有从部分标记数据中学习的能力。我们评估了TempHD对脑机接口中带有噪声的脑电图数据的有效性。我们的研究结果表明,与最先进的HDC算法相比,TempHD的分类准确率平均提高了2.3%,训练和测试时间分别提高了7.7倍和21.8倍。
{"title":"Neurally-Inspired Hyperdimensional Classification for Efficient and Robust Biosignal Processing","authors":"Yang Ni, N. Lesica, Fan-Gang Zeng, M. Imani","doi":"10.1145/3508352.3549477","DOIUrl":"https://doi.org/10.1145/3508352.3549477","url":null,"abstract":"The biosignals consist of several sensors that collect time series information. Since time series contain temporal dependencies, they are difficult to process by existing machine learning algorithms. Hyper-Dimensional Computing (HDC) is introduced as a brain-inspired paradigm for lightweight time series classification. However, there are the following drawbacks with existing HDC algorithms: (1) low classification accuracy that comes from linear hyperdimensional representation, (2) lack of real-time learning support due to costly and non-hardware friendly operations, and (3) unable to build up a strong model from partially labeled data.In this paper, we propose TempHD, a novel hyperdimensional computing method for efficient and accurate biosignal classification. We first develop a novel non-linear hyperdimensional encoding that maps data points into high-dimensional space. Unlike existing HDC solutions that use costly mathematics for encoding, TempHD preserves spatial-temporal information of data in original space before mapping data into high-dimensional space. To obtain the most informative representation, our encoding method considers the non-linear interactions between both spatial sensors and temporally sampled data. Our evaluation shows that TempHD provides higher classification accuracy, significantly higher computation efficiency, and, more importantly, the capability to learn from partially labeled data. We evaluate TempHD effectiveness on noisy EEG data used for a brain-machine interface. Our results show that TempHD achieves, on average, 2.3% higher classification accuracy as well as 7.7× and 21.8× speedup for training and testing time compared to state-of-the-art HDC algorithms, respectively.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114346211","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Reliable Machine Learning for Wearable Activity Monitoring: Novel Algorithms and Theoretical Guarantees 可穿戴式活动监测的可靠机器学习:新算法和理论保证
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549430
Dina Hussein, Taha Belkhouja, Ganapati Bhat, J. Doppa
Wearable devices are becoming popular for health and activity monitoring. The machine learning (ML) models for these applications are trained by collecting data in a laboratory with precise control of experimental settings. However, during real-world deployment/usage, the experimental settings (e.g., sensor position or sampling rate) may deviate from those used during training. This discrepancy can degrade the accuracy and effectiveness of the health monitoring applications. Therefore, there is a great need to develop reliable ML approaches that provide high accuracy for real-world deployment. In this paper, we propose a novel statistical optimization approach referred as StatOpt that automatically accounts for the real-world disturbances in sensing data to improve the reliability of ML models for wearable devices. We theoretically derive the upper bounds on sensor data disturbance for StatOpt to produce a ML model with reliability certificates. We validate StatOpt on two publicly available datasets for human activity recognition. Our results show that compared to standard ML algorithms, the reliable ML classifiers enabled by the StatOpt approach improve the accuracy up to 50% in real-world settings with zero overhead, while baseline approaches incur significant overhead and fail to achieve comparable accuracy.
可穿戴设备在健康和活动监测方面越来越受欢迎。这些应用程序的机器学习(ML)模型是通过在实验室收集数据并精确控制实验设置来训练的。然而,在实际部署/使用过程中,实验设置(例如,传感器位置或采样率)可能会偏离训练期间使用的设置。这种差异会降低运行状况监视应用程序的准确性和有效性。因此,非常需要开发可靠的ML方法,为现实世界的部署提供高精度。在本文中,我们提出了一种称为StatOpt的新型统计优化方法,该方法自动考虑传感数据中的现实干扰,以提高可穿戴设备ML模型的可靠性。我们从理论上推导了传感器数据扰动的上界,以产生具有可靠性证书的机器学习模型。我们在两个公开可用的人类活动识别数据集上验证StatOpt。我们的研究结果表明,与标准机器学习算法相比,StatOpt方法启用的可靠机器学习分类器在真实环境下的准确率提高了50%,开销为零,而基线方法会产生显著的开销,无法达到相当的精度。
{"title":"Reliable Machine Learning for Wearable Activity Monitoring: Novel Algorithms and Theoretical Guarantees","authors":"Dina Hussein, Taha Belkhouja, Ganapati Bhat, J. Doppa","doi":"10.1145/3508352.3549430","DOIUrl":"https://doi.org/10.1145/3508352.3549430","url":null,"abstract":"Wearable devices are becoming popular for health and activity monitoring. The machine learning (ML) models for these applications are trained by collecting data in a laboratory with precise control of experimental settings. However, during real-world deployment/usage, the experimental settings (e.g., sensor position or sampling rate) may deviate from those used during training. This discrepancy can degrade the accuracy and effectiveness of the health monitoring applications. Therefore, there is a great need to develop reliable ML approaches that provide high accuracy for real-world deployment. In this paper, we propose a novel statistical optimization approach referred as StatOpt that automatically accounts for the real-world disturbances in sensing data to improve the reliability of ML models for wearable devices. We theoretically derive the upper bounds on sensor data disturbance for StatOpt to produce a ML model with reliability certificates. We validate StatOpt on two publicly available datasets for human activity recognition. Our results show that compared to standard ML algorithms, the reliable ML classifiers enabled by the StatOpt approach improve the accuracy up to 50% in real-world settings with zero overhead, while baseline approaches incur significant overhead and fail to achieve comparable accuracy.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122823154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Workload-Balanced Graph Attention Network Accelerator with Top-K Aggregation Candidates 具有Top-K聚合候选者的工作负载平衡图注意力网络加速器
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549343
Naebeom Park, Daehyun Ahn, Jae-Joon Kim
Graph attention networks (GATs) are gaining attention for various transductive and inductive graph processing tasks due to their higher accuracy than conventional graph convolutional networks (GCNs). The power-law distribution of real-world graph-structured data, on the other hand, causes a severe workload imbalance problem for GAT accelerators. To reduce the degradation of PE utilization due to the workload imbalance, we present algorithm/hardware co-design results for a GAT accelerator that balances workload assigned to processing elements by allowing only K neighbor nodes to participate in aggregation phase. The proposed model selects the K neighbor nodes with high attention scores, which represent relevance between two nodes, to minimize accuracy drop. Experimental results show that our algorithm/hardware co-design of the GAT accelerator achieves higher processing speed and energy efficiency than the GAT accelerators using conventional workload balancing techniques. Furthermore, we demonstrate that the proposed GAT accelerators can be made faster than the GCN accelerators that typically process smaller number of computations.
图注意网络(GATs)由于其比传统的图卷积网络(GCNs)具有更高的精度,在各种换向和归纳图处理任务中越来越受到关注。另一方面,实际图结构数据的幂律分布会导致GAT加速器出现严重的工作负载不平衡问题。为了减少由于工作负载不平衡而导致的PE利用率下降,我们提出了一种GAT加速器的算法/硬件协同设计结果,该加速器通过只允许K个邻居节点参与聚合阶段来平衡分配给处理元素的工作负载。该模型选择了K个具有高关注分数的相邻节点,这些节点代表了两个节点之间的相关性,以最小化准确率下降。实验结果表明,我们的算法/硬件协同设计的GAT加速器比使用传统工作负载平衡技术的GAT加速器具有更高的处理速度和能源效率。此外,我们证明了所提出的GAT加速器可以比通常处理较少计算量的GCN加速器更快。
{"title":"Workload-Balanced Graph Attention Network Accelerator with Top-K Aggregation Candidates","authors":"Naebeom Park, Daehyun Ahn, Jae-Joon Kim","doi":"10.1145/3508352.3549343","DOIUrl":"https://doi.org/10.1145/3508352.3549343","url":null,"abstract":"Graph attention networks (GATs) are gaining attention for various transductive and inductive graph processing tasks due to their higher accuracy than conventional graph convolutional networks (GCNs). The power-law distribution of real-world graph-structured data, on the other hand, causes a severe workload imbalance problem for GAT accelerators. To reduce the degradation of PE utilization due to the workload imbalance, we present algorithm/hardware co-design results for a GAT accelerator that balances workload assigned to processing elements by allowing only K neighbor nodes to participate in aggregation phase. The proposed model selects the K neighbor nodes with high attention scores, which represent relevance between two nodes, to minimize accuracy drop. Experimental results show that our algorithm/hardware co-design of the GAT accelerator achieves higher processing speed and energy efficiency than the GAT accelerators using conventional workload balancing techniques. Furthermore, we demonstrate that the proposed GAT accelerators can be made faster than the GCN accelerators that typically process smaller number of computations.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125339307","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Reconfigurable Hardware Library for Robot Scene Perception 机器人场景感知的可重构硬件库
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3561110
Yanqi Liu, A. Opipari, O. Jenkins, R. I. Bahar
Perceiving the position and orientation of objects (i.e., pose estimation) is a crucial prerequisite for robots acting within their natural environment. We present a hardware acceleration approach to enable real-time and energy efficient articulated pose estimation for robots operating in unstructured environments. Our hardware accelerator implements Nonparametric Belief Propagation (NBP) to infer the belief distribution of articulated object poses. Our approach is on average, 26X more energy efficient than a high-end GPU and 11X faster than an embedded low-power GPU implementation. Moreover, we present a Monte-Carlo Perception Library generated from high-level synthesis to enable reconfigurable hardware designs on FPGA fabrics that are better tuned to user-specified scene, resource, and performance constraints.
感知物体的位置和方向(即姿态估计)是机器人在自然环境中行动的关键先决条件。我们提出了一种硬件加速方法,以实现在非结构化环境中操作的机器人的实时和节能的关节姿态估计。我们的硬件加速器实现了非参数信念传播(NBP)来推断关节目标姿态的信念分布。我们的方法平均比高端GPU节能26倍,比嵌入式低功耗GPU实现快11倍。此外,我们提出了一个由高级合成生成的蒙特卡罗感知库,以实现FPGA结构上的可重构硬件设计,从而更好地适应用户指定的场景、资源和性能约束。
{"title":"A Reconfigurable Hardware Library for Robot Scene Perception","authors":"Yanqi Liu, A. Opipari, O. Jenkins, R. I. Bahar","doi":"10.1145/3508352.3561110","DOIUrl":"https://doi.org/10.1145/3508352.3561110","url":null,"abstract":"Perceiving the position and orientation of objects (i.e., pose estimation) is a crucial prerequisite for robots acting within their natural environment. We present a hardware acceleration approach to enable real-time and energy efficient articulated pose estimation for robots operating in unstructured environments. Our hardware accelerator implements Nonparametric Belief Propagation (NBP) to infer the belief distribution of articulated object poses. Our approach is on average, 26X more energy efficient than a high-end GPU and 11X faster than an embedded low-power GPU implementation. Moreover, we present a Monte-Carlo Perception Library generated from high-level synthesis to enable reconfigurable hardware designs on FPGA fabrics that are better tuned to user-specified scene, resource, and performance constraints.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128357675","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fast and Compact Interleaved Modular Multiplication based on Carry Save Addition 基于进位保存加法的快速紧凑交错模乘法
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549414
O. Mazonka, E. Chielle, Deepraj Soni, M. Maniatakos
Improving fully homomorphic encryption computation by designing specialized hardware is an active topic of research. The most prominent encryption schemes operate on long polynomials requiring many concurrent modular multiplications of very big numbers. Thus, it is crucial to use many small and efficient multipliers. Interleaved and Montgomery iterative multipliers are the best candidates for the task. Interleaved designs, however, suffer from longer latency as they require a number comparison within each iteration; Montgomery designs, on the other hand, need extra conversion of the operands or the result. In this work, we propose a novel hardware design that combines the best of both worlds: Exhibiting the carry save addition of Montgomery designs without the need for any domain conversions. Experimental results demonstrate improved latency-area product efficiency by up to 47% when compared to the standard Interleaved multiplier for large arithmetic word sizes.
通过设计专用硬件来改进全同态加密计算是一个活跃的研究课题。最突出的加密方案在长多项式上操作,需要对非常大的数字进行许多并发的模乘法。因此,使用许多小而有效的乘数是至关重要的。交错和蒙哥马利迭代乘法器是该任务的最佳候选。然而,交错设计的延迟较长,因为它们需要在每次迭代中进行数字比较;另一方面,Montgomery设计需要对操作数或结果进行额外的转换。在这项工作中,我们提出了一种新颖的硬件设计,它结合了两个世界的优点:展示了蒙哥马利设计的进位和添加,而不需要任何域转换。实验结果表明,与标准交错乘法器相比,延迟面积乘积效率提高了47%。
{"title":"Fast and Compact Interleaved Modular Multiplication based on Carry Save Addition","authors":"O. Mazonka, E. Chielle, Deepraj Soni, M. Maniatakos","doi":"10.1145/3508352.3549414","DOIUrl":"https://doi.org/10.1145/3508352.3549414","url":null,"abstract":"Improving fully homomorphic encryption computation by designing specialized hardware is an active topic of research. The most prominent encryption schemes operate on long polynomials requiring many concurrent modular multiplications of very big numbers. Thus, it is crucial to use many small and efficient multipliers. Interleaved and Montgomery iterative multipliers are the best candidates for the task. Interleaved designs, however, suffer from longer latency as they require a number comparison within each iteration; Montgomery designs, on the other hand, need extra conversion of the operands or the result. In this work, we propose a novel hardware design that combines the best of both worlds: Exhibiting the carry save addition of Montgomery designs without the need for any domain conversions. Experimental results demonstrate improved latency-area product efficiency by up to 47% when compared to the standard Interleaved multiplier for large arithmetic word sizes.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127797272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Heterogeneous Graph Neural Network-based Imitation Learning for Gate Sizing Acceleration 基于异构图神经网络的门尺寸加速模仿学习
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549361
Xinyi Zhou, Junjie Ye, Chak-Wa Pui, Kun Shao, Guangliang Zhang, Bin Wang, Jianye Hao, Guangyong Chen, P. Heng
Gate Sizing is an important step in logic synthesis, where the cells are resized to optimize metrics such as area, timing, power, leakage, etc. In this work, we consider the gate sizing problem for leakage power optimization with timing constraints. Lagrangian Relaxation is a widely employed optimization method for gate sizing problems. We accelerate Lagrangian Relaxation-based algorithms by narrowing down the range of cells to resize. In particular, we formulate a heterogeneous directed graph to represent the timing graph, propose a heterogeneous graph neural network as the encoder, and train in the way of imitation learning to mimic the selection behavior of each iteration in Lagrangian Relaxation. This network is used to predict the set of cells that need to be changed during the optimization process of Lagrangian Relaxation. Experiments show that our accelerated gate sizer could achieve comparable performance to the baseline with an average of 22.5% runtime reduction.
栅极尺寸是逻辑合成中的一个重要步骤,其中单元调整大小以优化诸如面积,时序,功率,泄漏等指标。本文研究了具有时序约束的泄漏功率优化的栅极尺寸问题。拉格朗日松弛法是一种应用广泛的闸门尺寸优化方法。我们通过缩小单元的范围来调整大小来加速基于拉格朗日松弛的算法。特别地,我们提出了一个异构有向图来表示时序图,提出了一个异构图神经网络作为编码器,并以模仿学习的方式进行训练,模拟拉格朗日松弛中每次迭代的选择行为。该网络用于预测拉格朗日松弛优化过程中需要改变的单元集。实验表明,我们的加速门尺寸器可以达到与基线相当的性能,平均运行时间减少22.5%。
{"title":"Heterogeneous Graph Neural Network-based Imitation Learning for Gate Sizing Acceleration","authors":"Xinyi Zhou, Junjie Ye, Chak-Wa Pui, Kun Shao, Guangliang Zhang, Bin Wang, Jianye Hao, Guangyong Chen, P. Heng","doi":"10.1145/3508352.3549361","DOIUrl":"https://doi.org/10.1145/3508352.3549361","url":null,"abstract":"Gate Sizing is an important step in logic synthesis, where the cells are resized to optimize metrics such as area, timing, power, leakage, etc. In this work, we consider the gate sizing problem for leakage power optimization with timing constraints. Lagrangian Relaxation is a widely employed optimization method for gate sizing problems. We accelerate Lagrangian Relaxation-based algorithms by narrowing down the range of cells to resize. In particular, we formulate a heterogeneous directed graph to represent the timing graph, propose a heterogeneous graph neural network as the encoder, and train in the way of imitation learning to mimic the selection behavior of each iteration in Lagrangian Relaxation. This network is used to predict the set of cells that need to be changed during the optimization process of Lagrangian Relaxation. Experiments show that our accelerated gate sizer could achieve comparable performance to the baseline with an average of 22.5% runtime reduction.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126766950","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Transitive Closure Graph-Based Warpage-aware Floorplanning for Package Designs 包装设计中基于传递闭包图的翘曲感知平面规划
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549354
Yang Hsu, Min-Hsuan Chung, Yao-Wen Chang, Ci-Hong Lin
In modern heterogeneous integration technologies, chips with different processes and functionality are integrated into a package with high interconnection density and large I/O counts. Integrating multiple chips into a package may suffer from severe warpage problems caused by the mismatch in coefficients of thermal expansion between different manufacturing materials, leading to deformation and malfunction in the manufactured package. The industry is eager to find a solution for warpage optimization. This paper proposes the first warpage-aware floorplanning algorithm for heterogeneous integration. We first present an efficient qualitative warpage model for a multi-chip package structure based on Suhir’s solution, more suitable for optimization than the time-consuming finite element analysis. Based on the transitive closure graph floorplan representation, we then propose three perturbations for simulated annealing to optimize the warpage more directly and can thus speed up the process. Finally, we develop a force-directed detailed floorplanning algorithm to further refine the solutions by utilizing the dead spaces. Experimental results demonstrate the effectiveness of our warpage model and algorithm.
在现代异构集成技术中,具有不同工艺和功能的芯片被集成到具有高互连密度和大I/O计数的封装中。将多个芯片集成到一个封装中可能会遇到严重的翘曲问题,这是由于不同制造材料之间的热膨胀系数不匹配造成的,从而导致制造的封装变形和故障。业界迫切希望找到翘曲优化的解决方案。提出了异构集成中首个感知翘曲的平面规划算法。我们首先基于Suhir的解决方案提出了一个高效的多芯片封装结构的定性翘曲模型,比耗时的有限元分析更适合优化。基于传递闭包图平面表示,我们提出了模拟退火的三种扰动,以更直接地优化翘曲,从而加快过程。最后,我们开发了一种力导向的详细地板规划算法,通过利用死空间进一步完善解决方案。实验结果证明了我们的翘曲模型和算法的有效性。
{"title":"Transitive Closure Graph-Based Warpage-aware Floorplanning for Package Designs","authors":"Yang Hsu, Min-Hsuan Chung, Yao-Wen Chang, Ci-Hong Lin","doi":"10.1145/3508352.3549354","DOIUrl":"https://doi.org/10.1145/3508352.3549354","url":null,"abstract":"In modern heterogeneous integration technologies, chips with different processes and functionality are integrated into a package with high interconnection density and large I/O counts. Integrating multiple chips into a package may suffer from severe warpage problems caused by the mismatch in coefficients of thermal expansion between different manufacturing materials, leading to deformation and malfunction in the manufactured package. The industry is eager to find a solution for warpage optimization. This paper proposes the first warpage-aware floorplanning algorithm for heterogeneous integration. We first present an efficient qualitative warpage model for a multi-chip package structure based on Suhir’s solution, more suitable for optimization than the time-consuming finite element analysis. Based on the transitive closure graph floorplan representation, we then propose three perturbations for simulated annealing to optimize the warpage more directly and can thus speed up the process. Finally, we develop a force-directed detailed floorplanning algorithm to further refine the solutions by utilizing the dead spaces. Experimental results demonstrate the effectiveness of our warpage model and algorithm.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125895404","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
EI-MOR: A Hybrid Exponential Integrator and Model Order Reduction Approach for Transient Power/Ground Network Analysis EI-MOR:一种混合指数积分器和模型降阶方法用于暂态电力/地网分析
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549407
Cong Wang, Dongen Yang, Quan Chen
Exponential integrator (EI) method has been proved to be an effective technique to accelerate large-scale transient power/ground network analysis. However, EI requires the inputs to be piece-wise linear (PWL) in one step, which greatly limits the step size when the inputs are poorly aligned. To address this issue, in this work we first elucidate with mathematical proof that EI, when used together with the rational Krylov subspace, is equivalent to performing a moment-matching model order reduction (MOR) with single input in each time step, then advancing the reduced system using EI in the same step. Based on this equivalence, we next devise a hybrid method, EI-MOR, to combine the usage of EI and MOR in the same transient simulation. A majority group of well-aligned inputs are still treated by EI as usual, while a few misaligned inputs are selected to be handled by a MOR process producing a reduced model that works for arbitrary inputs. Therefore the step size limitation imposed by the misaligned inputs can be largely alleviated. Numerical experiments are conducted to demonstrate the efficacy of the proposed method.
指数积分法(EI)已被证明是一种有效的加速大规模暂态电/地网络分析的方法。然而,EI要求输入在一个步骤中是分段线性的(PWL),这极大地限制了输入排列不良时的步长。为了解决这个问题,在这项工作中,我们首先用数学证明说明EI,当与有理Krylov子空间一起使用时,相当于在每个时间步中使用单个输入执行一个矩匹配模型降阶(MOR),然后在同一步中使用EI推进降阶系统。基于这种等价性,我们设计了一种混合方法EI-MOR,将EI和MOR在同一瞬态仿真中结合使用。大多数对齐良好的输入仍然像往常一样被EI处理,而一些不对齐的输入被选择由MOR过程处理,产生一个适用于任意输入的简化模型。因此,由失调输入造成的步长限制可以在很大程度上得到缓解。数值实验验证了该方法的有效性。
{"title":"EI-MOR: A Hybrid Exponential Integrator and Model Order Reduction Approach for Transient Power/Ground Network Analysis","authors":"Cong Wang, Dongen Yang, Quan Chen","doi":"10.1145/3508352.3549407","DOIUrl":"https://doi.org/10.1145/3508352.3549407","url":null,"abstract":"Exponential integrator (EI) method has been proved to be an effective technique to accelerate large-scale transient power/ground network analysis. However, EI requires the inputs to be piece-wise linear (PWL) in one step, which greatly limits the step size when the inputs are poorly aligned. To address this issue, in this work we first elucidate with mathematical proof that EI, when used together with the rational Krylov subspace, is equivalent to performing a moment-matching model order reduction (MOR) with single input in each time step, then advancing the reduced system using EI in the same step. Based on this equivalence, we next devise a hybrid method, EI-MOR, to combine the usage of EI and MOR in the same transient simulation. A majority group of well-aligned inputs are still treated by EI as usual, while a few misaligned inputs are selected to be handled by a MOR process producing a reduced model that works for arbitrary inputs. Therefore the step size limitation imposed by the misaligned inputs can be largely alleviated. Numerical experiments are conducted to demonstrate the efficacy of the proposed method.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131261986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
ReD-LUT: Reconfigurable In-DRAM LUTs Enabling Massive Parallel Computation ReD-LUT:可重构的内存lut,支持大规模并行计算
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549469
Ranyang Zhou, A. Roohi, Durga Misra, Shaahin Angizi
In this paper, we propose a reconfigurable processing-in-DRAM architecture named ReD-LUT leveraging the high density of commodity main memory to enable a flexible, general-purpose, and massively parallel computation. ReD-LUT supports lookup table (LUT) queries to efficiently execute complex arithmetic operations (e.g., multiplication, division, etc.) via only memory read operation. In addition, ReD-LUT enables bulk bit-wise in-memory logic by elevating the analog operation of the DRAM sub-array to implement Boolean functions between operands stored in the same bit-line beyond the scope of prior DRAM-based proposals. We explore the efficacy of ReD-LUT in two computationally-intensive applications, i.e., low-precision deep learning acceleration, and the Advanced Encryption Standard (AES) computation. Our circuit-to-architecture simulation results show that for a quantized deep learning workload, ReD-LUT reduces the energy consumption per image by a factor of 21.4× compared with the GPU and achieves ~37.8× speedup and 2.1× energy-efficiency over the best in-DRAM bit-wise accelerators. As for AES data-encryption, it reduces energy consumption by a factor of ~2.2× compared to an ASIC implementation.
在本文中,我们提出了一种名为ReD-LUT的可重构dram处理架构,利用商品主存储器的高密度来实现灵活、通用和大规模并行计算。ReD-LUT支持查找表(LUT)查询,仅通过内存读取操作有效地执行复杂的算术运算(例如,乘法,除法等)。此外,ReD-LUT通过提升DRAM子阵列的模拟操作来实现存储在同一位线上的操作数之间的布尔函数,从而实现内存中按位的批量逻辑,超出了先前基于DRAM的建议的范围。我们的电路到架构仿真结果表明,对于量化的深度学习工作负载,与GPU相比,ReD-LUT将每个图像的能耗降低了21.4倍,并且比最佳的dram位加速器实现了~37.8倍的加速和2.1倍的能效。对于AES数据加密,与ASIC实现相比,其能耗降低了约2.2倍。
{"title":"ReD-LUT: Reconfigurable In-DRAM LUTs Enabling Massive Parallel Computation","authors":"Ranyang Zhou, A. Roohi, Durga Misra, Shaahin Angizi","doi":"10.1145/3508352.3549469","DOIUrl":"https://doi.org/10.1145/3508352.3549469","url":null,"abstract":"In this paper, we propose a reconfigurable processing-in-DRAM architecture named ReD-LUT leveraging the high density of commodity main memory to enable a flexible, general-purpose, and massively parallel computation. ReD-LUT supports lookup table (LUT) queries to efficiently execute complex arithmetic operations (e.g., multiplication, division, etc.) via only memory read operation. In addition, ReD-LUT enables bulk bit-wise in-memory logic by elevating the analog operation of the DRAM sub-array to implement Boolean functions between operands stored in the same bit-line beyond the scope of prior DRAM-based proposals. We explore the efficacy of ReD-LUT in two computationally-intensive applications, i.e., low-precision deep learning acceleration, and the Advanced Encryption Standard (AES) computation. Our circuit-to-architecture simulation results show that for a quantized deep learning workload, ReD-LUT reduces the energy consumption per image by a factor of 21.4× compared with the GPU and achieves ~37.8× speedup and 2.1× energy-efficiency over the best in-DRAM bit-wise accelerators. As for AES data-encryption, it reduces energy consumption by a factor of ~2.2× compared to an ASIC implementation.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129385154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Low-Cost 7T-SRAM Compute-In-Memory Design based on Bit-Line Charge-Sharing based Analog-To-Digital Conversion 基于位线电荷共享模数转换的低成本7T-SRAM内存计算设计
Pub Date : 2022-10-29 DOI: 10.1145/3508352.3549423
Kyeongho Lee, Joonhyung Kim, J. Park
Although compute-in-memory (CIM) is considered as one of the promising solutions to overcome memory wall problem, the variations in analog voltage computation and analog-to-digital-converter (ADC) cost still remain as design challenges. In this paper, we present a 7T SRAM CIM that seamlessly supports multiply-accumulation (MAC) operation between 4-bit inputs and 8-bit weights. In the proposed CIM, highly parallel and robust MAC operations are enabled by exploiting the bit-line charge-sharing scheme to simultaneously process multiple inputs. For the readout of analog MAC values, instead of adopting the conventional ADC structure, the bit-line charge-sharing is efficiently used to reduce the implementation cost of the reference voltage generations. Based on the in-SRAM reference voltage generation and the parallel analog readout in all columns, the proposed CIM efficiently reduces ADC power and area cost. In addition, the variation models from Monte-Carlo simulations are also used during training to reduce the accuracy drop due to process variations. The implementation of 256×64 7T SRAM CIM using 28nm CMOS process shows that it operates in the wide voltage range from 0.6V to 1.2V with energy efficiency of 45.8-TOPS/W at 0.6V.
尽管内存计算(CIM)被认为是克服存储墙问题的一种有前途的解决方案,但模拟电压计算和模数转换器(ADC)成本的变化仍然是设计上的挑战。在本文中,我们提出了一个7T SRAM CIM,无缝支持4位输入和8位权重之间的乘法累积(MAC)操作。在所提出的CIM中,通过利用位线电荷共享方案同时处理多个输入,实现了高度并行和鲁棒的MAC操作。对于模拟MAC值的读出,有效地利用位线电荷共享来降低参考电压发生器的实现成本,而不是采用传统的ADC结构。基于sram内基准电压产生和各列并行模拟读数,该CIM有效地降低了ADC的功耗和面积成本。此外,在训练过程中还使用蒙特卡罗模拟的变化模型,以减少由于过程变化而导致的精度下降。利用28nm CMOS工艺实现256×64 7T SRAM CIM,结果表明其工作电压范围为0.6V ~ 1.2V, 0.6V时的能量效率为45.8 tops /W。
{"title":"Low-Cost 7T-SRAM Compute-In-Memory Design based on Bit-Line Charge-Sharing based Analog-To-Digital Conversion","authors":"Kyeongho Lee, Joonhyung Kim, J. Park","doi":"10.1145/3508352.3549423","DOIUrl":"https://doi.org/10.1145/3508352.3549423","url":null,"abstract":"Although compute-in-memory (CIM) is considered as one of the promising solutions to overcome memory wall problem, the variations in analog voltage computation and analog-to-digital-converter (ADC) cost still remain as design challenges. In this paper, we present a 7T SRAM CIM that seamlessly supports multiply-accumulation (MAC) operation between 4-bit inputs and 8-bit weights. In the proposed CIM, highly parallel and robust MAC operations are enabled by exploiting the bit-line charge-sharing scheme to simultaneously process multiple inputs. For the readout of analog MAC values, instead of adopting the conventional ADC structure, the bit-line charge-sharing is efficiently used to reduce the implementation cost of the reference voltage generations. Based on the in-SRAM reference voltage generation and the parallel analog readout in all columns, the proposed CIM efficiently reduces ADC power and area cost. In addition, the variation models from Monte-Carlo simulations are also used during training to reduce the accuracy drop due to process variations. The implementation of 256×64 7T SRAM CIM using 28nm CMOS process shows that it operates in the wide voltage range from 0.6V to 1.2V with energy efficiency of 45.8-TOPS/W at 0.6V.","PeriodicalId":270592,"journal":{"name":"2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133847227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
期刊
2022 IEEE/ACM International Conference On Computer Aided Design (ICCAD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1