首页 > 最新文献

2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)最新文献

英文 中文
A Novel 3D DRAM Memory Cube Architecture for Space Applications 一种新的空间应用3D DRAM记忆体架构
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3195978
Anthony Agnesina, A. Sidana, James Yamaguchi, Christian Krutzik, J. Carson, J. Yang-Scharlotta, S. Lim
The first mainstream products in 3D IC design are memory devices where multiple memory tiers are horizontally integrated to offer manifold improvements compared with their 2D counterparts. Unfortunately, none of these existing 3D memory cubes are ready for harsh space environments. This paper presents a new memory cube architecture for space, based on vertical integration of Commercial-Off-The-Shelf (COTS), 3D stacked, DRAM memory devices with a custom Radiation-Hardened-By-Design (RHBD) controller offering high memory capacity, robust reliability and low latency. Validation and evaluation of the ASIC controller will be conducted prior to tape-out on a custom FPGA-based emulator platform integrating the 3D-stack.
3D集成电路设计中的第一个主流产品是存储器件,其中多个存储层横向集成,与2D同类产品相比,提供了多方面的改进。不幸的是,这些现有的3D存储立方体都不适合恶劣的空间环境。本文提出了一种新的空间存储立方体架构,基于商用现货(COTS)的垂直集成,3D堆叠,DRAM存储设备与定制的辐射强化设计(RHBD)控制器,提供高存储容量,高可靠性和低延迟。ASIC控制器的验证和评估将在集成3d堆栈的定制fpga仿真平台上进行。
{"title":"A Novel 3D DRAM Memory Cube Architecture for Space Applications","authors":"Anthony Agnesina, A. Sidana, James Yamaguchi, Christian Krutzik, J. Carson, J. Yang-Scharlotta, S. Lim","doi":"10.1145/3195970.3195978","DOIUrl":"https://doi.org/10.1145/3195970.3195978","url":null,"abstract":"The first mainstream products in 3D IC design are memory devices where multiple memory tiers are horizontally integrated to offer manifold improvements compared with their 2D counterparts. Unfortunately, none of these existing 3D memory cubes are ready for harsh space environments. This paper presents a new memory cube architecture for space, based on vertical integration of Commercial-Off-The-Shelf (COTS), 3D stacked, DRAM memory devices with a custom Radiation-Hardened-By-Design (RHBD) controller offering high memory capacity, robust reliability and low latency. Validation and evaluation of the ASIC controller will be conducted prior to tape-out on a custom FPGA-based emulator platform integrating the 3D-stack.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"16 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75319205","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
DAC 2018 Copyright Page DAC 2018版权页面
Pub Date : 2018-06-01 DOI: 10.1109/dac.2018.8465927
{"title":"DAC 2018 Copyright Page","authors":"","doi":"10.1109/dac.2018.8465927","DOIUrl":"https://doi.org/10.1109/dac.2018.8465927","url":null,"abstract":"","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"2 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74134958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Parallelizing SRAM Arrays with Customized Bit-Cell for Binary Neural Networks 用于二进制神经网络的自定义位单元并行化SRAM阵列
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3196089
Rui Liu, Xiaochen Peng, Xiaoyu Sun, W. Khwa, Xin Si, Jia-Jing Chen, Jia-Fang Li, Meng-Fan Chang, Shimeng Yu
Recent advances in deep neural networks (DNNs) have shown Binary Neural Networks (BNNs) are able to provide a reasonable accuracy on various image datasets with a significant reduction in computation and memory cost. In this paper, we explore two BNNs: hybrid BNN (HBNN) and XNOR-BNN, where the weights are binarized to +1/−1 while the neuron activations are binarized to 1/0 and +1/−1, respectively. Two SRAM bit cell designs are proposed, namely, 6T SRAM for HBNN and customized 8T SRAM for XNOR-BNN. In our design, the high-precision multiply-and-accumulate (MAC) is replaced by bitwise multiplication for HBNN or XNOR for XNOR-BNN plus bit-counting operations. To parallelize the weighted sum operation, we activate multiple word lines in the SRAM array simultaneously and digitize the analog voltage developed along the bit line by a multi-level sense amplifier (MLSA). In order to partition the large matrices in DNNs, we investigate the impact of sensing bit-levels of MLSA on the accuracy degradation for different sub-array sizes and propose using the nonlinear quantization technique to mitigate the accuracy degradation. With 64 × 64 sub-array size and 3-bit MLSA, HBNN and XNOR-BNN architectures can minimize the accuracy degradation to 2.37% and 0.88%, respectively, for an inspired VGG-16 network on the CIFAR-10 dataset. Design space exploration of SRAM based synaptic architectures with the conventional row-by-row access scheme and our proposed parallel access scheme are also performed, showing significant benefits in the area, latency and energy-efficiency. Finally, we have successfully taped-out and validated the proposed HBNN and XNOR-BNN designs in TSMC 65 nm process with measured silicon data, achieving energy-efficiency >100 TOPS/W for HBNN and >50 TOPS/W for XNOR-BNN.
深度神经网络(dnn)的最新进展表明,二进制神经网络(bnn)能够在各种图像数据集上提供合理的精度,同时显著降低计算和内存成本。在本文中,我们研究了两种BNN:混合BNN (HBNN)和XNOR-BNN,其中权重二值化为+1/−1,神经元激活分别二值化为1/0和+1/−1。提出了两种SRAM位单元设计,即用于HBNN的6T SRAM和用于XNOR-BNN的定制8T SRAM。在我们的设计中,HBNN的高精度乘法累加(MAC)被逐位乘法取代,XNOR (XNOR- bnn加位计数操作)被逐位乘法取代。为了使加权和运算并行化,我们同时激活SRAM阵列中的多个字线,并通过多级感测放大器(MLSA)将沿位线产生的模拟电压数字化。为了对dnn中的大矩阵进行划分,研究了MLSA的感知位水平对不同子阵列大小下精度退化的影响,并提出了采用非线性量化技术来缓解精度退化的方法。对于CIFAR-10数据集上的VGG-16网络,在64 × 64子阵列大小和3位MLSA的情况下,HBNN和XNOR-BNN架构可以将精度降低到分别为2.37%和0.88%。采用传统的逐行访问方案和我们提出的并行访问方案对基于SRAM的突触架构进行了设计空间探索,显示出在面积、延迟和能效方面的显着优势。最后,我们成功地在台积电65nm工艺中对HBNN和XNOR-BNN设计进行了测试和验证,HBNN的能效>100 TOPS/W, XNOR-BNN的能效>50 TOPS/W。
{"title":"Parallelizing SRAM Arrays with Customized Bit-Cell for Binary Neural Networks","authors":"Rui Liu, Xiaochen Peng, Xiaoyu Sun, W. Khwa, Xin Si, Jia-Jing Chen, Jia-Fang Li, Meng-Fan Chang, Shimeng Yu","doi":"10.1145/3195970.3196089","DOIUrl":"https://doi.org/10.1145/3195970.3196089","url":null,"abstract":"Recent advances in deep neural networks (DNNs) have shown Binary Neural Networks (BNNs) are able to provide a reasonable accuracy on various image datasets with a significant reduction in computation and memory cost. In this paper, we explore two BNNs: hybrid BNN (HBNN) and XNOR-BNN, where the weights are binarized to +1/−1 while the neuron activations are binarized to 1/0 and +1/−1, respectively. Two SRAM bit cell designs are proposed, namely, 6T SRAM for HBNN and customized 8T SRAM for XNOR-BNN. In our design, the high-precision multiply-and-accumulate (MAC) is replaced by bitwise multiplication for HBNN or XNOR for XNOR-BNN plus bit-counting operations. To parallelize the weighted sum operation, we activate multiple word lines in the SRAM array simultaneously and digitize the analog voltage developed along the bit line by a multi-level sense amplifier (MLSA). In order to partition the large matrices in DNNs, we investigate the impact of sensing bit-levels of MLSA on the accuracy degradation for different sub-array sizes and propose using the nonlinear quantization technique to mitigate the accuracy degradation. With 64 × 64 sub-array size and 3-bit MLSA, HBNN and XNOR-BNN architectures can minimize the accuracy degradation to 2.37% and 0.88%, respectively, for an inspired VGG-16 network on the CIFAR-10 dataset. Design space exploration of SRAM based synaptic architectures with the conventional row-by-row access scheme and our proposed parallel access scheme are also performed, showing significant benefits in the area, latency and energy-efficiency. Finally, we have successfully taped-out and validated the proposed HBNN and XNOR-BNN designs in TSMC 65 nm process with measured silicon data, achieving energy-efficiency >100 TOPS/W for HBNN and >50 TOPS/W for XNOR-BNN.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"96 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75929362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 60
Measurement-Based Cache Representativeness on Multipath Programs 多路径程序中基于测量的缓存代表性
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3196075
Suzana Milutinovic, J. Abella, E. Mezzetti, F. Cazorla
Autonomous vehicles in embedded real-time systems increase critical-software size and complexity whose performance needs are covered with high-performance hardware features like caches, which however hampers obtaining WCET estimates that hold valid for all program execution paths. This requires assessing that all cache layouts have been properly factored in the WCET process. For measurement-based timing analysis, the most common analysis method, we provide a solution to achieve cache representativeness and full path coverage: we create a modified program for analysis purposes where cache impact is upper-bounded across any path, and derive the minimum number of runs required to capture in the test campaign cache layouts resulting in high execution times.
嵌入式实时系统中的自动驾驶汽车增加了关键软件的大小和复杂性,其性能需求由高性能硬件功能(如缓存)覆盖,然而,这阻碍了获得对所有程序执行路径有效的WCET估计。这需要在WCET过程中评估所有缓存布局是否被正确地考虑在内。对于基于度量的计时分析(最常见的分析方法),我们提供了一个解决方案来实现缓存代表性和全路径覆盖:我们创建了一个用于分析目的的修改程序,其中缓存影响在任何路径上都是上界的,并推导出在测试活动缓存布局中捕获所需的最小运行次数,从而导致高执行时间。
{"title":"Measurement-Based Cache Representativeness on Multipath Programs","authors":"Suzana Milutinovic, J. Abella, E. Mezzetti, F. Cazorla","doi":"10.1145/3195970.3196075","DOIUrl":"https://doi.org/10.1145/3195970.3196075","url":null,"abstract":"Autonomous vehicles in embedded real-time systems increase critical-software size and complexity whose performance needs are covered with high-performance hardware features like caches, which however hampers obtaining WCET estimates that hold valid for all program execution paths. This requires assessing that all cache layouts have been properly factored in the WCET process. For measurement-based timing analysis, the most common analysis method, we provide a solution to achieve cache representativeness and full path coverage: we create a modified program for analysis purposes where cache impact is upper-bounded across any path, and derive the minimum number of runs required to capture in the test campaign cache layouts resulting in high execution times.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"45 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77788426","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Closed yet Open DRAM: Achieving Low Latency and High Performance in DRAM Memory Systems 封闭而开放的DRAM:实现低延迟和高性能的DRAM存储系统
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3196008
Lavanya Subramanian, Kaushik Vaidyanathan, Anant V. Nori, S. Subramoney, T. Karnik, Hong Wang
DRAM memory access is a critical performance bottleneck. To access one cache block, an entire row needs to be sensed and amplified, data restored into the bitcells and the bitlines precharged, incurring high latency. Isolating the bitlines and sense amplifiers after activation enables reads and precharges to happen in parallel. However, there are challenges in achieving this isolation. We tackle these challenges and propose an effective scheme, simultaneous read and precharge (SRP), to isolate the sense amplifiers and bitlines and serve reads and precharges in parallel. Our detailed architecture and circuit simulations demonstrate that our simultaneous read and precharge (SRP) mechanism is able to achieve an 8.6% performance benefit over baseline, while reducing sense amplifier idle power by 30%, as compared to prior work, over a wide range of workloads.
DRAM内存访问是一个关键的性能瓶颈。为了访问一个缓存块,需要感测和放大整个行,将数据恢复到位单元中,并对位行进行预充,从而导致高延迟。激活后隔离位线和感测放大器使读取和预充并行发生。然而,在实现这种隔离方面存在挑战。我们解决了这些挑战,并提出了一种有效的方案,即同步读取和预充(SRP),以隔离感测放大器和位线,并并行地提供读取和预充。我们详细的架构和电路仿真表明,我们的同步读取和预充电(SRP)机制能够在基线上实现8.6%的性能优势,同时在广泛的工作负载范围内,与之前的工作相比,将感测放大器的空闲功率降低30%。
{"title":"Closed yet Open DRAM: Achieving Low Latency and High Performance in DRAM Memory Systems","authors":"Lavanya Subramanian, Kaushik Vaidyanathan, Anant V. Nori, S. Subramoney, T. Karnik, Hong Wang","doi":"10.1145/3195970.3196008","DOIUrl":"https://doi.org/10.1145/3195970.3196008","url":null,"abstract":"DRAM memory access is a critical performance bottleneck. To access one cache block, an entire row needs to be sensed and amplified, data restored into the bitcells and the bitlines precharged, incurring high latency. Isolating the bitlines and sense amplifiers after activation enables reads and precharges to happen in parallel. However, there are challenges in achieving this isolation. We tackle these challenges and propose an effective scheme, simultaneous read and precharge (SRP), to isolate the sense amplifiers and bitlines and serve reads and precharges in parallel. Our detailed architecture and circuit simulations demonstrate that our simultaneous read and precharge (SRP) mechanism is able to achieve an 8.6% performance benefit over baseline, while reducing sense amplifier idle power by 30%, as compared to prior work, over a wide range of workloads.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"46 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81554502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
INVITED: Reducing Time and Effort in IC Implementation: A Roadmap of Challenges and Solutions 邀请:减少集成电路实施的时间和精力:挑战和解决方案的路线图
Pub Date : 2018-06-01 DOI: 10.1109/DAC.2018.8465871
A. Kahng
To reduce time and effort in IC implementation, fundamental challenges must be solved. First, the need for (expensive) humans must be removed wherever possible. Humans are skilled at predicting downstream flow failures, evaluating key early decisions such as RTL floorplanning, and deciding tool/flow options to apply to a given design. Achieving human-quality prediction, evaluation and decision-making will require new machine learning-centric models of both tools and designs. Second, to reduce design schedule, focus must return to the long-held dream of single-pass design. Future design tools and flows that never require iteration (i.e., that never fail, but without undue conservatism) demand new paradigms and core algorithms for parallel, cloud-based design automation. Third, learning-based models of tools and flows must continually improve with additional design experiences. Therefore, the EDA and design ecosystem must develop new infrastructure for ML model development and sharing.
为了减少集成电路实施的时间和精力,必须解决一些根本性的挑战。首先,必须尽可能减少对(昂贵的)人力的需求。人类擅长预测下游流程故障,评估关键的早期决策,如RTL平面图,以及决定适用于给定设计的工具/流程选项。实现人类质量的预测、评估和决策将需要新的以机器学习为中心的工具和设计模型。其次,为了减少设计进度,重点必须回到长期以来的单次设计梦想。未来的设计工具和流程不需要迭代(即,永远不会失败,但没有过度的保守),需要新的范例和核心算法来并行,基于云的设计自动化。第三,基于学习的工具和流程模型必须通过额外的设计经验不断改进。因此,EDA和设计生态系统必须为ML模型的开发和共享开发新的基础设施。
{"title":"INVITED: Reducing Time and Effort in IC Implementation: A Roadmap of Challenges and Solutions","authors":"A. Kahng","doi":"10.1109/DAC.2018.8465871","DOIUrl":"https://doi.org/10.1109/DAC.2018.8465871","url":null,"abstract":"To reduce time and effort in IC implementation, fundamental challenges must be solved. First, the need for (expensive) humans must be removed wherever possible. Humans are skilled at predicting downstream flow failures, evaluating key early decisions such as RTL floorplanning, and deciding tool/flow options to apply to a given design. Achieving human-quality prediction, evaluation and decision-making will require new machine learning-centric models of both tools and designs. Second, to reduce design schedule, focus must return to the long-held dream of single-pass design. Future design tools and flows that never require iteration (i.e., that never fail, but without undue conservatism) demand new paradigms and core algorithms for parallel, cloud-based design automation. Third, learning-based models of tools and flows must continually improve with additional design experiences. Therefore, the EDA and design ecosystem must develop new infrastructure for ML model development and sharing.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"25 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81617892","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An Architecture-Agnostic Integer Linear Programming Approach to CGRA Mapping CGRA映射的体系结构不可知整数线性规划方法
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3195986
S. Alexander Chin, Jason H. Anderson
Coarse-grained reconfigurable architectures (CGRAs) have gained traction as a potential solution to implement accelerators for compute-intensive kernels, particularly in domains requiring hardware programmability. Architecture and CAD for CGRAs are tightly intertwined, with many prior works having combined architectures and tools. In this work, we present an architecture-agnostic integer linear programming (ILP) approach for CGRA mapping, integrated within an open-source CGRA architecture evaluation framework. The mapper accepts an application and an architecture description as input and can generate an optimal mapping, if indeed mapping is feasible. An experimental study demonstrates its effectiveness over a range of CGRA architectures.
粗粒度可重构架构(CGRAs)作为实现计算密集型内核加速器的潜在解决方案已经获得了关注,特别是在需要硬件可编程性的领域。CGRAs的架构和CAD紧密地交织在一起,许多先前的作品将架构和工具结合在一起。在这项工作中,我们提出了一种架构无关的整数线性规划(ILP)方法,用于CGRA映射,集成在一个开源的CGRA架构评估框架中。映射器接受应用程序和体系结构描述作为输入,如果映射确实可行,则可以生成最优映射。实验研究证明了该方法在多种CGRA结构上的有效性。
{"title":"An Architecture-Agnostic Integer Linear Programming Approach to CGRA Mapping","authors":"S. Alexander Chin, Jason H. Anderson","doi":"10.1145/3195970.3195986","DOIUrl":"https://doi.org/10.1145/3195970.3195986","url":null,"abstract":"Coarse-grained reconfigurable architectures (CGRAs) have gained traction as a potential solution to implement accelerators for compute-intensive kernels, particularly in domains requiring hardware programmability. Architecture and CAD for CGRAs are tightly intertwined, with many prior works having combined architectures and tools. In this work, we present an architecture-agnostic integer linear programming (ILP) approach for CGRA mapping, integrated within an open-source CGRA architecture evaluation framework. The mapper accepts an application and an architecture description as input and can generate an optimal mapping, if indeed mapping is feasible. An experimental study demonstrates its effectiveness over a range of CGRA architectures.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"8 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78605000","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Extracting Data Parallelism in Non-Stencil Kernel Computing by Optimally Coloring Folded Memory Conflict Graph 基于折叠内存冲突图优化着色的非模板核计算数据并行性提取
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3196088
Juan Escobedo, Mingjie Lin
Irregular memory access pattern in non-stencil kernel computing renders the well-known hyperplane- [1], lattice- [2], or tessellation-based [3] HLS techniques ineffective. We develop an elegant yet effective technique that synthesizes memory-optimal architecture from high level software code in order to maximize application-specific data parallelism. Our basic idea is to exploit graph structures embedded in data access pattern and computation structure in order to perform the memory banking that maximizes parallel memory accesses while conserving both hardware and energy consumption. Specifically, we priority color a weighted conflict graph generated from folding the fundamental conflict graph to maximize memory conflict reduction. Most interestingly, our graph-based methodology enables a straightforward tradeoff between the number of memory banks and minimizing memory conflicts.We empirically test our methodology with Vivado HLx 2015.4 on a standard Kintex-7 device for six benchmark computing kernels by measuring conflict reduction. In particular, our approach only require 9.56% LUT, 3.2% FF, 2.5% BRAM, and 11.33% DSP of the total available hardware resource to obtain a mapping function that achieves a 90% conflict reduction on a modified forward Gaussian elimination Kernel with 4 simultaneous memory accesses.
在非模板核计算中,不规则的内存访问模式使得众所周知的超平面[1]、晶格[2]或基于镶嵌的[3]HLS技术失效。我们开发了一种优雅而有效的技术,从高级软件代码中综合内存优化架构,以最大限度地提高特定应用程序的数据并行性。我们的基本思想是利用嵌入在数据访问模式和计算结构中的图结构来执行内存银行,从而最大化并行内存访问,同时节省硬件和能源消耗。具体而言,我们对基本冲突图折叠后生成的加权冲突图优先着色,以最大限度地减少内存冲突。最有趣的是,我们基于图的方法可以在内存库的数量和最小化内存冲突之间进行直接的权衡。我们在标准Kintex-7设备上使用Vivado HLx 2015.4对我们的方法进行了实证测试,测试了六个基准计算内核的冲突减少情况。特别是,我们的方法只需要9.56%的LUT, 3.2%的FF, 2.5%的BRAM和11.33%的DSP就可以获得映射函数,该映射函数在修改的前向高斯消去核上实现了90%的冲突减少,同时有4个内存访问。
{"title":"Extracting Data Parallelism in Non-Stencil Kernel Computing by Optimally Coloring Folded Memory Conflict Graph","authors":"Juan Escobedo, Mingjie Lin","doi":"10.1145/3195970.3196088","DOIUrl":"https://doi.org/10.1145/3195970.3196088","url":null,"abstract":"Irregular memory access pattern in non-stencil kernel computing renders the well-known hyperplane- [1], lattice- [2], or tessellation-based [3] HLS techniques ineffective. We develop an elegant yet effective technique that synthesizes memory-optimal architecture from high level software code in order to maximize application-specific data parallelism. Our basic idea is to exploit graph structures embedded in data access pattern and computation structure in order to perform the memory banking that maximizes parallel memory accesses while conserving both hardware and energy consumption. Specifically, we priority color a weighted conflict graph generated from folding the fundamental conflict graph to maximize memory conflict reduction. Most interestingly, our graph-based methodology enables a straightforward tradeoff between the number of memory banks and minimizing memory conflicts.We empirically test our methodology with Vivado HLx 2015.4 on a standard Kintex-7 device for six benchmark computing kernels by measuring conflict reduction. In particular, our approach only require 9.56% LUT, 3.2% FF, 2.5% BRAM, and 11.33% DSP of the total available hardware resource to obtain a mapping function that achieves a 90% conflict reduction on a modified forward Gaussian elimination Kernel with 4 simultaneous memory accesses.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"82 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85598870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A Machine Learning Framework to Identify Detailed Routing Short Violations from a Placed Netlist 从放置的网表中识别详细路由短违规的机器学习框架
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3195975
Aysa Fakheri Tabrizi, L. Rakai, Nima Karimpour Darav, Ismail Bustany, L. Behjat, Shuchang Xu, A. Kennings
Detecting and preventing routing violations has become a critical issue in physical design, especially in the early stages. Lack of correlation between global and detailed routing congestion estimations and the long runtime required to frequently consult a global router adds to the problem. In this paper, we propose a machine learning framework to predict detailed routing short violations from a placed netlist. Factors contributing to routing violations are determined and a supervised neural network model is implemented to detect these violations. Experimental results show that the proposed method is able to predict on average 90% of the shorts with only 7% false alarms and considerably reduced computational time.
检测和防止路由违规已经成为物理设计中的一个关键问题,特别是在早期阶段。缺乏全局和详细路由拥塞估计之间的相关性以及频繁咨询全局路由器所需的长时间运行会增加问题。在本文中,我们提出了一个机器学习框架来预测来自放置的网络列表的详细路由短违规。确定了导致路由违规的因素,并实现了监督神经网络模型来检测这些违规。实验结果表明,该方法能够平均预测90%的短线,只有7%的虚警,大大减少了计算时间。
{"title":"A Machine Learning Framework to Identify Detailed Routing Short Violations from a Placed Netlist","authors":"Aysa Fakheri Tabrizi, L. Rakai, Nima Karimpour Darav, Ismail Bustany, L. Behjat, Shuchang Xu, A. Kennings","doi":"10.1145/3195970.3195975","DOIUrl":"https://doi.org/10.1145/3195970.3195975","url":null,"abstract":"Detecting and preventing routing violations has become a critical issue in physical design, especially in the early stages. Lack of correlation between global and detailed routing congestion estimations and the long runtime required to frequently consult a global router adds to the problem. In this paper, we propose a machine learning framework to predict detailed routing short violations from a placed netlist. Factors contributing to routing violations are determined and a supervised neural network model is implemented to detect these violations. Experimental results show that the proposed method is able to predict on average 90% of the shorts with only 7% false alarms and considerably reduced computational time.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"53 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84033923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Subutai: Distributed Synchronization Primitives in NoC Interfaces for Legacy Parallel-Applications 遗留并行应用程序的NoC接口中的分布式同步原语
Pub Date : 2018-06-01 DOI: 10.1145/3195970.3196124
R. Cataldo, Ramon Fernandes, Kevin J. M. Martin, Martha Johanna Sepúlveda, A. Susin, C. Marcon, J. Diguet
Parallel applications are essential for efficiently using the computational power of a Multiprocessor System-on-Chip (MPSoC). Unfortunately, these applications do not scale effortlessly with the number of cores because of synchronization operations that take away valuable computational time and restrict the parallelization gains. Moreover, synchronization is also a bottleneck due to sequential access to shared memory. We address this issue and introduce ”Subutai”, a hardware/software (HW/SW) architecture designed to distribute essential synchronization mechanisms over the Network-on-Chip (NoC). It includes Network Interfaces (NIs), drivers and a custom library of a NoC-based MPSoC architecture that speeds up the essential synchronization primitives of any legacy parallel application. Besides, we provide a fast simulation tool for parallel applications and a HW architecture of the NI. Experimental results with PARSEC benchmark show an average application speedup of 2.05 compared to the same architecture running legacy SW solutions for 36% overhead of HW architecture.
并行应用对于有效利用多处理器单片系统(MPSoC)的计算能力至关重要。不幸的是,由于同步操作占用了宝贵的计算时间并限制了并行化的收益,因此这些应用程序并不能随着内核数量的增加而轻松扩展。此外,由于对共享内存的顺序访问,同步也是一个瓶颈。我们解决了这个问题,并引入了“Subutai”,这是一种硬件/软件(HW/SW)架构,旨在通过片上网络(NoC)分发基本同步机制。它包括网络接口(NIs),驱动程序和基于noc的MPSoC架构的自定义库,可以加速任何遗留并行应用程序的基本同步原语。此外,我们还提供了一个并行应用的快速仿真工具和NI的硬件架构。PARSEC基准测试的实验结果显示,与运行遗留软件解决方案的相同架构相比,应用程序的平均加速速度为2.05,而硬件架构的开销为36%。
{"title":"Subutai: Distributed Synchronization Primitives in NoC Interfaces for Legacy Parallel-Applications","authors":"R. Cataldo, Ramon Fernandes, Kevin J. M. Martin, Martha Johanna Sepúlveda, A. Susin, C. Marcon, J. Diguet","doi":"10.1145/3195970.3196124","DOIUrl":"https://doi.org/10.1145/3195970.3196124","url":null,"abstract":"Parallel applications are essential for efficiently using the computational power of a Multiprocessor System-on-Chip (MPSoC). Unfortunately, these applications do not scale effortlessly with the number of cores because of synchronization operations that take away valuable computational time and restrict the parallelization gains. Moreover, synchronization is also a bottleneck due to sequential access to shared memory. We address this issue and introduce ”Subutai”, a hardware/software (HW/SW) architecture designed to distribute essential synchronization mechanisms over the Network-on-Chip (NoC). It includes Network Interfaces (NIs), drivers and a custom library of a NoC-based MPSoC architecture that speeds up the essential synchronization primitives of any legacy parallel application. Besides, we provide a fast simulation tool for parallel applications and a HW architecture of the NI. Experimental results with PARSEC benchmark show an average application speedup of 2.05 compared to the same architecture running legacy SW solutions for 36% overhead of HW architecture.","PeriodicalId":6491,"journal":{"name":"2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)","volume":"50 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2018-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78357846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
2018 55th ACM/ESDA/IEEE Design Automation Conference (DAC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1