首页 > 最新文献

ACM Great Lakes Symposium on VLSI最新文献

英文 中文
High level energy modeling of controller logic in data caches 数据缓存中控制器逻辑的高级能量建模
Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591590
P. Panda, Sourav Roy, Srikanth Chandrasekaran, Namita Sharma, Jasleen Kaur, Sarath Kumar Kandalam, N. Nagaraj
In modern embedded processor caches, a significant amount of energy dissipation occurs in the controller logic part of the cache. Previous power/energy modeling tools have focused on the core memory part of the cache. We propose energy models for two of these modules -- Write Buffer and Replacement logic. Since this hardware is generally synthesized by designers, our power models are also based on empirical data. We found a linear dependence of the per-access write buffer energy on the write buffer depth and write width. We validated our models on several different benchmark examples, using different technology nodes. Our models generate energy estimates that are within 4.2% of those measured by detailed power simulations, making the models valuable mechanisms for rapid energy estimates during architecture exploration.
在现代嵌入式处理器缓存中,大量的能量耗散发生在缓存的控制器逻辑部分。以前的功率/能量建模工具主要关注缓存的核心内存部分。我们提出了其中两个模块的能量模型——写入缓冲区和替换逻辑。由于这些硬件通常是由设计师合成的,所以我们的功率模型也是基于经验数据。我们发现每次访问的写缓冲区能量与写缓冲区深度和写宽度呈线性关系。我们使用不同的技术节点,在几个不同的基准示例上验证了我们的模型。我们的模型产生的能量估计在详细的功率模拟测量值的4.2%以内,使模型在建筑探索期间快速估计能量的有价值的机制。
{"title":"High level energy modeling of controller logic in data caches","authors":"P. Panda, Sourav Roy, Srikanth Chandrasekaran, Namita Sharma, Jasleen Kaur, Sarath Kumar Kandalam, N. Nagaraj","doi":"10.1145/2591513.2591590","DOIUrl":"https://doi.org/10.1145/2591513.2591590","url":null,"abstract":"In modern embedded processor caches, a significant amount of energy dissipation occurs in the controller logic part of the cache. Previous power/energy modeling tools have focused on the core memory part of the cache. We propose energy models for two of these modules -- Write Buffer and Replacement logic. Since this hardware is generally synthesized by designers, our power models are also based on empirical data. We found a linear dependence of the per-access write buffer energy on the write buffer depth and write width. We validated our models on several different benchmark examples, using different technology nodes. Our models generate energy estimates that are within 4.2% of those measured by detailed power simulations, making the models valuable mechanisms for rapid energy estimates during architecture exploration.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116730122","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Horizontal benchmark extension for improved assessment of physical CAD research 提高物理CAD研究评价水平基准扩展
Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591540
A. Kahng, Hyein Lee, Jiajia Li
The rapid growth in complexity and diversity of IC designs, design flows and methodologies has resulted in a benchmark-centric culture for evaluation of performance and scalability in physicaldesign algorithm research. Landmark papers in the literature present vertical benchmarks that can be used across multiple design flow stages; artificial benchmarks with characteristics that mimic those of real designs; artificial benchmarks with known optimal solutions; as well as benchmark suites created by major companies from internal designs and/or open-source RTL. However, to our knowledge, there has been no work on horizontal benchmark creation, i.e., the creation of benchmarks that enable maximal, comprehensive assessments across commercial and academic tools at one or more specific design stages. Typically, the creation of horizontal benchmarks is limited by mismatches in data models, netlist formats, technology files, library granularity, etc. across different tools, technologies, and benchmark suites. In this paper, we describe methodology and robust infrastructure for horizontal benchmark extension" that permits maximal leverage of benchmark suites and technologies in "apples-to-apples" assessment of both industry and academic optimizers. We demonstrate horizontal benchmark extensions, and the assessments that are thus enabled, in two well-studied domains: place-and-route (four combinations of academic placers/routers, and two commercial P&R tools) and gate sizing (two academic sizers, and three commercial tools). We also point out several issues and precepts for horizontal benchmark enablement.
IC设计、设计流程和方法的复杂性和多样性的快速增长导致了以基准为中心的文化,用于评估物理设计算法研究中的性能和可扩展性。文献中具有里程碑意义的论文提出了可以跨多个设计流程阶段使用的垂直基准;模拟真实设计的人为基准;具有已知最优解的人工基准;以及由大公司通过内部设计和/或开源RTL创建的基准套件。然而,据我们所知,还没有关于横向基准创建的工作,也就是说,在一个或多个特定的设计阶段,创建能够在商业和学术工具之间进行最大限度、全面评估的基准。通常,横向基准的创建受到数据模型、网表格式、技术文件、库粒度等方面的不匹配的限制,这些不匹配跨不同的工具、技术和基准套件。在本文中,我们描述了“水平基准扩展”的方法和健壮的基础设施,它允许在行业和学术优化器的“苹果对苹果”评估中最大限度地利用基准套件和技术。我们在两个经过充分研究的领域中演示了水平基准扩展,以及由此启用的评估:放置和路由(四种学术放置器/路由器的组合,以及两种商业P&R工具)和门大小(两种学术大小器,以及三种商业工具)。我们还指出了水平基准启用的几个问题和原则。
{"title":"Horizontal benchmark extension for improved assessment of physical CAD research","authors":"A. Kahng, Hyein Lee, Jiajia Li","doi":"10.1145/2591513.2591540","DOIUrl":"https://doi.org/10.1145/2591513.2591540","url":null,"abstract":"The rapid growth in complexity and diversity of IC designs, design flows and methodologies has resulted in a benchmark-centric culture for evaluation of performance and scalability in physicaldesign algorithm research. Landmark papers in the literature present vertical benchmarks that can be used across multiple design flow stages; artificial benchmarks with characteristics that mimic those of real designs; artificial benchmarks with known optimal solutions; as well as benchmark suites created by major companies from internal designs and/or open-source RTL. However, to our knowledge, there has been no work on horizontal benchmark creation, i.e., the creation of benchmarks that enable maximal, comprehensive assessments across commercial and academic tools at one or more specific design stages. Typically, the creation of horizontal benchmarks is limited by mismatches in data models, netlist formats, technology files, library granularity, etc. across different tools, technologies, and benchmark suites. In this paper, we describe methodology and robust infrastructure for horizontal benchmark extension\" that permits maximal leverage of benchmark suites and technologies in \"apples-to-apples\" assessment of both industry and academic optimizers. We demonstrate horizontal benchmark extensions, and the assessments that are thus enabled, in two well-studied domains: place-and-route (four combinations of academic placers/routers, and two commercial P&R tools) and gate sizing (two academic sizers, and three commercial tools). We also point out several issues and precepts for horizontal benchmark enablement.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128230054","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
A performance enhancing hybrid locally mesh globally star NoC topology 一种增强性能的混合局部网格全局星型NoC拓扑
Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591544
T. S. Das, P. Ghosal, S. Mohanty, E. Kougianos
With the rapid increase in the chip density, Network-on-Chip (NoC) is becoming the prevalent architecture for today's complex chip multi processor (CMP) based systems. One of the major challenges of the NoC is to design an enhanced parallel communication centric scalable architecture for the on chip communication. In this paper, a hybrid Mesh based Star topology has been proposed to provide low latency, high throughput and more evenly distributed traffic throughout the network. Simulation results show that a maximum of 62% latency benefit (for size 8x8), 55% (for size 8x8), and 42% (for size 12x12) throughput benefits can be achieved for proposed topology over mesh with a small area overhead.
随着芯片密度的快速增长,片上网络(NoC)正成为当今基于复杂芯片多处理器(CMP)系统的主流架构。NoC的主要挑战之一是为片上通信设计一个增强的以并行通信为中心的可扩展架构。本文提出了一种基于混合网格的星型拓扑结构,以在整个网络中提供低延迟、高吞吐量和更均匀分布的流量。仿真结果表明,对于面积开销较小的网格,可以实现最大62%的延迟优势(对于大小为8x8), 55%的延迟优势(对于大小为8x8)和42%的吞吐量优势(对于大小为12x12)。
{"title":"A performance enhancing hybrid locally mesh globally star NoC topology","authors":"T. S. Das, P. Ghosal, S. Mohanty, E. Kougianos","doi":"10.1145/2591513.2591544","DOIUrl":"https://doi.org/10.1145/2591513.2591544","url":null,"abstract":"With the rapid increase in the chip density, Network-on-Chip (NoC) is becoming the prevalent architecture for today's complex chip multi processor (CMP) based systems. One of the major challenges of the NoC is to design an enhanced parallel communication centric scalable architecture for the on chip communication. In this paper, a hybrid Mesh based Star topology has been proposed to provide low latency, high throughput and more evenly distributed traffic throughout the network. Simulation results show that a maximum of 62% latency benefit (for size 8x8), 55% (for size 8x8), and 42% (for size 12x12) throughput benefits can be achieved for proposed topology over mesh with a small area overhead.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121220155","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A parallel and reconfigurable architecture for efficient OMP compressive sensing reconstruction 一种用于高效OMP压缩感知重构的并行可重构结构
Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591598
A. Kulkarni, H. Homayoun, T. Mohsenin
Compressive Sensing (CS) is a novel scheme, in which a signal that is sparse in a known transform domain can be reconstructed using fewer samples. However, the signal reconstruction techniques are computationally intensive and power consuming, which make them impractical for embedded applications. This work presents a parallel and reconfigurable architecture for Orthogonal Matching Pursuit (OMP) algorithm, one of the most popular CS reconstruction algorithms. In this paper, we are proposing the first reconfigurable OMP CS reconstruction architecture which can take different image sizes with sparsity up to 32. The aim is to minimize the hardware complexity, area and power consumption, and improve the reconstruction latency while meeting the reconstruction accuracy. First, the accuracy of reconstructed images is analyzed for different sparsity values and fixed point word length reduction. Next, efficient parallelization techniques are applied to reconstruct signals with variant signal lengths of N. The OMP algorithm is mainly divided into three kernels, where each kernel is parallelized to reduce execution time, and efficient reuse of the matrix operators allows us to reduce area. The proposed architecture can reconstruct images of different sizes and measurements and is implemented on a Xilinx Virtex 7 FPGA. The results indicate that, for a 128x128 image reconstruction, the proposed reconfigurable architecture is 2.67x to 1.8x faster than the previous non-reconfigurable work which is less complex and uses much smaller sparsity.
压缩感知(CS)是一种新颖的方法,它可以用更少的样本重构已知变换域中的稀疏信号。然而,信号重建技术是计算密集型和功耗高的,这使得它们不适合嵌入式应用。本文提出了一种并行的、可重构的正交匹配追踪(OMP)算法架构,该算法是最流行的CS重构算法之一。在本文中,我们提出了第一个可重构的OMP CS重构架构,该架构可以采用不同的图像大小,稀疏度高达32。其目标是在满足重构精度的前提下,最大限度地降低硬件复杂度、面积和功耗,提高重构延迟。首先,分析了不同稀疏度值和定点字长约简下重构图像的精度;接下来,利用高效并行化技术重构信号长度为n的变型信号。OMP算法主要分为三个核,每个核并行化以减少执行时间,矩阵算子的高效重用使我们可以减少面积。所提出的架构可以重建不同尺寸和尺寸的图像,并在Xilinx Virtex 7 FPGA上实现。结果表明,对于128x128的图像重建,所提出的可重构架构比以前的非可重构工作快2.67 ~ 1.8倍,并且复杂度更低,稀疏度更小。
{"title":"A parallel and reconfigurable architecture for efficient OMP compressive sensing reconstruction","authors":"A. Kulkarni, H. Homayoun, T. Mohsenin","doi":"10.1145/2591513.2591598","DOIUrl":"https://doi.org/10.1145/2591513.2591598","url":null,"abstract":"Compressive Sensing (CS) is a novel scheme, in which a signal that is sparse in a known transform domain can be reconstructed using fewer samples. However, the signal reconstruction techniques are computationally intensive and power consuming, which make them impractical for embedded applications. This work presents a parallel and reconfigurable architecture for Orthogonal Matching Pursuit (OMP) algorithm, one of the most popular CS reconstruction algorithms. In this paper, we are proposing the first reconfigurable OMP CS reconstruction architecture which can take different image sizes with sparsity up to 32. The aim is to minimize the hardware complexity, area and power consumption, and improve the reconstruction latency while meeting the reconstruction accuracy. First, the accuracy of reconstructed images is analyzed for different sparsity values and fixed point word length reduction. Next, efficient parallelization techniques are applied to reconstruct signals with variant signal lengths of N. The OMP algorithm is mainly divided into three kernels, where each kernel is parallelized to reduce execution time, and efficient reuse of the matrix operators allows us to reduce area. The proposed architecture can reconstruct images of different sizes and measurements and is implemented on a Xilinx Virtex 7 FPGA. The results indicate that, for a 128x128 image reconstruction, the proposed reconfigurable architecture is 2.67x to 1.8x faster than the previous non-reconfigurable work which is less complex and uses much smaller sparsity.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121427332","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
A comparison of FinFET based FPGA LUT designs 基于FinFET的FPGA LUT设计比较
Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591596
M. Abusultan, S. Khatri
The FinFET device has gained much traction in recent VLSI designs. In the FinFET device, the conduction channel is vertical, unlike a traditional bulk MOSFET, in which the conduction channel is planar. This yields several benefits, and as a consequence, it is expected that most VLSI designs will utilize FinFETs from the 20nm node and beyond. Despite the fact that several research papers have reported FinFET based circuit and layout realizations for popular circuit blocks, there has been no reported work on the use of FinFETs for Field Programmable Gate Array (FPGA) designs. The key circuit in the FPGA that enables programmability is the n-input Look-up Table (LUT). An n-input LUT can implement any logic function of up to n inputs. In this paper, we present an evaluation of several FPGA LUT designs. We compare these designs from a performance (delay, power, energy) as well as an area perspective. Comparisons are conducted with respect to a bulk based LUT as well. Our results demonstrate that all the FinFET based LUTs exhibit better delays and energy than the bulk based LUT. Based on our comparisons, we have two winning candidate LUTs, one for high performance designs (3X faster than a bulk based LUT) and another for low energy, area constrained designs (83% energy and 58% area compared to a bulk based LUT).
FinFET器件在最近的VLSI设计中获得了很大的吸引力。在FinFET器件中,导通通道是垂直的,而不像传统的大块MOSFET器件中,导通通道是平面的。这带来了几个好处,因此,预计大多数VLSI设计将使用20nm及以上节点的finfet。尽管有几篇研究论文报道了基于FinFET的电路和流行电路块的布局实现,但还没有关于将FinFET用于现场可编程门阵列(FPGA)设计的报道。FPGA中实现可编程性的关键电路是n输入查找表(LUT)。n输入LUT可以实现最多n个输入的任何逻辑函数。在本文中,我们对几种FPGA LUT设计进行了评估。我们从性能(延迟、功率、能量)和面积角度比较这些设计。对基于批量的LUT也进行了比较。我们的结果表明,所有基于FinFET的LUT都比基于块体的LUT具有更好的延迟和能量。根据我们的比较,我们有两个获胜的候选LUT,一个用于高性能设计(比基于批量的LUT快3倍),另一个用于低能量,面积受限的设计(与基于批量的LUT相比,能量为83%,面积为58%)。
{"title":"A comparison of FinFET based FPGA LUT designs","authors":"M. Abusultan, S. Khatri","doi":"10.1145/2591513.2591596","DOIUrl":"https://doi.org/10.1145/2591513.2591596","url":null,"abstract":"The FinFET device has gained much traction in recent VLSI designs. In the FinFET device, the conduction channel is vertical, unlike a traditional bulk MOSFET, in which the conduction channel is planar. This yields several benefits, and as a consequence, it is expected that most VLSI designs will utilize FinFETs from the 20nm node and beyond. Despite the fact that several research papers have reported FinFET based circuit and layout realizations for popular circuit blocks, there has been no reported work on the use of FinFETs for Field Programmable Gate Array (FPGA) designs. The key circuit in the FPGA that enables programmability is the n-input Look-up Table (LUT). An n-input LUT can implement any logic function of up to n inputs. In this paper, we present an evaluation of several FPGA LUT designs. We compare these designs from a performance (delay, power, energy) as well as an area perspective. Comparisons are conducted with respect to a bulk based LUT as well. Our results demonstrate that all the FinFET based LUTs exhibit better delays and energy than the bulk based LUT. Based on our comparisons, we have two winning candidate LUTs, one for high performance designs (3X faster than a bulk based LUT) and another for low energy, area constrained designs (83% energy and 58% area compared to a bulk based LUT).","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"78 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115937627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Neural network-based accelerators for transcendental function approximation 基于神经网络的超越函数逼近加速器
Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591534
Schuyler Eldridge, F. Raudies, D. Zou, A. Joshi
The general-purpose approximate nature of neural network (NN) based accelerators has the potential to sustain the historic energy and performance improvements of computing systems. We propose the use of NN-based accelerators to approximate mathematical functions in the GNU C Library (glibc) that commonly occur in application benchmarks. Using our NN-based approach to approximate cos, exp, log, pow, and sin we achieve an average energy-delay product (EDP) that is 68x lower than that of traditional glibc execution. In applications, our NN-based approach has an EDP 78% of that of traditional execution at the cost of an average mean squared error (MSE) of 1.56.
基于神经网络(NN)的加速器的通用近似性质具有维持计算系统历史能量和性能改进的潜力。我们建议使用基于神经网络的加速器来近似GNU C库(glibc)中的数学函数,这些函数通常出现在应用程序基准测试中。使用我们基于神经网络的方法来近似cos、exp、log、pow和sin,我们实现了平均能量延迟积(EDP),比传统的glibc执行低68倍。在应用中,我们基于神经网络的方法的EDP是传统执行方法的78%,平均均方误差(MSE)为1.56。
{"title":"Neural network-based accelerators for transcendental function approximation","authors":"Schuyler Eldridge, F. Raudies, D. Zou, A. Joshi","doi":"10.1145/2591513.2591534","DOIUrl":"https://doi.org/10.1145/2591513.2591534","url":null,"abstract":"The general-purpose approximate nature of neural network (NN) based accelerators has the potential to sustain the historic energy and performance improvements of computing systems. We propose the use of NN-based accelerators to approximate mathematical functions in the GNU C Library (glibc) that commonly occur in application benchmarks. Using our NN-based approach to approximate cos, exp, log, pow, and sin we achieve an average energy-delay product (EDP) that is 68x lower than that of traditional glibc execution. In applications, our NN-based approach has an EDP 78% of that of traditional execution at the cost of an average mean squared error (MSE) of 1.56.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"67 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131304069","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Reconfigurable STT-NV LUT-based functional units to improve performance in general-purpose processors 可重构的基于STT-NV lut的功能单元,以提高通用处理器的性能
Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591535
Adarsh Reddy Ashammagari, H. Mahmoodi, T. Mohsenin, H. Homayoun
Unavailability of functional units is a major performance bottleneck in general-purpose processors (GPP). In a GPP with limited number of functional units while a functional unit may be heavily utilized at times, creating a performance bottleneck, the other functional units might be under-utilized. We propose a novel idea for adapting functional units in GPP architecture in order to overcome this challenge. For this purpose, a selected set of complex functional units that might be under-utilized such as multiplier and divider, are realized using a programmable look up table-based fabric. This allows for run-time adaptation of functional units to improving performance. The programmable look up tables are realized using magnetic tunnel junction (MTJ) based memories that dissipate near zero leakage and are CMOS compatible. We have applied this idea to a dual issue architecture. The results show that compared to a design with all CMOS functional units a performance improvement of 18%, on average is achieved for standard benchmarks. This comes with 4.1% power increase in integer benchmarks and 2.3% power decrease in floating point benchmarks, compared to a CMOS design.
功能单元的不可用性是通用处理器(GPP)的主要性能瓶颈。在功能单元数量有限的GPP中,虽然一个功能单元有时可能被大量利用,从而造成性能瓶颈,但其他功能单元可能未得到充分利用。为了克服这一挑战,我们提出了一种适应GPP架构中功能单元的新思路。为此,选择一组可能未充分利用的复杂功能单元,如乘法器和除法器,使用基于可编程查找表的结构来实现。这允许在运行时调整功能单元以提高性能。可编程查表是使用磁隧道结(MTJ)存储器实现的,该存储器耗散接近零泄漏并且与CMOS兼容。我们已经将这个想法应用到双问题架构中。结果表明,与具有所有CMOS功能单元的设计相比,在标准基准测试中平均实现了18%的性能改进。与CMOS设计相比,整数基准测试的功耗提高了4.1%,浮点基准测试的功耗降低了2.3%。
{"title":"Reconfigurable STT-NV LUT-based functional units to improve performance in general-purpose processors","authors":"Adarsh Reddy Ashammagari, H. Mahmoodi, T. Mohsenin, H. Homayoun","doi":"10.1145/2591513.2591535","DOIUrl":"https://doi.org/10.1145/2591513.2591535","url":null,"abstract":"Unavailability of functional units is a major performance bottleneck in general-purpose processors (GPP). In a GPP with limited number of functional units while a functional unit may be heavily utilized at times, creating a performance bottleneck, the other functional units might be under-utilized. We propose a novel idea for adapting functional units in GPP architecture in order to overcome this challenge. For this purpose, a selected set of complex functional units that might be under-utilized such as multiplier and divider, are realized using a programmable look up table-based fabric. This allows for run-time adaptation of functional units to improving performance. The programmable look up tables are realized using magnetic tunnel junction (MTJ) based memories that dissipate near zero leakage and are CMOS compatible. We have applied this idea to a dual issue architecture. The results show that compared to a design with all CMOS functional units a performance improvement of 18%, on average is achieved for standard benchmarks. This comes with 4.1% power increase in integer benchmarks and 2.3% power decrease in floating point benchmarks, compared to a CMOS design.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115837651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Trade-off between energy and quality of service through dynamic operand truncation and fusion 通过动态操作数截断和融合实现能量和服务质量的权衡
Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591561
Wenchao Qian, Robert Karam, S. Bhunia
Energy efficiency has emerged as a major design concern for embedded and portable electronics. Conventional approaches typically impact performance and often require significant design-time modifications. In this paper, we propose a novel approach for improving energy efficiency through judicious fusion of operations. The proposed approach has two major distinctions: (1) the fusion is enabled by operand truncation, which allows representing multiple operations into a reasonably sized lookup table (LUT); and (2) it works for large varieties of functions. Most applications in the domain of digital signal processing (DSP) and graphics can tolerate some computation error without large degradation in output quality. Our approach improves energy efficiency with graceful degradation in quality. The proposed fusion approach can be applied to trade-off energy efficiency with quality at run time and requires virtually no circuit or architecture level modifications in a processor. Using our software tool for automatic fusion and truncation, the effectiveness of the approach is studied for four common applications. Simulation results show promising improvements (19-90%) in energy delay product with minimal impact on quality.
能源效率已经成为嵌入式和便携式电子产品设计的主要关注点。传统方法通常会影响性能,并且通常需要在设计时进行大量修改。在本文中,我们提出了一种通过明智的融合操作来提高能源效率的新方法。所提出的方法有两个主要区别:(1)融合是通过操作数截断实现的,这允许将多个操作表示为一个合理大小的查找表(LUT);(2)它适用于很多函数。在数字信号处理(DSP)和图形领域的大多数应用都可以容忍一些计算误差,而不会对输出质量造成很大的影响。我们的方法提高了能源效率,同时降低了质量。所提出的融合方法可以应用于在运行时权衡能源效率和质量,并且几乎不需要在处理器中修改电路或架构级别。利用我们的自动融合和截断软件工具,研究了该方法在四种常见应用中的有效性。仿真结果表明,在对质量影响最小的情况下,能量延迟积有了很大的改善(19- 90%)。
{"title":"Trade-off between energy and quality of service through dynamic operand truncation and fusion","authors":"Wenchao Qian, Robert Karam, S. Bhunia","doi":"10.1145/2591513.2591561","DOIUrl":"https://doi.org/10.1145/2591513.2591561","url":null,"abstract":"Energy efficiency has emerged as a major design concern for embedded and portable electronics. Conventional approaches typically impact performance and often require significant design-time modifications. In this paper, we propose a novel approach for improving energy efficiency through judicious fusion of operations. The proposed approach has two major distinctions: (1) the fusion is enabled by operand truncation, which allows representing multiple operations into a reasonably sized lookup table (LUT); and (2) it works for large varieties of functions. Most applications in the domain of digital signal processing (DSP) and graphics can tolerate some computation error without large degradation in output quality. Our approach improves energy efficiency with graceful degradation in quality. The proposed fusion approach can be applied to trade-off energy efficiency with quality at run time and requires virtually no circuit or architecture level modifications in a processor. Using our software tool for automatic fusion and truncation, the effectiveness of the approach is studied for four common applications. Simulation results show promising improvements (19-90%) in energy delay product with minimal impact on quality.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114866826","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Forward-scaling, serially equivalent parallelism for FPGA placement FPGA放置的前向缩放、串行等效并行性
Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591543
C. Fobel, G. Grewal, D. Stacey
Placement run-times continue to dominate the FPGA design flow. Previous attempts at parallel placement methods either only scale to a few threads or result in a significant loss in solution quality as thread-count is increased. We propose a novel method for generating large amounts of parallel work for placement, which scales with the size of the target architecture. Our experimental results show that we nearly reach the limit of the number of possible parallel swaps, while improving critical-path-delay 4.7% compared to VPR. While our proposed implementation currently utilizes a single thread, we still achieve speedups of 13.3x over VPR.
放置运行时间继续主导着FPGA设计流程。以前对并行放置方法的尝试要么只能扩展到几个线程,要么随着线程数的增加导致解决方案质量的严重损失。我们提出了一种新的方法来产生大量的并行工作的放置,它与目标体系结构的大小缩放。我们的实验结果表明,我们几乎达到了可能的并行交换数量的极限,同时与VPR相比,关键路径延迟提高了4.7%。虽然我们建议的实现目前使用单线程,但我们仍然可以实现比VPR快13.3倍的速度。
{"title":"Forward-scaling, serially equivalent parallelism for FPGA placement","authors":"C. Fobel, G. Grewal, D. Stacey","doi":"10.1145/2591513.2591543","DOIUrl":"https://doi.org/10.1145/2591513.2591543","url":null,"abstract":"Placement run-times continue to dominate the FPGA design flow. Previous attempts at parallel placement methods either only scale to a few threads or result in a significant loss in solution quality as thread-count is increased. We propose a novel method for generating large amounts of parallel work for placement, which scales with the size of the target architecture. Our experimental results show that we nearly reach the limit of the number of possible parallel swaps, while improving critical-path-delay 4.7% compared to VPR. While our proposed implementation currently utilizes a single thread, we still achieve speedups of 13.3x over VPR.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132961271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A current-mode CMOS/memristor hybrid implementation of an extreme learning machine 一种电流模式CMOS/忆阻器混合实现的极限学习机
Pub Date : 2014-05-20 DOI: 10.1145/2591513.2591572
Cory E. Merkel, D. Kudithipudi
In this work, we propose a current-mode CMOS/memristor hybrid implementation of an extreme learning machine (ELM) architecture. We present novel circuit designs for linear, sigmoid,and threshold neuronal activation functions, as well as memristor-based bipolar synaptic weighting. In addition, this work proposes a stochastic version of the least-mean-squares (LMS) training algorithm for adapting the weights between the ELM's hidden and output layers. We simulated our top-level ELM architecture using Cadence AMS Designer with 45 nm CMOS models and an empirical piecewise linear memristor model based on experimental data from an HfOx device. With 10 hidden node neurons, the ELM was able to learn a 2-input XOR function after 150 training epochs.
在这项工作中,我们提出了一种电流模式CMOS/忆阻器混合实现的极限学习机(ELM)架构。我们提出了线性、s型和阈值神经元激活函数的新电路设计,以及基于记忆电阻器的双极突触加权。此外,这项工作提出了一种随机版本的最小均方(LMS)训练算法,用于适应ELM的隐藏层和输出层之间的权重。我们使用Cadence AMS Designer模拟了我们的顶层ELM架构,采用45纳米CMOS模型和基于HfOx器件实验数据的经验分段线性忆阻器模型。通过10个隐藏节点神经元,ELM能够在150次训练后学习一个2输入异或函数。
{"title":"A current-mode CMOS/memristor hybrid implementation of an extreme learning machine","authors":"Cory E. Merkel, D. Kudithipudi","doi":"10.1145/2591513.2591572","DOIUrl":"https://doi.org/10.1145/2591513.2591572","url":null,"abstract":"In this work, we propose a current-mode CMOS/memristor hybrid implementation of an extreme learning machine (ELM) architecture. We present novel circuit designs for linear, sigmoid,and threshold neuronal activation functions, as well as memristor-based bipolar synaptic weighting. In addition, this work proposes a stochastic version of the least-mean-squares (LMS) training algorithm for adapting the weights between the ELM's hidden and output layers. We simulated our top-level ELM architecture using Cadence AMS Designer with 45 nm CMOS models and an empirical piecewise linear memristor model based on experimental data from an HfOx device. With 10 hidden node neurons, the ELM was able to learn a 2-input XOR function after 150 training epochs.","PeriodicalId":272619,"journal":{"name":"ACM Great Lakes Symposium on VLSI","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131008868","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
期刊
ACM Great Lakes Symposium on VLSI
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1