首页 > 最新文献

2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)最新文献

英文 中文
An efficient manipulation package for Biconditional Binary Decision Diagrams 一个有效的双条件二元决策图操作包
Pub Date : 2014-03-24 DOI: 10.7873/DATE.2014.309
L. Amarù, P. Gaillardon, G. Micheli
Biconditional Binary Decision Diagrams (BBDDs) are a novel class of binary decision diagrams where the branching condition, and its associated logic expansion, is biconditional on two variables. Reduced and ordered BBDDs are remarkably compact and unique for a given Boolean function. In order to exploit BBDDs in Electronic Design Automation (EDA) applications, efficient manipulation algorithms must be developed and integrated in a software package. In this paper, we present the theory for efficient BBDD manipulation and its practical software implementation. The key features of the proposed approach are strong canonical form pre-conditioning of stored BBDD nodes, recursive formulation of Boolean operations in terms of biconditional expansions, performance-oriented memory management and dedicated BBDD re-ordering techniques. Experimental results show that the developed BBDD package achieves an average node count reduction of 19.48% and a speed-up factor of 1.63x with respect to a state-of-art decision diagram manipulation package. Employed in the synthesis of datapath circuits, the BBDD manipulation package is capable to advantageously restructure arithmetic operations producing 11.02% smaller and 32.29% faster circuits as compared to a commercial synthesis flow.
双条件二元决策图(bbdd)是一类新的二元决策图,其分支条件及其相关的逻辑展开在两个变量上是双条件的。简化和有序的bbdd对于给定的布尔函数来说是非常紧凑和唯一的。为了在电子设计自动化(EDA)应用中利用bbdd,必须开发有效的操作算法并将其集成到软件包中。在本文中,我们提出了有效的BBDD操作理论及其实际的软件实现。该方法的主要特点是对存储的BBDD节点进行强规范形式预处理,根据双条件展开的布尔运算递归公式,面向性能的内存管理和专用的BBDD重新排序技术。实验结果表明,与现有的决策图处理包相比,所开发的BBDD包平均节点数减少了19.48%,加速系数提高了1.63倍。在数据路径电路的合成中,BBDD操作包能够重组算术运算,与商业合成流程相比,电路体积缩小11.02%,速度提高32.29%。
{"title":"An efficient manipulation package for Biconditional Binary Decision Diagrams","authors":"L. Amarù, P. Gaillardon, G. Micheli","doi":"10.7873/DATE.2014.309","DOIUrl":"https://doi.org/10.7873/DATE.2014.309","url":null,"abstract":"Biconditional Binary Decision Diagrams (BBDDs) are a novel class of binary decision diagrams where the branching condition, and its associated logic expansion, is biconditional on two variables. Reduced and ordered BBDDs are remarkably compact and unique for a given Boolean function. In order to exploit BBDDs in Electronic Design Automation (EDA) applications, efficient manipulation algorithms must be developed and integrated in a software package. In this paper, we present the theory for efficient BBDD manipulation and its practical software implementation. The key features of the proposed approach are strong canonical form pre-conditioning of stored BBDD nodes, recursive formulation of Boolean operations in terms of biconditional expansions, performance-oriented memory management and dedicated BBDD re-ordering techniques. Experimental results show that the developed BBDD package achieves an average node count reduction of 19.48% and a speed-up factor of 1.63x with respect to a state-of-art decision diagram manipulation package. Employed in the synthesis of datapath circuits, the BBDD manipulation package is capable to advantageously restructure arithmetic operations producing 11.02% smaller and 32.29% faster circuits as compared to a commercial synthesis flow.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"26 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87286470","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Increasing the efficiency of syndrome coding for PUFs with helper data compression 利用辅助数据压缩提高puf的综合征编码效率
Pub Date : 2014-03-24 DOI: 10.7873/DATE.2014.084
Matthias Hiller, G. Sigl
Physical Unclonable Functions (PUFs) provide secure cryptographic keys for resource constrained embedded systems without secure storage. A PUF measures internal manufacturing variations to create a unique, but noisy secret inside a device. Syndrome coding schemes create and store helper data about the structure of a specific PUF to correct errors within subsequent PUF measurements and generate a reliable key. This helper data can contain redundancy. We analyze existing schemes and show that data compression can be applied to decrease the size of the helper data of existing implementations. We introduce compressed Differential Sequence Coding (DSC), which is the most efficient syndrome coding scheme known to date for a popular reference scenario. Adding helper data compression to the DSC algorithm leads to an overall decrease of 68% in helper data size compared to other algorithms in a reference scenario. This is achieved without increasing the number of PUF bits and a minimal increase in logic size.
物理不可克隆函数(puf)为没有安全存储的资源受限嵌入式系统提供了安全的加密密钥。PUF测量内部制造变化,在设备内部创造一个独特但嘈杂的秘密。综合征编码方案创建并存储关于特定PUF结构的辅助数据,以纠正后续PUF测量中的错误并生成可靠的密钥。这个助手数据可以包含冗余。我们分析了现有的方案,并表明数据压缩可以用于减少现有实现的辅助数据的大小。我们介绍了压缩差分序列编码(DSC),这是迄今为止已知的最有效的综合征编码方案,用于流行的参考场景。将辅助数据压缩添加到DSC算法中,与参考场景中的其他算法相比,辅助数据大小总体上减少了68%。这是在不增加PUF位的数量和逻辑大小的最小增加的情况下实现的。
{"title":"Increasing the efficiency of syndrome coding for PUFs with helper data compression","authors":"Matthias Hiller, G. Sigl","doi":"10.7873/DATE.2014.084","DOIUrl":"https://doi.org/10.7873/DATE.2014.084","url":null,"abstract":"Physical Unclonable Functions (PUFs) provide secure cryptographic keys for resource constrained embedded systems without secure storage. A PUF measures internal manufacturing variations to create a unique, but noisy secret inside a device. Syndrome coding schemes create and store helper data about the structure of a specific PUF to correct errors within subsequent PUF measurements and generate a reliable key. This helper data can contain redundancy. We analyze existing schemes and show that data compression can be applied to decrease the size of the helper data of existing implementations. We introduce compressed Differential Sequence Coding (DSC), which is the most efficient syndrome coding scheme known to date for a popular reference scenario. Adding helper data compression to the DSC algorithm leads to an overall decrease of 68% in helper data size compared to other algorithms in a reference scenario. This is achieved without increasing the number of PUF bits and a minimal increase in logic size.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"26 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85550693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Temporal memoization for energy-efficient timing error recovery in GPGPUs 基于时间记忆的高效gpgpu时序错误恢复
Pub Date : 2014-03-24 DOI: 10.7873/DATE.2014.113
Abbas Rahimi, L. Benini, Rajesh K. Gupta
Manufacturing and environmental variability lead to timing errors in computing systems that are typically corrected by error detection and correction mechanisms at the circuit level. The cost and speed of recovery can be improved by memoization-based optimization methods that exploit spatial or temporal parallelisms in suitable computing fabrics such as general-purpose graphics processing units (GPGPUs). We propose here a temporal memoization technique for use in floating-point units (FPUs) in GPGPUs that uses value locality inside data-parallel programs. The technique recalls (memorizes) the context of error-free execution of an instruction on a FPU. To enable scalable and independent recovery, a single-cycle lookup table (LUT) is tightly coupled to every FPU to maintain contexts of recent error-free executions. The LUT reuses these memorized contexts to exactly, or approximately, correct errant FP instructions based on application needs. In real-world applications, the temporal memoization technique achieves an average energy saving of 8%-28% for a wide range of timing error rates (0%-4%) and outperforms recent advances in resilient architectures. This technique also enhances robustness in the voltage overscaling regime and achieves relative average energy saving of 66 % with 11% voltage overscaling.
制造和环境的可变性导致计算系统中的定时错误,这些错误通常通过电路级的错误检测和校正机制来纠正。基于记忆的优化方法可以在适当的计算结构(如通用图形处理单元(gpgpu))中利用空间或时间的并行性,从而提高恢复的成本和速度。本文提出了一种用于gpgpu中的浮点单元(fpu)的时间记忆技术,该技术在数据并行程序中使用值局部性。该技术可以在FPU上回忆(记忆)无错误执行指令的上下文。为了支持可扩展和独立的恢复,单周期查找表(LUT)与每个FPU紧密耦合,以维护最近无错误执行的上下文。LUT重用这些记忆的上下文,根据应用程序的需要精确地或近似地纠正错误的FP指令。在实际应用中,时间记忆技术在大范围的时间错误率(0%-4%)下实现了8%-28%的平均节能,并且优于弹性架构中的最新进展。该技术还增强了电压过标度的鲁棒性,在电压过标度为11%的情况下实现了66%的相对平均节能。
{"title":"Temporal memoization for energy-efficient timing error recovery in GPGPUs","authors":"Abbas Rahimi, L. Benini, Rajesh K. Gupta","doi":"10.7873/DATE.2014.113","DOIUrl":"https://doi.org/10.7873/DATE.2014.113","url":null,"abstract":"Manufacturing and environmental variability lead to timing errors in computing systems that are typically corrected by error detection and correction mechanisms at the circuit level. The cost and speed of recovery can be improved by memoization-based optimization methods that exploit spatial or temporal parallelisms in suitable computing fabrics such as general-purpose graphics processing units (GPGPUs). We propose here a temporal memoization technique for use in floating-point units (FPUs) in GPGPUs that uses value locality inside data-parallel programs. The technique recalls (memorizes) the context of error-free execution of an instruction on a FPU. To enable scalable and independent recovery, a single-cycle lookup table (LUT) is tightly coupled to every FPU to maintain contexts of recent error-free executions. The LUT reuses these memorized contexts to exactly, or approximately, correct errant FP instructions based on application needs. In real-world applications, the temporal memoization technique achieves an average energy saving of 8%-28% for a wide range of timing error rates (0%-4%) and outperforms recent advances in resilient architectures. This technique also enhances robustness in the voltage overscaling regime and achieves relative average energy saving of 66 % with 11% voltage overscaling.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"75 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86024035","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Efficiency of a glitch detector against electromagnetic fault injection 故障检测器抗电磁注入故障的效率
Pub Date : 2014-03-24 DOI: 10.7873/DATE.2014.216
Loïc Zussa, Amine Dehbaoui, Karim Tobich, J. Dutertre, P. Maurine, L. Guillaume-Sage, J. Clédière, A. Tria
The use of electromagnetic glitches has recently emerged as an effective fault injection technique for the purpose of conducting physical attacks against integrated circuits. First research works have shown that electromagnetic faults are induced by timing constraint violations and that they are also located in the vicinity of the injection probe. This paper reports the study of the efficiency of a glitch detector against EM injection. This detector was originally designed to detect any attempt of inducing timing violations by means of clock or power glitches. Because electromagnetic disturbances are more local than global, the use of a single detector proved to be inefficient. Our subsequent investigation of the use of several detectors to obtain a full fault detection coverage is reported, it also provides further insights into the properties of electromagnetic injection and into the key role played by the injection probe.
电磁故障的使用最近成为一种有效的故障注入技术,用于对集成电路进行物理攻击。首先,研究工作表明电磁故障是由时间约束违反引起的,并且它们也位于注入探针附近。本文报道了一种针对电磁注入的故障检测器的效率研究。该检测器最初设计用于检测任何通过时钟或电源故障诱导时间违规的企图。由于电磁干扰是局部的,而不是全局的,使用单个探测器被证明是低效的。我们随后的研究使用了几种探测器来获得完整的故障检测覆盖范围,这也为电磁注入的特性和注入探针所起的关键作用提供了进一步的见解。
{"title":"Efficiency of a glitch detector against electromagnetic fault injection","authors":"Loïc Zussa, Amine Dehbaoui, Karim Tobich, J. Dutertre, P. Maurine, L. Guillaume-Sage, J. Clédière, A. Tria","doi":"10.7873/DATE.2014.216","DOIUrl":"https://doi.org/10.7873/DATE.2014.216","url":null,"abstract":"The use of electromagnetic glitches has recently emerged as an effective fault injection technique for the purpose of conducting physical attacks against integrated circuits. First research works have shown that electromagnetic faults are induced by timing constraint violations and that they are also located in the vicinity of the injection probe. This paper reports the study of the efficiency of a glitch detector against EM injection. This detector was originally designed to detect any attempt of inducing timing violations by means of clock or power glitches. Because electromagnetic disturbances are more local than global, the use of a single detector proved to be inefficient. Our subsequent investigation of the use of several detectors to obtain a full fault detection coverage is reported, it also provides further insights into the properties of electromagnetic injection and into the key role played by the injection probe.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"1 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86044770","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 73
EVX: Vector execution on low power EDGE cores EVX:低功耗EDGE内核上的矢量执行
Pub Date : 2014-03-24 DOI: 10.7873/DATE.2014.035
M. Duric, Oscar Palomar, Aaron Smith, O. Unsal, A. Cristal, M. Valero, D. Burger
In this paper, we present a vector execution model that provides the advantages of vector processors on low power, general purpose cores, with limited additional hardware. While accelerating data-level parallel (DLP) workloads, the vector model increases the efficiency and hardware resources utilization. We use a modest dual issue core based on an Explicit Data Graph Execution (EDGE) architecture to implement our approach, called EVX. Unlike most DLP accelerators which utilize additional hardware and increase the complexity of low power processors, EVX leverages the available resources of EDGE cores, and with minimal costs allows for specialization of the resources. EVX adds a control logic that increases the core area by 2.1%. We show that EVX yields an average speedup of 3x compared to a scalar baseline and outperforms multimedia SIMD extensions.
在本文中,我们提出了一个矢量执行模型,该模型提供了矢量处理器在低功耗,通用内核上的优势,并且具有有限的额外硬件。在加速数据级并行(DLP)工作负载的同时,矢量模型提高了效率和硬件资源利用率。我们使用基于显式数据图执行(EDGE)架构的适度双问题核心来实现我们的方法,称为EVX。与大多数使用额外硬件并增加低功耗处理器复杂性的DLP加速器不同,EVX利用EDGE内核的可用资源,并且以最小的成本允许资源专业化。EVX增加了一个控制逻辑,使核心面积增加了2.1%。我们表明,与标量基线相比,EVX的平均加速速度提高了3倍,并且优于多媒体SIMD扩展。
{"title":"EVX: Vector execution on low power EDGE cores","authors":"M. Duric, Oscar Palomar, Aaron Smith, O. Unsal, A. Cristal, M. Valero, D. Burger","doi":"10.7873/DATE.2014.035","DOIUrl":"https://doi.org/10.7873/DATE.2014.035","url":null,"abstract":"In this paper, we present a vector execution model that provides the advantages of vector processors on low power, general purpose cores, with limited additional hardware. While accelerating data-level parallel (DLP) workloads, the vector model increases the efficiency and hardware resources utilization. We use a modest dual issue core based on an Explicit Data Graph Execution (EDGE) architecture to implement our approach, called EVX. Unlike most DLP accelerators which utilize additional hardware and increase the complexity of low power processors, EVX leverages the available resources of EDGE cores, and with minimal costs allows for specialization of the resources. EVX adds a control logic that increases the core area by 2.1%. We show that EVX yields an average speedup of 3x compared to a scalar baseline and outperforms multimedia SIMD extensions.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"64 1","pages":"1-4"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88926791","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Optimization of design complexity in time-multiplexed constant multiplications 时间复用常数乘法中设计复杂度的优化
Pub Date : 2014-03-24 DOI: 10.7873/DATE.2014.313
L. Aksoy, P. Flores, J. Monteiro
The multiplication of constants by a data input is an essential operation in digital signal processing (DSP) systems. For applications requiring a large number of constant multiplications under stringent hardware constraints, it is generally realized under a folded architecture, where a single constant selected from a set of multiple constants is multiplied by the data input at each time, called time-multiplexed constant multiplication (TMCM). This paper addresses the problem of optimizing the complexity of a TMCM design and introduces an algorithm that finds the least complex TMCM design by sharing the logic operators, i.e., adders, subtractors, adders/subtractors, and multiplexors (MUXes). It includes efficient search methods, yielding better results than existing TMCM algorithms.
在数字信号处理(DSP)系统中,常量的乘法运算是一个重要的操作。对于需要在严格硬件约束下进行大量常数乘法的应用,一般采用折叠架构实现,即从一组多个常数中选择一个常数,每次与输入的数据相乘,称为时间复用常数乘法(TMCM)。本文解决了优化TMCM设计复杂性的问题,并介绍了一种算法,该算法通过共享逻辑运算符,即加、减、加/减和多路复用器(mux),找到最不复杂的TMCM设计。它包括有效的搜索方法,产生比现有的TMCM算法更好的结果。
{"title":"Optimization of design complexity in time-multiplexed constant multiplications","authors":"L. Aksoy, P. Flores, J. Monteiro","doi":"10.7873/DATE.2014.313","DOIUrl":"https://doi.org/10.7873/DATE.2014.313","url":null,"abstract":"The multiplication of constants by a data input is an essential operation in digital signal processing (DSP) systems. For applications requiring a large number of constant multiplications under stringent hardware constraints, it is generally realized under a folded architecture, where a single constant selected from a set of multiple constants is multiplied by the data input at each time, called time-multiplexed constant multiplication (TMCM). This paper addresses the problem of optimizing the complexity of a TMCM design and introduces an algorithm that finds the least complex TMCM design by sharing the logic operators, i.e., adders, subtractors, adders/subtractors, and multiplexors (MUXes). It includes efficient search methods, yielding better results than existing TMCM algorithms.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"46 4 1","pages":"1-4"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81443375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
A multiple fault injection methodology based on cone partitioning towards RTL modeling of laser attacks 面向激光攻击RTL建模的基于锥划分的多故障注入方法
Pub Date : 2014-03-24 DOI: 10.7873/DATE2014.219
Athanasios Papadimitriou, D. Hély, V. Beroulle, P. Maistri, R. Leveugle
Laser attacks, especially on circuits manufactured with recent deep submicron semiconductor technologies, pose a threat to secure integrated circuits due to the multiplicity of errors induced by a single attack. An efficient way to neutralize such effects is the design of appropriate countermeasures, according to the circuit implementation and characteristics. Therefore tools which allow the early evaluation of security implementations are necessary. Our efforts involve the development of an RTL fault injection approach more representative of laser attacks than random multi-bit fault injections and the utilization and evolution of state of the art emulation techniques to reduce the duration of the fault injection campaigns. This will ultimately lead to the design and validation of new countermeasures against laser attacks, on ASICs implementing cryptographic algorithms.
激光攻击,特别是对采用最新深亚微米半导体技术制造的电路,由于一次攻击引起的多重错误,对集成电路的安全构成了威胁。消除这种影响的有效方法是根据电路的实现和特性设计适当的对抗措施。因此,允许早期评估安全实现的工具是必要的。我们的工作包括开发一种比随机多比特故障注入更能代表激光攻击的RTL故障注入方法,以及利用和发展最先进的仿真技术来缩短故障注入活动的持续时间。这将最终导致针对实现加密算法的asic的激光攻击的新对策的设计和验证。
{"title":"A multiple fault injection methodology based on cone partitioning towards RTL modeling of laser attacks","authors":"Athanasios Papadimitriou, D. Hély, V. Beroulle, P. Maistri, R. Leveugle","doi":"10.7873/DATE2014.219","DOIUrl":"https://doi.org/10.7873/DATE2014.219","url":null,"abstract":"Laser attacks, especially on circuits manufactured with recent deep submicron semiconductor technologies, pose a threat to secure integrated circuits due to the multiplicity of errors induced by a single attack. An efficient way to neutralize such effects is the design of appropriate countermeasures, according to the circuit implementation and characteristics. Therefore tools which allow the early evaluation of security implementations are necessary. Our efforts involve the development of an RTL fault injection approach more representative of laser attacks than random multi-bit fault injections and the utilization and evolution of state of the art emulation techniques to reduce the duration of the fault injection campaigns. This will ultimately lead to the design and validation of new countermeasures against laser attacks, on ASICs implementing cryptographic algorithms.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"58 1","pages":"1-4"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84871857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
A multi banked — Multi ported — Non blocking shared L2 cache for MPSoC platforms 一个多银行-多端口-非阻塞共享L2缓存的MPSoC平台
Pub Date : 2014-03-24 DOI: 10.7873/DATE.2014.093
Igor Loi, L. Benini
On-chip L2 cache architectures, well established in high-performance parallel computing systems, are now becoming a performance-critical component also for multi/many-core architectures targeted at lower-power, embedded applications. The very stringent requirements on power and cost of these systems result in one of the key challenges in many-core designs, mandating the deployment of highly efficient L2 caches. In this perspective, sharing the L2 cache layer among all system cores has important advantages, such as increased utilization, fast inter-core communication, and reduced aggregate footprint because no undesired replication of lines occurs. This paper presents a novel architecture for a shared L2 cache system with multi-port and multi-bank features. We target this L2 cache to a many-core platform based on hierarchical cluster structure that does not employ private data caches, and therefore does not require complex coherency mechanisms. In fact, our shared L2 cache can be seen logically as a Last Level Cache (LLC) adopting the terminology of higher-performance many-core products, although in these latter the LLC is more often an L3 layer. Our experimental results show a maximum aggregate bandwidth of 28GB/s (89% of the maximum channel capacity) for 100% hit traffic with random banking conflicts, as a realistic case. Physical implementation results in 28nm Fully-Depleted-Silicon-on-Insulator (FDSoI) show that our L2 cache can operate at up to 1GHz with a memory density loss of only 20% with respect to an L2 scratchpad for a 2 MB configuration.
在高性能并行计算系统中建立良好的片上L2缓存架构,现在也成为针对低功耗嵌入式应用的多核/多核架构的性能关键组件。这些系统对功率和成本的严格要求导致了多核设计中的一个关键挑战,要求部署高效的L2缓存。从这个角度来看,在所有系统核心之间共享L2缓存层具有重要的优势,例如提高利用率、快速的核心间通信和减少聚合占用,因为不会发生不必要的线路复制。本文提出了一种具有多端口和多银行特性的共享二级缓存系统的新架构。我们将这个二级缓存定位于基于分层集群结构的多核平台,该平台不使用私有数据缓存,因此不需要复杂的一致性机制。事实上,我们的共享L2缓存在逻辑上可以被视为采用高性能多核产品术语的最后一级缓存(LLC),尽管后者的LLC通常是L3层。我们的实验结果表明,对于随机银行冲突的100%命中流量,作为一个现实案例,最大总带宽为28GB/s(最大信道容量的89%)。在28nm完全耗尽绝缘体上硅(FDSoI)的物理实现结果表明,我们的L2缓存可以在高达1GHz的频率下工作,相对于2mb配置的L2 scratchpad,内存密度损失仅为20%。
{"title":"A multi banked — Multi ported — Non blocking shared L2 cache for MPSoC platforms","authors":"Igor Loi, L. Benini","doi":"10.7873/DATE.2014.093","DOIUrl":"https://doi.org/10.7873/DATE.2014.093","url":null,"abstract":"On-chip L2 cache architectures, well established in high-performance parallel computing systems, are now becoming a performance-critical component also for multi/many-core architectures targeted at lower-power, embedded applications. The very stringent requirements on power and cost of these systems result in one of the key challenges in many-core designs, mandating the deployment of highly efficient L2 caches. In this perspective, sharing the L2 cache layer among all system cores has important advantages, such as increased utilization, fast inter-core communication, and reduced aggregate footprint because no undesired replication of lines occurs. This paper presents a novel architecture for a shared L2 cache system with multi-port and multi-bank features. We target this L2 cache to a many-core platform based on hierarchical cluster structure that does not employ private data caches, and therefore does not require complex coherency mechanisms. In fact, our shared L2 cache can be seen logically as a Last Level Cache (LLC) adopting the terminology of higher-performance many-core products, although in these latter the LLC is more often an L3 layer. Our experimental results show a maximum aggregate bandwidth of 28GB/s (89% of the maximum channel capacity) for 100% hit traffic with random banking conflicts, as a realistic case. Physical implementation results in 28nm Fully-Depleted-Silicon-on-Insulator (FDSoI) show that our L2 cache can operate at up to 1GHz with a memory density loss of only 20% with respect to an L2 scratchpad for a 2 MB configuration.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"66 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85066097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Word-line power supply selector for stability improvement of embedded SRAMs in high reliability applications 在高可靠性应用中提高嵌入式sram稳定性的字行电源选择器
Pub Date : 2014-03-24 DOI: 10.5555/2616606.2616804
B. Alorda, C. Carmona, S. Bota
Embedded SRAM yield dominates the overall ASIC yield, therefore the methodologies centered on improving SRAM cell stability will be introduced in the design as a mandatory. Word-line voltage modulation has showed that it is possible to improve cell stability during access operations. The high variability of physical and performance parameters introduce the need to adopt adaptable solutions to adequately improve SRAM cell stability. In this work, we present a wordline voltage selector circuit designed to modulate power-supply word-line voltage at each individual embedded SRAM block. The final area overhead is minimal and several strategies can be implemented with the embedded SRAM allowing adjust wordline voltage value during the life of ASIC, taking into account different operation, aging and degradations effects.
嵌入式SRAM良率占ASIC整体良率的主导地位,因此以提高SRAM单元稳定性为中心的方法将作为强制性措施引入设计中。字行电压调制表明,在接入操作期间,有可能提高小区的稳定性。物理和性能参数的高度可变性引入了采用适应性解决方案以充分提高SRAM单元稳定性的需要。在这项工作中,我们提出了一个字线电压选择电路,用于调制每个嵌入式SRAM块上的电源字线电压。最后的面积开销是最小的,并且可以使用嵌入式SRAM实现多种策略,允许在ASIC的使用寿命期间调整字线电压值,考虑到不同的操作,老化和退化效应。
{"title":"Word-line power supply selector for stability improvement of embedded SRAMs in high reliability applications","authors":"B. Alorda, C. Carmona, S. Bota","doi":"10.5555/2616606.2616804","DOIUrl":"https://doi.org/10.5555/2616606.2616804","url":null,"abstract":"Embedded SRAM yield dominates the overall ASIC yield, therefore the methodologies centered on improving SRAM cell stability will be introduced in the design as a mandatory. Word-line voltage modulation has showed that it is possible to improve cell stability during access operations. The high variability of physical and performance parameters introduce the need to adopt adaptable solutions to adequately improve SRAM cell stability. In this work, we present a wordline voltage selector circuit designed to modulate power-supply word-line voltage at each individual embedded SRAM block. The final area overhead is minimal and several strategies can be implemented with the embedded SRAM allowing adjust wordline voltage value during the life of ASIC, taking into account different operation, aging and degradations effects.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"226 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89188130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
Effective post-silicon failure localization using dynamic program slicing 基于动态程序切片的有效硅后失效定位
Pub Date : 2014-03-24 DOI: 10.7873/DATE.2014.332
Ophir Friedler, W. Kadry, A. Morgenshtein, Amir Nahir, V. Sokhin
In post-silicon functional validation, one of the most complex and time-consuming processes is the localization of an instruction that exposes a bug detected at system level. The task is particularly difficult due to the silicon's limited observability and the long time between a failure's occurrence and its detection. We propose a novel method that automates the architectural localization of post-silicon test-case failures. Our proposed tool analyzes a failing test-case, while leveraging the information derived from executing the test on an Instruction Set software Simulator (ISS), to identify a set of instructions that could lead to the faulty final state. The proposed failure localization process comprises the creation of a resource dependency graph based on the execution of the test-case on the ISS, determining a program slice of instructions that influence the faulty resources, and the reduction of the set of suspicious instructions by leveraging the knowledge of the correct resources. We evaluate our proposed solution through extensive experiments. Experimental results show that, in over 97% of all cases, our method was able to narrow down the list of suspicious instructions to under 2 instructions, on average, out of over 200. In over 59% of all cases, our method correctly reduced a test-case to a single faulty instruction.
在后硅功能验证中,最复杂和耗时的过程之一是对暴露在系统级检测到的错误的指令进行定位。由于硅的可观测性有限,故障发生和检测之间的时间很长,这项任务尤其困难。我们提出了一种新颖的方法来自动化后硅测试用例失败的架构本地化。我们建议的工具分析一个失败的测试用例,同时利用在指令集软件模拟器(ISS)上执行测试所获得的信息,来识别一组可能导致错误最终状态的指令。提出的故障定位过程包括基于在ISS上执行测试用例创建资源依赖图,确定影响故障资源的指令的程序片段,以及通过利用正确资源的知识减少可疑指令集。我们通过大量的实验来评估我们提出的解决方案。实验结果表明,在超过97%的情况下,我们的方法能够从200多条指令中平均将可疑指令列表缩小到2条以下。在超过59%的情况下,我们的方法正确地将测试用例减少到单个错误指令。
{"title":"Effective post-silicon failure localization using dynamic program slicing","authors":"Ophir Friedler, W. Kadry, A. Morgenshtein, Amir Nahir, V. Sokhin","doi":"10.7873/DATE.2014.332","DOIUrl":"https://doi.org/10.7873/DATE.2014.332","url":null,"abstract":"In post-silicon functional validation, one of the most complex and time-consuming processes is the localization of an instruction that exposes a bug detected at system level. The task is particularly difficult due to the silicon's limited observability and the long time between a failure's occurrence and its detection. We propose a novel method that automates the architectural localization of post-silicon test-case failures. Our proposed tool analyzes a failing test-case, while leveraging the information derived from executing the test on an Instruction Set software Simulator (ISS), to identify a set of instructions that could lead to the faulty final state. The proposed failure localization process comprises the creation of a resource dependency graph based on the execution of the test-case on the ISS, determining a program slice of instructions that influence the faulty resources, and the reduction of the set of suspicious instructions by leveraging the knowledge of the correct resources. We evaluate our proposed solution through extensive experiments. Experimental results show that, in over 97% of all cases, our method was able to narrow down the list of suspicious instructions to under 2 instructions, on average, out of over 200. In over 59% of all cases, our method correctly reduced a test-case to a single faulty instruction.","PeriodicalId":6550,"journal":{"name":"2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)","volume":"23 1","pages":"1-6"},"PeriodicalIF":0.0,"publicationDate":"2014-03-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84350365","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
期刊
2014 Design, Automation & Test in Europe Conference & Exhibition (DATE)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1