首页 > 最新文献

2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)最新文献

英文 中文
Hierarchical power budgeting for Dark Silicon chips 暗硅芯片的分层功率预算
Pub Date : 2015-07-22 DOI: 10.1109/ISLPED.2015.7273516
M. U. Khan, M. Shafique, J. Henkel
The emerging Dark Silicon limitation has led the application designers to carefully consider the available Thermal Design Power (TDP) budgets, hardware resources, and software characteristics. In this paper, we propose a hierarchical scheme for distributing the resources and TDP budget among concurrently executing applications with multi-threaded workloads under throughput constraints. Afterwards, the application-level TDP budget is partitioned among its threads depending upon their workloads, which can then be fine-tuned at run time considering workload variations. We evaluate our scheme for the next-generation, multi-threaded, High Efficiency Video Codec and demonstrate that up to 30.86% higher throughput is achieved compared to the state-of-the-art.
新出现的暗硅限制导致应用设计人员仔细考虑可用的热设计功率(TDP)预算、硬件资源和软件特性。在本文中,我们提出了一种在吞吐量限制下并发执行多线程工作负载的应用程序之间分配资源和TDP预算的分层方案。然后,应用程序级别的TDP预算根据线程的工作负载在线程之间进行分区,然后可以在运行时考虑工作负载的变化对其进行微调。我们评估了下一代多线程高效视频编解码器的方案,并证明与最先进的方案相比,该方案的吞吐量提高了30.86%。
{"title":"Hierarchical power budgeting for Dark Silicon chips","authors":"M. U. Khan, M. Shafique, J. Henkel","doi":"10.1109/ISLPED.2015.7273516","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273516","url":null,"abstract":"The emerging Dark Silicon limitation has led the application designers to carefully consider the available Thermal Design Power (TDP) budgets, hardware resources, and software characteristics. In this paper, we propose a hierarchical scheme for distributing the resources and TDP budget among concurrently executing applications with multi-threaded workloads under throughput constraints. Afterwards, the application-level TDP budget is partitioned among its threads depending upon their workloads, which can then be fine-tuned at run time considering workload variations. We evaluate our scheme for the next-generation, multi-threaded, High Efficiency Video Codec and demonstrate that up to 30.86% higher throughput is achieved compared to the state-of-the-art.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121887473","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Enabling energy efficient Hybrid Memory Cube systems with erasure codes 启用具有擦除码的节能混合记忆体系统
Pub Date : 2015-07-22 DOI: 10.1109/ISLPED.2015.7273492
Shibo Wang, Yanwei Song, M. N. Bojnordi, Engin Ipek
The Hybrid Memory Cube (HMC) is a promising alternative to DDRx memory due to its potential to achieve significantly higher bandwidth. However, the high static power of an HMC device compromises power efficiency when the device is lightly utilized. Activating a sleeping HMC takes over 2μs, which makes it challenging to manage HMC power without a substantial degradation in system performance. We introduce a new technique that alleviates the long wakeup penalty of an HMC by employing erasure codes. Inaccessible data stored in a sleeping HMC module can be reconstructed by decoding related data retrieved from other active HMCs, rather than waiting for the sleeping HMC module to become active. This approach makes it possible to tolerate the latency penalty incurred when switching an HMC between active and sleep modes, thereby enabling a power-capped HMC system. Simulations show that the proposed architecture outperforms a current HMC-based multicore system by 6.2×, and reduces the system energy by 5.3× under the same power budget as the multicore baseline.
混合内存立方体(HMC)是DDRx内存的一个很有前途的替代品,因为它有可能实现更高的带宽。然而,HMC设备的高静态功率会在设备被少量使用时影响功率效率。激活休眠HMC需要超过2μs的时间,这使得在不大幅降低系统性能的情况下管理HMC电源变得非常困难。我们介绍了一种新的技术,通过使用擦除码来减轻HMC的长唤醒惩罚。存储在休眠HMC模块中的不可访问的数据可以通过解码从其他活动HMC中检索到的相关数据来重建,而不是等待休眠HMC模块变得活动。这种方法可以容忍在活动模式和睡眠模式之间切换HMC时产生的延迟损失,从而支持功率限制的HMC系统。仿真结果表明,在与多核基线相同的功耗预算下,该架构比当前基于hmc的多核系统性能提高6.2倍,系统能耗降低5.3倍。
{"title":"Enabling energy efficient Hybrid Memory Cube systems with erasure codes","authors":"Shibo Wang, Yanwei Song, M. N. Bojnordi, Engin Ipek","doi":"10.1109/ISLPED.2015.7273492","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273492","url":null,"abstract":"The Hybrid Memory Cube (HMC) is a promising alternative to DDRx memory due to its potential to achieve significantly higher bandwidth. However, the high static power of an HMC device compromises power efficiency when the device is lightly utilized. Activating a sleeping HMC takes over 2μs, which makes it challenging to manage HMC power without a substantial degradation in system performance. We introduce a new technique that alleviates the long wakeup penalty of an HMC by employing erasure codes. Inaccessible data stored in a sleeping HMC module can be reconstructed by decoding related data retrieved from other active HMCs, rather than waiting for the sleeping HMC module to become active. This approach makes it possible to tolerate the latency penalty incurred when switching an HMC between active and sleep modes, thereby enabling a power-capped HMC system. Simulations show that the proposed architecture outperforms a current HMC-based multicore system by 6.2×, and reduces the system energy by 5.3× under the same power budget as the multicore baseline.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"61 34","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"120816815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
An energy efficient and low cross-talk CMOS sub-THz I/O with surface-wave modulator and interconnect 具有表面波调制器和互连的高能效低串扰CMOS亚太赫兹I/O
Pub Date : 2015-07-22 DOI: 10.1109/ISLPED.2015.7273499
Yuan Liang, Hao Yu, Junfeng Zhao, Wei Yang, Yuangang Wang
Free-space EM-wave based GHz interconnect has significant loss and crosstalk that cannot be deployed as low-power and dense I/Os for future network-on-chip (NoC) integration of many-core and memory. This paper proposes an energy-efficient and low-crosstalk sub-THz (0.1T-1T) I/O with use of surface-wave based modulator and interconnects in CMOS. By introducing sub-wavelength periodical corrugation structure onto transmission line, the surface-wave is established to propagate signal that is strongly localized on surface of top-layer metal wire, which results in low coupling into lossy substrate and neighboring metal wires. As such, significant power saving and cross-talk reduction can be observed with high communication bandwidth. In addition, a high on/off-ratio surface-wave modulator is also proposed to support on-chip THz communication. As designed in 65nm CMOS, the results have shown that the proposed surface-wave I/O interface achieves 25Gbps data rate and 0.016pJ/bit/mm energy efficiency at 140GHz carrier frequency over 20mm surface-wave channels. They can be placed with 2.4μm channel spacing and a -20dB crosstalk ratio. The surface-wave modulator also achieves significant reduction of radiation loss with 23dB extinction ratio.
基于自由空间em波的GHz互连具有显著的损耗和串扰,不能部署为低功耗和密集的I/ o,用于未来的多核和内存的片上网络(NoC)集成。本文提出了一种利用表面波调制器和CMOS互连的节能低串扰sub-THz (0.1T-1T) I/O。通过在传输线上引入亚波长周期性波纹结构,建立表面波传播信号,使信号强局部化在顶层金属线表面,从而使信号低耦合到有损耗的衬底和邻近的金属线。因此,可以在高通信带宽下观察到显著的节能和串扰减少。此外,还提出了一种高开/关比的表面波调制器来支持片上太赫兹通信。在65nm CMOS上设计的表面波I/O接口,在140GHz载波频率下,在20mm表面波通道上实现了25Gbps的数据速率和0.016pJ/bit/mm的能量效率。它们的通道间距为2.4μm,串扰比为-20dB。表面波调制器还实现了显著降低辐射损耗,消光比达到23dB。
{"title":"An energy efficient and low cross-talk CMOS sub-THz I/O with surface-wave modulator and interconnect","authors":"Yuan Liang, Hao Yu, Junfeng Zhao, Wei Yang, Yuangang Wang","doi":"10.1109/ISLPED.2015.7273499","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273499","url":null,"abstract":"Free-space EM-wave based GHz interconnect has significant loss and crosstalk that cannot be deployed as low-power and dense I/Os for future network-on-chip (NoC) integration of many-core and memory. This paper proposes an energy-efficient and low-crosstalk sub-THz (0.1T-1T) I/O with use of surface-wave based modulator and interconnects in CMOS. By introducing sub-wavelength periodical corrugation structure onto transmission line, the surface-wave is established to propagate signal that is strongly localized on surface of top-layer metal wire, which results in low coupling into lossy substrate and neighboring metal wires. As such, significant power saving and cross-talk reduction can be observed with high communication bandwidth. In addition, a high on/off-ratio surface-wave modulator is also proposed to support on-chip THz communication. As designed in 65nm CMOS, the results have shown that the proposed surface-wave I/O interface achieves 25Gbps data rate and 0.016pJ/bit/mm energy efficiency at 140GHz carrier frequency over 20mm surface-wave channels. They can be placed with 2.4μm channel spacing and a -20dB crosstalk ratio. The surface-wave modulator also achieves significant reduction of radiation loss with 23dB extinction ratio.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133319373","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Power-efficient embedded processing with resilience and real-time constraints 具有弹性和实时约束的节能嵌入式处理
Pub Date : 2015-07-22 DOI: 10.1109/ISLPED.2015.7273519
Liang Wang, Augusto J. Vega, A. Buyuktosunoglu, P. Bose, K. Skadron
Low-power embedded processing typically relies on dynamic voltage-frequency scaling (DVFS) in order to optimize energy usage (and therefore, battery life). However, low voltage operation exacerbates the incidence of soft errors. Similarly, higher voltage operation (to meet real-time deadlines) is constrained by hard-failure rate limits. In this paper, we examine a class of embedded system applications relevant to mobile vehicles. We investigate the problem of assigning optimal voltage-frequency settings to individual segments within target workflows. The goal of this study is to understand the limits of achievable energy efficiency (performance per watt) under varying levels of system resilience constraints. To optimize for energy efficiency, we consider static optimization of voltage-frequency settings on a per-application-segment basis. We consider both linear and graph-structured workflows. In order to understand the loss in energy efficiency in the face of environmental uncertainties encountered by the mobile vehicle, we also study the effect of injecting random variations in the actual runtime of individual application segments. A dynamic re-optimization of the voltage-frequency settings is required to cope with such in-field uncertainties.
低功耗嵌入式处理通常依赖于动态电压频率缩放(DVFS),以优化能源使用(从而延长电池寿命)。然而,低电压操作加剧了软错误的发生率。同样,高电压操作(以满足实时截止日期)受到硬故障率限制的约束。在本文中,我们研究了一类与移动车辆相关的嵌入式系统应用。我们研究了分配最佳电压频率设置到目标工作流中的各个部分的问题。本研究的目的是了解在不同水平的系统弹性约束下可实现的能源效率(每瓦性能)的限制。为了优化能源效率,我们考虑在每个应用段的基础上对电压频率设置进行静态优化。我们考虑了线性和图形结构的工作流。为了了解移动车辆在面对环境不确定性时的能效损失,我们还研究了在单个应用程序段的实际运行时注入随机变量的影响。需要对电压-频率设置进行动态再优化,以应对这种场内不确定性。
{"title":"Power-efficient embedded processing with resilience and real-time constraints","authors":"Liang Wang, Augusto J. Vega, A. Buyuktosunoglu, P. Bose, K. Skadron","doi":"10.1109/ISLPED.2015.7273519","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273519","url":null,"abstract":"Low-power embedded processing typically relies on dynamic voltage-frequency scaling (DVFS) in order to optimize energy usage (and therefore, battery life). However, low voltage operation exacerbates the incidence of soft errors. Similarly, higher voltage operation (to meet real-time deadlines) is constrained by hard-failure rate limits. In this paper, we examine a class of embedded system applications relevant to mobile vehicles. We investigate the problem of assigning optimal voltage-frequency settings to individual segments within target workflows. The goal of this study is to understand the limits of achievable energy efficiency (performance per watt) under varying levels of system resilience constraints. To optimize for energy efficiency, we consider static optimization of voltage-frequency settings on a per-application-segment basis. We consider both linear and graph-structured workflows. In order to understand the loss in energy efficiency in the face of environmental uncertainties encountered by the mobile vehicle, we also study the effect of injecting random variations in the actual runtime of individual application segments. A dynamic re-optimization of the voltage-frequency settings is required to cope with such in-field uncertainties.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"168 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134646310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
ReDEEM: A heterogeneous distributed microarchitecture for energy-efficient reliability ReDEEM:一种用于节能可靠性的异构分布式微架构
Pub Date : 2015-07-22 DOI: 10.1109/ISLPED.2015.7273530
Biruk Mammo, Ritesh Parikh, V. Bertacco
Diminishing energy-efficiency returns and decreasing transistor reliability are casting shadows on semiconductor scaling. Prior research has been addressing processors' energy-efficiency and transistor reliability as orthogonal problems. However, as embedded processors become more powerful and find their way into more diverse applications, both high reliability and energy-efficiency become critical. In this work, we propose ReDEEM, a novel approach to design energy-efficient and reliable microarchitectures. Our proposed solution composes processor pipelines at runtime from redundant but heterogeneous pipeline components. Our pipeline components are loosely coupled and the control logic is decentralized so as to enable fault isolation and thereby eliminate single points of failure. We equip the microarchitecture with the ability to adapt dynamically to varying application phases by constructing energy-efficient pipelines best suited for each phase. In addition, pipeline components have power management capabilities that allow for greater energy efficiency and flexibility. Our experimental evaluation shows that our solution offers up to 60% in energy savings and can operate about 1.8x longer, when subjected to the same fault rate as a state-of-the-art reliable microarchitecture.
能源效率回报的减少和晶体管可靠性的下降正在影响半导体的规模。先前的研究一直将处理器的能源效率和晶体管的可靠性作为正交问题来解决。然而,随着嵌入式处理器变得越来越强大,并在更多样化的应用中找到了自己的方式,高可靠性和能源效率变得至关重要。在这项工作中,我们提出了一种新的方法来设计节能和可靠的微架构。我们提出的解决方案在运行时由冗余但异构的管道组件组成处理器管道。我们的管道组件是松散耦合的,控制逻辑是分散的,从而实现故障隔离,从而消除单点故障。我们通过构建最适合每个阶段的节能管道,使微架构具有动态适应不同应用阶段的能力。此外,管道组件具有电源管理功能,可实现更高的能源效率和灵活性。我们的实验评估表明,我们的解决方案可以节省高达60%的能源,并且在与最先进的可靠微架构相同的故障率下,可以延长1.8倍的运行时间。
{"title":"ReDEEM: A heterogeneous distributed microarchitecture for energy-efficient reliability","authors":"Biruk Mammo, Ritesh Parikh, V. Bertacco","doi":"10.1109/ISLPED.2015.7273530","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273530","url":null,"abstract":"Diminishing energy-efficiency returns and decreasing transistor reliability are casting shadows on semiconductor scaling. Prior research has been addressing processors' energy-efficiency and transistor reliability as orthogonal problems. However, as embedded processors become more powerful and find their way into more diverse applications, both high reliability and energy-efficiency become critical. In this work, we propose ReDEEM, a novel approach to design energy-efficient and reliable microarchitectures. Our proposed solution composes processor pipelines at runtime from redundant but heterogeneous pipeline components. Our pipeline components are loosely coupled and the control logic is decentralized so as to enable fault isolation and thereby eliminate single points of failure. We equip the microarchitecture with the ability to adapt dynamically to varying application phases by constructing energy-efficient pipelines best suited for each phase. In addition, pipeline components have power management capabilities that allow for greater energy efficiency and flexibility. Our experimental evaluation shows that our solution offers up to 60% in energy savings and can operate about 1.8x longer, when subjected to the same fault rate as a state-of-the-art reliable microarchitecture.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121640279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Modeling and power optimization of cyber-physical systems with energy-workload tradeoff 具有能量-工作负载权衡的网络物理系统建模和功率优化
Pub Date : 2015-07-22 DOI: 10.1109/ISLPED.2015.7273533
Hoeseok Yang, S. Ha
In this paper, we propose to take the relationship between delay and workload into account in the optimization of cyber-physical systems (CPSs). Since the components at the physical side continuously change their values or properties, a longer delay at the cyber part may result in a bigger workload for the next computation. We formulate this tradeoff and apply it to the power optimization of CPS. In doing so, we examine the schedulability of the given CPS with respect to the given parameters and initial workload. Then, we propose to keep the system operate in the stable state with minimum scaling factor and prove that it is better than any alternating sequences. We verify the validity of the proposed delay-workload model by measuring the execution delay of real-life examples. The effectiveness of the proposed power optimization policy is demonstrated with simulation results.
在本文中,我们建议在网络物理系统(cps)的优化中考虑延迟和工作量之间的关系。由于物理部分的组件不断改变其值或属性,因此网络部分的较长延迟可能会导致下一次计算的更大工作量。我们制定了这种权衡,并将其应用于CPS的功率优化。在此过程中,我们检查给定CPS相对于给定参数和初始工作负载的可调度性。然后,我们提出以最小的比例因子保持系统运行在稳定状态,并证明了它比任何交替序列都好。我们通过测量实际实例的执行延迟来验证所提出的延迟-工作负载模型的有效性。仿真结果验证了所提功率优化策略的有效性。
{"title":"Modeling and power optimization of cyber-physical systems with energy-workload tradeoff","authors":"Hoeseok Yang, S. Ha","doi":"10.1109/ISLPED.2015.7273533","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273533","url":null,"abstract":"In this paper, we propose to take the relationship between delay and workload into account in the optimization of cyber-physical systems (CPSs). Since the components at the physical side continuously change their values or properties, a longer delay at the cyber part may result in a bigger workload for the next computation. We formulate this tradeoff and apply it to the power optimization of CPS. In doing so, we examine the schedulability of the given CPS with respect to the given parameters and initial workload. Then, we propose to keep the system operate in the stable state with minimum scaling factor and prove that it is better than any alternating sequences. We verify the validity of the proposed delay-workload model by measuring the execution delay of real-life examples. The effectiveness of the proposed power optimization policy is demonstrated with simulation results.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121311727","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Interconnect synthesis of heterogeneous accelerators in a shared memory architecture 共享内存体系结构中异构加速器的互连合成
Pub Date : 2015-07-22 DOI: 10.1109/ISLPED.2015.7273540
Yu-Ting Chen, J. Cong
An accelerator-rich architecture (ARA) is composed of heterogeneous accelerators with an on-chip memory system. Compared to the general-purpose processors, an accelerator demands short and predictable latency to its local on-chip memory to satisfy its performance target. Moreover, an accelerator requires a much higher off-chip memory bandwidth than a CPU since it consumes much more data in a given time period. Therefore, a customized on-chip memory system design is one of the keys to an efficient ARA. In this work we provide a two-layer interconnect synthesis method. We first provide an optimal layer of partial crossbar that connects the heterogeneous accelerators and shared memory banks with a minimum number of switches. The second layer of interconnect tries to interleave possible conflicting long-burst memory requests for prefetching data from off-chip memory. The experimental results show that we can reduce more than 45% of the switches of the partial crossbar compared to the best known method. This further leads to 53% reduction of LUTs and 34% reduction of slice utilization on a 30-accelerator FPGA prototype. Furthermore, the performance of an ARA can be improved by 36% - 52% with a well-designed interleaved network in a real ARA prototype for medical imaging applications. This prototype also shows a 7.44x energy efficiency gain over the state-of-the-art Xeon processors.
富加速器架构(ARA)是由异构加速器和片上存储系统组成的。与通用处理器相比,加速器对其本地片上存储器的延迟要求较短且可预测,以满足其性能目标。此外,加速器需要比CPU高得多的片外内存带宽,因为它在给定时间段内消耗更多的数据。因此,定制的片上存储系统设计是高效ARA的关键之一。在这项工作中,我们提供了一种双层互连合成方法。我们首先提供了一个最佳的部分横杆层,它用最少数量的交换机连接异构加速器和共享内存库。互连的第二层试图交错可能冲突的长突发存储器请求,以便从片外存储器预取数据。实验结果表明,与目前已知的方法相比,我们可以减少45%以上的部分横杆开关。这进一步导致在30个加速器的FPGA原型上,lut减少53%,片利用率减少34%。此外,在医疗成像应用的真实ARA原型中,通过精心设计的交错网络,ARA的性能可以提高36% - 52%。与最先进的至强处理器相比,该原型机的能效提高了7.44倍。
{"title":"Interconnect synthesis of heterogeneous accelerators in a shared memory architecture","authors":"Yu-Ting Chen, J. Cong","doi":"10.1109/ISLPED.2015.7273540","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273540","url":null,"abstract":"An accelerator-rich architecture (ARA) is composed of heterogeneous accelerators with an on-chip memory system. Compared to the general-purpose processors, an accelerator demands short and predictable latency to its local on-chip memory to satisfy its performance target. Moreover, an accelerator requires a much higher off-chip memory bandwidth than a CPU since it consumes much more data in a given time period. Therefore, a customized on-chip memory system design is one of the keys to an efficient ARA. In this work we provide a two-layer interconnect synthesis method. We first provide an optimal layer of partial crossbar that connects the heterogeneous accelerators and shared memory banks with a minimum number of switches. The second layer of interconnect tries to interleave possible conflicting long-burst memory requests for prefetching data from off-chip memory. The experimental results show that we can reduce more than 45% of the switches of the partial crossbar compared to the best known method. This further leads to 53% reduction of LUTs and 34% reduction of slice utilization on a 30-accelerator FPGA prototype. Furthermore, the performance of an ARA can be improved by 36% - 52% with a well-designed interleaved network in a real ARA prototype for medical imaging applications. This prototype also shows a 7.44x energy efficiency gain over the state-of-the-art Xeon processors.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127818425","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
FreqLeak: A frequency step based method for efficient leakage power characterization in a system FreqLeak:一种基于频率阶跃的方法,用于系统中有效的泄漏功率表征
Pub Date : 2015-07-22 DOI: 10.1109/ISLPED.2015.7273513
Arun Joseph, A. Haridass, C. Lefurgy, Sreekanth Pai, Spandana Rachamalla, Francesco A. Campisano
Accurate estimation of leakage power at runtime requires post-silicon power measurements across a wide range of temperature and voltage conditions. Testing individual chips, especially at high-temperature corner conditions, is expensive in cost and time. We examine this problem in an industrial context and introduce FreqLeak, a frequency step based method for inexpensive and efficient leakage power characterization in a system. It enables a more thorough characterization than can be accomplished on a wafer prober alone due to time and equipment costs. Experimental evaluation on IBM POWER8 based systems demonstrates the efficiency of the proposed method, within an error of 5%. Further, we discuss the application of FreqLeak in system level power management.
运行时泄漏功率的准确估计需要在广泛的温度和电压条件下进行后硅功率测量。测试单个芯片,特别是在高温角落条件下,在成本和时间上都是昂贵的。我们在工业环境中研究了这个问题,并介绍了FreqLeak,这是一种基于频率阶跃的方法,可以在系统中廉价高效地进行泄漏功率表征。由于时间和设备成本的原因,它可以比单独在晶圆探针上完成更彻底的表征。在基于IBM POWER8的系统上的实验评估证明了该方法的有效性,误差在5%以内。此外,我们还讨论了FreqLeak在系统级电源管理中的应用。
{"title":"FreqLeak: A frequency step based method for efficient leakage power characterization in a system","authors":"Arun Joseph, A. Haridass, C. Lefurgy, Sreekanth Pai, Spandana Rachamalla, Francesco A. Campisano","doi":"10.1109/ISLPED.2015.7273513","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273513","url":null,"abstract":"Accurate estimation of leakage power at runtime requires post-silicon power measurements across a wide range of temperature and voltage conditions. Testing individual chips, especially at high-temperature corner conditions, is expensive in cost and time. We examine this problem in an industrial context and introduce FreqLeak, a frequency step based method for inexpensive and efficient leakage power characterization in a system. It enables a more thorough characterization than can be accomplished on a wafer prober alone due to time and equipment costs. Experimental evaluation on IBM POWER8 based systems demonstrates the efficiency of the proposed method, within an error of 5%. Further, we discuss the application of FreqLeak in system level power management.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133770613","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Design and analysis of 6-T 2-MTJ ternary Content Addressable Memory 6- t2 - mtj三进制内容可寻址存储器的设计与分析
Pub Date : 2015-07-22 DOI: 10.1109/ISLPED.2015.7273532
Rekha Govindaraj, Swaroop Ghosh
Content Addressable Memory (CAM) is widely used in pattern matching, internet data processing and many other fields where searching a specific pattern of data is a major operation. Conventional CAMs suffer from area, power, and speed limitations. We propose a magnetic tunnel junction (MTJ) based Ternary CAM (TCAM). The proposed TCAM cell is 127 percent (33 percent) area efficient compared to conventional CMOS TCAM (spintronic TCAMs). We analyzed sense margin of the proposed TCAM with respect to 16, 32, 64, 128 and 256-bit words sizes in 22nm predictive technology. Simulations indicated reliable sense margin of 50mV even at 0.7V supply voltage. The worst case sense delay and sense margin of 256-bit TCAM is found to be 263ps and 220mV respectively at 1V supply voltage. The average search power consumed is 13mW and the search energy is 4.7fJ per bit search. The write time is 4ns and the write energy is 0.69pJ per bit.
内容可寻址存储器(CAM)广泛应用于模式匹配、网络数据处理以及其他以搜索特定数据模式为主要操作的领域。传统的cam受到面积、功率和速度的限制。我们提出了一种基于磁隧道结(MTJ)的三元CAM (TCAM)。与传统的CMOS TCAM(自旋电子TCAM)相比,提出的TCAM电池的面积效率为127%(33%)。我们分析了在22nm预测技术中,所提出的TCAM在16位、32位、64位、128位和256位字长的意义余量。仿真结果表明,即使在0.7V电源电压下,仍有50mV的可靠感应余量。在1V电源电压下,256位TCAM的最坏情况下的感测延迟和感测余量分别为263ps和220mV。平均搜索功率为13mW,搜索能量为4.7fJ / bit。写入时间为4ns,写入能量为0.69pJ / bit。
{"title":"Design and analysis of 6-T 2-MTJ ternary Content Addressable Memory","authors":"Rekha Govindaraj, Swaroop Ghosh","doi":"10.1109/ISLPED.2015.7273532","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273532","url":null,"abstract":"Content Addressable Memory (CAM) is widely used in pattern matching, internet data processing and many other fields where searching a specific pattern of data is a major operation. Conventional CAMs suffer from area, power, and speed limitations. We propose a magnetic tunnel junction (MTJ) based Ternary CAM (TCAM). The proposed TCAM cell is 127 percent (33 percent) area efficient compared to conventional CMOS TCAM (spintronic TCAMs). We analyzed sense margin of the proposed TCAM with respect to 16, 32, 64, 128 and 256-bit words sizes in 22nm predictive technology. Simulations indicated reliable sense margin of 50mV even at 0.7V supply voltage. The worst case sense delay and sense margin of 256-bit TCAM is found to be 263ps and 220mV respectively at 1V supply voltage. The average search power consumed is 13mW and the search energy is 4.7fJ per bit search. The write time is 4ns and the write energy is 0.69pJ per bit.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116689793","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
The digital bidirectional function as a hardware security primitive: Architecture and applications 作为硬件安全原语的数字双向功能:体系结构和应用程序
Pub Date : 2015-07-22 DOI: 10.1109/ISLPED.2015.7273536
T. Xu, M. Potkonjak
Security and low power have emerged to become two essential requirements to modern design. In this paper, we have proposed a new hardware security primitive: digital bidirectional function (DBF) designed on FPGA to meet both criteria. The DBF has two forms of functions and implements two mappings of opposite directions. The DBF can be easily implemented using hierarchical lookup-table (LUT) structures with low delay and power overhead. In terms of applications, we demonstrate how DBF is applied in the protocol of secure message transfer and compare its power/bandwidth consumption with other cryptographic approaches. Our results indicate that the energy consumption of DBF outperforms the traditional ciphers by averagely two to three orders of magnitude.
安全性和低功耗已经成为现代设计的两个基本要求。在本文中,我们提出了一种新的硬件安全原语:在FPGA上设计的数字双向功能(DBF)来满足这两个标准。DBF有两种形式的函数,并实现两个相反方向的映射。DBF可以使用具有低延迟和低功耗开销的分层查找表(LUT)结构轻松实现。在应用程序方面,我们将演示如何在安全消息传输协议中应用DBF,并将其功耗/带宽消耗与其他加密方法进行比较。我们的研究结果表明,DBF的能量消耗比传统密码平均高出两到三个数量级。
{"title":"The digital bidirectional function as a hardware security primitive: Architecture and applications","authors":"T. Xu, M. Potkonjak","doi":"10.1109/ISLPED.2015.7273536","DOIUrl":"https://doi.org/10.1109/ISLPED.2015.7273536","url":null,"abstract":"Security and low power have emerged to become two essential requirements to modern design. In this paper, we have proposed a new hardware security primitive: digital bidirectional function (DBF) designed on FPGA to meet both criteria. The DBF has two forms of functions and implements two mappings of opposite directions. The DBF can be easily implemented using hierarchical lookup-table (LUT) structures with low delay and power overhead. In terms of applications, we demonstrate how DBF is applied in the protocol of secure message transfer and compare its power/bandwidth consumption with other cryptographic approaches. Our results indicate that the energy consumption of DBF outperforms the traditional ciphers by averagely two to three orders of magnitude.","PeriodicalId":421236,"journal":{"name":"2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117191764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
期刊
2015 IEEE/ACM International Symposium on Low Power Electronics and Design (ISLPED)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1