首页 > 最新文献

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)最新文献

英文 中文
Spare register aware prefetching for graph algorithms on GPUs gpu上图形算法的备用寄存器感知预取
Nagesh B. Lakshminarayana, Hyesoon Kim
More and more graph algorithms are being GPU enabled. Graph algorithm implementations on GPUs have irregular control flow and are memory-intensive with many irregular/data-dependent memory accesses. Due to these factors graph algorithms on GPUs have low execution efficiency. In this work we propose a mechanism to improve the execution efficiency of graph algorithms by improving their memory access latency tolerance. We propose a mechanism for prefetching data for load pairs that have one load dependent on the other - such pairs are common in graph algorithms. Our mechanism detects the target loads in hardware and injects instructions into the pipeline to prefetch data into spare registers that are not being used by any active threads. By prefetching data into registers, early eviction of prefetched data can be eliminated. We also propose a mechanism that uses the compiler to identify the target loads. Our mechanism improves performance over no prefetching by 10% on average and upto 51% for nine memory intensive graph algorithm kernels.
越来越多的图形算法支持GPU。图形算法在gpu上的实现具有不规则的控制流,并且具有许多不规则/数据依赖的内存访问,是内存密集型的。由于这些因素,图形算法在gpu上的执行效率很低。在这项工作中,我们提出了一种机制,通过提高其内存访问延迟容忍度来提高图算法的执行效率。我们提出了一种机制,为具有一个负载依赖于另一个负载的负载对预取数据-这种对在图算法中很常见。我们的机制检测硬件中的目标负载,并向管道中注入指令,将数据预取到没有被任何活动线程使用的备用寄存器中。通过将数据预取到寄存器中,可以避免提前取出预取的数据。我们还提出了一种使用编译器识别目标负载的机制。我们的机制在没有预取的情况下平均提高了10%的性能,在9个内存密集型图形算法内核上提高了51%。
{"title":"Spare register aware prefetching for graph algorithms on GPUs","authors":"Nagesh B. Lakshminarayana, Hyesoon Kim","doi":"10.1109/HPCA.2014.6835970","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835970","url":null,"abstract":"More and more graph algorithms are being GPU enabled. Graph algorithm implementations on GPUs have irregular control flow and are memory-intensive with many irregular/data-dependent memory accesses. Due to these factors graph algorithms on GPUs have low execution efficiency. In this work we propose a mechanism to improve the execution efficiency of graph algorithms by improving their memory access latency tolerance. We propose a mechanism for prefetching data for load pairs that have one load dependent on the other - such pairs are common in graph algorithms. Our mechanism detects the target loads in hardware and injects instructions into the pipeline to prefetch data into spare registers that are not being used by any active threads. By prefetching data into registers, early eviction of prefetched data can be eliminated. We also propose a mechanism that uses the compiler to identify the target loads. Our mechanism improves performance over no prefetching by 10% on average and upto 51% for nine memory intensive graph algorithm kernels.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"19 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133436668","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Undersubscribed threading on clustered cache architectures 在集群缓存架构上订阅不足的线程
W. Heirman, Trevor E. Carlson, K. V. Craeynest, I. Hur, A. Jaleel, L. Eeckhout
Recent many-core processors such as Intel's Xeon Phi and GPGPUs specialize in running highly scalable parallel applications at high performance while simultaneously embracing energy efficiency as a first-order design constraint. The traditional belief is that full utilization of all available cores also translates into the highest possible performance. In this paper, we study the effects of cache capacity conflicts and competition for shared off-chip bandwidth; and show that undersubscription, or not utilizing all cores, often yields significant increases in both performance and energy efficiency. Based on a detailed shared working set analysis we make the case for clustered cache architectures as an efficient design point for exploiting both data sharing and undersubscription, while providing low-latency and ease of implementation in many-core processors. We then propose ClusteR-aware Undersubscribed Scheduling of Threads (CRUST) which dynamically matches an application's working set size and off-chip bandwidth demands with the available on-chip cache capacity and off-chip bandwidth. CRUST improves application performance and energy efficiency by 15% on average, and up to 50%, for the NPB and SPEC OMP benchmarks. In addition, we make recommendations for the design of future many-core architectures, and show that taking the undersubscription usage model into account moves the optimum performance under the cores-versus-cache area tradeoff towards design points with more cores and less cache.
最近的许多核心处理器,如英特尔的Xeon Phi和gpgpu,专注于以高性能运行高度可扩展的并行应用程序,同时将能效作为一阶设计约束。传统的信念是,充分利用所有可用的内核也可以转化为尽可能高的性能。在本文中,我们研究了缓存容量冲突和共享片外带宽竞争的影响;并表明,不充分订阅或不使用所有核心,通常会显著提高性能和能源效率。基于详细的共享工作集分析,我们将集群缓存架构作为利用数据共享和欠订阅的有效设计点,同时在多核处理器中提供低延迟和易于实现。然后,我们提出了ClusteR-aware undersubscribe Scheduling of Threads (CRUST),它动态匹配应用程序的工作集大小和片外带宽需求,以及可用的片上缓存容量和片外带宽。在NPB和SPEC OMP基准测试中,CRUST可将应用性能和能源效率平均提高15%,最高可提高50%。此外,我们对未来多核架构的设计提出了建议,并表明考虑到订阅不足的使用模型可以将核心与缓存区域权衡下的最佳性能移动到具有更多核心和更少缓存的设计点。
{"title":"Undersubscribed threading on clustered cache architectures","authors":"W. Heirman, Trevor E. Carlson, K. V. Craeynest, I. Hur, A. Jaleel, L. Eeckhout","doi":"10.1109/HPCA.2014.6835975","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835975","url":null,"abstract":"Recent many-core processors such as Intel's Xeon Phi and GPGPUs specialize in running highly scalable parallel applications at high performance while simultaneously embracing energy efficiency as a first-order design constraint. The traditional belief is that full utilization of all available cores also translates into the highest possible performance. In this paper, we study the effects of cache capacity conflicts and competition for shared off-chip bandwidth; and show that undersubscription, or not utilizing all cores, often yields significant increases in both performance and energy efficiency. Based on a detailed shared working set analysis we make the case for clustered cache architectures as an efficient design point for exploiting both data sharing and undersubscription, while providing low-latency and ease of implementation in many-core processors. We then propose ClusteR-aware Undersubscribed Scheduling of Threads (CRUST) which dynamically matches an application's working set size and off-chip bandwidth demands with the available on-chip cache capacity and off-chip bandwidth. CRUST improves application performance and energy efficiency by 15% on average, and up to 50%, for the NPB and SPEC OMP benchmarks. In addition, we make recommendations for the design of future many-core architectures, and show that taking the undersubscription usage model into account moves the optimum performance under the cores-versus-cache area tradeoff towards design points with more cores and less cache.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127399484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 25
Scalably verifiable dynamic power management 可扩展可验证的动态电源管理
Opeoluwa Matthews, Meng Zhang, Daniel J. Sorin
Dynamic power management (DPM) is critical to maximizing the performance of systems ranging from multicore processors to datacenters. However, one formidable challenge with DPM schemes is verifying that the DPM schemes are correct as the number of computational resources scales up. In this paper, we develop a DPM scheme such that it is scalably verifiable with fully automated formal tools. The key to the design is that the DPM scheme has fractal behavior; that is, it behaves the same at every scale. We show that the fractal design enables scalable formal verification and simulation shows that our scheme does not sacrifice much performance compared to an oracle DPM scheme that optimally allocates power to computational resources. We implement our scheme in a 2-socket 16-core x86 system and experimentally evaluate it.
动态电源管理(DPM)对于最大限度地提高从多核处理器到数据中心等系统的性能至关重要。然而,DPM方案面临的一个巨大挑战是,随着计算资源数量的增加,验证DPM方案是否正确。在本文中,我们开发了一个DPM方案,使得它可以用全自动形式化工具进行可扩展验证。设计的关键在于DPM方案具有分形特性;也就是说,它在每个尺度上的行为都是一样的。我们证明了分形设计能够实现可扩展的形式化验证,并且仿真表明,与oracle DPM方案相比,我们的方案不会牺牲太多性能,该方案可以最佳地分配计算资源的功率。我们在一个2套接字的16核x86系统上实现了该方案,并对其进行了实验评估。
{"title":"Scalably verifiable dynamic power management","authors":"Opeoluwa Matthews, Meng Zhang, Daniel J. Sorin","doi":"10.1109/HPCA.2014.6835967","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835967","url":null,"abstract":"Dynamic power management (DPM) is critical to maximizing the performance of systems ranging from multicore processors to datacenters. However, one formidable challenge with DPM schemes is verifying that the DPM schemes are correct as the number of computational resources scales up. In this paper, we develop a DPM scheme such that it is scalably verifiable with fully automated formal tools. The key to the design is that the DPM scheme has fractal behavior; that is, it behaves the same at every scale. We show that the fractal design enables scalable formal verification and simulation shows that our scheme does not sacrifice much performance compared to an oracle DPM scheme that optimally allocates power to computational resources. We implement our scheme in a 2-socket 16-core x86 system and experimentally evaluate it.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115278189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Suppressing the Oblivious RAM timing channel while making information leakage and program efficiency trade-offs 抑制遗忘RAM时间通道,同时使信息泄漏和程序效率折衷
Christopher W. Fletcher, Ling Ren, Xiangyao Yu, Marten van Dijk, O. Khan, S. Devadas
Oblivious RAM (ORAM) is an established cryptographic technique to hide a program's address pattern to an untrusted storage system. More recently, ORAM schemes have been proposed to replace conventional memory controllers in secure processor settings to protect against information leakage in external memory and the processor I/O bus. A serious problem in current secure processor ORAM proposals is that they don't obfuscate when ORAM accesses are made, or do so in a very conservative manner. Since secure processors make ORAM accesses on last-level cache misses, ORAM access timing strongly correlates to program access pattern (e.g., locality). This brings ORAM's purpose in secure processors into question. This paper makes two contributions. First, we show how a secure processor can bound ORAM timing channel leakage to a user-controllable leakage limit. The secure processor is allowed to dynamically optimize ORAM access rate for power/performance, subject to the constraint that the leakage limit is not violated. Second, we show how changing the leakage limit impacts program efficiency. We present a dynamic scheme that leaks at most 32 bits through the ORAM timing channel and introduces only 20% performance overhead and 12% power overhead relative to a baseline ORAM that has no timing channel protection. By reducing leakage to 16 bits, our scheme degrades in performance by 5% but gains in power efficiency by 3%. We show that a static (zero leakage) scheme imposes a 34% power overhead for equivalent performance (or a 30% performance overhead for equivalent power) relative to our dynamic scheme.
遗忘RAM (ORAM)是一种成熟的加密技术,用于向不受信任的存储系统隐藏程序的地址模式。最近,人们提出了ORAM方案来取代安全处理器设置中的传统内存控制器,以防止外部存储器和处理器I/O总线中的信息泄漏。当前安全处理器ORAM建议中的一个严重问题是,当进行ORAM访问时,它们不会混淆,或者以非常保守的方式进行。由于安全处理器在最后一级缓存丢失时进行ORAM访问,因此ORAM访问时间与程序访问模式(例如,局部性)密切相关。这使得ORAM在安全处理器中的用途受到质疑。本文有两个贡献。首先,我们展示了安全处理器如何将ORAM时序通道泄漏绑定到用户可控的泄漏限制。在不违反泄漏限制的约束下,允许安全处理器动态优化ORAM访问速率以获得功率/性能。其次,我们展示了改变泄漏极限如何影响程序效率。我们提出了一种动态方案,该方案通过ORAM时序通道最多泄漏32位,相对于没有时序通道保护的基准ORAM,仅引入20%的性能开销和12%的功耗开销。通过将泄漏降低到16位,我们的方案性能降低5%,但功率效率提高3%。我们表明,相对于我们的动态方案,静态(零泄漏)方案对等效性能施加34%的功率开销(或对等效功率施加30%的性能开销)。
{"title":"Suppressing the Oblivious RAM timing channel while making information leakage and program efficiency trade-offs","authors":"Christopher W. Fletcher, Ling Ren, Xiangyao Yu, Marten van Dijk, O. Khan, S. Devadas","doi":"10.1109/HPCA.2014.6835932","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835932","url":null,"abstract":"Oblivious RAM (ORAM) is an established cryptographic technique to hide a program's address pattern to an untrusted storage system. More recently, ORAM schemes have been proposed to replace conventional memory controllers in secure processor settings to protect against information leakage in external memory and the processor I/O bus. A serious problem in current secure processor ORAM proposals is that they don't obfuscate when ORAM accesses are made, or do so in a very conservative manner. Since secure processors make ORAM accesses on last-level cache misses, ORAM access timing strongly correlates to program access pattern (e.g., locality). This brings ORAM's purpose in secure processors into question. This paper makes two contributions. First, we show how a secure processor can bound ORAM timing channel leakage to a user-controllable leakage limit. The secure processor is allowed to dynamically optimize ORAM access rate for power/performance, subject to the constraint that the leakage limit is not violated. Second, we show how changing the leakage limit impacts program efficiency. We present a dynamic scheme that leaks at most 32 bits through the ORAM timing channel and introduces only 20% performance overhead and 12% power overhead relative to a baseline ORAM that has no timing channel protection. By reducing leakage to 16 bits, our scheme degrades in performance by 5% but gains in power efficiency by 3%. We show that a static (zero leakage) scheme imposes a 34% power overhead for equivalent performance (or a 30% performance overhead for equivalent power) relative to our dynamic scheme.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126541571","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 97
BigDataBench: A big data benchmark suite from internet services BigDataBench:来自互联网服务的大数据基准套件
Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, Bizhu Qiu
As architecture, systems, and data management communities pay greater attention to innovative big data systems and architecture, the pressure of benchmarking and evaluating these systems rises. However, the complexity, diversity, frequently changed workloads, and rapid evolution of big data systems raise great challenges in big data benchmarking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above. This paper presents our joint research efforts on this issue with several industrial partners. Our big data benchmark suite-BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets. Currently, we choose 19 big data benchmarks from dimensions of application scenarios, operations/ algorithms, data types, data sources, software stacks, and application types, and they are comprehensive for fairly measuring and evaluating big data systems and architecture. BigDataBench is publicly available from the project home page http://prof.ict.ac.cn/BigDataBench. Also, we comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs. On a typical state-of-practice processor, Intel Xeon E5645, we have the following observations: First, in comparison with the traditional benchmarks: including PARSEC, HPCC, and SPECCPU, big data applications have very low operation intensity, which measures the ratio of the total number of instructions divided by the total byte number of memory accesses; Second, the volume of data input has non-negligible impact on micro-architecture characteristics, which may impose challenges for simulation-based big data architecture research; Last but not least, corroborating the observations in CloudSuite and DCBench (which use smaller data inputs), we find that the numbers of L1 instruction cache (L1I) misses per 1000 instructions (in short, MPKI) of the big data applications are higher than in the traditional benchmarks; also, we find that L3 caches are effective for the big data applications, corroborating the observation in DCBench.
随着架构、系统和数据管理社区越来越关注创新的大数据系统和架构,对这些系统进行基准测试和评估的压力越来越大。然而,大数据系统的复杂性、多样性、工作负载变化频繁、发展速度快,给大数据对标提出了巨大挑战。考虑到大数据系统的广泛应用,为了公平起见,大数据基准测试必须包括数据和工作负载的多样性,这是评估大数据系统和架构的前提。大多数最先进的大数据基准测试工作的目标是评估特定类型的应用程序或系统软件堆栈,因此它们不适合服务于上述目的。本文介绍了我们与几个行业合作伙伴在这一问题上的共同研究工作。我们的大数据基准套件- bigdatabench不仅涵盖了广泛的应用场景,而且包含了多样化和代表性的数据集。目前,我们从应用场景、操作/算法、数据类型、数据源、软件栈、应用类型等维度选取了19个大数据基准,能够较为全面地对大数据系统和架构进行公正的衡量和评估。BigDataBench可从项目主页http://prof.ict.ac.cn/BigDataBench公开获取。此外,我们还全面描述了BigDataBench中包含的19种具有不同数据输入的大数据工作负载。在典型的Intel至强E5645处理器上,我们观察到:首先,与传统的基准测试(包括PARSEC、HPCC和SPECCPU)相比,大数据应用的操作强度非常低,其衡量的是总指令数除以总内存访问字节数的比率;其次,数据输入量对微架构特性的影响不容忽视,这可能给基于仿真的大数据架构研究带来挑战;最后但并非最不重要的是,证实了CloudSuite和DCBench(使用较小的数据输入)中的观察结果,我们发现大数据应用程序的每1000条指令(简而言之,MPKI)的L1指令缓存(L1I)缺失数量高于传统基准;此外,我们发现L3缓存对于大数据应用是有效的,证实了在DCBench中的观察。
{"title":"BigDataBench: A big data benchmark suite from internet services","authors":"Lei Wang, Jianfeng Zhan, Chunjie Luo, Yuqing Zhu, Qiang Yang, Yongqiang He, Wanling Gao, Zhen Jia, Yingjie Shi, Shujie Zhang, Chen Zheng, Gang Lu, Kent Zhan, Xiaona Li, Bizhu Qiu","doi":"10.1109/HPCA.2014.6835958","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835958","url":null,"abstract":"As architecture, systems, and data management communities pay greater attention to innovative big data systems and architecture, the pressure of benchmarking and evaluating these systems rises. However, the complexity, diversity, frequently changed workloads, and rapid evolution of big data systems raise great challenges in big data benchmarking. Considering the broad use of big data systems, for the sake of fairness, big data benchmarks must include diversity of data and workloads, which is the prerequisite for evaluating big data systems and architecture. Most of the state-of-the-art big data benchmarking efforts target evaluating specific types of applications or system software stacks, and hence they are not qualified for serving the purposes mentioned above. This paper presents our joint research efforts on this issue with several industrial partners. Our big data benchmark suite-BigDataBench not only covers broad application scenarios, but also includes diverse and representative data sets. Currently, we choose 19 big data benchmarks from dimensions of application scenarios, operations/ algorithms, data types, data sources, software stacks, and application types, and they are comprehensive for fairly measuring and evaluating big data systems and architecture. BigDataBench is publicly available from the project home page http://prof.ict.ac.cn/BigDataBench. Also, we comprehensively characterize 19 big data workloads included in BigDataBench with varying data inputs. On a typical state-of-practice processor, Intel Xeon E5645, we have the following observations: First, in comparison with the traditional benchmarks: including PARSEC, HPCC, and SPECCPU, big data applications have very low operation intensity, which measures the ratio of the total number of instructions divided by the total byte number of memory accesses; Second, the volume of data input has non-negligible impact on micro-architecture characteristics, which may impose challenges for simulation-based big data architecture research; Last but not least, corroborating the observations in CloudSuite and DCBench (which use smaller data inputs), we find that the numbers of L1 instruction cache (L1I) misses per 1000 instructions (in short, MPKI) of the big data applications are higher than in the traditional benchmarks; also, we find that L3 caches are effective for the big data applications, corroborating the observation in DCBench.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"22 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115565101","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 566
Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks 缠结:面向路由的动态电压最小化的变化影响,节能片上网络
Amin Ansari, Asit K. Mishra, Jianping Xu, J. Torrellas
On-chip networks are especially vulnerable to within-die parameter variations. Since they connect distant parts of the chip, they need to be designed to work under the most unfavorable parameter values in the chip. This results in energy-inefficient designs. To improve the energy efficiency of on-chip networks, this paper presents a novel approach that relies on monitoring the errors of messages as they traverse the network. Based on the observed errors of messages, the system dynamically decreases or increases the voltage (Vdd) of groups of network routers. With this approach, called Tangle, the different Vdd values applied to different groups of network routers progressively converge to their lowest, variation-aware, error-free values - always keeping the network frequency unchanged. This saves substantial network energy. In a simulated 64-router network with 4 Vdd domains, Tangle reduces the network energy consumption by an average of 22% with negligible performance impact. In a future network design with one Vdd domain per router, Tangle lowers the network Vdd by an average of 21%, reducing the network energy consumption by an average of 28% with negligible performance impact.
片上网络特别容易受到芯片内参数变化的影响。由于它们连接芯片的较远部分,因此需要将它们设计为在芯片中最不利的参数值下工作。这导致了能源效率低下的设计。为了提高片上网络的能源效率,本文提出了一种新的方法,该方法依赖于监测消息在网络中传输时的错误。根据观察到的消息错误,系统动态地降低或增加网络路由器组的电压(Vdd)。通过这种称为Tangle的方法,应用于不同网络路由器组的不同Vdd值逐渐收敛到其最低的、可感知变化的、无错误的值——始终保持网络频率不变。这节省了大量的网络能量。在具有4个Vdd域的64路由器模拟网络中,Tangle平均降低了22%的网络能耗,而性能影响可以忽略不计。在未来每台路由器一个Vdd域的网络设计中,Tangle平均降低了21%的网络Vdd,平均降低了28%的网络能耗,而性能影响可以忽略不计。
{"title":"Tangle: Route-oriented dynamic voltage minimization for variation-afflicted, energy-efficient on-chip networks","authors":"Amin Ansari, Asit K. Mishra, Jianping Xu, J. Torrellas","doi":"10.1109/HPCA.2014.6835953","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835953","url":null,"abstract":"On-chip networks are especially vulnerable to within-die parameter variations. Since they connect distant parts of the chip, they need to be designed to work under the most unfavorable parameter values in the chip. This results in energy-inefficient designs. To improve the energy efficiency of on-chip networks, this paper presents a novel approach that relies on monitoring the errors of messages as they traverse the network. Based on the observed errors of messages, the system dynamically decreases or increases the voltage (Vdd) of groups of network routers. With this approach, called Tangle, the different Vdd values applied to different groups of network routers progressively converge to their lowest, variation-aware, error-free values - always keeping the network frequency unchanged. This saves substantial network energy. In a simulated 64-router network with 4 Vdd domains, Tangle reduces the network energy consumption by an average of 22% with negligible performance impact. In a future network design with one Vdd domain per router, Tangle lowers the network Vdd by an average of 21%, reducing the network energy consumption by an average of 28% with negligible performance impact.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133898463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
期刊
2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1