2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)最新文献_第5页

Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning 通过动态银行分区提高共享内存CMP系统的吞吐量和公平性

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835945

Mingli Xie, Dong Tong, Kan Huang, Xu Cheng

Applications running concurrently in CMP systems interfere with each other at DRAM memory, leading to poor system performance and fairness. Memory access scheduling reorders memory requests to improve system throughput and fairness. However, it cannot resolve the interference issue effectively. To reduce interference, memory partitioning divides memory resource among threads. Memory channel partitioning maps the data of threads that are likely to severely interfere with each other to different channels. However, it allocates memory resource unfairly and physically exacerbates memory contention of intensive threads, thus ultimately resulting in the increased slowdown of these threads and high system unfairness. Bank partitioning divides memory banks among cores and eliminates interference. However, previous equal bank partitioning restricts the number of banks available to individual thread and reduces bank level parallelism. In this paper, we first propose a Dynamic Bank Partitioning (DBP), which partitions memory banks according to threads' requirements for bank amounts. DBP compensates for the reduced bank level parallelism caused by equal bank partitioning. The key principle is to profile threads' memory characteristics at run-time and estimate their demands for bank amount, then use the estimation to direct our bank partitioning. Second, we observe that bank partitioning and memory scheduling are orthogonal in the sense; both methods can be illuminated when they are applied together. Therefore, we present a comprehensive approach which integrates Dynamic Bank Partitioning and Thread Cluster Memory scheduling (DBP-TCM, TCM is one of the best memory scheduling) to further improve system performance. Experimental results show that the proposed DBP improves system performance by 4.3% and improves system fairness by 16% over equal bank partitioning. Compared to TCM, DBP-TCM improves system throughput by 6.2% and fairness by 16.7%. When compared with MCP, DBP-TCM provides 5.3% better system throughput and 37% better system fairness. We conclude that our methods are effective in improving both system throughput and fairness.

在CMP系统中并发运行的应用程序在DRAM内存中相互干扰，导致系统性能和公平性较差。内存访问调度重新排序内存请求，以提高系统吞吐量和公平性。然而，它不能有效地解决干扰问题。为了减少干扰，内存分区在线程之间划分内存资源。内存通道分区将可能严重相互干扰的线程的数据映射到不同的通道。但是，它不公平地分配内存资源，并且在物理上加剧了密集线程的内存争用，从而最终导致这些线程的速度增加和系统不公平。银行分区将内存银行划分在不同的核之间，消除了干扰。但是，以前的相等银行分区限制了单个线程可用的银行数量，并降低了银行级别的并行性。在本文中，我们首先提出了动态银行分区(DBP)，它根据线程对银行数量的需求对内存银行进行分区。DBP补偿了相等的银行分区所导致的银行级并行性的降低。关键原则是在运行时分析线程的内存特征，并估计它们对银行数量的需求，然后使用估计来指导我们的银行分区。其次，我们观察到银行分区和内存调度在某种意义上是正交的;当这两种方法一起应用时，它们可以被照亮。因此，我们提出了一种综合的方法，将动态银行分区和线程集群内存调度(DBP-TCM, TCM是最好的内存调度之一)相结合，以进一步提高系统性能。实验结果表明，与等银行分区相比，DBP算法的系统性能提高了4.3%，系统公平性提高了16%。与TCM相比，DBP-TCM系统吞吐量提高了6.2%，公平性提高了16.7%。与MCP相比，DBP-TCM提供了5.3%的系统吞吐量和37%的系统公平性。我们得出结论，我们的方法在提高系统吞吐量和公平性方面都是有效的。

{"title":"Improving system throughput and fairness simultaneously in shared memory CMP systems via Dynamic Bank Partitioning","authors":"Mingli Xie, Dong Tong, Kan Huang, Xu Cheng","doi":"10.1109/HPCA.2014.6835945","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835945","url":null,"abstract":"Applications running concurrently in CMP systems interfere with each other at DRAM memory, leading to poor system performance and fairness. Memory access scheduling reorders memory requests to improve system throughput and fairness. However, it cannot resolve the interference issue effectively. To reduce interference, memory partitioning divides memory resource among threads. Memory channel partitioning maps the data of threads that are likely to severely interfere with each other to different channels. However, it allocates memory resource unfairly and physically exacerbates memory contention of intensive threads, thus ultimately resulting in the increased slowdown of these threads and high system unfairness. Bank partitioning divides memory banks among cores and eliminates interference. However, previous equal bank partitioning restricts the number of banks available to individual thread and reduces bank level parallelism. In this paper, we first propose a Dynamic Bank Partitioning (DBP), which partitions memory banks according to threads' requirements for bank amounts. DBP compensates for the reduced bank level parallelism caused by equal bank partitioning. The key principle is to profile threads' memory characteristics at run-time and estimate their demands for bank amount, then use the estimation to direct our bank partitioning. Second, we observe that bank partitioning and memory scheduling are orthogonal in the sense; both methods can be illuminated when they are applied together. Therefore, we present a comprehensive approach which integrates Dynamic Bank Partitioning and Thread Cluster Memory scheduling (DBP-TCM, TCM is one of the best memory scheduling) to further improve system performance. Experimental results show that the proposed DBP improves system performance by 4.3% and improves system fairness by 16% over equal bank partitioning. Compared to TCM, DBP-TCM improves system throughput by 6.2% and fairness by 16.7%. When compared with MCP, DBP-TCM provides 5.3% better system throughput and 37% better system fairness. We conclude that our methods are effective in improving both system throughput and fairness.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"97 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115701172","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 62

MRPB: Memory request prioritization for massively parallel processors MRPB:大规模并行处理器的内存请求优先级

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835938

Wenhao Jia, K. Shaw, M. Martonosi

Massively parallel, throughput-oriented systems such as graphics processing units (GPUs) offer high performance for a broad range of programs. They are, however, complex to program, especially because of their intricate memory hierarchies with multiple address spaces. In response, modern GPUs have widely adopted caches, hoping to providing smoother reductions in memory access traffic and latency. Unfortunately, GPU caches often have mixed or unpredictable performance impact due to cache contention that results from the high thread counts in GPUs. We propose the memory request prioritization buffer (MRPB) to ease GPU programming and improve GPU performance. This hardware structure improves caching efficiency of massively parallel workloads by applying two prioritization methods-request reordering and cache bypassing-to memory requests before they access a cache. MRPB then releases requests into the cache in a more cache-friendly order. The result is drastically reduced cache contention and improved use of the limited per-thread cache capacity. For a simulated 16KB L1 cache, MRPB improves the average performance of the entire PolyBench and Rodinia suites by 2.65× and 1.27× respectively, outperforming a state-of-the-art GPU cache management technique.

大规模并行、面向吞吐量的系统(如图形处理单元(gpu))为各种程序提供高性能。然而，它们的编程很复杂，特别是因为它们具有多个地址空间的复杂内存层次结构。作为回应，现代gpu已经广泛采用了缓存，希望能够更流畅地减少内存访问流量和延迟。不幸的是，由于GPU中高线程数导致的缓存争用，GPU缓存通常具有混合或不可预测的性能影响。为了简化GPU编程，提高GPU性能，我们提出了内存请求优先级缓冲(MRPB)。这种硬件结构通过在内存请求访问缓存之前对其应用两种优先级方法(请求重新排序和缓存绕过)来提高大规模并行工作负载的缓存效率。然后MRPB以更适合缓存的顺序将请求释放到缓存中。其结果是大大减少了缓存争用，并改进了有限的每个线程缓存容量的使用。对于模拟的16KB L1缓存，MRPB将整个PolyBench和Rodinia套件的平均性能分别提高了2.65倍和1.27倍，优于最先进的GPU缓存管理技术。

{"title":"MRPB: Memory request prioritization for massively parallel processors","authors":"Wenhao Jia, K. Shaw, M. Martonosi","doi":"10.1109/HPCA.2014.6835938","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835938","url":null,"abstract":"Massively parallel, throughput-oriented systems such as graphics processing units (GPUs) offer high performance for a broad range of programs. They are, however, complex to program, especially because of their intricate memory hierarchies with multiple address spaces. In response, modern GPUs have widely adopted caches, hoping to providing smoother reductions in memory access traffic and latency. Unfortunately, GPU caches often have mixed or unpredictable performance impact due to cache contention that results from the high thread counts in GPUs. We propose the memory request prioritization buffer (MRPB) to ease GPU programming and improve GPU performance. This hardware structure improves caching efficiency of massively parallel workloads by applying two prioritization methods-request reordering and cache bypassing-to memory requests before they access a cache. MRPB then releases requests into the cache in a more cache-friendly order. The result is drastically reduced cache contention and improved use of the limited per-thread cache capacity. For a simulated 16KB L1 cache, MRPB improves the average performance of the entire PolyBench and Rodinia suites by 2.65× and 1.27× respectively, outperforming a state-of-the-art GPU cache management technique.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124982692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 155

Increasing TLB reach by exploiting clustering in page translations 通过利用页面翻译中的集群来增加TLB覆盖范围

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835964

B. Pham, A. Bhattacharjee, Yasuko Eckert, G. Loh

The steadily increasing sizes of main memory capacities require corresponding increases in the processor's translation lookaside buffer (TLB) resources to avoid performance bottlenecks. Large operating system page sizes can mitigate the bottleneck with a smaller TLB, but most OSs and applications do not fully utilize the large-page support in current hardware. Recent work has shown that, while not guaranteed, some virtual-to-physical page mappings exhibit “contiguous” spatial locality in which consecutive virtual pages map to consecutive physical pages. Such locality provides opportunities to coalesce “adjacent” TLB entries for increased reach. We observe that beyond simple adjacent-entry coalescing, many more translations exhibit “clustered” spatial locality in which a group or cluster of nearby virtual pages map to a similarly clustered set of physical pages. In this work, we provide a detailed characterization of the spatial locality among the virtual-to-physical translations. Based on this characterization, we present a multi-granular TLB organization that significantly increases its effective reach and reduces miss rates substantially while requiring no additional OS support. Our evaluation shows that the multi-granular design outperforms conventional TLBs and the recently proposed coalesced TLBs technique.

主存容量的稳步增长需要相应增加处理器的转换暂置缓冲区(TLB)资源，以避免性能瓶颈。较大的操作系统页面大小可以通过较小的TLB缓解瓶颈，但是大多数操作系统和应用程序并没有充分利用当前硬件中的大页面支持。最近的研究表明，虽然不能保证，但一些虚拟到物理页面映射表现出“连续的”空间局部性，其中连续的虚拟页面映射到连续的物理页面。这种局部性提供了合并“相邻”TLB条目以增加覆盖范围的机会。我们观察到，除了简单的邻接条目合并之外，更多的翻译表现出“集群”空间局部性，其中一组或一组附近的虚拟页面映射到类似的物理页面集群集。在这项工作中，我们提供了虚拟到物理翻译之间空间局部性的详细特征。基于这一特性，我们提出了一种多颗粒TLB组织，该组织在不需要额外操作系统支持的情况下显着增加了其有效覆盖范围并大大降低了脱靶率。我们的评估表明，多颗粒设计优于传统的tlb和最近提出的合并tlb技术。

{"title":"Increasing TLB reach by exploiting clustering in page translations","authors":"B. Pham, A. Bhattacharjee, Yasuko Eckert, G. Loh","doi":"10.1109/HPCA.2014.6835964","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835964","url":null,"abstract":"The steadily increasing sizes of main memory capacities require corresponding increases in the processor's translation lookaside buffer (TLB) resources to avoid performance bottlenecks. Large operating system page sizes can mitigate the bottleneck with a smaller TLB, but most OSs and applications do not fully utilize the large-page support in current hardware. Recent work has shown that, while not guaranteed, some virtual-to-physical page mappings exhibit “contiguous” spatial locality in which consecutive virtual pages map to consecutive physical pages. Such locality provides opportunities to coalesce “adjacent” TLB entries for increased reach. We observe that beyond simple adjacent-entry coalescing, many more translations exhibit “clustered” spatial locality in which a group or cluster of nearby virtual pages map to a similarly clustered set of physical pages. In this work, we provide a detailed characterization of the spatial locality among the virtual-to-physical translations. Based on this characterization, we present a multi-granular TLB organization that significantly increases its effective reach and reduces miss rates substantially while requiring no additional OS support. Our evaluation shows that the multi-granular design outperforms conventional TLBs and the recently proposed coalesced TLBs technique.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"473 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127549922","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 132

Precision-aware soft error protection for GPUs gpu的精度感知软错误保护

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-06-19 DOI: 10.1109/HPCA.2014.6835966

David J. Palframan, N. Kim, Mikko H. Lipasti

With the advent of general-purpose GPU computing, it is becoming increasingly desirable to protect GPUs from soft errors. For high computation throughout, GPUs must store a significant amount of state and have many execution units. The high power and area costs of full protection from soft errors make selective protection techniques attractive. Such approaches provide maximum error coverage within a fixed area or power limit, but typically treat all errors equally. We observe that for many floating-point-intensive GPGPU applications, small magnitude errors may have little effect on results, while large magnitude errors can be amplified to have a significant negative impact. We therefore propose a novel precision-aware protection approach for the GPU execution logic and register file to mitigate large magnitude errors. We also propose an architecture modification to optimize error coverage for integer computations. Our approach combines selective logic hardening, targeted checker circuits, and intelligent register file encoding for best error protection. We demonstrate that our approach can reduce the mean error magnitude by up to 87% compared to a traditional selective protection approach with the same overhead.

随着通用GPU计算的出现，保护GPU不受软错误的影响变得越来越重要。对于整个高计算量，gpu必须存储大量的状态并具有许多执行单元。对软错误进行全面保护的高功率和面积成本使得选择性保护技术具有吸引力。这种方法在固定区域或功率限制内提供最大的错误覆盖，但通常对所有错误一视同仁。我们观察到，对于许多浮点密集型的GPGPU应用，小幅度的误差可能对结果影响不大，而大幅度的误差可能会被放大，从而产生显著的负面影响。因此，我们提出了一种新的GPU执行逻辑和寄存器文件的精度感知保护方法，以减轻大幅度的错误。我们还提出了一个架构修改，以优化整数计算的错误覆盖。我们的方法结合了选择性逻辑强化、目标检查电路和智能寄存器文件编码，以实现最佳的错误保护。我们证明，与具有相同开销的传统选择性保护方法相比，我们的方法可以将平均误差幅度降低高达87%。

{"title":"Precision-aware soft error protection for GPUs","authors":"David J. Palframan, N. Kim, Mikko H. Lipasti","doi":"10.1109/HPCA.2014.6835966","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835966","url":null,"abstract":"With the advent of general-purpose GPU computing, it is becoming increasingly desirable to protect GPUs from soft errors. For high computation throughout, GPUs must store a significant amount of state and have many execution units. The high power and area costs of full protection from soft errors make selective protection techniques attractive. Such approaches provide maximum error coverage within a fixed area or power limit, but typically treat all errors equally. We observe that for many floating-point-intensive GPGPU applications, small magnitude errors may have little effect on results, while large magnitude errors can be amplified to have a significant negative impact. We therefore propose a novel precision-aware protection approach for the GPU execution logic and register file to mitigate large magnitude errors. We also propose an architecture modification to optimize error coverage for integer computations. Our approach combines selective logic hardening, targeted checker circuits, and intelligent register file encoding for best error protection. We demonstrate that our approach can reduce the mean error magnitude by up to 87% compared to a traditional selective protection approach with the same overhead.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"153 2-4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114037896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Improving in-memory database index performance with Intel® Transactional Synchronization Extensions 使用Intel®事务性同步扩展改进内存数据库索引性能

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-02-01 DOI: 10.1109/HPCA.2014.6835957

Tomas Karnagel, R. Dementiev, Ravi Rajwar, K. Lai, T. Legler, B. Schlegel, Wolfgang Lehner

The increasing number of cores every generation poses challenges for high-performance in-memory database systems. While these systems use sophisticated high-level algorithms to partition a query or run multiple queries in parallel, they also utilize low-level synchronization mechanisms to synchronize access to internal database data structures. Developers often spend significant development and verification effort to improve concurrency in the presence of such synchronization. The Intel® Transactional Synchronization Extensions (Intel® TSX) in the 4th Generation Core™ Processors enable hardware to dynamically determine whether threads actually need to synchronize even in the presence of conservatively used synchronization. This paper evaluates the effectiveness of such hardware support in a commercial database. We focus on two index implementations: a B+Tree Index and the Delta Storage Index used in the SAP HANA® database system. We demonstrate that such support can improve performance of database data structures such as index trees and presents a compelling opportunity for the development of simpler, scalable, and easy-to-verify algorithms.

每一代内核数量的增加对高性能内存数据库系统提出了挑战。虽然这些系统使用复杂的高级算法对查询进行分区或并行运行多个查询，但它们也利用低级同步机制来同步对内部数据库数据结构的访问。开发人员经常花费大量的开发和验证工作来改进存在这种同步的并发性。第四代酷睿™处理器中的Intel®Transactional Synchronization Extensions (Intel®TSX)使硬件能够动态地确定线程是否实际上需要同步，即使在保守使用同步的情况下也是如此。本文在一个商业数据库中评估了这种硬件支持的有效性。我们将重点介绍两种索引实现:B+树索引和SAP HANA®数据库系统中使用的Delta存储索引。我们演示了这种支持可以提高数据库数据结构(如索引树)的性能，并为开发更简单、可扩展且易于验证的算法提供了一个引人注目的机会。

{"title":"Improving in-memory database index performance with Intel® Transactional Synchronization Extensions","authors":"Tomas Karnagel, R. Dementiev, Ravi Rajwar, K. Lai, T. Legler, B. Schlegel, Wolfgang Lehner","doi":"10.1109/HPCA.2014.6835957","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835957","url":null,"abstract":"The increasing number of cores every generation poses challenges for high-performance in-memory database systems. While these systems use sophisticated high-level algorithms to partition a query or run multiple queries in parallel, they also utilize low-level synchronization mechanisms to synchronize access to internal database data structures. Developers often spend significant development and verification effort to improve concurrency in the presence of such synchronization. The Intel® Transactional Synchronization Extensions (Intel® TSX) in the 4th Generation Core™ Processors enable hardware to dynamically determine whether threads actually need to synchronize even in the presence of conservatively used synchronization. This paper evaluates the effectiveness of such hardware support in a commercial database. We focus on two index implementations: a B+Tree Index and the Delta Storage Index used in the SAP HANA® database system. We demonstrate that such support can improve performance of database data structures such as index trees and presents a compelling opportunity for the development of simpler, scalable, and easy-to-verify algorithms.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125598846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 63

Mosaic: Exploiting the spatial locality of process variation to reduce refresh energy in on-chip eDRAM modules 马赛克:利用工艺变化的空间局部性来减少片上eDRAM模块的刷新能量

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-02-01 DOI: 10.1109/HPCA.2014.6835978

Aditya Agrawal, Amin Ansari, J. Torrellas

EDRAM cells require periodic refresh, which ends up consuming substantial energy for large last-level caches. In practice, it is well known that different eDRAM cells can exhibit very different charge-retention properties. Unfortunately, current systems pessimistically assume worst-case retention times, and end up refreshing all the cells at a conservatively-high rate. In this paper, we propose an alternative approach. We use known facts about the factors that determine the retention properties of cells to build a new model of eDRAM retention times. The model is called Mosaic. The model shows that the retention times of cells in large eDRAM modules exhibit spatial correlation. Therefore, we logically divide the eDRAM module into regions or tiles, profile the retention properties of each tile, and program their refresh requirements in small counters in the cache controller. With this architecture, also called Mosaic, we refresh each tile at a different rate. The result is a 20x reduction in the number of refreshes in large eDRAM modules - practically eliminating refresh as a source of energy consumption.

EDRAM单元需要定期刷新，这最终会消耗大量的能量用于大型最后一级缓存。在实践中，众所周知，不同的eDRAM电池可以表现出非常不同的电荷保留特性。不幸的是，当前的系统悲观地假设了最坏情况下的保留时间，并以保守的高速率刷新所有细胞。在本文中，我们提出了一种替代方法。我们使用已知的决定细胞保留特性的因素来建立一个新的eDRAM保留时间模型。这个模型叫做“马赛克”。该模型表明，在大型eDRAM模块中，细胞的保留时间具有空间相关性。因此，我们在逻辑上将eDRAM模块划分为区域或块，分析每个块的保留属性，并在缓存控制器中的小计数器中对其刷新需求进行编程。这种架构也被称为马赛克，我们以不同的速度刷新每个瓷砖。结果是大型eDRAM模块中的刷新次数减少了20倍-实际上消除了刷新作为能源消耗的来源。

{"title":"Mosaic: Exploiting the spatial locality of process variation to reduce refresh energy in on-chip eDRAM modules","authors":"Aditya Agrawal, Amin Ansari, J. Torrellas","doi":"10.1109/HPCA.2014.6835978","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835978","url":null,"abstract":"EDRAM cells require periodic refresh, which ends up consuming substantial energy for large last-level caches. In practice, it is well known that different eDRAM cells can exhibit very different charge-retention properties. Unfortunately, current systems pessimistically assume worst-case retention times, and end up refreshing all the cells at a conservatively-high rate. In this paper, we propose an alternative approach. We use known facts about the factors that determine the retention properties of cells to build a new model of eDRAM retention times. The model is called Mosaic. The model shows that the retention times of cells in large eDRAM modules exhibit spatial correlation. Therefore, we logically divide the eDRAM module into regions or tiles, profile the retention properties of each tile, and program their refresh requirements in small counters in the cache controller. With this architecture, also called Mosaic, we refresh each tile at a different rate. The result is a 20x reduction in the number of refreshes in large eDRAM modules - practically eliminating refresh as a source of energy consumption.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131908896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 65

Reducing the cost of persistence for nonvolatile heaps in end user devices 降低终端用户设备中非易失性堆的持久性成本

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-02-01 DOI: 10.1109/HPCA.2014.6835960

Sudarsun Kannan, Ada Gavrilovska, K. Schwan

This paper explores the performance implications of using future byte addressable non-volatile memory (NVM) like PCM in end client devices. We explore how to obtain dual benefits - increased capacity and faster persistence - with low overhead and cost. Specifically, while increasing memory capacity can be gained by treating NVM as virtual memory, its use of persistent data storage incurs high consistency (frequent cache flushes) and durability (logging for failure) overheads, referred to as `persistence cost'. These not only affect the applications causing them, but also other applications relying on the same cache and/or memory hierarchy. This paper analyzes and quantifies in detail the performance overheads of persistence, which include (1) the aforementioned cache interference as well as (2) memory allocator overheads, and finally, (3) durability costs due to logging. Novel solutions to overcome such overheads include (1) a page contiguity algorithm that reduces interference-related cache misses, (2) a cache efficient NVM write aware memory allocator that reduces cache line flushes of allocator state by 8X, and (3) hybrid logging that reduces durability overheads substantially. With these solutions, experimental evaluations with different end user applications and SPEC2006 benchmarks show up to 12% reductions in cache misses, thereby reducing the total number of NVM writes.

本文探讨了在终端客户端设备中使用未来的字节可寻址非易失性存储器(NVM)(如PCM)对性能的影响。我们将探讨如何在低开销和低成本的情况下获得双重好处——增加容量和更快的持久性。具体来说，虽然可以通过将NVM视为虚拟内存来增加内存容量，但它对持久数据存储的使用会带来高一致性(频繁的缓存刷新)和持久性(记录故障)开销，称为“持久性成本”。这些问题不仅会影响产生这些问题的应用程序，还会影响依赖相同缓存和/或内存层次结构的其他应用程序。本文详细分析和量化了持久性的性能开销，包括:(1)前面提到的缓存干扰;(2)内存分配器开销;(3)由于日志记录导致的持久性开销。克服此类开销的新解决方案包括:(1)减少与干扰相关的缓存丢失的页面连续性算法，(2)缓存高效的NVM写感知内存分配器，可将分配器状态的缓存线刷新减少8倍，以及(3)混合日志记录，可大幅降低持久性开销。使用这些解决方案，在不同的终端用户应用程序和SPEC2006基准测试中进行的实验评估显示，缓存丢失最多减少了12%，从而减少了NVM写操作的总数。

{"title":"Reducing the cost of persistence for nonvolatile heaps in end user devices","authors":"Sudarsun Kannan, Ada Gavrilovska, K. Schwan","doi":"10.1109/HPCA.2014.6835960","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835960","url":null,"abstract":"This paper explores the performance implications of using future byte addressable non-volatile memory (NVM) like PCM in end client devices. We explore how to obtain dual benefits - increased capacity and faster persistence - with low overhead and cost. Specifically, while increasing memory capacity can be gained by treating NVM as virtual memory, its use of persistent data storage incurs high consistency (frequent cache flushes) and durability (logging for failure) overheads, referred to as `persistence cost'. These not only affect the applications causing them, but also other applications relying on the same cache and/or memory hierarchy. This paper analyzes and quantifies in detail the performance overheads of persistence, which include (1) the aforementioned cache interference as well as (2) memory allocator overheads, and finally, (3) durability costs due to logging. Novel solutions to overcome such overheads include (1) a page contiguity algorithm that reduces interference-related cache misses, (2) a cache efficient NVM write aware memory allocator that reduces cache line flushes of allocator state by 8X, and (3) hybrid logging that reduces durability overheads substantially. With these solutions, experimental evaluations with different end user applications and SPEC2006 benchmarks show up to 12% reductions in cache misses, thereby reducing the total number of NVM writes.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114607933","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 24

3D stacking of high-performance processors 高性能处理器的3D堆叠

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-02-01 DOI: 10.1109/HPCA.2014.6835959

P. Emma, A. Buyuktosunoglu, Michael B. Healy, K. Kailas, Valentin Puente, R. Yu, A. Hartstein, P. Bose, J. Moreno, E. Kursun

In most 3D work to date, people have looked at two situations: 1) a case in which power density is not a problem, and the parts of a processor and/or entire processors can be stacked atop each other, and 2) a case in which power density is limited, and storage is stacked atop processors. In this paper, we consider the case in which power density is a limitation, yet we stack processors atop processors. We also will discuss some of the physical limitations today that render many of the good ideas presented in other work impractical, and what would be required in the technology to make them feasible. In the high-performance regime, circuits are not designed to be “power efficient;” they're designed to be fast. In power-efficient design, the speed and power of a processor should be nearly proportional. In the high-performance regime, the frequency is (ever progressingly) sublinear in power. Thus, when the power density is constrained - as it is in high-performance machines, there may be opportunities to selectively exploit parallelism in workloads by running processor-on-processor systems at the same power, yet at much greater than half speed.

到目前为止，在大多数3D工作中，人们关注的是两种情况:1)功率密度不是问题，处理器的部件和/或整个处理器可以堆叠在一起;2)功率密度有限，存储堆叠在处理器上的情况。在本文中，我们考虑功率密度受限的情况，但我们将处理器堆叠在处理器之上。今天，我们还将讨论一些物理限制，这些限制使其他工作中提出的许多好想法变得不切实际，以及在技术上需要什么才能使它们可行。在高性能系统中，电路的设计不是为了“节能”，而是为了速度。在节能设计中，处理器的速度和功率应该几乎成正比。在高性能状态下，频率在功率上是(不断地)次线性的。因此，当功率密度受到限制时(就像在高性能机器中一样)，可能有机会通过以相同的功率运行处理器上的处理器系统来选择性地利用工作负载中的并行性，但速度要比一半快得多。

{"title":"3D stacking of high-performance processors","authors":"P. Emma, A. Buyuktosunoglu, Michael B. Healy, K. Kailas, Valentin Puente, R. Yu, A. Hartstein, P. Bose, J. Moreno, E. Kursun","doi":"10.1109/HPCA.2014.6835959","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835959","url":null,"abstract":"In most 3D work to date, people have looked at two situations: 1) a case in which power density is not a problem, and the parts of a processor and/or entire processors can be stacked atop each other, and 2) a case in which power density is limited, and storage is stacked atop processors. In this paper, we consider the case in which power density is a limitation, yet we stack processors atop processors. We also will discuss some of the physical limitations today that render many of the good ideas presented in other work impractical, and what would be required in the technology to make them feasible. In the high-performance regime, circuits are not designed to be “power efficient;” they're designed to be fast. In power-efficient design, the speed and power of a processor should be nearly proportional. In the high-performance regime, the frequency is (ever progressingly) sublinear in power. Thus, when the power density is constrained - as it is in high-performance machines, there may be opportunities to selectively exploit parallelism in workloads by running processor-on-processor systems at the same power, yet at much greater than half speed.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129275475","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

PVCoherence: Designing flat coherence protocols for scalable verification PVCoherence:为可扩展的验证设计平面一致性协议

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-02-01 DOI: 10.1109/MM.2015.48

Meng Zhang, Jesse D. Bingham, John Erickson, Daniel J. Sorin

The goal of this work is to design cache coherence protocols with many cores that can be verified with state-of-the-art automated verification methodologies. In particular, we focus on flat (non-hierarchical) coherence protocols, and we use a mostly-automated methodology based on parametric verification (PV). We propose several design guidelines that architects should follow if they want to design protocols that can be parametrically verified. We experimentally evaluate performance, storage overhead, and scalability of a protocol verified with PV compared to a highly optimized protocol that cannot be verified with PV.

这项工作的目标是设计具有许多核心的缓存一致性协议，这些协议可以用最先进的自动验证方法进行验证。我们特别关注平面(非分层)相干协议，并使用基于参数验证(PV)的大部分自动化方法。我们提出了一些建筑师应该遵循的设计准则，如果他们想要设计可以参数化验证的协议。我们通过实验评估了用PV验证的协议的性能、存储开销和可扩展性，并将其与无法用PV验证的高度优化协议进行了比较。

引用次数: 28

QORE: A fault tolerant network-on-chip architecture with power-efficient quad-function channel (QFC) buffers QORE:具有高能效四功能通道(QFC)缓冲的容错片上网络架构

2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)

Pub Date : 2014-02-01 DOI: 10.1109/HPCA.2014.6835942

D. DiTomaso, Avinash Karanth Kodi, A. Louri

Network-on-Chips (NoCs) are quickly becoming the standard communication paradigm for the growing number of cores on the chip. While NoCs can deliver sufficient bandwidth and enhance scalability, NoCs suffer from high power consumption due to the router microarchitecture and communication channels that facilitate inter-core communication. As technology keeps scaling down in the nanometer regime, unpredictable device behavior due to aging, infant mortality, design defects, soft errors, aggressive design, and process-voltage-temperature variations, will increase and will result in a significant increase in faults (both permanent and transient) and hardware failures. In this paper, we propose QORE - a fault tolerant NoC architecture with Quad-Function Channel (QFC) buffers. The use of QFC buffers and their associated control (link and fault controllers) enhance fault-tolerance by allowing the NoC to dynamically adapt to faults at the link level and reverse propagation direction to avoid faulty links. Additionally, QFC buffers reduce router power and improve performance by eliminating in-router buffering. Our simulation results using real benchmarks and synthetic traffic mixes show that QORE improves speedup by 1.3× and throughput by 2.3× when compared to state-of-the art fault tolerant NoCs designs such as Ariadne and Vicis. Moreover, using Synopsys Design Compiler, we also show that network power in QORE is reduced by 21% with minimal control overhead.

片上网络(noc)正迅速成为芯片上核心数量不断增加的标准通信范式。虽然noc可以提供足够的带宽并增强可扩展性，但由于路由器微架构和促进核间通信的通信通道，noc存在高功耗问题。随着技术在纳米范围内不断缩小，由于老化、婴儿死亡率、设计缺陷、软错误、侵略性设计和工艺电压温度变化而导致的不可预测的设备行为将增加，并将导致故障(永久和瞬态)和硬件故障的显著增加。在本文中，我们提出了QORE——一种带有四功能通道(QFC)缓冲区的容错NoC架构。QFC缓冲区及其相关控制(链路和故障控制器)的使用通过允许NoC动态适应链路级别的故障和反向传播方向以避免故障链路来增强容错性。此外，QFC缓冲区通过消除路由器内缓冲降低了路由器功耗并提高了性能。我们使用真实基准测试和合成流量混合的模拟结果表明，与最先进的容错noc设计(如Ariadne和Vicis)相比，QORE的速度提高了1.3倍，吞吐量提高了2.3倍。此外，使用Synopsys Design Compiler，我们还发现QORE中的网络功耗降低了21%，并且控制开销最小。

{"title":"QORE: A fault tolerant network-on-chip architecture with power-efficient quad-function channel (QFC) buffers","authors":"D. DiTomaso, Avinash Karanth Kodi, A. Louri","doi":"10.1109/HPCA.2014.6835942","DOIUrl":"https://doi.org/10.1109/HPCA.2014.6835942","url":null,"abstract":"Network-on-Chips (NoCs) are quickly becoming the standard communication paradigm for the growing number of cores on the chip. While NoCs can deliver sufficient bandwidth and enhance scalability, NoCs suffer from high power consumption due to the router microarchitecture and communication channels that facilitate inter-core communication. As technology keeps scaling down in the nanometer regime, unpredictable device behavior due to aging, infant mortality, design defects, soft errors, aggressive design, and process-voltage-temperature variations, will increase and will result in a significant increase in faults (both permanent and transient) and hardware failures. In this paper, we propose QORE - a fault tolerant NoC architecture with Quad-Function Channel (QFC) buffers. The use of QFC buffers and their associated control (link and fault controllers) enhance fault-tolerance by allowing the NoC to dynamically adapt to faults at the link level and reverse propagation direction to avoid faulty links. Additionally, QFC buffers reduce router power and improve performance by eliminating in-router buffering. Our simulation results using real benchmarks and synthetic traffic mixes show that QORE improves speedup by 1.3× and throughput by 2.3× when compared to state-of-the art fault tolerant NoCs designs such as Ariadne and Vicis. Moreover, using Synopsys Design Compiler, we also show that network power in QORE is reduced by 21% with minimal control overhead.","PeriodicalId":164587,"journal":{"name":"2014 IEEE 20th International Symposium on High Performance Computer Architecture (HPCA)","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115037851","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42