首页 > 最新文献

2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)最新文献

英文 中文
BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches BEAR:用于减轻千兆级DRAM缓存中带宽膨胀的技术
Chiachen Chou, A. Jaleel, Moinuddin K. Qureshi
Die stacking memory technology can enable gigascale DRAM caches that can operate at 4x-8x higher bandwidth than commodity DRAM. Such caches can improve system performance by servicing data at a faster rate when the requested data is found in the cache, potentially increasing the memory bandwidth of the system by 4x-8x. Unfortunately, a DRAM cache uses the available memory bandwidth not only for data transfer on cache hits, but also for other secondary operations such as cache miss detection, fill on cache miss, and writeback lookup and content update on dirty evictions from the last-level on-chip cache. Ideally, we want the bandwidth consumed for such secondary operations to be negligible, and have almost all the bandwidth be available for transfer of useful data from the DRAM cache to the processor. We evaluate a 1GB DRAM cache, architected as Alloy Cache, and show that even the most bandwidth-efficient proposal for DRAM cache consumes 3.8x bandwidth compared to an idealized DRAM cache that does not consume any bandwidth for secondary operations. We also show that redesigning the DRAM cache to minimize the bandwidth consumed by secondary operations can potentially improve system performance by 22%. To that end, this paper proposes Bandwidth Efficient ARchitecture (BEAR) for DRAM caches. BEAR integrates three components, one each for reducing the bandwidth consumed by miss detection, miss fill, and writeback probes. BEAR reduces the bandwidth consumption of DRAM cache by 32%, which reduces cache hit latency by 24% and increases overall system performance by 10%. BEAR, with negligible overhead, outperforms an idealized SRAM Tag-Store design that incurs an unacceptable overhead of 64 megabytes, as well as Sector Cache designs that incur an SRAM storage overhead of 6 megabytes.
芯片堆叠存储技术可以使千兆级DRAM缓存能够以比普通DRAM高4 -8倍的带宽运行。当在缓存中找到请求的数据时,这样的缓存可以以更快的速度为数据提供服务,从而提高系统性能,可能会将系统的内存带宽增加4 -8倍。不幸的是,DRAM缓存使用的可用内存带宽不仅用于缓存命中时的数据传输,还用于其他次要操作,例如缓存缺失检测、缓存缺失时的填充、以及从最后一级片上缓存清除脏数据时的回写查找和内容更新。理想情况下,我们希望这种次要操作所消耗的带宽可以忽略不计,并且几乎所有带宽都可用于将有用的数据从DRAM缓存传输到处理器。我们评估了一个1GB的DRAM缓存,架构为Alloy cache,并表明即使是带宽效率最高的DRAM缓存方案,与不消耗任何带宽用于辅助操作的理想DRAM缓存相比,也要消耗3.8倍的带宽。我们还表明,重新设计DRAM缓存以最小化次要操作消耗的带宽可能会使系统性能提高22%。为此,本文提出了用于DRAM缓存的带宽高效架构(Bandwidth Efficient ARchitecture, BEAR)。BEAR集成了三个组件,分别用于减少遗漏检测、遗漏填充和回写探测所消耗的带宽。BEAR将DRAM缓存的带宽消耗降低了32%,从而将缓存命中延迟降低了24%,并将整体系统性能提高了10%。BEAR的开销可以忽略不计,优于理想的SRAM标签存储设计,后者会产生64兆字节的不可接受的开销,也优于扇区缓存设计,后者会产生6兆字节的SRAM存储开销。
{"title":"BEAR: Techniques for mitigating bandwidth bloat in gigascale DRAM caches","authors":"Chiachen Chou, A. Jaleel, Moinuddin K. Qureshi","doi":"10.1145/2749469.2750387","DOIUrl":"https://doi.org/10.1145/2749469.2750387","url":null,"abstract":"Die stacking memory technology can enable gigascale DRAM caches that can operate at 4x-8x higher bandwidth than commodity DRAM. Such caches can improve system performance by servicing data at a faster rate when the requested data is found in the cache, potentially increasing the memory bandwidth of the system by 4x-8x. Unfortunately, a DRAM cache uses the available memory bandwidth not only for data transfer on cache hits, but also for other secondary operations such as cache miss detection, fill on cache miss, and writeback lookup and content update on dirty evictions from the last-level on-chip cache. Ideally, we want the bandwidth consumed for such secondary operations to be negligible, and have almost all the bandwidth be available for transfer of useful data from the DRAM cache to the processor. We evaluate a 1GB DRAM cache, architected as Alloy Cache, and show that even the most bandwidth-efficient proposal for DRAM cache consumes 3.8x bandwidth compared to an idealized DRAM cache that does not consume any bandwidth for secondary operations. We also show that redesigning the DRAM cache to minimize the bandwidth consumed by secondary operations can potentially improve system performance by 22%. To that end, this paper proposes Bandwidth Efficient ARchitecture (BEAR) for DRAM caches. BEAR integrates three components, one each for reducing the bandwidth consumed by miss detection, miss fill, and writeback probes. BEAR reduces the bandwidth consumption of DRAM cache by 32%, which reduces cache hit latency by 24% and increases overall system performance by 10%. BEAR, with negligible overhead, outperforms an idealized SRAM Tag-Store design that incurs an unacceptable overhead of 64 megabytes, as well as Sector Cache designs that incur an SRAM storage overhead of 6 megabytes.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"21 1","pages":"198-210"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78166170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 82
A fully associative, tagless DRAM cache 一种完全关联的、无标签的DRAM缓存
Yongjun Lee, JongWon Kim, Hakbeom Jang, Hyunggyun Yang, Jang-Hyun Kim, Jinkyu Jeong, Jae W. Lee
This paper introduces a tagless cache architecture for large in-package DRAM caches. The conventional die-stacked DRAM cache has both a TLB and a cache tag array, which are responsible for virtual-to-physical and physical-to-cache address translation, respectively. We propose to align the granularity of caching with OS page size and take a unified approach to address translation and cache tag management. To this end, we introduce cache-map TLB (cTLB), which stores virtual-to-cache, instead of virtual-to-physical, address mappings. At a TLB miss, the TLB miss handler allocates the requested block into the cache if it is not cached yet, and updates both the page table and cTLB with the virtual-to-cache address mapping. Assuming the availability of large in-package DRAM caches, this ensures that an access to the memory region within the TLB reach always hits in the cache with low hit latency since a TLB access immediately returns the exact location of the requested block in the cache, hence saving a tag-checking operation. The remaining cache space is used as victim cache for memory pages that are recently evicted from cTLB. By completely eliminating data structures for cache tag management, from either on-die SRAM or inpackage DRAM, the proposed DRAM cache achieves best scalability and hit latency, while maintaining high hit rate of a fully associative cache. Our evaluation with 3D Through-Silicon Via (TSV)-based in-package DRAM demonstrates that the proposed cache improves the IPC and energy efficiency by 30.9% and 39.5%, respectively, compared to the baseline with no DRAM cache. These numbers translate to 4.3% and 23.8% improvements over an impractical SRAM-tag cache requiring megabytes of on-die SRAM storage, due to low hit latency and zero energy waste for cache tags.
本文介绍了一种用于大型封装内DRAM缓存的无标签缓存架构。传统的模堆式DRAM缓存具有TLB和缓存标签阵列,它们分别负责虚拟到物理和物理到缓存的地址转换。我们建议将缓存粒度与操作系统页面大小保持一致,并采用统一的方法来解决翻译和缓存标签管理问题。为此,我们引入缓存映射TLB (cTLB),它存储虚拟到缓存的地址映射,而不是虚拟到物理的地址映射。在TLB丢失时,如果请求的块尚未缓存,则TLB丢失处理程序将其分配到缓存中,并使用虚拟到缓存的地址映射更新页表和cTLB。假设大型封装内DRAM缓存的可用性,这确保了对TLB范围内的内存区域的访问总是在缓存中以低延迟命中,因为TLB访问立即返回缓存中所请求块的确切位置,从而节省了标签检查操作。剩余的缓存空间用作最近从cTLB中驱逐的内存页的受害者缓存。通过从片上SRAM或封装DRAM中完全消除缓存标签管理的数据结构,所提出的DRAM缓存实现了最佳的可扩展性和命中延迟,同时保持了全关联缓存的高命中率。我们对基于3D通硅孔(TSV)的封装DRAM的评估表明,与没有DRAM缓存的基准相比,所提出的缓存分别提高了30.9%和39.5%的IPC和能源效率。由于缓存标签的低命中延迟和零能量浪费,与需要兆字节片上SRAM存储的不切实际的SRAM标签缓存相比,这些数字转化为4.3%和23.8%的改进。
{"title":"A fully associative, tagless DRAM cache","authors":"Yongjun Lee, JongWon Kim, Hakbeom Jang, Hyunggyun Yang, Jang-Hyun Kim, Jinkyu Jeong, Jae W. Lee","doi":"10.1145/2749469.2750383","DOIUrl":"https://doi.org/10.1145/2749469.2750383","url":null,"abstract":"This paper introduces a tagless cache architecture for large in-package DRAM caches. The conventional die-stacked DRAM cache has both a TLB and a cache tag array, which are responsible for virtual-to-physical and physical-to-cache address translation, respectively. We propose to align the granularity of caching with OS page size and take a unified approach to address translation and cache tag management. To this end, we introduce cache-map TLB (cTLB), which stores virtual-to-cache, instead of virtual-to-physical, address mappings. At a TLB miss, the TLB miss handler allocates the requested block into the cache if it is not cached yet, and updates both the page table and cTLB with the virtual-to-cache address mapping. Assuming the availability of large in-package DRAM caches, this ensures that an access to the memory region within the TLB reach always hits in the cache with low hit latency since a TLB access immediately returns the exact location of the requested block in the cache, hence saving a tag-checking operation. The remaining cache space is used as victim cache for memory pages that are recently evicted from cTLB. By completely eliminating data structures for cache tag management, from either on-die SRAM or inpackage DRAM, the proposed DRAM cache achieves best scalability and hit latency, while maintaining high hit rate of a fully associative cache. Our evaluation with 3D Through-Silicon Via (TSV)-based in-package DRAM demonstrates that the proposed cache improves the IPC and energy efficiency by 30.9% and 39.5%, respectively, compared to the baseline with no DRAM cache. These numbers translate to 4.3% and 23.8% improvements over an impractical SRAM-tag cache requiring megabytes of on-die SRAM storage, due to low hit latency and zero energy waste for cache tags.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"19 1","pages":"211-222"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82540646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 75
Thermal time shifting: Leveraging phase change materials to reduce cooling costs in warehouse-scale computers 热时移:利用相变材料来降低仓库级计算机的冷却成本
Matt Skach, Manish Arora, Chang-Hong Hsu, Qi Li, D. Tullsen, Lingjia Tang, Jason Mars
Datacenters, or warehouse scale computers, are rapidly increasing in size and power consumption. However, this growth comes at the cost of an increasing thermal load that must be removed to prevent overheating and server failure. In this paper, we propose to use phase changing materials (PCM) to shape the thermal load of a datacenter, absorbing and releasing heat when it is advantageous to do so. We present and validate a methodology to study the impact of PCM on a datacenter, and evaluate two important opportunities for cost savings. We find that in a datacenter with full cooling system subscription, PCM can reduce the necessary cooling system size by up to 12% without impacting peak throughput, or increase the number of servers by up to 14.6% without increasing the cooling load. In a thermally constrained setting, PCM can increase peak throughput up to 69% while delaying the onset of thermal limits by over 3 hours.
数据中心或仓库级计算机的规模和功耗正在迅速增加。然而,这种增长是以不断增加的热负载为代价的,必须消除热负载以防止过热和服务器故障。在本文中,我们建议使用相变材料(PCM)来塑造数据中心的热负荷,在有利的时候吸收和释放热量。我们提出并验证了一种方法来研究PCM对数据中心的影响,并评估了节省成本的两个重要机会。我们发现,在完全订购冷却系统的数据中心中,PCM可以在不影响峰值吞吐量的情况下将必要的冷却系统大小减少多达12%,或者在不增加冷却负载的情况下将服务器数量增加多达14.6%。在热受限的环境中,PCM可以将峰值吞吐量提高69%,同时将热限制的开始延迟3小时以上。
{"title":"Thermal time shifting: Leveraging phase change materials to reduce cooling costs in warehouse-scale computers","authors":"Matt Skach, Manish Arora, Chang-Hong Hsu, Qi Li, D. Tullsen, Lingjia Tang, Jason Mars","doi":"10.1145/2749469.2749474","DOIUrl":"https://doi.org/10.1145/2749469.2749474","url":null,"abstract":"Datacenters, or warehouse scale computers, are rapidly increasing in size and power consumption. However, this growth comes at the cost of an increasing thermal load that must be removed to prevent overheating and server failure. In this paper, we propose to use phase changing materials (PCM) to shape the thermal load of a datacenter, absorbing and releasing heat when it is advantageous to do so. We present and validate a methodology to study the impact of PCM on a datacenter, and evaluate two important opportunities for cost savings. We find that in a datacenter with full cooling system subscription, PCM can reduce the necessary cooling system size by up to 12% without impacting peak throughput, or increase the number of servers by up to 14.6% without increasing the cooling load. In a thermally constrained setting, PCM can increase peak throughput up to 69% while delaying the onset of thermal limits by over 3 hours.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"44 1","pages":"439-449"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80058989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
A case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling flexible data compression with assist warps gpu中核心辅助瓶颈加速的案例:通过辅助扭曲实现灵活的数据压缩
Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, A. Bhowmick, Rachata Ausavarungnirun, C. Das, M. Kandemir, T. Mowry, O. Mutlu
Modern Graphics Processing Units (CPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a CPU is bottle necked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in CPU execution. CABA provides flexible mechanisms to automatically generate "assist warps" that execute on CPU cores to perform specific tasks that can improve CPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the CPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the CPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.
现代图形处理单元(cpu)的配置很好,可以支持数千个线程的并发执行。不幸的是,执行过程中的不同瓶颈和异构应用程序需求会导致核心中资源利用率的不平衡。例如,当CPU受到可用的片外内存带宽的限制时,它的计算资源通常非常空闲,等待来自内存的数据到达。本文介绍了核心辅助瓶颈加速(CABA)框架,该框架利用空闲的片上资源来缓解CPU执行中的各种瓶颈。CABA提供了灵活的机制来自动生成在CPU内核上执行的“辅助扭曲”,以执行可以提高CPU性能和效率的特定任务。CABA允许使用空闲的计算单元和管道来缓解内存带宽瓶颈,例如,通过使用辅助扭曲来执行数据压缩以从内存传输更少的数据。相反,同样的框架可以用来处理CPU被可用的计算单元阻塞的情况,在这种情况下,内存管道是空闲的,可以被CABA用来加速计算,例如,通过使用辅助扭曲执行记忆。我们提供了一个全面的设计和评估的CABA,以执行有效和灵活的数据压缩在CPU内存层次,以缓解内存带宽瓶颈。我们的广泛评估表明,当使用CABA实现数据压缩时,在各种内存带宽敏感的GPGPU应用程序中提供41.7%(高达2.6倍)的平均性能改进。
{"title":"A case for Core-Assisted Bottleneck Acceleration in GPUs: Enabling flexible data compression with assist warps","authors":"Nandita Vijaykumar, Gennady Pekhimenko, Adwait Jog, A. Bhowmick, Rachata Ausavarungnirun, C. Das, M. Kandemir, T. Mowry, O. Mutlu","doi":"10.1145/2749469.2750399","DOIUrl":"https://doi.org/10.1145/2749469.2750399","url":null,"abstract":"Modern Graphics Processing Units (CPUs) are well provisioned to support the concurrent execution of thousands of threads. Unfortunately, different bottlenecks during execution and heterogeneous application requirements create imbalances in utilization of resources in the cores. For example, when a CPU is bottle necked by the available off-chip memory bandwidth, its computational resources are often overwhelmingly idle, waiting for data from memory to arrive. This paper introduces the Core-Assisted Bottleneck Acceleration (CABA) framework that employs idle on-chip resources to alleviate different bottlenecks in CPU execution. CABA provides flexible mechanisms to automatically generate \"assist warps\" that execute on CPU cores to perform specific tasks that can improve CPU performance and efficiency. CABA enables the use of idle computational units and pipelines to alleviate the memory bandwidth bottleneck, e.g., by using assist warps to perform data compression to transfer less data from memory. Conversely, the same framework can be employed to handle cases where the CPU is bottlenecked by the available computational units, in which case the memory pipelines are idle and can be used by CABA to speed up computation, e.g., by performing memoization using assist warps. We provide a comprehensive design and evaluation of CABA to perform effective and flexible data compression in the CPU memory hierarchy to alleviate the memory bandwidth bottleneck. Our extensive evaluations show that CABA, when used to implement data compression, provides an average performance improvement of 41.7% (as high as 2.6X) across a variety of memory-bandwidth-sensitive GPGPU applications.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"188 1","pages":"41-53"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77934228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 106
Cost-effective speculative scheduling in high performance processors 高性能处理器中具有成本效益的推测调度
Arthur Perais, André Seznec, P. Michaud, Andreas Sembrant, Erik Hagersten
To maximize peiformance, out-of-order execution processors sometimes issue instructions without having the guarantee that operands will be available in time; e.g. loads are typically assumed to hit in the LI cache and dependent instructions are issued accordingly. This form of speculation - that we refer to as speculative scheduling - has been used for two decades in real processors, but has received little attention from the research community. In particular, as pipeline depth grows, and the distance between the Issue and the Execute stages increases, it becomes critical to issue instructions dependent on variable-latency instructions as soon as possible rather than wait for the actual cycle at which the result becomes available. Unfortunately, due to the uncertain nature of speculative scheduling, the scheduler may wrongly issue an instruction that will not have its source( s) available on the bypass network when it reaches the Execute stage. In that event, the instruction is canceled and replayed, potentially impairing peiformance and increasing energy consumption. In this work, we do not present a new replay mechanism. Rather, we focus on ways to reduce the number of replays that are agnostic of the replay scheme. First, we propose an easily implementable, low-cost solution to reduce the number of replays caused by Ll bank conflicts. Schedule shifting always assumes that, given a dual-load issue capacity, the second load issued in a given cycle will be delayed because of a bank conflict. Its dependents are thus always issued with the corresponding delay. Second, we also improve on existing Ll hit/miss prediction schemes by taking into account instruction criticality. That is, for some criterion of criticality and for loads whose hit/miss behavior is hard to predict, we show that it is more cost-effective to stall dependents if the load is not predicted critical.
为了最大限度地提高性能,乱序执行处理器有时会发出指令,但不能保证操作数及时可用;例如,加载通常假定在LI缓存中命中,并相应地发出相关指令。这种形式的推测——我们称之为推测调度——已经在真实的处理器中使用了20年,但是很少受到研究界的关注。特别是,随着管道深度的增加,Issue和Execute阶段之间的距离增加,尽快发出依赖于可变延迟指令的指令变得至关重要,而不是等待结果可用的实际周期。不幸的是,由于推测调度的不确定性,当调度程序到达执行阶段时,它可能会错误地发出一条在旁路网络上没有可用源的指令。在这种情况下,指令被取消并重放,这可能会损害性能并增加能耗。在这项工作中,我们没有提出新的重放机制。相反,我们专注于减少与回放方案无关的回放次数的方法。首先,我们提出了一个易于实现的低成本解决方案,以减少由Ll银行冲突引起的重播次数。计划转移总是假设,给定双负载发行能力,在给定周期内发布的第二个负载将由于银行冲突而延迟。因此,它的依赖项总是具有相应的延迟。其次,我们还通过考虑指令临界性来改进现有的Ll命中/未命中预测方案。也就是说,对于某些临界标准和命中/未命中行为难以预测的负载,我们表明,如果负载不是预测的临界,则失速依赖的成本效益更高。
{"title":"Cost-effective speculative scheduling in high performance processors","authors":"Arthur Perais, André Seznec, P. Michaud, Andreas Sembrant, Erik Hagersten","doi":"10.1145/2872887.2749470","DOIUrl":"https://doi.org/10.1145/2872887.2749470","url":null,"abstract":"To maximize peiformance, out-of-order execution processors sometimes issue instructions without having the guarantee that operands will be available in time; e.g. loads are typically assumed to hit in the LI cache and dependent instructions are issued accordingly. This form of speculation - that we refer to as speculative scheduling - has been used for two decades in real processors, but has received little attention from the research community. In particular, as pipeline depth grows, and the distance between the Issue and the Execute stages increases, it becomes critical to issue instructions dependent on variable-latency instructions as soon as possible rather than wait for the actual cycle at which the result becomes available. Unfortunately, due to the uncertain nature of speculative scheduling, the scheduler may wrongly issue an instruction that will not have its source( s) available on the bypass network when it reaches the Execute stage. In that event, the instruction is canceled and replayed, potentially impairing peiformance and increasing energy consumption. In this work, we do not present a new replay mechanism. Rather, we focus on ways to reduce the number of replays that are agnostic of the replay scheme. First, we propose an easily implementable, low-cost solution to reduce the number of replays caused by Ll bank conflicts. Schedule shifting always assumes that, given a dual-load issue capacity, the second load issued in a given cycle will be delayed because of a bank conflict. Its dependents are thus always issued with the corresponding delay. Second, we also improve on existing Ll hit/miss prediction schemes by taking into account instruction criticality. That is, for some criterion of criticality and for loads whose hit/miss behavior is hard to predict, we show that it is more cost-effective to stall dependents if the load is not predicted critical.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"9 1","pages":"247-259"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87623242","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Profiling a warehouse-scale computer 分析仓库级计算机
Svilen Kanev, Juan Pablo Darago, K. Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, D. Brooks
With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications. We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This “datacenter tax” can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.
随着仓库规模(WSC)和云计算的日益普及,理解服务器应用程序与底层微体系结构的交互变得越来越重要,以便从服务器硬件中提取最大性能。为了帮助理解这种理解,本文对实时数据中心作业进行了详细的微架构分析,在三年的时间里对超过20,000台Google机器进行了测量,包括数千种不同的应用程序。首先,我们发现WSC工作负载非常多样化,因此需要能够容忍应用程序可变性而不损失性能的架构。然而,出现了一些模式,为硬件和软件的协同优化提供了机会。例如,我们在软件堆栈的较低层次中识别公共构建块。这种“数据中心税”可以占到整个舰队中运行作业周期的近30%,这使得它的组成部分成为未来服务器芯片系统硬件专业化的主要候选者。我们还发现了对服务器处理器进行经典微体系结构优化的机会,特别是在缓存层次结构中。典型的工作负载对指令缓存有很大的压力,并且更喜欢内存延迟而不是带宽。它们也经常使核心停滞,但在突发情况下进行大量计算。这些观察结果激发了未来仓库级计算机的几个有趣方向。
{"title":"Profiling a warehouse-scale computer","authors":"Svilen Kanev, Juan Pablo Darago, K. Hazelwood, Parthasarathy Ranganathan, Tipp Moseley, Gu-Yeon Wei, D. Brooks","doi":"10.1145/2749469.2750392","DOIUrl":"https://doi.org/10.1145/2749469.2750392","url":null,"abstract":"With the increasing prevalence of warehouse-scale (WSC) and cloud computing, understanding the interactions of server applications with the underlying microarchitecture becomes ever more important in order to extract maximum performance out of server hardware. To aid such understanding, this paper presents a detailed microarchitectural analysis of live datacenter jobs, measured on more than 20,000 Google machines over a three year period, and comprising thousands of different applications. We first find that WSC workloads are extremely diverse, breeding the need for architectures that can tolerate application variability without performance loss. However, some patterns emerge, offering opportunities for co-optimization of hardware and software. For example, we identify common building blocks in the lower levels of the software stack. This “datacenter tax” can comprise nearly 30% of cycles across jobs running in the fleet, which makes its constituents prime candidates for hardware specialization in future server systems-on-chips. We also uncover opportunities for classic microarchitectural optimizations for server processors, especially in the cache hierarchy. Typical workloads place significant stress on instruction caches and prefer memory latency over bandwidth. They also stall cores often, but compute heavily in bursts. These observations motivate several interesting directions for future warehouse-scale computers.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"14 6 1","pages":"158-169"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85407676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 346
Clean: A race detector with cleaner semantics Clean:具有更清晰语义的竞争检测器
Cedomir Segulja, T. Abdelrahman
Data races make parallel programs hard to understand. Precise race detection that stops an execution on first occurrence of a race addresses this problem, but it comes with significant overhead. In this work, we exploit the insight that precisely detecting only write-after-write (WAW) and read-after-write (RAW) races suffices to provide cleaner semantics for racy programs. We demonstrate that stopping an execution only when these races occur ensures that synchronization-free-regions appear to be executed in isolation and that their writes appear atomic. Additionally, the undetected racy executions can be given certain deterministic guarantees with efficient mechanisms. We present CLEAN, a system that precisely detects WAW and RAW races and deterministically orders synchronization. We demonstrate that the combination of these two relatively inexpensive mechanisms provides cleaner semantics for racy programs. We evaluate both software-only and hardware-supported CLEAN. The software-only CLEAN runs all Pthread benchmarks from the SPLASH-2 and PARSEC suites with an average 7.8x slowdown. The overhead of precise WAW and RAW detection (5.8x) constitutes the majority of this slowdown. Simple hardware extensions reduce the slowdown of CLEAN's race detection to on average 10.4% and never more than 46.7%.
数据竞争使得并行程序难以理解。精确的竞争检测在第一次出现竞争时停止执行,解决了这个问题,但它带来了很大的开销。在这项工作中,我们利用精确检测仅写后写(WAW)和写后读(RAW)竞争的洞察力,足以为动态程序提供更清晰的语义。我们将证明,仅在发生这些争用时停止执行可确保无同步区域看起来是孤立地执行的,并且它们的写看起来是原子的。此外,可以使用有效的机制为未检测到的非法执行提供一定的确定性保证。我们提出了CLEAN,一个精确检测WAW和RAW竞争和确定性订单同步的系统。我们证明了这两种相对便宜的机制的组合为动态程序提供了更清晰的语义。我们评估了纯软件和硬件支持的CLEAN。纯软件CLEAN可以运行来自SPLASH-2和PARSEC套件的所有Pthread基准测试,平均速度降低7.8倍。精确WAW和RAW检测的开销(5.8倍)构成了这种减速的主要原因。简单的硬件扩展将CLEAN的竞争检测速度降低到平均10.4%,而不会超过46.7%。
{"title":"Clean: A race detector with cleaner semantics","authors":"Cedomir Segulja, T. Abdelrahman","doi":"10.1145/2749469.2750395","DOIUrl":"https://doi.org/10.1145/2749469.2750395","url":null,"abstract":"Data races make parallel programs hard to understand. Precise race detection that stops an execution on first occurrence of a race addresses this problem, but it comes with significant overhead. In this work, we exploit the insight that precisely detecting only write-after-write (WAW) and read-after-write (RAW) races suffices to provide cleaner semantics for racy programs. We demonstrate that stopping an execution only when these races occur ensures that synchronization-free-regions appear to be executed in isolation and that their writes appear atomic. Additionally, the undetected racy executions can be given certain deterministic guarantees with efficient mechanisms. We present CLEAN, a system that precisely detects WAW and RAW races and deterministically orders synchronization. We demonstrate that the combination of these two relatively inexpensive mechanisms provides cleaner semantics for racy programs. We evaluate both software-only and hardware-supported CLEAN. The software-only CLEAN runs all Pthread benchmarks from the SPLASH-2 and PARSEC suites with an average 7.8x slowdown. The overhead of precise WAW and RAW detection (5.8x) constitutes the majority of this slowdown. Simple hardware extensions reduce the slowdown of CLEAN's race detection to on average 10.4% and never more than 46.7%.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"3 1","pages":"401-413"},"PeriodicalIF":0.0,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84331484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
期刊
2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1