Proceedings of the 40th Annual International Symposium on Computer Architecture最新文献_第5页

Studying multicore processor scaling via reuse distance analysis 基于重用距离分析的多核处理器扩展研究

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485965

Meng-Ju Wu, Minshu Zhao, D. Yeung

The trend for multicore processors is towards increasing numbers of cores, with 100s of cores--i.e. large-scale chip multiprocessors (LCMPs)--possible in the future. The key to realizing the potential of LCMPs is the cache hierarchy, so studying how memory performance will scale is crucial. Reuse distance (RD) analysis can help architects do this. In particular, recent work has developed concurrent reuse distance (CRD) and private reuse distance (PRD) profiles to enable analysis of shared and private caches. Also, techniques have been developed to predict profiles across problem size and core count, enabling the analysis of configurations that are too large to simulate. This paper applies RD analysis to study the scalability of multicore cache hierarchies. We present a framework based on CRD and PRD profiles for reasoning about the locality impact of core count and problem scaling. We find interference-based locality degradation is more significant than sharing-based locality degradation. For 256 cores running small problems, the former occurs at small cache sizes, allowing moderate capacity scaling of multicore caches to achieve the same cache performance (MPKI) as a single-core cache. At very large problems, interference-based locality degradation increases significantly in many of our benchmarks. For shared caches, this prevents most of our benchmarks from achieving constant-MPKI scaling within a 256 MB capacity budget; for private caches, all benchmarks cannot achieve constant-MPKI scaling within 256 MB.

多核处理器的趋势是增加核数，有100个核。大规模芯片多处理器(LCMPs)——未来可能实现。实现lcmp潜力的关键是缓存层次结构，因此研究内存性能如何扩展是至关重要的。重用距离(RD)分析可以帮助架构师做到这一点。特别是，最近的工作开发了并发重用距离(CRD)和私有重用距离(PRD)概要文件，以支持对共享缓存和私有缓存的分析。此外，还开发了一些技术来预测跨问题大小和核心数量的配置文件，从而能够分析太大而无法模拟的配置。本文应用RD分析方法研究了多核缓存层次结构的可扩展性。我们提出了一个基于CRD和PRD概况的框架，用于推理核心计数和问题缩放的局部性影响。我们发现基于干扰的局部退化比基于共享的局部退化更为显著。对于256核运行的小问题，前者发生在较小的缓存大小，允许适度扩展多核缓存的容量，以实现与单核缓存相同的缓存性能(MPKI)。在非常大的问题中，基于干扰的局部退化在我们的许多基准中显着增加。对于共享缓存，这使得我们的大多数基准测试无法在256 MB容量预算内实现恒定的mpki扩展;对于私有缓存，所有基准测试都无法在256 MB内实现恒定的mpki缩放。

{"title":"Studying multicore processor scaling via reuse distance analysis","authors":"Meng-Ju Wu, Minshu Zhao, D. Yeung","doi":"10.1145/2485922.2485965","DOIUrl":"https://doi.org/10.1145/2485922.2485965","url":null,"abstract":"The trend for multicore processors is towards increasing numbers of cores, with 100s of cores--i.e. large-scale chip multiprocessors (LCMPs)--possible in the future. The key to realizing the potential of LCMPs is the cache hierarchy, so studying how memory performance will scale is crucial. Reuse distance (RD) analysis can help architects do this. In particular, recent work has developed concurrent reuse distance (CRD) and private reuse distance (PRD) profiles to enable analysis of shared and private caches. Also, techniques have been developed to predict profiles across problem size and core count, enabling the analysis of configurations that are too large to simulate. This paper applies RD analysis to study the scalability of multicore cache hierarchies. We present a framework based on CRD and PRD profiles for reasoning about the locality impact of core count and problem scaling. We find interference-based locality degradation is more significant than sharing-based locality degradation. For 256 cores running small problems, the former occurs at small cache sizes, allowing moderate capacity scaling of multicore caches to achieve the same cache performance (MPKI) as a single-core cache. At very large problems, interference-based locality degradation increases significantly in many of our benchmarks. For shared caches, this prevents most of our benchmarks from achieving constant-MPKI scaling within a 256 MB capacity budget; for private caches, all benchmarks cannot achieve constant-MPKI scaling within 256 MB.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86695070","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 36

ZSim: fast and accurate microarchitectural simulation of thousand-core systems ZSim:快速准确的千核系统微架构仿真

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485963

Daniel Sánchez, C. Kozyrakis

Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by allowing event reordering and using simplistic contention models. As a result, most researchers use sequential simulators and model small-scale systems with 16-32 cores. With 100-core chips already available, developing simulators that scale to thousands of cores is crucial. We present three novel techniques that, together, make thousand-core simulation practical. First, we speed up detailed core models (including OOO cores) with instruction-driven timing models that leverage dynamic binary translation. Second, we introduce bound-weave, a two-phase parallelization technique that scales parallel simulation on multicore hosts efficiently with minimal loss of accuracy. Third, we implement lightweight user-level virtualization to support complex workloads, including multiprogrammed, client-server, and managed-runtime applications, without the need for full-system simulation, sidestepping the lack of scalable OSs and ISAs that support thousands of cores. We use these techniques to build zsim, a fast, scalable, and accurate simulator. On a 16-core host, zsim models a 1024-core chip at speeds of up to 1,500 MIPS using simple cores and up to 300 MIPS using detailed OOO cores, 2-3 orders of magnitude faster than existing parallel simulators. Simulator performance scales well with both the number of modeled cores and the number of host cores. We validate zsim against a real Westmere system on a wide variety of workloads, and find performance and microarchitectural events to be within a narrow range of the real system.

架构模拟非常耗时，而且数百个内核的趋势使得顺序模拟更加缓慢。现有的并行模拟技术要么由于过度同步而伸缩性差，要么由于允许事件重新排序和使用简单的争用模型而牺牲准确性。因此，大多数研究人员使用顺序模拟器和模拟16-32核的小规模系统。由于100核芯片已经可用，开发可扩展到数千核的模拟器至关重要。我们提出了三种新技术，它们共同使千核模拟成为现实。首先，我们使用利用动态二进制转换的指令驱动计时模型来加速详细的核心模型(包括OOO核心)。其次，我们介绍了bound-weave，这是一种两相并行化技术，可以在多核主机上有效地扩展并行模拟，同时精度损失最小。第三，我们实现了轻量级的用户级虚拟化，以支持复杂的工作负载，包括多编程、客户机-服务器和托管运行时应用程序，而不需要进行全系统模拟，避免了缺乏支持数千个内核的可扩展操作系统和isa的问题。我们使用这些技术来构建zsim，一个快速，可扩展，准确的模拟器。在16核主机上，zsim模拟1024核芯片，使用简单核心的速度高达1,500 MIPS，使用详细的OOO核心的速度高达300 MIPS，比现有的并行模拟器快2-3个数量级。模拟器的性能随建模内核的数量和主机内核的数量都可以很好地扩展。我们针对实际的Westmere系统在各种工作负载上验证了zsim，并发现性能和微架构事件在实际系统的狭窄范围内。

{"title":"ZSim: fast and accurate microarchitectural simulation of thousand-core systems","authors":"Daniel Sánchez, C. Kozyrakis","doi":"10.1145/2485922.2485963","DOIUrl":"https://doi.org/10.1145/2485922.2485963","url":null,"abstract":"Architectural simulation is time-consuming, and the trend towards hundreds of cores is making sequential simulation even slower. Existing parallel simulation techniques either scale poorly due to excessive synchronization, or sacrifice accuracy by allowing event reordering and using simplistic contention models. As a result, most researchers use sequential simulators and model small-scale systems with 16-32 cores. With 100-core chips already available, developing simulators that scale to thousands of cores is crucial. We present three novel techniques that, together, make thousand-core simulation practical. First, we speed up detailed core models (including OOO cores) with instruction-driven timing models that leverage dynamic binary translation. Second, we introduce bound-weave, a two-phase parallelization technique that scales parallel simulation on multicore hosts efficiently with minimal loss of accuracy. Third, we implement lightweight user-level virtualization to support complex workloads, including multiprogrammed, client-server, and managed-runtime applications, without the need for full-system simulation, sidestepping the lack of scalable OSs and ISAs that support thousands of cores. We use these techniques to build zsim, a fast, scalable, and accurate simulator. On a 16-core host, zsim models a 1024-core chip at speeds of up to 1,500 MIPS using simple cores and up to 300 MIPS using detailed OOO cores, 2-3 orders of magnitude faster than existing parallel simulators. Simulator performance scales well with both the number of modeled cores and the number of host cores. We validate zsim against a real Westmere system on a wide variety of workloads, and find performance and microarchitectural events to be within a narrow range of the real system.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78896151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 514

Whare-map: heterogeneity in "homogeneous" warehouse-scale computers Whare-map:“同构”仓库级计算机中的异构性

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485975

Jason Mars, Lingjia Tang

Modern "warehouse scale computers" (WSCs) continue to be embraced as homogeneous computing platforms. However, due to frequent machine replacements and upgrades, modern WSCs are in fact composed of diverse commodity microarchitectures and machine configurations. Yet, current WSCs are architected with the assumption of homogeneity, leaving a potentially significant performance opportunity unexplored. In this paper, we expose and quantify the performance impact of the "homogeneity assumption" for modern production WSCs using industry-strength large-scale web-service workloads. In addition, we argue for, and evaluate the benefits of, a heterogeneity-aware WSC using commercial web-service production workloads including Google's web-search. We also identify key factors impacting the available performance opportunity when exploiting heterogeneity and introduce a new metric, opportunity factor, to quantify an application's sensitivity to the heterogeneity in a given WSC. To exploit heterogeneity in "homogeneous" WSCs, we propose "Whare-Map," the WSC Heterogeneity Aware Mapper that leverages already in-place continuous profiling subsystems found in production environments. When employing "Whare-Map", we observe a cluster-wide performance improvement of 15% on average over heterogeneity--oblivious job placement and up to an 80% improvement for web-service applications that are particularly sensitive to heterogeneity.

现代“仓库规模计算机”(WSCs)继续被视为同构计算平台。然而，由于频繁的机器更换和升级，现代wsc实际上由各种商品微架构和机器配置组成。然而，当前的wsc是在假设同质性的基础上构建的，这使得潜在的重要性能机会没有得到开发。在本文中，我们揭示并量化了“同质性假设”对使用行业强度的大规模web服务工作负载的现代生产wsc的性能影响。此外，我们论证并评估了使用商业web服务生产工作负载(包括Google的web搜索)的异构感知WSC的好处。我们还确定了在利用异构性时影响可用性能机会的关键因素，并引入了一个新的度量，即机会因子，以量化给定WSC中应用程序对异构性的敏感性。为了利用“同构”WSC中的异质性，我们提出了“wha - map”，即WSC异质性感知映射器，它利用了在生产环境中已经存在的连续分析子系统。当使用“Whare-Map”时，我们观察到集群范围内的性能比异构性平均提高15%——忽略了工作安置，对于对异构特别敏感的web服务应用程序，性能提高高达80%。

{"title":"Whare-map: heterogeneity in \"homogeneous\" warehouse-scale computers","authors":"Jason Mars, Lingjia Tang","doi":"10.1145/2485922.2485975","DOIUrl":"https://doi.org/10.1145/2485922.2485975","url":null,"abstract":"Modern \"warehouse scale computers\" (WSCs) continue to be embraced as homogeneous computing platforms. However, due to frequent machine replacements and upgrades, modern WSCs are in fact composed of diverse commodity microarchitectures and machine configurations. Yet, current WSCs are architected with the assumption of homogeneity, leaving a potentially significant performance opportunity unexplored. In this paper, we expose and quantify the performance impact of the \"homogeneity assumption\" for modern production WSCs using industry-strength large-scale web-service workloads. In addition, we argue for, and evaluate the benefits of, a heterogeneity-aware WSC using commercial web-service production workloads including Google's web-search. We also identify key factors impacting the available performance opportunity when exploiting heterogeneity and introduce a new metric, opportunity factor, to quantify an application's sensitivity to the heterogeneity in a given WSC. To exploit heterogeneity in \"homogeneous\" WSCs, we propose \"Whare-Map,\" the WSC Heterogeneity Aware Mapper that leverages already in-place continuous profiling subsystems found in production environments. When employing \"Whare-Map\", we observe a cluster-wide performance improvement of 15% on average over heterogeneity--oblivious job placement and up to an 80% improvement for web-service applications that are particularly sensitive to heterogeneity.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85139887","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 158

Reducing memory access latency with asymmetric DRAM bank organizations 通过非对称DRAM银行组织减少内存访问延迟

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485955

Y. Son, O. Seongil, Yuhwan Ro, Jae W. Lee, Jung Ho Ahn

DRAM has been a de facto standard for main memory, and advances in process technology have led to a rapid increase in its capacity and bandwidth. In contrast, its random access latency has remained relatively stagnant, as it is still around 100 CPU clock cycles. Modern computer systems rely on caches or other latency tolerance techniques to lower the average access latency. However, not all applications have ample parallelism or locality that would help hide or reduce the latency. Moreover, applications' demands for memory space continue to grow, while the capacity gap between last-level caches and main memory is unlikely to shrink. Consequently, reducing the main-memory latency is important for application performance. Unfortunately, previous proposals have not adequately addressed this problem, as they have focused only on improving the bandwidth and capacity or reduced the latency at the cost of significant area overhead. We propose asymmetric DRAM bank organizations to reduce the average main-memory access latency. We first analyze the access and cycle times of a modern DRAM device to identify key delay components for latency reduction. Then we reorganize a subset of DRAM banks to reduce their access and cycle times by half with low area overhead. By synergistically combining these reorganized DRAM banks with support for non-uniform bank accesses, we introduce a novel DRAM bank organization with center high-aspect-ratio mats called CHARM. Experiments on a simulated chip-multiprocessor system show that CHARM improves both the instructions per cycle and system-wide energy-delay product up to 21% and 32%, respectively, with only a 3% increase in die area.

DRAM已经成为主存储器的事实上的标准，工艺技术的进步导致其容量和带宽的迅速增加。相比之下，它的随机访问延迟保持相对停滞，因为它仍然在100个CPU时钟周期左右。现代计算机系统依靠缓存或其他延迟容忍技术来降低平均访问延迟。然而，并不是所有的应用程序都有足够的并行性或局部性来帮助隐藏或减少延迟。此外，应用程序对内存空间的需求继续增长，而最后一级缓存和主存之间的容量差距不太可能缩小。因此，减少主存延迟对于应用程序性能非常重要。不幸的是，以前的建议并没有充分解决这个问题，因为它们只关注于提高带宽和容量，或者以显著的区域开销为代价降低延迟。我们提出非对称DRAM银行组织来减少平均主存储器访问延迟。我们首先分析现代DRAM设备的访问和周期时间，以确定减少延迟的关键延迟组件。然后，我们重组了DRAM库的一个子集，以降低区域开销，将其访问和周期时间减少一半。通过将这些重组的DRAM库与支持非统一存储库访问的协同结合，我们引入了一种具有中心高纵横比席子的新型DRAM库组织，称为CHARM。在一个模拟芯片多处理器系统上的实验表明，CHARM将每周期指令数和系统范围的能量延迟产品分别提高了21%和32%，而芯片面积仅增加了3%。

{"title":"Reducing memory access latency with asymmetric DRAM bank organizations","authors":"Y. Son, O. Seongil, Yuhwan Ro, Jae W. Lee, Jung Ho Ahn","doi":"10.1145/2485922.2485955","DOIUrl":"https://doi.org/10.1145/2485922.2485955","url":null,"abstract":"DRAM has been a de facto standard for main memory, and advances in process technology have led to a rapid increase in its capacity and bandwidth. In contrast, its random access latency has remained relatively stagnant, as it is still around 100 CPU clock cycles. Modern computer systems rely on caches or other latency tolerance techniques to lower the average access latency. However, not all applications have ample parallelism or locality that would help hide or reduce the latency. Moreover, applications' demands for memory space continue to grow, while the capacity gap between last-level caches and main memory is unlikely to shrink. Consequently, reducing the main-memory latency is important for application performance. Unfortunately, previous proposals have not adequately addressed this problem, as they have focused only on improving the bandwidth and capacity or reduced the latency at the cost of significant area overhead. We propose asymmetric DRAM bank organizations to reduce the average main-memory access latency. We first analyze the access and cycle times of a modern DRAM device to identify key delay components for latency reduction. Then we reorganize a subset of DRAM banks to reduce their access and cycle times by half with low area overhead. By synergistically combining these reorganized DRAM banks with support for non-uniform bank accesses, we introduce a novel DRAM bank organization with center high-aspect-ratio mats called CHARM. Experiments on a simulated chip-multiprocessor system show that CHARM improves both the instructions per cycle and system-wide energy-delay product up to 21% and 32%, respectively, with only a 3% increase in die area.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85201421","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 120

CPU transparent protection of OS kernel and hypervisor integrity with programmable DRAM CPU透明保护操作系统内核和管理程序完整性与可编程的DRAM

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485956

Ziyi Liu, Jong-Hyuk Lee, Junyuan Zeng, Y. Wen, Zhiqiang Lin, W. Shi

Increasingly, cyber attacks (e.g., kernel rootkits) target the inner rings of a computer system, and they have seriously undermined the integrity of the entire computer systems. To eliminate these threats, it is imperative to develop innovative solutions running below the attack surface. This paper presents MGuard, a new most inner ring solution for inspecting the system integrity that is directly integrated with the DRAM DIMM devices. More specifically, we design a programmable guard that is integrated with the advanced memory buffer of FB-DIMM to continuously monitor all the memory traffic and detect the system integrity violations. Unlike the existing approaches that are either snapshot-based or lack compatibility and flexibility, MGuard continuously monitors the integrity of all the outer rings including both OS kernel and hypervisor of interest, with a greater extendibility enabled by a programmable interface. It offers a hardware drop-in solution transparent to the host CPU and memory controller. Moreover, MGuard is isolated from the host software and hardware, leading to strong security for remote attackers. Our simulation-based experimental results show that MGuard introduces no speed overhead, and is able to detect nearly all the OS-kernel and hypervisor control data related rootkits we tested.

越来越多的网络攻击(如内核rootkits)以计算机系统的内环为目标，严重破坏了整个计算机系统的完整性。为了消除这些威胁，必须开发在攻击表面下运行的创新解决方案。本文提出了一种直接集成在DRAM内存器件上的最内环系统完整性检测方案MGuard。更具体地说，我们设计了一个可编程的保护，与FB-DIMM的高级内存缓冲区集成，以持续监控所有内存流量并检测系统完整性违规。与现有的基于快照或缺乏兼容性和灵活性的方法不同，MGuard可以持续监控所有外环的完整性，包括感兴趣的操作系统内核和管理程序，并通过可编程接口实现更大的可扩展性。它提供了对主机CPU和内存控制器透明的硬件插入式解决方案。此外，MGuard与主机软硬件完全隔离，为远程攻击者提供了强大的安全性。我们基于模拟的实验结果表明，MGuard没有引入速度开销，并且能够检测到我们测试的几乎所有操作系统内核和管理程序控制数据相关的rootkit。

{"title":"CPU transparent protection of OS kernel and hypervisor integrity with programmable DRAM","authors":"Ziyi Liu, Jong-Hyuk Lee, Junyuan Zeng, Y. Wen, Zhiqiang Lin, W. Shi","doi":"10.1145/2485922.2485956","DOIUrl":"https://doi.org/10.1145/2485922.2485956","url":null,"abstract":"Increasingly, cyber attacks (e.g., kernel rootkits) target the inner rings of a computer system, and they have seriously undermined the integrity of the entire computer systems. To eliminate these threats, it is imperative to develop innovative solutions running below the attack surface. This paper presents MGuard, a new most inner ring solution for inspecting the system integrity that is directly integrated with the DRAM DIMM devices. More specifically, we design a programmable guard that is integrated with the advanced memory buffer of FB-DIMM to continuously monitor all the memory traffic and detect the system integrity violations. Unlike the existing approaches that are either snapshot-based or lack compatibility and flexibility, MGuard continuously monitors the integrity of all the outer rings including both OS kernel and hypervisor of interest, with a greater extendibility enabled by a programmable interface. It offers a hardware drop-in solution transparent to the host CPU and memory controller. Moreover, MGuard is isolated from the host software and hardware, leading to strong security for remote attackers. Our simulation-based experimental results show that MGuard introduces no speed overhead, and is able to detect nearly all the OS-kernel and hypervisor control data related rootkits we tested.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"88990494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 37

Non-race concurrency bug detection through order-sensitive critical sections 通过顺序敏感临界区检测非竞争并发性错误

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485978

Ruirui C. Huang, Erik Halberg, G. Suh

This paper introduces a new heuristic condition for non-race concurrency bugs, named order-sensitive critical sections, and proposes a run-time bug detection scheme based on the condition. The order-sensitive critical sections are defined as a pair of critical sections that can lead to non-deterministic shared memory state depending on the order in which they execute. In a sense, the order-sensitive critical sections can be seen as extending the intuition in using data races as a potential bug condition to capture non-race bugs. Experiments show that the proposed scheme provides a good coverage for multiple types of non-race bugs, with a small number of false positives. For example, the scheme detected all 9 real-world non-race bugs that were tested as well as over 90% of injected non-race bugs. Additionally, this paper presents an efficient hardware architecture that supports the proposed scheme with minor hardware changes and a small amount of additional state - a 9-KB buffer per core and a 1-bit tag per data cache block. The hardware-based scheme could still detect all 9 real-world bugs that were tested and more than 84% of the injected non-race bugs. Moreover, the hardware supported scheme has a negligible impact on performance, with a 0.23% slowdown on average.

本文引入了一种新的非竞争并发错误的启发式条件——顺序敏感临界区，并提出了一种基于该条件的运行时错误检测方案。顺序敏感的临界区被定义为一对临界区，根据它们执行的顺序，它们可能导致不确定的共享内存状态。从某种意义上说，顺序敏感临界区可以看作是将数据竞争作为捕获非竞争错误的潜在错误条件的直觉的扩展。实验表明，该方案对多种类型的非种族错误有很好的覆盖，误报较少。例如，该方案检测了被测试的所有9个真实世界的非种族bug以及超过90%的注入的非种族bug。此外，本文还提出了一种高效的硬件架构，该架构支持所提出的方案，只需进行少量硬件更改和少量额外状态-每个核心9 kb缓冲区和每个数据缓存块1位标签。基于硬件的方案仍然可以检测到所测试的所有9个真实世界的错误，以及超过84%的注入的非种族错误。此外，硬件支持的方案对性能的影响可以忽略不计，平均降低0.23%。

{"title":"Non-race concurrency bug detection through order-sensitive critical sections","authors":"Ruirui C. Huang, Erik Halberg, G. Suh","doi":"10.1145/2485922.2485978","DOIUrl":"https://doi.org/10.1145/2485922.2485978","url":null,"abstract":"This paper introduces a new heuristic condition for non-race concurrency bugs, named order-sensitive critical sections, and proposes a run-time bug detection scheme based on the condition. The order-sensitive critical sections are defined as a pair of critical sections that can lead to non-deterministic shared memory state depending on the order in which they execute. In a sense, the order-sensitive critical sections can be seen as extending the intuition in using data races as a potential bug condition to capture non-race bugs. Experiments show that the proposed scheme provides a good coverage for multiple types of non-race bugs, with a small number of false positives. For example, the scheme detected all 9 real-world non-race bugs that were tested as well as over 90% of injected non-race bugs. Additionally, this paper presents an efficient hardware architecture that supports the proposed scheme with minor hardware changes and a small amount of additional state - a 9-KB buffer per core and a 1-bit tag per data cache block. The hardware-based scheme could still detect all 9 real-world bugs that were tested and more than 84% of the injected non-race bugs. Moreover, the hardware supported scheme has a negligible impact on performance, with a 0.23% slowdown on average.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86761866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

QuickSAN: a storage area network for fast, distributed, solid state disks QuickSAN:用于快速、分布式、固态磁盘的存储区域网络

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485962

Adrian M. Caulfield, S. Swanson

Solid State Disks (SSDs) based on flash and other non-volatile memory technologies reduce storage latencies from 10s of milliseconds to 10s or 100s of microseconds, transforming previously inconsequential storage overheads into performance bottlenecks. This problem is especially acute in storage area network (SAN) environments where complex hardware and software layers (distributed file systems, block severs, network stacks, etc.) lie between applications and remote data. These layers can add hundreds of microseconds to requests, obscuring the performance of both flash memory and faster, emerging non-volatile memory technologies. We describe QuickSAN, a SAN prototype that eliminates most software overheads and significantly reduces hardware overheads in SANs. QuickSAN integrates a network adapter into SSDs, so the SSDs can communicate directly with one another to service storage accesses as quickly as possible. QuickSAN can also give applications direct access to both local and remote data without operating system intervention, further reducing software costs. Our evaluation of QuickSAN demonstrates remote access latencies of 20 μs for 4 KB requests, bandwidth improvements of as much as 163x for small accesses compared with an equivalent iSCSI implementation, and 2.3-3.0x application level speedup for distributed sorting. We also show that QuickSAN improves energy efficiency by up to 96% and that QuickSAN's networking connectivity allows for improved cluster-level energy efficiency under varying load.

基于闪存和其他非易失性存储器技术的固态磁盘(ssd)将存储延迟从10毫秒减少到10毫秒或100微秒，将以前无关紧要的存储开销转变为性能瓶颈。这个问题在存储区域网络(SAN)环境中尤其严重，因为复杂的硬件和软件层(分布式文件系统、块服务器、网络堆栈等)位于应用程序和远程数据之间。这些层可以为请求增加数百微秒的时间，模糊了闪存和更快的、新兴的非易失性存储器技术的性能。我们描述QuickSAN，这是一种SAN原型，它消除了SAN中的大部分软件开销，并显著降低了硬件开销。QuickSAN将网卡集成到ssd盘中，使ssd盘之间可以直接通信，以最快的速度提供存储访问服务。QuickSAN还可以让应用程序直接访问本地和远程数据，而无需操作系统的干预，从而进一步降低软件成本。我们对QuickSAN的评估表明，对于4 KB请求，远程访问延迟为20 μs，与同等iSCSI实现相比，小访问的带宽改进高达163倍，分布式排序的应用级加速为2.3-3.0倍。我们还表明，QuickSAN将能源效率提高了96%，并且QuickSAN的网络连接允许在不同负载下提高集群级能源效率。

{"title":"QuickSAN: a storage area network for fast, distributed, solid state disks","authors":"Adrian M. Caulfield, S. Swanson","doi":"10.1145/2485922.2485962","DOIUrl":"https://doi.org/10.1145/2485922.2485962","url":null,"abstract":"Solid State Disks (SSDs) based on flash and other non-volatile memory technologies reduce storage latencies from 10s of milliseconds to 10s or 100s of microseconds, transforming previously inconsequential storage overheads into performance bottlenecks. This problem is especially acute in storage area network (SAN) environments where complex hardware and software layers (distributed file systems, block severs, network stacks, etc.) lie between applications and remote data. These layers can add hundreds of microseconds to requests, obscuring the performance of both flash memory and faster, emerging non-volatile memory technologies. We describe QuickSAN, a SAN prototype that eliminates most software overheads and significantly reduces hardware overheads in SANs. QuickSAN integrates a network adapter into SSDs, so the SSDs can communicate directly with one another to service storage accesses as quickly as possible. QuickSAN can also give applications direct access to both local and remote data without operating system intervention, further reducing software costs. Our evaluation of QuickSAN demonstrates remote access latencies of 20 μs for 4 KB requests, bandwidth improvements of as much as 163x for small accesses compared with an equivalent iSCSI implementation, and 2.3-3.0x application level speedup for distributed sorting. We also show that QuickSAN improves energy efficiency by up to 96% and that QuickSAN's networking connectivity allows for improved cluster-level energy efficiency under varying load.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82915040","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 55

A new perspective for efficient virtual-cache coherence 高效虚拟缓存一致性的新视角

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485968

S. Kaxiras, Alberto Ros

Coherent shared virtual memory (cSVM) is highly coveted for heterogeneous architectures as it will simplify programming across different cores and manycore accelerators. In this context, virtual L1 caches can be used to great advantage, e.g., saving energy consumption by eliminating address translation for hits. Unfortunately, multicore virtual-cache coherence is complex and costly because it requires reverse translation for any coherence request directed towards a virtual L1. The reason is the ambiguity of the virtual address due to the possibility of synonyms. In this paper, we take a radically different approach than all prior work which is focused on reverse translation. We examine the problem from the perspective of the coherence protocol. We show that if a coherence protocol adheres to certain conditions, it operates effortlessly with virtual caches, without requiring reverse translations even in the presence of synonyms. We show that these conditions hold in a new class of simple and efficient request-response protocols that use both self-invalidation and self-downgrade. This results in a new solution for virtual-cache coherence, significantly less complex and more efficient than prior proposals. We study design choices for TLB placement under our proposal and compare them against those under a directory-MESI protocol. Our approach allows for choices that are particularly effective as for example combining all per-core TLBs in a single logical TLB in front of the last level cache. Significant area, energy, and performance benefits ensue as a result of simplifying the entire multicore memory organization.

一致性共享虚拟内存(cSVM)在异构架构中非常受欢迎，因为它可以简化跨不同核和多核加速器的编程。在这种情况下，虚拟L1缓存可以发挥很大的优势，例如，通过消除命中的地址转换来节省能耗。不幸的是，多核虚拟缓存一致性既复杂又昂贵，因为它需要对指向虚拟L1的任何一致性请求进行反向转换。其原因是由于同义词的可能性导致虚拟地址的模糊性。在本文中，我们采取了一种完全不同于以往所有研究反向翻译的研究方法。我们从相干协议的角度来研究这个问题。我们表明，如果一致性协议遵守某些条件，它可以毫不费力地使用虚拟缓存，即使在同义词存在的情况下也不需要反向翻译。我们展示了这些条件在一类既使用自我失效又使用自我降级的简单有效的请求-响应协议中成立。这导致了虚拟缓存一致性的新解决方案，比以前的建议明显更简单，更有效。我们研究了提案下TLB放置的设计选择，并将它们与目录- mesi协议下的设计选择进行了比较。我们的方法允许特别有效的选择，例如将所有的每核TLB组合在最后一级缓存前面的单个逻辑TLB中。通过简化整个多核内存组织，可以获得显著的面积、能源和性能优势。

{"title":"A new perspective for efficient virtual-cache coherence","authors":"S. Kaxiras, Alberto Ros","doi":"10.1145/2485922.2485968","DOIUrl":"https://doi.org/10.1145/2485922.2485968","url":null,"abstract":"Coherent shared virtual memory (cSVM) is highly coveted for heterogeneous architectures as it will simplify programming across different cores and manycore accelerators. In this context, virtual L1 caches can be used to great advantage, e.g., saving energy consumption by eliminating address translation for hits. Unfortunately, multicore virtual-cache coherence is complex and costly because it requires reverse translation for any coherence request directed towards a virtual L1. The reason is the ambiguity of the virtual address due to the possibility of synonyms. In this paper, we take a radically different approach than all prior work which is focused on reverse translation. We examine the problem from the perspective of the coherence protocol. We show that if a coherence protocol adheres to certain conditions, it operates effortlessly with virtual caches, without requiring reverse translations even in the presence of synonyms. We show that these conditions hold in a new class of simple and efficient request-response protocols that use both self-invalidation and self-downgrade. This results in a new solution for virtual-cache coherence, significantly less complex and more efficient than prior proposals. We study design choices for TLB placement under our proposal and compare them against those under a directory-MESI protocol. Our approach allows for choices that are particularly effective as for example combining all per-core TLBs in a single logical TLB in front of the last level cache. Significant area, energy, and performance benefits ensue as a result of simplifying the entire multicore memory organization.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81883013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 60

Cooperative boosting: needy versus greedy power management 合作提升:需求vs贪婪的权力管理

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485947

Indrani Paul, Srilatha Manne, Manish Arora, W. Bircher, S. Yalamanchili

This paper examines the interaction between thermal management techniques and power boosting in a state-of-the-art heterogeneous processor consisting of a set of CPU and GPU cores. We show that for classes of applications that utilize both the CPU and the GPU, modern boost algorithms that greedily seek to convert thermal headroom into performance can interact with thermal coupling effects between the CPU and the GPU to degrade performance. We first examine the causes of this behavior and explain the interaction between thermal coupling, performance coupling, and workload behavior. Then we propose a dynamic power-management approach called cooperative boosting (CB) to allocate power dynamically between CPU and GPU in a manner that balances thermal coupling against the needs of performance coupling to optimize performance under a given thermal constraint. Through real hardware-based measurements, we evaluate CB against a state-of-the-practice boost algorithm and show that overall application performance and power savings increase by 10% and 8% (up to 52% and 34%), respectively, resulting in average energy efficiency improvement of 25% (up to 76%) over a wide range of benchmarks.

本文研究了由一组CPU和GPU内核组成的最先进的异构处理器中热管理技术和功率提升之间的相互作用。我们表明，对于同时利用CPU和GPU的应用程序类别，贪婪地寻求将热余量转换为性能的现代增强算法可能与CPU和GPU之间的热耦合效应相互作用，从而降低性能。我们首先检查这种行为的原因，并解释热耦合、性能耦合和工作负载行为之间的相互作用。然后，我们提出了一种称为协同提升(CB)的动态功率管理方法，在CPU和GPU之间动态分配功率，以平衡热耦合和性能耦合的需求，从而在给定的热约束下优化性能。通过真实的基于硬件的测量，我们根据实践状态的提升算法评估了CB，并表明总体应用程序性能和功耗分别提高了10%和8%(高达52%和34%)，从而在广泛的基准测试范围内平均能源效率提高了25%(高达76%)。

{"title":"Cooperative boosting: needy versus greedy power management","authors":"Indrani Paul, Srilatha Manne, Manish Arora, W. Bircher, S. Yalamanchili","doi":"10.1145/2485922.2485947","DOIUrl":"https://doi.org/10.1145/2485922.2485947","url":null,"abstract":"This paper examines the interaction between thermal management techniques and power boosting in a state-of-the-art heterogeneous processor consisting of a set of CPU and GPU cores. We show that for classes of applications that utilize both the CPU and the GPU, modern boost algorithms that greedily seek to convert thermal headroom into performance can interact with thermal coupling effects between the CPU and the GPU to degrade performance. We first examine the causes of this behavior and explain the interaction between thermal coupling, performance coupling, and workload behavior. Then we propose a dynamic power-management approach called cooperative boosting (CB) to allocate power dynamically between CPU and GPU in a manner that balances thermal coupling against the needs of performance coupling to optimize performance under a given thermal constraint. Through real hardware-based measurements, we evaluate CB against a state-of-the-practice boost algorithm and show that overall application performance and power savings increase by 10% and 8% (up to 52% and 34%), respectively, resulting in average energy efficiency improvement of 25% (up to 76%) over a wide range of benchmarks.","PeriodicalId":20555,"journal":{"name":"Proceedings of the 40th Annual International Symposium on Computer Architecture","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83781895","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 53

DNA-based molecular architecture with spatially localized components 具有空间定位成分的基于dna的分子结构

Proceedings of the 40th Annual International Symposium on Computer Architecture

Pub Date : 2013-06-23 DOI: 10.1145/2485922.2485938

Richard A. Muscat, K. Strauss, L. Ceze, Georg Seelig

Performing computation inside living cells offers life-changing applications, from improved medical diagnostics to better cancer therapy to intelligent drugs. Due to its bio-compatibility and ease of engineering, one promising approach for performing in-vivo computation is DNA strand displacement. This paper introduces computer architects to DNA strand displacement "circuits", discusses associated architectural challenges, and proposes a new organization that provides practical composability. In particular, prior approaches rely mostly on stochastic interaction of freely diffusing components. This paper proposes practical spatial isolation of components, leading to more easily designed DNA-based circuits. DNA nanotechnology is currently at a turning point, with many proposed applications being realized [20, 9]. We believe that it is time for the computer architecture community to take notice and contribute.

在活细胞内进行计算提供了改变生活的应用，从改进的医疗诊断到更好的癌症治疗，再到智能药物。由于其生物相容性和易于工程，一种有前途的进行体内计算的方法是DNA链位移。本文介绍了DNA链位移“电路”的计算机架构，讨论了相关的架构挑战，并提出了一种提供实际可组合性的新组织。特别是，先前的方法主要依赖于自由扩散组分的随机相互作用。本文提出了实用的元件空间隔离，从而更容易设计基于dna的电路。DNA纳米技术目前正处于一个转折点，许多提出的应用正在实现[20,9]。我们相信是时候让计算机体系结构社区注意并做出贡献了。

引用次数: 42