2015 International Conference on Parallel Architecture and Compilation (PACT)最新文献

Load Balancing in Decoupled Look-ahead: A Do-It-Yourself (DIY) Approach 解耦前瞻性中的负载平衡:一种自己动手的方法

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.55

Raj Parihar, Michael C. Huang

Despite the proliferation of multi-core and multi-threaded architectures, exploiting implicit parallelism for a single semantic thread is still a crucial component in achieving high performance. Lookahead is a "tried-and-true" strategy in uncovering implicit parallelism. However, a conventional, monolithic out-of-order core quickly becomes resource-inefficient when looking beyond a small distance. One general approach to mitigate the impact of branch mispredictions and cache misses is to enable deep look-ahead. A particular approach that is both flexible and effective is to use an independent, decoupled look-ahead thread on a separate thread context guided by a program slice known as skeleton. While capable of generating significant performance gains, the look-ahead agent often becomes the new speed limit. We propose to accelerate the look-ahead thread by skipping branch based, side-effect free code modules that do not contribute to the effectiveness of look-ahead. We call them Do-It-Yourself or DIY branches for which the main thread does not get any help from the look-ahead thread, instead relies on its own branch predictor and prefetcher. By skipping DIY branches, look-ahead thread propels ahead and provides performance-critical assistance down the stream to improve the performance of decoupled look-ahead system by up to 15%.

尽管多核和多线程体系结构在不断发展，但为单个语义线程开发隐式并行性仍然是实现高性能的关键组成部分。在揭示隐式并行性方面，向前看是一种“久经考验”的策略。然而，传统的、单片无序的核心在观察一小段距离时很快就会变得资源效率低下。减轻分支错误预测和缓存丢失影响的一种通用方法是启用深度前瞻性。一种既灵活又有效的特殊方法是在一个单独的线程上下文中使用一个独立的、解耦的前瞻性线程，该线程由一个称为骨架的程序片引导。虽然能够产生显著的性能提升，但向前看代理经常成为新的速度限制。我们建议通过跳过基于分支的、没有副作用的代码模块来加速预检线程，这些模块对预检的有效性没有贡献。我们称它们为“自己动手”或“自己动手”分支，对于这些分支，主线程不从预检线程那里得到任何帮助，而是依赖于它自己的分支预测器和预取器。通过跳过DIY分支，预检线程向前推进，并提供对性能至关重要的帮助，从而将解耦预检系统的性能提高了15%。

{"title":"Load Balancing in Decoupled Look-ahead: A Do-It-Yourself (DIY) Approach","authors":"Raj Parihar, Michael C. Huang","doi":"10.1109/PACT.2015.55","DOIUrl":"https://doi.org/10.1109/PACT.2015.55","url":null,"abstract":"Despite the proliferation of multi-core and multi-threaded architectures, exploiting implicit parallelism for a single semantic thread is still a crucial component in achieving high performance. Lookahead is a \"tried-and-true\" strategy in uncovering implicit parallelism. However, a conventional, monolithic out-of-order core quickly becomes resource-inefficient when looking beyond a small distance. One general approach to mitigate the impact of branch mispredictions and cache misses is to enable deep look-ahead. A particular approach that is both flexible and effective is to use an independent, decoupled look-ahead thread on a separate thread context guided by a program slice known as skeleton. While capable of generating significant performance gains, the look-ahead agent often becomes the new speed limit. We propose to accelerate the look-ahead thread by skipping branch based, side-effect free code modules that do not contribute to the effectiveness of look-ahead. We call them Do-It-Yourself or DIY branches for which the main thread does not get any help from the look-ahead thread, instead relies on its own branch predictor and prefetcher. By skipping DIY branches, look-ahead thread propels ahead and provides performance-critical assistance down the stream to improve the performance of decoupled look-ahead system by up to 15%.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116969055","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Fine Grain Cache Partitioning Using Per-Instruction Working Blocks 使用每指令工作块的细粒度缓存分区

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.11

Jason Jong Kyu Park, Yongjun Park, S. Mahlke

A traditional least-recently used (LRU) cache replacement policy fails to achieve the performance of the optimal replacement policy when cache blocks with diverse reuse characteristics interfere with each other. When multiple applications share a cache, it is often partitioned among the applications because cache blocks show similar reuse characteristics within each application. In this paper, we extend the idea to a single application by viewing a cache as a shared resource between individual memory instructions. To that end, we propose Instruction-based LRU (ILRU), a fine grain cache partitioning that way-partitions individual cache sets based on per-instruction working blocks, which are cache blocks required by an instruction to satisfy all the reuses within a set. In ILRU, a memory instruction steals a block from another only when it requires more blocks than it currently has. Otherwise, a memory instruction victimizes among the cache blocks inserted by itself. Experiments show that ILRU can improve the cache performance in all levels of cache, reducing the number of misses by an average of 7.0% for L1, 9.1% for L2, and 8.7% for L3, which results in a geometric mean performance improvement of 5.3%. ILRU for a three-level cache hierarchy imposes a modest 1.3% storage overhead over the total cache size.

当具有不同重用特征的缓存块相互干扰时，传统的LRU (least-recently used)缓存替换策略无法达到最优替换策略的性能。当多个应用程序共享一个缓存时，它通常在应用程序之间进行分区，因为缓存块在每个应用程序中显示相似的重用特征。在本文中，我们通过将缓存视为单个内存指令之间的共享资源，将该思想扩展到单个应用程序。为此，我们提出了基于指令的LRU (ILRU)，这是一种细粒度缓存分区，它基于每条指令的工作块对单个缓存集进行分区，这些工作块是指令满足集合内所有重用所需的缓存块。在ILRU中，只有当内存指令需要比当前拥有更多的块时，它才会从另一个内存指令那里窃取一个块。否则，内存指令会在自己插入的缓存块中受害。实验表明，ILRU可以在所有级别的缓存中提高缓存性能，平均减少了L1的7.0%，L2的9.1%和L3的8.7%的失误次数，从而使几何平均性能提高了5.3%。对于三层缓存层次结构，ILRU在总缓存大小上施加了适度的1.3%的存储开销。

{"title":"Fine Grain Cache Partitioning Using Per-Instruction Working Blocks","authors":"Jason Jong Kyu Park, Yongjun Park, S. Mahlke","doi":"10.1109/PACT.2015.11","DOIUrl":"https://doi.org/10.1109/PACT.2015.11","url":null,"abstract":"A traditional least-recently used (LRU) cache replacement policy fails to achieve the performance of the optimal replacement policy when cache blocks with diverse reuse characteristics interfere with each other. When multiple applications share a cache, it is often partitioned among the applications because cache blocks show similar reuse characteristics within each application. In this paper, we extend the idea to a single application by viewing a cache as a shared resource between individual memory instructions. To that end, we propose Instruction-based LRU (ILRU), a fine grain cache partitioning that way-partitions individual cache sets based on per-instruction working blocks, which are cache blocks required by an instruction to satisfy all the reuses within a set. In ILRU, a memory instruction steals a block from another only when it requires more blocks than it currently has. Otherwise, a memory instruction victimizes among the cache blocks inserted by itself. Experiments show that ILRU can improve the cache performance in all levels of cache, reducing the number of misses by an average of 7.0% for L1, 9.1% for L2, and 8.7% for L3, which results in a geometric mean performance improvement of 5.3%. ILRU for a three-level cache hierarchy imposes a modest 1.3% storage overhead over the total cache size.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127126184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Brain-Inspired Computing Brain-Inspired计算

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.49

D. Modha

Summary form only given. I will describe a decade-long, multi-disciplinary, multi-institutional effort spanning neuroscience, supercomputing, and nanotechnology to build and demonstrate a brain-inspired computer and describe the architecture, programming model, and applications. I will also describe future efforts to build, literally, "brain-in-a-box". For more information, see: modha.org.

只提供摘要形式。我将描述一个跨越神经科学、超级计算和纳米技术的长达十年、多学科、多机构的努力，以建立和展示一个受大脑启发的计算机，并描述其架构、编程模型和应用。我还将描述未来建造“盒子里的大脑”的努力。欲了解更多信息，请参见:modha.org。

引用次数: 3

NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures NVMMU:用于GPU-SSD异构架构的非易失性内存管理单元

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.43

Jie Zhang, D. Donofrio, J. Shalf, M. Kandemir, Myoungsoo Jung

Thanks to massive parallelism in modern Graphics Processing Units (GPUs), emerging data processing applications in GPU computing exhibit ten-fold speedups compared to CPU-only systems. However, this GPU-based acceleration is limited in many cases by the significant data movement overheads and inefficient memory management for host-side storage accesses. To address these shortcomings, this paper proposes a non-volatile memory management unit (NVMMU) that reduces the file data movement overheads by directly connecting the Solid State Disk (SSD) to the GPU. We implemented our proposed NVMMU on a real hardware with commercially available GPU and SSD devices by considering different types of storage interfaces and configurations. In this work, NVMMU unifies two discrete software stacks (one for the SSD and other for the GPU) in two major ways. While a new interface provided by our NVMMU directly forwards file data between the GPU runtime library and the I/O runtime library, it supports non-volatile direct memory access (NDMA) that pairs those GPU and SSD devices via physically shared system memory blocks. This unification in turn can eliminate unnecessary user/kernel-mode switching, improve memory management, and remove data copy overheads. Our evaluation results demonstrate that NVMMU can reduce the overheads of file data movement by 95% on average, improving overall system performance by 78% compared to a conventional IOMMU approach.

由于现代图形处理单元(GPU)的大规模并行性，GPU计算中的新兴数据处理应用程序与仅使用cpu的系统相比具有十倍的速度。然而，这种基于gpu的加速在许多情况下受到大量数据移动开销和主机端存储访问的低效内存管理的限制。为了解决这些缺点，本文提出了一种非易失性内存管理单元(NVMMU)，通过将固态硬盘(SSD)直接连接到GPU来减少文件数据移动开销。通过考虑不同类型的存储接口和配置，我们在使用商用GPU和SSD设备的真实硬件上实现了我们提出的NVMMU。在这项工作中，NVMMU以两种主要方式统一了两个离散的软件堆栈(一个用于SSD，另一个用于GPU)。虽然我们的NVMMU提供的新接口直接在GPU运行时库和I/O运行时库之间转发文件数据，但它支持非易失性直接内存访问(NDMA)，通过物理共享系统内存块将GPU和SSD设备配对。这种统一反过来又可以消除不必要的用户/内核模式切换，改进内存管理，并消除数据复制开销。我们的评估结果表明，与传统的IOMMU方法相比，NVMMU可以将文件数据移动的开销平均降低95%，将整体系统性能提高78%。

{"title":"NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures","authors":"Jie Zhang, D. Donofrio, J. Shalf, M. Kandemir, Myoungsoo Jung","doi":"10.1109/PACT.2015.43","DOIUrl":"https://doi.org/10.1109/PACT.2015.43","url":null,"abstract":"Thanks to massive parallelism in modern Graphics Processing Units (GPUs), emerging data processing applications in GPU computing exhibit ten-fold speedups compared to CPU-only systems. However, this GPU-based acceleration is limited in many cases by the significant data movement overheads and inefficient memory management for host-side storage accesses. To address these shortcomings, this paper proposes a non-volatile memory management unit (NVMMU) that reduces the file data movement overheads by directly connecting the Solid State Disk (SSD) to the GPU. We implemented our proposed NVMMU on a real hardware with commercially available GPU and SSD devices by considering different types of storage interfaces and configurations. In this work, NVMMU unifies two discrete software stacks (one for the SSD and other for the GPU) in two major ways. While a new interface provided by our NVMMU directly forwards file data between the GPU runtime library and the I/O runtime library, it supports non-volatile direct memory access (NDMA) that pairs those GPU and SSD devices via physically shared system memory blocks. This unification in turn can eliminate unnecessary user/kernel-mode switching, improve memory management, and remove data copy overheads. Our evaluation results demonstrate that NVMMU can reduce the overheads of file data movement by 95% on average, improving overall system performance by 78% compared to a conventional IOMMU approach.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115177863","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 38

Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures 验证弱序体系结构一致性的并行方法

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.18

Adam McLaughlin, D. Merrill, M. Garland, David A. Bader

Contemporary microprocessors use relaxed memory consistency models to allow for aggressive optimizations in hardware. This enhancement in performance comes at the cost of design complexity and verification effort. In particular, verifying an execution of a program against its system's memory consistency model is an NP-complete problem. Several graph-based approximations to this problem based on carefully constructed randomized test programs have been proposed in the literature, however, such approaches are sequential and execute slowly on large graphs of interest. Unfortunately, the ability to execute larger tests is tremendously important, since such tests enable one to expose bugs more quickly. Successfully executing more tests per unit time is also desirable, since it allows for one to check for a greater variety of errors in the memory subsystem by utilizing a more diverse set of tests. This paper improves upon existing work by introducing an algorithm that not only reduces the time complexity of the verification process, but also facilitates the development of parallel algorithms for solving these problems. We first show performance improvements from a sequential approach and gain further performance from parallel implementations in OpenMP and CUDA. For large tests of interest, our GPU implementation achieves an average application speedup of 26.36x over existing techniques in use at NVIDIA.

当代微处理器使用宽松的内存一致性模型，以便对硬件进行积极的优化。这种性能的增强是以设计复杂性和验证工作为代价的。特别是，根据系统的内存一致性模型验证程序的执行是一个np完全问题。文献中已经提出了几个基于精心构建的随机测试程序的基于图的近似方法，然而，这些方法是顺序的，并且在感兴趣的大型图上执行缓慢。不幸的是，执行大型测试的能力非常重要，因为这样的测试可以更快地暴露错误。在单位时间内成功地执行更多的测试也是可取的，因为它允许人们通过使用更多样化的测试集来检查内存子系统中更多种类的错误。本文通过引入一种算法来改进现有的工作，该算法不仅降低了验证过程的时间复杂度，而且还促进了解决这些问题的并行算法的发展。我们首先展示了顺序方法的性能改进，并从OpenMP和CUDA的并行实现中获得了进一步的性能。对于感兴趣的大型测试，我们的GPU实现比NVIDIA使用的现有技术实现了26.36倍的平均应用程序加速。

{"title":"Parallel Methods for Verifying the Consistency of Weakly-Ordered Architectures","authors":"Adam McLaughlin, D. Merrill, M. Garland, David A. Bader","doi":"10.1109/PACT.2015.18","DOIUrl":"https://doi.org/10.1109/PACT.2015.18","url":null,"abstract":"Contemporary microprocessors use relaxed memory consistency models to allow for aggressive optimizations in hardware. This enhancement in performance comes at the cost of design complexity and verification effort. In particular, verifying an execution of a program against its system's memory consistency model is an NP-complete problem. Several graph-based approximations to this problem based on carefully constructed randomized test programs have been proposed in the literature, however, such approaches are sequential and execute slowly on large graphs of interest. Unfortunately, the ability to execute larger tests is tremendously important, since such tests enable one to expose bugs more quickly. Successfully executing more tests per unit time is also desirable, since it allows for one to check for a greater variety of errors in the memory subsystem by utilizing a more diverse set of tests. This paper improves upon existing work by introducing an algorithm that not only reduces the time complexity of the verification process, but also facilitates the development of parallel algorithms for solving these problems. We first show performance improvements from a sequential approach and gain further performance from parallel implementations in OpenMP and CUDA. For large tests of interest, our GPU implementation achieves an average application speedup of 26.36x over existing techniques in use at NVIDIA.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128281290","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

A Software-Managed Approach to Die-Stacked DRAM 一种软件管理的模堆叠DRAM方法

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.30

M. Oskin, G. Loh

Advances in die-stacking (3D) technology have enabled the tight integration of significant quantities of DRAM with high-performance computation logic. How to integrate this technology into the overall architecture of a computing system is an open question. While much recent effort has focused on hardware-based techniques for using die-stacked memory (e.g., caching), in this paper we explore what it takes for a software-driven approach to be effective. First we consider exposing die-stacked DRAM directly to applications, relying on the static partitioning of allocations between fast on-chip and slow off-chip DRAM. We see only marginal benefits from this approach (9% speedup). Next, we explore OS-based page caches that dynamically partition application memory, but we find such approaches to be worse than not having stacked DRAM at all! We analyze the performance bottlenecks in OS page caches, and propose two simple techniques that make the OS approach viable. The first is a hardware-assisted TLB shoot-down, which is a more general mechanism that is valuable beyond stacked DRAM, and enables OS-managed page caches to achieve a 27% speedup, the second is a software-implemented prefetcher that extends classic hardware prefetching algorithms to the page level, leading to 39% speedup. With these simple and lightweight components, the OS page cache can provide 70% of the performance benefit that would be achievable with an ideal and unrealistic system where all of main memory is die-stacked. However, we also found that applications with poor locality (e.g., graph analyses) are not amenable to any page-caching schemes -- whether hardware or software -- and therefore we recommend that the system still provides APIs to the application layers to explicitly control die-stacked DRAM allocations.

芯片堆叠(3D)技术的进步使大量DRAM与高性能计算逻辑紧密集成成为可能。如何将这种技术集成到计算系统的整体体系结构中是一个悬而未决的问题。虽然最近的许多努力都集中在基于硬件的技术上，用于使用模堆叠内存(例如，缓存)，但在本文中，我们将探讨如何使软件驱动的方法有效。首先，我们考虑将芯片堆叠的DRAM直接暴露给应用程序，依赖于快速片内和慢速片外DRAM之间的静态分配分区。我们只看到这种方法的边际效益(9%的加速)。接下来，我们将探索基于操作系统的页面缓存，这些缓存可以动态地对应用程序内存进行分区，但是我们发现这种方法比根本没有堆叠DRAM还要糟糕!我们分析了操作系统页面缓存的性能瓶颈，并提出了两种使操作系统方法可行的简单技术。第一个是硬件辅助的TLB关闭，这是一种比堆叠DRAM更有价值的更通用的机制，它使操作系统管理的页面缓存能够实现27%的加速，第二个是软件实现的预取器，它将经典的硬件预取算法扩展到页面级别，从而实现39%的加速。有了这些简单和轻量级的组件，操作系统页面缓存可以提供70%的性能优势，而在一个理想的和不现实的系统中，所有的主存都是模堆叠的。然而，我们也发现局部性差的应用程序(例如，图形分析)不适合任何页面缓存方案——无论是硬件还是软件——因此，我们建议系统仍然向应用层提供api，以显式地控制芯片堆叠的DRAM分配。

{"title":"A Software-Managed Approach to Die-Stacked DRAM","authors":"M. Oskin, G. Loh","doi":"10.1109/PACT.2015.30","DOIUrl":"https://doi.org/10.1109/PACT.2015.30","url":null,"abstract":"Advances in die-stacking (3D) technology have enabled the tight integration of significant quantities of DRAM with high-performance computation logic. How to integrate this technology into the overall architecture of a computing system is an open question. While much recent effort has focused on hardware-based techniques for using die-stacked memory (e.g., caching), in this paper we explore what it takes for a software-driven approach to be effective. First we consider exposing die-stacked DRAM directly to applications, relying on the static partitioning of allocations between fast on-chip and slow off-chip DRAM. We see only marginal benefits from this approach (9% speedup). Next, we explore OS-based page caches that dynamically partition application memory, but we find such approaches to be worse than not having stacked DRAM at all! We analyze the performance bottlenecks in OS page caches, and propose two simple techniques that make the OS approach viable. The first is a hardware-assisted TLB shoot-down, which is a more general mechanism that is valuable beyond stacked DRAM, and enables OS-managed page caches to achieve a 27% speedup, the second is a software-implemented prefetcher that extends classic hardware prefetching algorithms to the page level, leading to 39% speedup. With these simple and lightweight components, the OS page cache can provide 70% of the performance benefit that would be achievable with an ideal and unrealistic system where all of main memory is die-stacked. However, we also found that applications with poor locality (e.g., graph analyses) are not amenable to any page-caching schemes -- whether hardware or software -- and therefore we recommend that the system still provides APIs to the application layers to explicitly control die-stacked DRAM allocations.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"460 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124018904","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 57

MeToo: Stochastic Modeling of Memory Traffic Timing Behavior MeToo:记忆流量定时行为的随机建模

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.36

Yipeng Wang, Ganesh Balakrishnan, Yan Solihin

The memory subsystem (memory controller, bus, andDRAM) is becoming a bottleneck in computer system performance. Optimizing the design of the multicore memory subsystem requires good understanding of the representative workload. A common practice in designing the memory subsystem is to rely on trace simulation. However, the conventional method of relying on traditional traces faces two major challenges. First, many software users are apprehensive about sharing their code (source or binaries) due to the proprietary nature of the code or secrecy of data, so representative traces are sometimes not available. Second, there is a feedback loop where memory performance affects processor performance, which in turnalters the timing of memory requests that reach the bus. Such feedback loop is difficult to capture with traces. In this paper, we present MeToo, a framework for generating synthetic memory traffic for memory subsystem design exploration. MeToo uses a small set of statistics that summarizes the performance behavior of the original applications, and generates synthetic traces or executables stochastically, allowing applications to remain proprietary. MeToo uses novel methods for mimicking the memory feedback loop. We validate MeToo clones, and show very good fit with the original applications' behavior, with an average error of only 4.2%, which is a small fraction of the errors obtained using geometric inter-arrival(commonly used in queueing models) and uniform inter-arrival.

存储器子系统(存储器控制器、总线和dram)正在成为计算机系统性能的瓶颈。优化多核内存子系统的设计需要对代表性工作负载有很好的理解。在设计内存子系统时，一种常见的做法是依赖跟踪仿真。然而，依靠传统痕迹的传统方法面临着两大挑战。首先，由于代码的专有性质或数据的保密性，许多软件用户担心共享他们的代码(源代码或二进制文件)，因此有时无法获得具有代表性的跟踪。其次，存在一个反馈循环，其中内存性能影响处理器性能，这反过来改变到达总线的内存请求的时间。这种反馈回路很难用迹线捕捉。在本文中，我们提出了MeToo，一个为内存子系统设计探索生成合成内存流量的框架。MeToo使用一小组统计数据来总结原始应用程序的性能行为，并随机生成合成跟踪或可执行文件，从而允许应用程序保持专有。我也是运动使用新颖的方法来模仿记忆反馈循环。我们验证了MeToo克隆，并显示与原始应用程序的行为非常吻合，平均误差仅为4.2%，这是使用几何间隔到达(通常用于排队模型)和均匀间隔到达获得的误差的一小部分。

{"title":"MeToo: Stochastic Modeling of Memory Traffic Timing Behavior","authors":"Yipeng Wang, Ganesh Balakrishnan, Yan Solihin","doi":"10.1109/PACT.2015.36","DOIUrl":"https://doi.org/10.1109/PACT.2015.36","url":null,"abstract":"The memory subsystem (memory controller, bus, andDRAM) is becoming a bottleneck in computer system performance. Optimizing the design of the multicore memory subsystem requires good understanding of the representative workload. A common practice in designing the memory subsystem is to rely on trace simulation. However, the conventional method of relying on traditional traces faces two major challenges. First, many software users are apprehensive about sharing their code (source or binaries) due to the proprietary nature of the code or secrecy of data, so representative traces are sometimes not available. Second, there is a feedback loop where memory performance affects processor performance, which in turnalters the timing of memory requests that reach the bus. Such feedback loop is difficult to capture with traces. In this paper, we present MeToo, a framework for generating synthetic memory traffic for memory subsystem design exploration. MeToo uses a small set of statistics that summarizes the performance behavior of the original applications, and generates synthetic traces or executables stochastically, allowing applications to remain proprietary. MeToo uses novel methods for mimicking the memory feedback loop. We validate MeToo clones, and show very good fit with the original applications' behavior, with an average error of only 4.2%, which is a small fraction of the errors obtained using geometric inter-arrival(commonly used in queueing models) and uniform inter-arrival.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134513153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

An Efficient, Self-Contained, On-chip Directory: DIR1-SISD 一个有效的，自包含的，片上目录:DIR1-SISD

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.23

Mahdad Davari, Alberto Ros, Erik Hagersten, S. Kaxiras

Directory-based cache coherence is the de-facto standard for scalable shared-memory multi/many-cores and significant effort is invested in reducing its overhead. However, directory area and complexity optimizations are often antithetical to each other. Novel directory-less coherence schemes have been introduced to remove the complexity and cost associated with directories in their entirety. However, such schemes introduce new challenges by transferring some of the directory complexity and functionality to the OS and using the page table and the TLBs to store data classification information. In this work we bridge the gap between directory-based and directory-less coherence schemes and propose a hybrid scheme called DIR1-SISD which employs self-invalidation and self-downgrade as directory policies for the shared entries. DIR1-SISD allows simultaneous optimizations in area and complexity without relying on the OS. DIR1-SISD keeps track of a single -- private -- owner, or allows multiple-readers-multiple-writers to exist simultaneously by transferring the responsibility for their coherence to the corresponding cores. A DIR1-SISD self-contained directory cache has a unique ability to minimize eviction-induced complexities by allowing directory entries to be evicted without maintaining inclusion with the cached data (thus avoiding the complexities of broadcasts) and without the need to have a backing store. Using simulation we show that a small, self-contained, DIR1-SISD cache outperforms a traditional DIR16-NB MESI protocol with a directory cache embedded in the LLC (8% in execution time and 15% in traffic) and, further, outperforms a SISD protocol that relies on the OS to provide a persistent page-based directory (4% in execution time and 20% in traffic).

基于目录的缓存一致性是可扩展共享内存多核/多核的事实标准，并且在减少其开销方面投入了大量精力。然而，目录面积和复杂性优化通常是相互对立的。为了从整体上消除与目录相关的复杂性和成本，引入了新的无目录一致性方案。然而，这种模式引入了新的挑战，因为它将一些目录复杂性和功能转移到操作系统，并使用页表和tlb来存储数据分类信息。在这项工作中，我们弥合了基于目录和无目录一致性方案之间的差距，并提出了一种称为DIR1-SISD的混合方案，该方案采用自失效和自降级作为共享条目的目录策略。DIR1-SISD允许在不依赖于操作系统的情况下同时优化面积和复杂性。DIR1-SISD跟踪单个私有所有者，或者通过将其一致性的责任转移到相应的核心，允许多个读取器-多个写入器同时存在。DIR1-SISD自包含目录缓存具有一种独特的能力，它允许在不维护包含缓存数据(从而避免广播的复杂性)和不需要备份存储的情况下清除目录条目，从而最大限度地减少由清除引起的复杂性。通过模拟，我们证明了一个小型的、自包含的DIR1-SISD缓存优于嵌入在LLC中的目录缓存的传统DIR16-NB MESI协议(执行时间减少8%，流量减少15%)，并且优于依赖于操作系统提供持久的基于页面的目录的SISD协议(执行时间减少4%，流量减少20%)。

{"title":"An Efficient, Self-Contained, On-chip Directory: DIR1-SISD","authors":"Mahdad Davari, Alberto Ros, Erik Hagersten, S. Kaxiras","doi":"10.1109/PACT.2015.23","DOIUrl":"https://doi.org/10.1109/PACT.2015.23","url":null,"abstract":"Directory-based cache coherence is the de-facto standard for scalable shared-memory multi/many-cores and significant effort is invested in reducing its overhead. However, directory area and complexity optimizations are often antithetical to each other. Novel directory-less coherence schemes have been introduced to remove the complexity and cost associated with directories in their entirety. However, such schemes introduce new challenges by transferring some of the directory complexity and functionality to the OS and using the page table and the TLBs to store data classification information. In this work we bridge the gap between directory-based and directory-less coherence schemes and propose a hybrid scheme called DIR1-SISD which employs self-invalidation and self-downgrade as directory policies for the shared entries. DIR1-SISD allows simultaneous optimizations in area and complexity without relying on the OS. DIR1-SISD keeps track of a single -- private -- owner, or allows multiple-readers-multiple-writers to exist simultaneously by transferring the responsibility for their coherence to the corresponding cores. A DIR1-SISD self-contained directory cache has a unique ability to minimize eviction-induced complexities by allowing directory entries to be evicted without maintaining inclusion with the cached data (thus avoiding the complexities of broadcasts) and without the need to have a backing store. Using simulation we show that a small, self-contained, DIR1-SISD cache outperforms a traditional DIR16-NB MESI protocol with a directory cache embedded in the LLC (8% in execution time and 15% in traffic) and, further, outperforms a SISD protocol that relies on the OS to provide a persistent page-based directory (4% in execution time and 20% in traffic).","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131731885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Tardis: Time Traveling Coherence Algorithm for Distributed Shared Memory 分布式共享内存的时间旅行相干算法

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.12

Xiangyao Yu, S. Devadas

A new memory coherence protocol, Tardis, is proposed. Tardis uses timestamp counters representing logical time as well as physical time to order memory operations and enforce sequential consistency in any type of shared memory system. Tardis is unique in that as compared to the widely-adopted directory coherence protocol, and its variants, it completely avoids multicasting and only requires O(log N) storage per cache block for an N-core system rather than O(N) sharer information. Tardis is simpler and easier to reason about, yet achieves similar performance to directory protocols on a wide range of benchmarks run on 16, 64 and 256 cores.

提出了一种新的内存一致性协议Tardis。Tardis使用时间戳计数器来表示逻辑时间和物理时间，以便在任何类型的共享内存系统中对内存操作进行排序并强制执行顺序一致性。与广泛采用的目录一致性协议及其变体相比，Tardis的独特之处在于，它完全避免了多播，并且对于N核系统，每个缓存块只需要O(log N)存储空间，而不是O(N)共享信息。Tardis更简单，更容易理解，但在16、64和256核上运行的广泛基准测试中，它的性能与目录协议相似。

引用次数: 19

Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM 解耦直接内存访问:通过利用双数据端口DRAM隔离CPU和IO流量

2015 International Conference on Parallel Architecture and Compilation (PACT)

Pub Date : 2015-10-18 DOI: 10.1109/PACT.2015.51

Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, O. Mutlu

Memory channel contention is a critical performance bottleneck in modern systems that have highly parallelized processing units operating on large data sets. The memory channel is contended not only by requests from different user applications (CPU access) but also by system requests for peripheral data (IO access), usually controlled by Direct Memory Access (DMA) engines. Our goal, in this work, is to improve system performance byeliminating memory channel contention between CPU accesses and IO accesses. To this end, we propose a hardware-software cooperative data transfer mechanism, Decoupled DMA (DDMA) that provides a specialized low-cost memory channel for IO accesses. In our DDMA design, main memoryhas two independent data channels, of which one is connected to the processor (CPU channel) and the other to the IO devices (IO channel), enabling CPU and IO accesses to be served on different channels. Systemsoftware or the compiler identifies which requests should be handled on the IO channel and communicates this to the DDMA engine, which then initiates the transfers on the IO channel. By doing so, our proposal increasesthe effective memory channel bandwidth, thereby either accelerating data transfers between system components, or providing opportunities to employ IO performance enhancement techniques (e.g., aggressive IO prefetching)without interfering with CPU accessesWe demonstrate the effectiveness of our DDMA framework in two scenarios: (i) CPU-GPU communication and (ii) in-memory communication (bulk datacopy/initialization within the main memory). By effectively decoupling accesses for CPU-GPU communication and in-memory communication from CPU accesses, our DDMA-based design achieves significant performanceimprovement across a wide variety of system configurations (e.g., 20% average performance improvement on a typical 2-channel 2-rank memory system).

对于在大型数据集上运行的高度并行处理单元的现代系统，内存通道争用是一个关键的性能瓶颈。内存通道不仅受到来自不同用户应用程序(CPU访问)的请求的争夺，而且还受到系统对外围数据(IO访问)的请求的争夺，这些请求通常由直接内存访问(DMA)引擎控制。在这项工作中，我们的目标是通过消除CPU访问和IO访问之间的内存通道争用来提高系统性能。为此，我们提出了一种硬件软件协同数据传输机制——去耦DMA (DDMA)，它为IO访问提供了一个专门的低成本存储通道。在我们的DDMA设计中，主存有两个独立的数据通道，其中一个连接到处理器(CPU通道)，另一个连接到IO设备(IO通道)，使CPU和IO访问可以在不同的通道上服务。系统软件或编译器确定应该在IO通道上处理哪些请求，并将其与DDMA引擎通信，然后DDMA引擎启动IO通道上的传输。通过这样做，我们的建议增加了有效的内存通道带宽，从而加速系统组件之间的数据传输，或者提供使用IO性能增强技术(例如，积极的IO预取)而不干扰CPU访问的机会。我们在两种情况下证明了我们的DDMA框架的有效性:(i) CPU- gpu通信和(ii)内存通信(主内存内的批量数据复制/初始化)。通过有效地将CPU- gpu通信和内存通信的访问与CPU访问解耦，我们基于ddma的设计在各种系统配置中实现了显著的性能改进(例如，在典型的2通道2级内存系统上平均性能提高20%)。

{"title":"Decoupled Direct Memory Access: Isolating CPU and IO Traffic by Leveraging a Dual-Data-Port DRAM","authors":"Donghyuk Lee, Lavanya Subramanian, Rachata Ausavarungnirun, Jongmoo Choi, O. Mutlu","doi":"10.1109/PACT.2015.51","DOIUrl":"https://doi.org/10.1109/PACT.2015.51","url":null,"abstract":"Memory channel contention is a critical performance bottleneck in modern systems that have highly parallelized processing units operating on large data sets. The memory channel is contended not only by requests from different user applications (CPU access) but also by system requests for peripheral data (IO access), usually controlled by Direct Memory Access (DMA) engines. Our goal, in this work, is to improve system performance byeliminating memory channel contention between CPU accesses and IO accesses. To this end, we propose a hardware-software cooperative data transfer mechanism, Decoupled DMA (DDMA) that provides a specialized low-cost memory channel for IO accesses. In our DDMA design, main memoryhas two independent data channels, of which one is connected to the processor (CPU channel) and the other to the IO devices (IO channel), enabling CPU and IO accesses to be served on different channels. Systemsoftware or the compiler identifies which requests should be handled on the IO channel and communicates this to the DDMA engine, which then initiates the transfers on the IO channel. By doing so, our proposal increasesthe effective memory channel bandwidth, thereby either accelerating data transfers between system components, or providing opportunities to employ IO performance enhancement techniques (e.g., aggressive IO prefetching)without interfering with CPU accessesWe demonstrate the effectiveness of our DDMA framework in two scenarios: (i) CPU-GPU communication and (ii) in-memory communication (bulk datacopy/initialization within the main memory). By effectively decoupling accesses for CPU-GPU communication and in-memory communication from CPU accesses, our DDMA-based design achieves significant performanceimprovement across a wide variety of system configurations (e.g., 20% average performance improvement on a typical 2-channel 2-rank memory system).","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124304905","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 114