Exploiting Staleness for Approximating Loads on CMPs

2015 International Conference on Parallel Architecture and Compilation (PACT) Pub Date : 2015-10-18 DOI:10.1109/PACT.2015.27

Prasanna Venkatesh Rengasamy, A. Sivasubramaniam, M. Kandemir, C. Das

{"title":"Exploiting Staleness for Approximating Loads on CMPs","authors":"Prasanna Venkatesh Rengasamy, A. Sivasubramaniam, M. Kandemir, C. Das","doi":"10.1109/PACT.2015.27","DOIUrl":null,"url":null,"abstract":"Coherence misses are an important factor in limiting the scalability of multi-threaded shared memory applications on chip multiprocessors (CMPs) that are envisaged to contain dozens of cores in the imminent future. This paper proposes a novel approach to tackling this problem by leveraging the growingly important paradigm of approximate computing. Many applications are either tolerant to slight errors in the output or if stringent, have in-built resiliency to tolerate some errors in the execution. The approximate computing paradigm suggests breaking conventional barriers of mandating stringent correctness on the hardware, allowing more flexibility in the performance-power-reliability design space. Taking the multi-threaded applications in the SPLASH-2 benchmark suite, we note that nearly all these applications have such inherent resiliency and/or tolerance to slight errors in the output. Based on this observation, we propose to approximate coherence-related load misses by returning stale values, i.e., the version at the time of the invalidation. We show that returning such values from the invalidated lines already present in d-L1 offers only limited scope for improvement since those lines get evicted fairly soon due to the high pressure on d-L1. Instead, we propose a very small (8 lines) Stale Victim Cache (SVC), to hold such lines upon d-L1 eviction. While this does offer significant improvement, there is the possibility of data getting very stale in such a structure, making it highly sensitive to the choice of what data to keep, and for how long. To address these concerns, we propose to time-out these lines from the SVC to limit their staleness in a mechanism called SVC+TB. We show that SVC+TB provides as much as 28.6% speedup in some SPLASH-2 applications, with an average speedup between 10-15% across the entire suite, becoming comparable to an ideal execution that does not incur coherence misses. Further, the consequent approximations have little impact on the correctness, allowing all of them to complete. There were no errors, because of inherent application resilience, in eleven applications, and the maximum error was at most 0.08% across the entire suite.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2015.27","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 10

Abstract

Coherence misses are an important factor in limiting the scalability of multi-threaded shared memory applications on chip multiprocessors (CMPs) that are envisaged to contain dozens of cores in the imminent future. This paper proposes a novel approach to tackling this problem by leveraging the growingly important paradigm of approximate computing. Many applications are either tolerant to slight errors in the output or if stringent, have in-built resiliency to tolerate some errors in the execution. The approximate computing paradigm suggests breaking conventional barriers of mandating stringent correctness on the hardware, allowing more flexibility in the performance-power-reliability design space. Taking the multi-threaded applications in the SPLASH-2 benchmark suite, we note that nearly all these applications have such inherent resiliency and/or tolerance to slight errors in the output. Based on this observation, we propose to approximate coherence-related load misses by returning stale values, i.e., the version at the time of the invalidation. We show that returning such values from the invalidated lines already present in d-L1 offers only limited scope for improvement since those lines get evicted fairly soon due to the high pressure on d-L1. Instead, we propose a very small (8 lines) Stale Victim Cache (SVC), to hold such lines upon d-L1 eviction. While this does offer significant improvement, there is the possibility of data getting very stale in such a structure, making it highly sensitive to the choice of what data to keep, and for how long. To address these concerns, we propose to time-out these lines from the SVC to limit their staleness in a mechanism called SVC+TB. We show that SVC+TB provides as much as 28.6% speedup in some SPLASH-2 applications, with an average speedup between 10-15% across the entire suite, becoming comparable to an ideal execution that does not incur coherence misses. Further, the consequent approximations have little impact on the correctness, allowing all of them to complete. There were no errors, because of inherent application resilience, in eleven applications, and the maximum error was at most 0.08% across the entire suite.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

利用陈旧性来近似cmp上的负载

相干缺失是限制芯片多处理器(cmp)上多线程共享内存应用程序可扩展性的一个重要因素，这些应用程序预计在不久的将来包含数十个内核。本文提出了一种利用日益重要的近似计算范式来解决这个问题的新方法。许多应用程序要么容忍输出中的轻微错误，要么具有内置的弹性来容忍执行中的一些错误。近似计算范式建议打破在硬件上强制执行严格正确性的传统障碍，在性能-功率-可靠性设计空间中允许更大的灵活性。以SPLASH-2基准测试套件中的多线程应用程序为例，我们注意到几乎所有这些应用程序都具有这种固有的弹性和/或对输出中的轻微错误的容忍度。基于这一观察，我们建议通过返回失效值(即失效时的版本)来近似一致性相关的加载缺失。我们表明，从d-L1中已经存在的无效行返回这样的值只提供了有限的改进空间，因为由于d-L1上的高压，这些行很快就会被驱逐。相反，我们建议使用一个非常小的(8行)陈旧受害者缓存(SVC)，以便在d-L1驱逐时保存这些行。虽然这确实提供了显著的改进，但在这种结构中存在数据变得非常陈旧的可能性，这使得它对保留哪些数据以及保留多长时间的选择非常敏感。为了解决这些问题，我们建议在SVC+TB机制中暂停这些行，以限制它们的过期。我们表明，SVC+TB在一些SPLASH-2应用程序中提供了高达28.6%的加速，整个套件的平均加速在10-15%之间，与不导致一致性丢失的理想执行相当。此外，后续的近似对正确性的影响很小，允许它们全部完成。由于固有的应用程序弹性，在11个应用程序中没有出现错误，并且整个套件的最大错误最多为0.08%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2015 International Conference on Parallel Architecture and Compilation (PACT)

自引率

0.00%

发文量

期刊最新文献

Storage Consolidation on SSDs: Not Always a Panacea, but Can We Ease the Pain? AREP: Adaptive Resource Efficient Prefetching for Maximizing Multicore Performance NVMMU: A Non-volatile Memory Management Unit for Heterogeneous GPU-SSD Architectures Scalable Task Scheduling and Synchronization Using Hierarchical Effects Scalable SIMD-Efficient Graph Processing on GPUs