Prasanna Venkatesh Rengasamy, A. Sivasubramaniam, M. Kandemir, C. Das
{"title":"Exploiting Staleness for Approximating Loads on CMPs","authors":"Prasanna Venkatesh Rengasamy, A. Sivasubramaniam, M. Kandemir, C. Das","doi":"10.1109/PACT.2015.27","DOIUrl":null,"url":null,"abstract":"Coherence misses are an important factor in limiting the scalability of multi-threaded shared memory applications on chip multiprocessors (CMPs) that are envisaged to contain dozens of cores in the imminent future. This paper proposes a novel approach to tackling this problem by leveraging the growingly important paradigm of approximate computing. Many applications are either tolerant to slight errors in the output or if stringent, have in-built resiliency to tolerate some errors in the execution. The approximate computing paradigm suggests breaking conventional barriers of mandating stringent correctness on the hardware, allowing more flexibility in the performance-power-reliability design space. Taking the multi-threaded applications in the SPLASH-2 benchmark suite, we note that nearly all these applications have such inherent resiliency and/or tolerance to slight errors in the output. Based on this observation, we propose to approximate coherence-related load misses by returning stale values, i.e., the version at the time of the invalidation. We show that returning such values from the invalidated lines already present in d-L1 offers only limited scope for improvement since those lines get evicted fairly soon due to the high pressure on d-L1. Instead, we propose a very small (8 lines) Stale Victim Cache (SVC), to hold such lines upon d-L1 eviction. While this does offer significant improvement, there is the possibility of data getting very stale in such a structure, making it highly sensitive to the choice of what data to keep, and for how long. To address these concerns, we propose to time-out these lines from the SVC to limit their staleness in a mechanism called SVC+TB. We show that SVC+TB provides as much as 28.6% speedup in some SPLASH-2 applications, with an average speedup between 10-15% across the entire suite, becoming comparable to an ideal execution that does not incur coherence misses. Further, the consequent approximations have little impact on the correctness, allowing all of them to complete. There were no errors, because of inherent application resilience, in eleven applications, and the maximum error was at most 0.08% across the entire suite.","PeriodicalId":385398,"journal":{"name":"2015 International Conference on Parallel Architecture and Compilation (PACT)","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2015-10-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 International Conference on Parallel Architecture and Compilation (PACT)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PACT.2015.27","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10
Abstract
Coherence misses are an important factor in limiting the scalability of multi-threaded shared memory applications on chip multiprocessors (CMPs) that are envisaged to contain dozens of cores in the imminent future. This paper proposes a novel approach to tackling this problem by leveraging the growingly important paradigm of approximate computing. Many applications are either tolerant to slight errors in the output or if stringent, have in-built resiliency to tolerate some errors in the execution. The approximate computing paradigm suggests breaking conventional barriers of mandating stringent correctness on the hardware, allowing more flexibility in the performance-power-reliability design space. Taking the multi-threaded applications in the SPLASH-2 benchmark suite, we note that nearly all these applications have such inherent resiliency and/or tolerance to slight errors in the output. Based on this observation, we propose to approximate coherence-related load misses by returning stale values, i.e., the version at the time of the invalidation. We show that returning such values from the invalidated lines already present in d-L1 offers only limited scope for improvement since those lines get evicted fairly soon due to the high pressure on d-L1. Instead, we propose a very small (8 lines) Stale Victim Cache (SVC), to hold such lines upon d-L1 eviction. While this does offer significant improvement, there is the possibility of data getting very stale in such a structure, making it highly sensitive to the choice of what data to keep, and for how long. To address these concerns, we propose to time-out these lines from the SVC to limit their staleness in a mechanism called SVC+TB. We show that SVC+TB provides as much as 28.6% speedup in some SPLASH-2 applications, with an average speedup between 10-15% across the entire suite, becoming comparable to an ideal execution that does not incur coherence misses. Further, the consequent approximations have little impact on the correctness, allowing all of them to complete. There were no errors, because of inherent application resilience, in eleven applications, and the maximum error was at most 0.08% across the entire suite.