Lluc Alvarez, L. Vilanova, Miquel Moretó, Marc Casas, Marc González, X. Martorell, N. Navarro, E. Ayguadé, M. Valero
{"title":"Coherence protocol for transparent management of scratchpad memories in shared memory manycore architectures","authors":"Lluc Alvarez, L. Vilanova, Miquel Moretó, Marc Casas, Marc González, X. Martorell, N. Navarro, E. Ayguadé, M. Valero","doi":"10.1145/2872887.2750411","DOIUrl":null,"url":null,"abstract":"The increasing number of cores in manycore architectures causes important power and scalability problems in the memory subsystem. One solution is to introduce scratchpad memories alongside the cache hierarchy, forming a hybrid memory system. Scratchpad memories are more power-efficient than caches and they do not generate coherence traffic, but they suffer from poor programmability. A good way to hide the programmability difficulties to the programmer is to give the compiler the responsibility of generating code to manage the scratchpad memories. Unfortunately, compilers do not succeed in generating this code in the presence of random memory accesses with unknown aliasing hazards. This paper proposes a coherence protocol for the hybrid memory system that allows the compiler to always generate code to manage the scratchpad memories. In coordination with the compiler, memory accesses that may access stale copies of data are identified and diverted to the valid copy of the data. The proposal allows the architecture to be exposed to the programmer as a shared memory manycore, maintaining the programming simplicity of shared memory models and preserving backwards compatibility. In a 64-core manycore, the coherence protocol adds overheads of 4% in performance, 8% in network traffic and 9% in energy consumption to enable the usage of the hybrid memory system that, compared to a cache-based system, achieves a speedup of 1.14x and reduces on-chip network traffic and energy consumption by 29% and 17%, respectively.","PeriodicalId":6878,"journal":{"name":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","volume":"125 1","pages":"720-732"},"PeriodicalIF":0.0000,"publicationDate":"2015-06-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"32","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2015 ACM/IEEE 42nd Annual International Symposium on Computer Architecture (ISCA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2872887.2750411","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 32
Abstract
The increasing number of cores in manycore architectures causes important power and scalability problems in the memory subsystem. One solution is to introduce scratchpad memories alongside the cache hierarchy, forming a hybrid memory system. Scratchpad memories are more power-efficient than caches and they do not generate coherence traffic, but they suffer from poor programmability. A good way to hide the programmability difficulties to the programmer is to give the compiler the responsibility of generating code to manage the scratchpad memories. Unfortunately, compilers do not succeed in generating this code in the presence of random memory accesses with unknown aliasing hazards. This paper proposes a coherence protocol for the hybrid memory system that allows the compiler to always generate code to manage the scratchpad memories. In coordination with the compiler, memory accesses that may access stale copies of data are identified and diverted to the valid copy of the data. The proposal allows the architecture to be exposed to the programmer as a shared memory manycore, maintaining the programming simplicity of shared memory models and preserving backwards compatibility. In a 64-core manycore, the coherence protocol adds overheads of 4% in performance, 8% in network traffic and 9% in energy consumption to enable the usage of the hybrid memory system that, compared to a cache-based system, achieves a speedup of 1.14x and reduces on-chip network traffic and energy consumption by 29% and 17%, respectively.