A. Srivatsa, Sven Rheindt, Thomas Wild, A. Herkersdorf
{"title":"Region based cache coherence for tiled MPSoCs","authors":"A. Srivatsa, Sven Rheindt, Thomas Wild, A. Herkersdorf","doi":"10.1109/SOCC.2017.8226059","DOIUrl":null,"url":null,"abstract":"The need for faster and more energy efficient computing has led us to the multicore era with distributed shared memory hierarchies. The primary goal is to distribute parallel tasks onto multiple processing elements to collectively achieve shorter execution times at lower frequencies and supply voltages when compared to a single-core architecture. Major challenges of this approach are how to achieve local, low latency memory accesses and low overheads for coherence and synchronization management. We believe that enabling global coherence in tiled many-core architectures does not scale in a cost efficient manner and isn't even required for applications with limited degrees of parallelism. In this paper, we propose a novel region based cache coherence scheme, where coherence is provided by hardware directories within a flexibly sized but confined set of compute and memory tiles. We also show that data placement and task mapping have a huge impact on the application performance, and hence should be considered in conjunction with region based coherence. The approach is evaluated by means of a high level simulation model using workloads from PARSEC. Experiments demonstrate that our region based approach with multiple compute tiles increases performance by a factor of up to 2.5 compared to a single tile structure with nominally identical computing and memory resources. Thus the independent local memory accesses, which are effectively increasing the memory bandwidth, usually outweigh the penalties of inter-tile remote memory accesses. Our approach also reduces the directory structures significantly compared to traditional schemes, making it scalable for large MPSoCs (eg. by 41.4% for a 16 tile system with 4 tiles per region). Considering data-to-task-placement, our investigations show that it can lead to performance variations up to a factor of 12.7.","PeriodicalId":366264,"journal":{"name":"2017 30th IEEE International System-on-Chip Conference (SOCC)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 30th IEEE International System-on-Chip Conference (SOCC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SOCC.2017.8226059","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 6
Abstract
The need for faster and more energy efficient computing has led us to the multicore era with distributed shared memory hierarchies. The primary goal is to distribute parallel tasks onto multiple processing elements to collectively achieve shorter execution times at lower frequencies and supply voltages when compared to a single-core architecture. Major challenges of this approach are how to achieve local, low latency memory accesses and low overheads for coherence and synchronization management. We believe that enabling global coherence in tiled many-core architectures does not scale in a cost efficient manner and isn't even required for applications with limited degrees of parallelism. In this paper, we propose a novel region based cache coherence scheme, where coherence is provided by hardware directories within a flexibly sized but confined set of compute and memory tiles. We also show that data placement and task mapping have a huge impact on the application performance, and hence should be considered in conjunction with region based coherence. The approach is evaluated by means of a high level simulation model using workloads from PARSEC. Experiments demonstrate that our region based approach with multiple compute tiles increases performance by a factor of up to 2.5 compared to a single tile structure with nominally identical computing and memory resources. Thus the independent local memory accesses, which are effectively increasing the memory bandwidth, usually outweigh the penalties of inter-tile remote memory accesses. Our approach also reduces the directory structures significantly compared to traditional schemes, making it scalable for large MPSoCs (eg. by 41.4% for a 16 tile system with 4 tiles per region). Considering data-to-task-placement, our investigations show that it can lead to performance variations up to a factor of 12.7.