Shared-memory multiprocessors have dominated all platforms from high-end to desktop computers. On such platforms, it is well known that the interconnect between the processors and the main memory has become a major bottleneck. The bandwidth-aware job scheduling is an effective and relatively easy-to-implement way to relieve the bandwidth contention. Previous policies understood that bandwidth saturation hurt the throughput of parallel jobs so they scheduled the jobs to let the total bandwidth requirement equal to the system peak bandwidth. However, we found that intra-quantum fine-grained bandwidth contention still happened due to a program's irregular fluctuation in memory access intensity, which is mostly ignored in previous policies. In this paper, we quantify the impact of bandwidth contention on overall performance. We found that concurrent jobs could achieve a higher memory bandwidth utilization at the expense of super-linear performance degradation. Based on such an observation, we proposed a new workload scheduling policy. Its basic idea is that interference due to bandwidth contention could be minimized when bandwidth utilization is maintained at the level of average bandwidth requirement of the workload. Our evaluation is based on both SPEC 2006 and NPB workloads. The evaluation results on randomly generated workloads show that our policy could improve the system throughput by 4.1% on average over the native OS scheduler, and up to 11.7% improvement has been observed.
{"title":"On mitigating memory bandwidth contention through bandwidth-aware scheduling","authors":"Di Xu, Chenggang Wu, P. Yew","doi":"10.1145/1854273.1854306","DOIUrl":"https://doi.org/10.1145/1854273.1854306","url":null,"abstract":"Shared-memory multiprocessors have dominated all platforms from high-end to desktop computers. On such platforms, it is well known that the interconnect between the processors and the main memory has become a major bottleneck. The bandwidth-aware job scheduling is an effective and relatively easy-to-implement way to relieve the bandwidth contention. Previous policies understood that bandwidth saturation hurt the throughput of parallel jobs so they scheduled the jobs to let the total bandwidth requirement equal to the system peak bandwidth. However, we found that intra-quantum fine-grained bandwidth contention still happened due to a program's irregular fluctuation in memory access intensity, which is mostly ignored in previous policies. In this paper, we quantify the impact of bandwidth contention on overall performance. We found that concurrent jobs could achieve a higher memory bandwidth utilization at the expense of super-linear performance degradation. Based on such an observation, we proposed a new workload scheduling policy. Its basic idea is that interference due to bandwidth contention could be minimized when bandwidth utilization is maintained at the level of average bandwidth requirement of the workload. Our evaluation is based on both SPEC 2006 and NPB workloads. The evaluation results on randomly generated workloads show that our policy could improve the system throughput by 4.1% on average over the native OS scheduler, and up to 11.7% improvement has been observed.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"47 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124969246","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Speeding up sequential programs on multicores is a challenging problem that is in urgent need of a solution. Automatic parallelization of irregular pointer-intensive codes, exemplified by the SPECint codes, is a very hard problem. This paper shows that, with a helping hand, such auto-parallelization is possible and fruitful.
{"title":"The paralax infrastructure: Automatic parallelization with a helping hand","authors":"H. Vandierendonck, S. Rul, K. D. Bosschere","doi":"10.1145/1854273.1854322","DOIUrl":"https://doi.org/10.1145/1854273.1854322","url":null,"abstract":"Speeding up sequential programs on multicores is a challenging problem that is in urgent need of a solution. Automatic parallelization of irregular pointer-intensive codes, exemplified by the SPECint codes, is a very hard problem. This paper shows that, with a helping hand, such auto-parallelization is possible and fruitful.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125025466","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper a new architecture, Speculative-Aware Execution (SAE) is presented that employs speculative-awareness as a means of mitigating the drawbacks of speculative execution which are: useless work (uses speculative values so it produces incorrect results or is done on the wrong path) and redundant work (produces results previously obtained). In order to achieve this, SAE tries to partition the dynamic instruction stream into two disjoint parallel threads: A speculative thread that is partially speculative-aware (p-thread) as it records its speculative state and uses it to avoid useless work (using speculative values) but have no account for its control-flow violations; and a fully speculative-aware thread (f-thread) that has full record of p-thread's speculations, and so can steer p-thread away from incorrect control-flow paths and can accurately identify p-thread's correct work and avoid it, otherwise it would be redundant. By eliminating useless and redundant works, SAE outperforms existing architectures that share similar high-level micro-architecture while incurring only minor hardware additions/changes. Detailed experimental results confirm that SAE indeed reduces the number of useless and redundant computations. We also report an average performance improvement of 18% for the SPEC_INT2000 benchmarks.
{"title":"Speculative-Aware Execution: A simple and efficient technique for utilizing multi-cores to improve single-thread performance","authors":"R. Mameesh, M. Franklin","doi":"10.1145/1854273.1854326","DOIUrl":"https://doi.org/10.1145/1854273.1854326","url":null,"abstract":"In this paper a new architecture, Speculative-Aware Execution (SAE) is presented that employs speculative-awareness as a means of mitigating the drawbacks of speculative execution which are: useless work (uses speculative values so it produces incorrect results or is done on the wrong path) and redundant work (produces results previously obtained). In order to achieve this, SAE tries to partition the dynamic instruction stream into two disjoint parallel threads: A speculative thread that is partially speculative-aware (p-thread) as it records its speculative state and uses it to avoid useless work (using speculative values) but have no account for its control-flow violations; and a fully speculative-aware thread (f-thread) that has full record of p-thread's speculations, and so can steer p-thread away from incorrect control-flow paths and can accurately identify p-thread's correct work and avoid it, otherwise it would be redundant. By eliminating useless and redundant works, SAE outperforms existing architectures that share similar high-level micro-architecture while incurring only minor hardware additions/changes. Detailed experimental results confirm that SAE indeed reduces the number of useless and redundant computations. We also report an average performance improvement of 18% for the SPEC_INT2000 benchmarks.","PeriodicalId":422461,"journal":{"name":"2010 19th International Conference on Parallel Architectures and Compilation Techniques (PACT)","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2010-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126982333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}