Smoothing on Dynamic Concurrency Throttling

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW) Pub Date : 2022-05-01 DOI:10.1109/IPDPSW55747.2022.00154

Janaina Schwarzrock, Hiago Rocha, A. Lorenzon, A. C. S. Beck

{"title":"Smoothing on Dynamic Concurrency Throttling","authors":"Janaina Schwarzrock, Hiago Rocha, A. Lorenzon, A. C. S. Beck","doi":"10.1109/IPDPSW55747.2022.00154","DOIUrl":null,"url":null,"abstract":"Technology scaling has been allowing a growing number of cores in processors to satisfy the increasing demand of new applications, which need to process huge amounts of data in High-Performance Computing (HPC). However, considering that many parallel applications have limited scalability, not always activating the maximum number of available cores to execute an application will provide the best outcome in energy and performance (represented by the Energy-Delay Product, or EDP). Because of that, many works have already proposed different Dynamic Concurrency Throttling (DCT) techniques to adapt thread count as the application executes. However, dynamically tuning (i.e. increasing or decreasing) thread count implies in overheads related to caches warm-up and thread reallocation across the cores. This overhead may become very significant when thread count changes often during execution, which may overcome the benefits brought by DCT. This problem is further aggravated in Non-Uniform Memory Access (NUMA) systems since some cores are more distant, in terms of latency, than others. With that in mind, in this paper, we propose a smoothing-based strategy to minimize the thread count changes and, consequently, mitigate the aforementioned overhead. Our proposal is generic and aims further to improve the optimization results of any DCT technique. As case-study, we performed experiments on two multicore systems with nine well-known benchmarks, showing that our smoothing technique improves EDP results of offline and online state-of-the-art DCT techniques by up to 93% and 89% (both 22% on mean), respectively.","PeriodicalId":286968,"journal":{"name":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/IPDPSW55747.2022.00154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

Abstract

Technology scaling has been allowing a growing number of cores in processors to satisfy the increasing demand of new applications, which need to process huge amounts of data in High-Performance Computing (HPC). However, considering that many parallel applications have limited scalability, not always activating the maximum number of available cores to execute an application will provide the best outcome in energy and performance (represented by the Energy-Delay Product, or EDP). Because of that, many works have already proposed different Dynamic Concurrency Throttling (DCT) techniques to adapt thread count as the application executes. However, dynamically tuning (i.e. increasing or decreasing) thread count implies in overheads related to caches warm-up and thread reallocation across the cores. This overhead may become very significant when thread count changes often during execution, which may overcome the benefits brought by DCT. This problem is further aggravated in Non-Uniform Memory Access (NUMA) systems since some cores are more distant, in terms of latency, than others. With that in mind, in this paper, we propose a smoothing-based strategy to minimize the thread count changes and, consequently, mitigate the aforementioned overhead. Our proposal is generic and aims further to improve the optimization results of any DCT technique. As case-study, we performed experiments on two multicore systems with nine well-known benchmarks, showing that our smoothing technique improves EDP results of offline and online state-of-the-art DCT techniques by up to 93% and 89% (both 22% on mean), respectively.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

动态并发节流平滑

技术扩展已经允许处理器中越来越多的内核来满足新应用程序不断增长的需求，这些应用程序需要在高性能计算(HPC)中处理大量数据。然而，考虑到许多并行应用程序具有有限的可伸缩性，不总是激活最大数量的可用内核来执行应用程序将在能源和性能方面提供最佳结果(由能源延迟产品表示，或EDP)。正因为如此，许多工作已经提出了不同的动态并发节流(DCT)技术，以便在应用程序执行时调整线程数。然而，动态调优(即增加或减少)线程数意味着与缓存预热和跨内核的线程重新分配相关的开销。当线程计数在执行期间经常变化时，这种开销可能会变得非常大，这可能会抵消DCT带来的好处。这个问题在非统一内存访问(NUMA)系统中进一步恶化，因为就延迟而言，一些内核比其他内核距离更远。考虑到这一点，在本文中，我们提出了一种基于平滑的策略，以最大限度地减少线程数变化，从而减轻上述开销。我们的建议是通用的，旨在进一步改善任何DCT技术的优化结果。作为案例研究，我们在两个多核系统上进行了9个众所周知的基准测试，结果表明，我们的平滑技术将离线和在线最先进的DCT技术的EDP结果分别提高了93%和89%(平均为22%)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2022 IEEE International Parallel and Distributed Processing Symposium Workshops (IPDPSW)

自引率

0.00%

发文量

期刊最新文献

(CGRA4HPC) 2022 Invited Speaker: Pushing the Boundaries of HPC with the Integration of AI Moving from Composable to Programmable Energy-aware neural architecture selection and hyperparameter optimization Smoothing on Dynamic Concurrency Throttling An Analysis of Mapping Polybench Kernels to HPC CGRAs