首页 > 最新文献

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献

英文 中文
EASE: Energy Efficiency and Proportionality Aware Virtual Machine Scheduling EASE:能源效率和比例意识虚拟机调度
Congfeng Jiang, Yumei Wang, Dongyang Ou, Yeliang Qiu, Youhuizi Li, Jian Wan, Bing Luo, Weisong Shi, C. Cérin
Servers have different energy efficiency and energy proportionality (EP) due to their hardware configuration (i.e., CPU generation and memory installation) and workload. However, current virtual machine (VM) scheduling in virtualized environments will saturate servers without considering their energy efficiency and EP differences. This article will discuss EASE, the energy efficiency and proportionality aware VM scheduling approach. EASE first executes customized computing intensive, memory intensive, and hybrid benchmarks to calculate a server's energy efficiency and EP. Then it schedules VMs to servers to keep them working at their peak energy efficiency point (or optimal working range). This step improves the overall energy efficiency of the cluster and the data center. For performance guarantee, EASE migrates VMs from servers under highly contending conditions. The experimental results on real clusters show that power consumption can be saved 37.07% ~ 49.98% in the homogeneous cluster. The average completion time of the computing intensive VMs increases only 0.31 % ~ 8.49%. In the heterogeneous nodes, the power consumption of the computing intensive VMs can be reduced by 44.22 %. The job completion time can be saved by 53.80%.
服务器由于其硬件配置(即CPU生成和内存安装)和工作负载而具有不同的能源效率和能源比例(EP)。然而,当前虚拟化环境中的虚拟机调度会使服务器饱和,而不考虑它们的能效和EP差异。本文将讨论EASE,即能源效率和比例感知的VM调度方法。EASE首先执行定制的计算密集型、内存密集型和混合基准测试,以计算服务器的能源效率和EP。然后,它将虚拟机调度到服务器上,以保持它们在最高能效点(或最佳工作范围)工作。此步骤可以提高集群和数据中心的整体能源效率。为了保证性能,在高竞争条件下,EASE会将虚拟机从服务器上迁移出去。在实际集群上的实验结果表明,在均匀集群下,功耗可节省37.07% ~ 49.98%。计算密集型虚拟机的平均完成时间仅增加0.31% ~ 8.49%。在异构节点下,计算密集型虚拟机能耗可降低44.22%。可节省作业完成时间53.80%。
{"title":"EASE: Energy Efficiency and Proportionality Aware Virtual Machine Scheduling","authors":"Congfeng Jiang, Yumei Wang, Dongyang Ou, Yeliang Qiu, Youhuizi Li, Jian Wan, Bing Luo, Weisong Shi, C. Cérin","doi":"10.1109/CAHPC.2018.8645948","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645948","url":null,"abstract":"Servers have different energy efficiency and energy proportionality (EP) due to their hardware configuration (i.e., CPU generation and memory installation) and workload. However, current virtual machine (VM) scheduling in virtualized environments will saturate servers without considering their energy efficiency and EP differences. This article will discuss EASE, the energy efficiency and proportionality aware VM scheduling approach. EASE first executes customized computing intensive, memory intensive, and hybrid benchmarks to calculate a server's energy efficiency and EP. Then it schedules VMs to servers to keep them working at their peak energy efficiency point (or optimal working range). This step improves the overall energy efficiency of the cluster and the data center. For performance guarantee, EASE migrates VMs from servers under highly contending conditions. The experimental results on real clusters show that power consumption can be saved 37.07% ~ 49.98% in the homogeneous cluster. The average completion time of the computing intensive VMs increases only 0.31 % ~ 8.49%. In the heterogeneous nodes, the power consumption of the computing intensive VMs can be reduced by 44.22 %. The job completion time can be saved by 53.80%.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122295417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
A Scalability and Sensitivity Study of Parallel Geometric Algorithms for Graph Partitioning 图划分并行几何算法的可扩展性和灵敏度研究
Shad Kirmani, Hongyang Sun, P. Raghavan
Graph partitioning arises in many computational simulation workloads, including those that involve finite difference or finite element methods, where partitioning enables efficient parallel processing of the entire simulation. We focus on parallel geometric algorithms for partitioning large graphs whose vertices are associated with coordinates in two or three-dimensional space on multi-core processors. Compared with other types of partitioning algorithms, geometric schemes generally show better scalability on a large number of processors or cores. This paper studies the scalability and sensitivity of two parallel algorithms, namely, recursive coordinate bisection (denoted by pRCB) and geometric mesh partitioning (denoted by pGMP), in terms of their robustness to several key factors that affect the partition quality, including coordinate perturbation, approximate embedding, mesh quality and graph planarity. Our results indicate that the quality of a partition as measured by the size of the edge separator (or cutsize) remains consistently better for pGMP compared to pRCB. On average for our test suite, relative to pRCB, pGMP yields 25% smaller cutsizes on the original embedding, and across all perturbations cutsizes that are smaller by at least 8% and by as much as 50%. Not surprisingly, higher quality cuts are obtained at the expense of longer execution times; on a single core, pGMP has an average execution time that is almost 10 times slower than that of pRCB, but it scales better and catches up at 32-cores to be slower by less than 20%. With the current trends in core counts that continue to increase per chip, these results suggest that pGMP presents an attractive solution if a modest number of cores can be deployed to reduce execution times while providing high quality partitions.
图分区出现在许多计算模拟工作负载中,包括那些涉及有限差分或有限元方法的工作负载,其中分区能够有效地并行处理整个模拟。我们专注于在多核处理器上划分大型图的并行几何算法,这些图的顶点与二维或三维空间中的坐标相关。与其他类型的分区算法相比,几何方案通常在大量处理器或核心上表现出更好的可伸缩性。本文研究了递归坐标平分(pRCB)和几何网格划分(pGMP)两种并行算法的可扩展性和灵敏度,以及它们对影响划分质量的几个关键因素(包括坐标摄动、近似嵌入、网格质量和图平面性)的鲁棒性。我们的研究结果表明,与pRCB相比,通过边缘分离器的大小(或切割尺寸)测量的分区质量对于pGMP来说始终优于pRCB。在我们的测试套件中,平均而言,相对于pRCB, pGMP在原始包埋上产生25%的小切口,在所有扰动中,小切口至少减少8%,最多减少50%。毫不奇怪,更高质量的切割是以更长的执行时间为代价的;在单核上,pGMP的平均执行时间几乎比pRCB慢10倍,但它的可扩展性更好,在32核上的速度比pRCB慢不到20%。鉴于当前每个芯片的内核数持续增加的趋势,这些结果表明,如果可以部署适当数量的内核以减少执行时间,同时提供高质量分区,那么pGMP将是一个有吸引力的解决方案。
{"title":"A Scalability and Sensitivity Study of Parallel Geometric Algorithms for Graph Partitioning","authors":"Shad Kirmani, Hongyang Sun, P. Raghavan","doi":"10.1109/CAHPC.2018.8645916","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645916","url":null,"abstract":"Graph partitioning arises in many computational simulation workloads, including those that involve finite difference or finite element methods, where partitioning enables efficient parallel processing of the entire simulation. We focus on parallel geometric algorithms for partitioning large graphs whose vertices are associated with coordinates in two or three-dimensional space on multi-core processors. Compared with other types of partitioning algorithms, geometric schemes generally show better scalability on a large number of processors or cores. This paper studies the scalability and sensitivity of two parallel algorithms, namely, recursive coordinate bisection (denoted by pRCB) and geometric mesh partitioning (denoted by pGMP), in terms of their robustness to several key factors that affect the partition quality, including coordinate perturbation, approximate embedding, mesh quality and graph planarity. Our results indicate that the quality of a partition as measured by the size of the edge separator (or cutsize) remains consistently better for pGMP compared to pRCB. On average for our test suite, relative to pRCB, pGMP yields 25% smaller cutsizes on the original embedding, and across all perturbations cutsizes that are smaller by at least 8% and by as much as 50%. Not surprisingly, higher quality cuts are obtained at the expense of longer execution times; on a single core, pGMP has an average execution time that is almost 10 times slower than that of pRCB, but it scales better and catches up at 32-cores to be slower by less than 20%. With the current trends in core counts that continue to increase per chip, these results suggest that pGMP presents an attractive solution if a modest number of cores can be deployed to reduce execution times while providing high quality partitions.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127333209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
2018 30th International Symposium on Computer Architecture and High Performance Computing 2018第30届计算机体系结构与高性能计算国际研讨会
{"title":"2018 30th International Symposium on Computer Architecture and High Performance Computing","authors":"","doi":"10.1109/cahpc.2018.8645958","DOIUrl":"https://doi.org/10.1109/cahpc.2018.8645958","url":null,"abstract":"","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"153 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115904383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Polyhedral Dataflow Programming: A Case Study 多面体数据流编程:一个案例研究
Romain Fontaine, L. Gonnord, L. Morel
Dataflow languages expose the application's potential parallelism naturally and have thus been studied and developped for the past thirty years as a solution for harnessing the increasing hardware parallelism. However, when generating code for parallel processors, current dataflow compilers only take into consideration the overall dataflow network of the application. This leaves out the potential parallelism that could be extracted from the internals of agents, typically when those include loop nests, for instance, but also potential application of intra-agent plpelining, or task spliting and rescheduling. In this work, we study the benefits of jointly using polyhedral compilation with dataflow languages. More precisely, we propose to expend the parallelization of dataflow programs by taking into account the parallelism exposed by loop nests describing the internal behavior of the program's agents. This approach is validated through the development of a prototype toolchain based on an extended version of the ΣC language. We demonstrate the benefit of this approach and the potentiality of further improvements on relevant case studies.
数据流语言自然地暴露了应用程序潜在的并行性,因此在过去的三十年中,数据流语言作为一种利用日益增长的硬件并行性的解决方案得到了研究和开发。然而,在为并行处理器生成代码时,当前的数据流编译器只考虑应用程序的整体数据流网络。这排除了从代理内部提取的潜在并行性,例如,通常当这些代理包含循环巢时,也排除了代理内部管道或任务分裂和重新调度的潜在应用。在这项工作中,我们研究了将多面体编译与数据流语言联合使用的好处。更准确地说,我们建议通过考虑描述程序代理内部行为的循环巢所暴露的并行性来扩展数据流程序的并行化。通过开发基于ΣC语言扩展版本的原型工具链,验证了这种方法。我们展示了这种方法的好处,以及在相关案例研究中进一步改进的潜力。
{"title":"Polyhedral Dataflow Programming: A Case Study","authors":"Romain Fontaine, L. Gonnord, L. Morel","doi":"10.1109/CAHPC.2018.8645947","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645947","url":null,"abstract":"Dataflow languages expose the application's potential parallelism naturally and have thus been studied and developped for the past thirty years as a solution for harnessing the increasing hardware parallelism. However, when generating code for parallel processors, current dataflow compilers only take into consideration the overall dataflow network of the application. This leaves out the potential parallelism that could be extracted from the internals of agents, typically when those include loop nests, for instance, but also potential application of intra-agent plpelining, or task spliting and rescheduling. In this work, we study the benefits of jointly using polyhedral compilation with dataflow languages. More precisely, we propose to expend the parallelization of dataflow programs by taking into account the parallelism exposed by loop nests describing the internal behavior of the program's agents. This approach is validated through the development of a prototype toolchain based on an extended version of the ΣC language. We demonstrate the benefit of this approach and the potentiality of further improvements on relevant case studies.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128474107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Exploring Power Budget Scheduling Opportunities and Tradeoffs for AMR-Based Applications 探索基于amr应用的功率预算调度机会和权衡
Yubo Qin, I. Rodero, P. Subedi, M. Parashar, S. Rigo
Computational demand has brought major changes to Advanced Cyber-Infrastructure (ACI) architectures. It is now possible to run scientific simulations faster and obtain more accurate results. However, power and energy have become critical concerns. Also, the current roadmap toward the new generation of ACI includes power budget as one of the main constraints. Current research efforts have studied power and performance tradeoffs and how to balance these (e.g., using Dynamic Voltage and Frequency Scaling (DVFS) and power capping for meeting power constraints, which can impact performance). However, applications may not tolerate degradation in performance, and other tradeoffs need to be explored to meet power budgets (e.g., involving the application in making energy-performance-quality tradeoff decisions). This paper proposes using the properties of AMR-based algorithms (e.g., dynamically adjusting the resolution of a simulation in combination with power capping techniques) to schedule or re-distribute the power budget. It specifically explores the opportunities to realize such an approach using checkpointing as a proof-of-concept use case and provides a characterization of a representative set of applications that use Adaptive Mesh Refinement (AMR) methods, including a Low-Mach-Number Combustion (LMC) application. It also explores the potential of utilizing power capping to understand power-quality tradeoffs via simulation.
计算需求给高级网络基础设施(ACI)体系结构带来了重大变化。现在可以更快地运行科学模拟并获得更准确的结果。然而,电力和能源已成为关键问题。此外,目前新一代ACI的路线图包括功率预算作为主要限制之一。目前的研究工作主要是研究功率和性能的权衡以及如何平衡它们(例如,使用动态电压和频率缩放(DVFS)和功率封顶来满足可能影响性能的功率限制)。但是,应用程序可能无法容忍性能的下降,因此需要探索其他折衷方案以满足功率预算(例如,让应用程序参与制定能源-性能-质量折衷决策)。本文提出利用基于amr算法的特性(例如,结合功率封顶技术动态调整模拟的分辨率)来调度或重新分配功率预算。它特别探讨了使用检查点作为概念验证用例实现这种方法的机会,并提供了使用自适应网格细化(AMR)方法的代表性应用程序集的特征,包括低马赫数燃烧(LMC)应用程序。它还探讨了利用功率封顶的潜力,通过模拟来理解功率质量权衡。
{"title":"Exploring Power Budget Scheduling Opportunities and Tradeoffs for AMR-Based Applications","authors":"Yubo Qin, I. Rodero, P. Subedi, M. Parashar, S. Rigo","doi":"10.1109/CAHPC.2018.8645941","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645941","url":null,"abstract":"Computational demand has brought major changes to Advanced Cyber-Infrastructure (ACI) architectures. It is now possible to run scientific simulations faster and obtain more accurate results. However, power and energy have become critical concerns. Also, the current roadmap toward the new generation of ACI includes power budget as one of the main constraints. Current research efforts have studied power and performance tradeoffs and how to balance these (e.g., using Dynamic Voltage and Frequency Scaling (DVFS) and power capping for meeting power constraints, which can impact performance). However, applications may not tolerate degradation in performance, and other tradeoffs need to be explored to meet power budgets (e.g., involving the application in making energy-performance-quality tradeoff decisions). This paper proposes using the properties of AMR-based algorithms (e.g., dynamically adjusting the resolution of a simulation in combination with power capping techniques) to schedule or re-distribute the power budget. It specifically explores the opportunities to realize such an approach using checkpointing as a proof-of-concept use case and provides a characterization of a representative set of applications that use Adaptive Mesh Refinement (AMR) methods, including a Low-Mach-Number Combustion (LMC) application. It also explores the potential of utilizing power capping to understand power-quality tradeoffs via simulation.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130849962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Predicting the Performance Impact of Increasing Memory Bandwidth for Scientific Workflows 预测增加内存带宽对科学工作流的性能影响
N. Gonzalez, J. Brunheroto, F. Artico, Yoonho Park, T. Carvalho, C. Miers, M. A. Pillon, G. Koslovski
The disparity between the bandwidth provided by modern processors and by the main memory led to the issue known as memory wall, in which application performance becomes completely bound by memory speed. Newer technologies are trying to increase memory bandwidth to address this issue, but the fact is that the effects of increasing bandwidth to application performance still lack exploration. This paper investigates these effects for scientific workflows focusing on the definition of a performance model and on the execution of experiments to validate the rationale for the model. The main contribution is based on two observations: memory bound applications benefit more from an increase to memory bandwidth, and the effects of improving bandwidth for a particular application gradually diminish as bandwidth is increased.
现代处理器和主内存提供的带宽之间的差距导致了所谓的“内存墙”问题,即应用程序性能完全受内存速度的限制。新的技术正在尝试增加内存带宽来解决这个问题,但事实是,增加带宽对应用程序性能的影响仍然缺乏探索。本文研究了这些对科学工作流程的影响,重点是性能模型的定义和实验的执行,以验证模型的基本原理。主要贡献基于两个观察结果:内存绑定应用程序从内存带宽的增加中获益更多,并且随着带宽的增加,对特定应用程序提高带宽的影响逐渐减少。
{"title":"Predicting the Performance Impact of Increasing Memory Bandwidth for Scientific Workflows","authors":"N. Gonzalez, J. Brunheroto, F. Artico, Yoonho Park, T. Carvalho, C. Miers, M. A. Pillon, G. Koslovski","doi":"10.1109/CAHPC.2018.8645886","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645886","url":null,"abstract":"The disparity between the bandwidth provided by modern processors and by the main memory led to the issue known as memory wall, in which application performance becomes completely bound by memory speed. Newer technologies are trying to increase memory bandwidth to address this issue, but the fact is that the effects of increasing bandwidth to application performance still lack exploration. This paper investigates these effects for scientific workflows focusing on the definition of a performance model and on the execution of experiments to validate the rationale for the model. The main contribution is based on two observations: memory bound applications benefit more from an increase to memory bandwidth, and the effects of improving bandwidth for a particular application gradually diminish as bandwidth is increased.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114804963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
A Batch Task Migration Approach for Decentralized Global Rescheduling 分散全局重调度的批处理任务迁移方法
Vinicius Freitas, A. Santana, M. Castro, L. Pilla
Effectively mapping tasks of High Performance Computing (HPC) applications on parallel systems is crucial to assure substantial performance gains. As platforms and applications grow, load imbalance becomes a priority issue. Even though centralized rescheduling has been a viable solution to mitigate this problem, its efficiency is not able to keep up with the increasing size of shared memory platforms. To efficiently solve load imbalance today, and in the years to come, we should prioritize decentralized strategies developed for large scale platforms. In this paper, we propose our Batch Task Migration approach to improve decentralized global rescheduling, ultimately reducing communication costs and preserving task locality. We implemented and evaluated our approach in two different parallel platforms, using both synthetic workloads and a molecular dynamics (MD) benchmark. Our solution was able to achieve speedups of up to 3.75 and 1.15 on rescheduling time, when compared to other centralized and distributed approaches, respectively. Moreover, it improved the execution time of MD by factors up to 1.34 and 1.22 when compared to a scenario without load balancing on two different platforms.
在并行系统上有效地映射高性能计算(HPC)应用程序的任务对于确保大幅度的性能提升至关重要。随着平台和应用程序的增长,负载不平衡成为一个优先考虑的问题。尽管集中式重调度是缓解此问题的可行解决方案,但其效率无法跟上共享内存平台规模的增长。为了有效地解决当前和未来几年的负载不平衡问题,我们应该优先考虑为大型平台开发的分散策略。在本文中,我们提出了我们的批任务迁移方法来改进分散的全局重调度,最终降低通信成本并保持任务局部性。我们在两个不同的并行平台上实现并评估了我们的方法,同时使用合成工作负载和分子动力学(MD)基准。与其他集中式和分布式方法相比,我们的解决方案能够在重新调度时间上分别实现高达3.75和1.15的加速。此外,与两个不同平台上没有负载平衡的场景相比,它将MD的执行时间提高了1.34和1.22倍。
{"title":"A Batch Task Migration Approach for Decentralized Global Rescheduling","authors":"Vinicius Freitas, A. Santana, M. Castro, L. Pilla","doi":"10.1109/CAHPC.2018.8645953","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645953","url":null,"abstract":"Effectively mapping tasks of High Performance Computing (HPC) applications on parallel systems is crucial to assure substantial performance gains. As platforms and applications grow, load imbalance becomes a priority issue. Even though centralized rescheduling has been a viable solution to mitigate this problem, its efficiency is not able to keep up with the increasing size of shared memory platforms. To efficiently solve load imbalance today, and in the years to come, we should prioritize decentralized strategies developed for large scale platforms. In this paper, we propose our Batch Task Migration approach to improve decentralized global rescheduling, ultimately reducing communication costs and preserving task locality. We implemented and evaluated our approach in two different parallel platforms, using both synthetic workloads and a molecular dynamics (MD) benchmark. Our solution was able to achieve speedups of up to 3.75 and 1.15 on rescheduling time, when compared to other centralized and distributed approaches, respectively. Moreover, it improved the execution time of MD by factors up to 1.34 and 1.22 when compared to a scenario without load balancing on two different platforms.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128110330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Jaccard Weights Kernel Leveraging Independent Thread Scheduling on GPUs 利用gpu上独立线程调度的Jaccard权重内核
H. Anzt, J. Dongarra
Jaccard weights are a popular metric for identifying communities in social network analytics. In this paper we present a kernel for efficiently computing the Jaccard weight matrix on G PU s. The kernel design is guided by fine-grained parallelism and the independent thread scheduling supported by NVIDIA's Volta architecture. This technology makes it possible to interleave the execution of divergent branches for enhanced data reuse and a higher instruction per cycle rate for memory-bound algorithms. In a performance evaluation using a set of publicly available social networks, we report the kernel execution time and analyze the built-in hardware counters on different GPU architectures. The findings have implications beyond the specific algorithm and suggest a reformulation of other data-sparse algorithms.
在社交网络分析中,Jaccard权重是识别社区的一种流行度量。本文提出了一种在gpu上高效计算Jaccard权矩阵的内核,该内核设计以细粒度并行性和NVIDIA的Volta架构支持的独立线程调度为指导。这种技术使得不同分支的交错执行成为可能,以增强数据重用,并为内存约束算法提供更高的每周期指令率。在使用一组公开可用的社交网络进行性能评估时,我们报告了内核执行时间并分析了不同GPU架构上的内置硬件计数器。这些发现的影响超出了特定的算法,并建议重新制定其他数据稀疏算法。
{"title":"A Jaccard Weights Kernel Leveraging Independent Thread Scheduling on GPUs","authors":"H. Anzt, J. Dongarra","doi":"10.1109/CAHPC.2018.8645946","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645946","url":null,"abstract":"Jaccard weights are a popular metric for identifying communities in social network analytics. In this paper we present a kernel for efficiently computing the Jaccard weight matrix on G PU s. The kernel design is guided by fine-grained parallelism and the independent thread scheduling supported by NVIDIA's Volta architecture. This technology makes it possible to interleave the execution of divergent branches for enhanced data reuse and a higher instruction per cycle rate for memory-bound algorithms. In a performance evaluation using a set of publicly available social networks, we report the kernel execution time and analyze the built-in hardware counters on different GPU architectures. The findings have implications beyond the specific algorithm and suggest a reformulation of other data-sparse algorithms.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129217630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
A Novel Broker-Based Hierarchical Authentication Scheme in Proxy Mobile IPv6 Networks 代理移动IPv6网络中一种基于代理的分层认证方案
Su-Hwan Jang, Jongpil Jeong, Byungjun Park
Most of the current research about Proxy Mobile IPv6 (PMIPv6) focuses on how to optimize the interactive processes between the PMIPv6 and AAA (Authentication, Authorization, and Accounting) protocol. This paper describes a cost-effective hierarchical authentication scheme, which makes its focus on minimizing the authentication latency in AAA processing. In this scheme, a hierarchical AAA architecture is proposed, in which the AAA servers are deployed on the Local Mobility Anchor (LMA), the Root AAA server manages several Leaf AAA servers and the Brokers on behalf of the AAA server in home domain. The simulation results shows that the proposed scheme reduces the handoff and authentication latency evidently compared to the previous traditional authentication combination modeling
目前关于代理移动IPv6 (Proxy Mobile IPv6, PMIPv6)的研究主要集中在如何优化PMIPv6与AAA (Authentication, Authorization, and Accounting)协议之间的交互过程。本文提出了一种具有成本效益的分层认证方案,其重点是最小化AAA处理中的认证延迟。在该方案中,提出了一种分层AAA架构,其中AAA服务器部署在本地移动锚(Local Mobility Anchor, LMA)上,根AAA服务器管理多台Leaf AAA服务器,代理代理AAA服务器在主域中。仿真结果表明,与传统的认证组合建模相比,该方案明显降低了切换和认证延迟
{"title":"A Novel Broker-Based Hierarchical Authentication Scheme in Proxy Mobile IPv6 Networks","authors":"Su-Hwan Jang, Jongpil Jeong, Byungjun Park","doi":"10.1109/CAHPC.2018.8645943","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645943","url":null,"abstract":"Most of the current research about Proxy Mobile IPv6 (PMIPv6) focuses on how to optimize the interactive processes between the PMIPv6 and AAA (Authentication, Authorization, and Accounting) protocol. This paper describes a cost-effective hierarchical authentication scheme, which makes its focus on minimizing the authentication latency in AAA processing. In this scheme, a hierarchical AAA architecture is proposed, in which the AAA servers are deployed on the Local Mobility Anchor (LMA), the Root AAA server manages several Leaf AAA servers and the Brokers on behalf of the AAA server in home domain. The simulation results shows that the proposed scheme reduces the handoff and authentication latency evidently compared to the previous traditional authentication combination modeling","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123714167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Design Space Exploration of Energy Efficient NoC-and Cache-Based Many-Core Architecture 基于节能noc和缓存的多核架构的设计空间探索
M. Souza, H. Freitas, J. Méhaut
Performance of parallel scientific applications on many-core processor architectures is a challenge that increases every day, especially when energy efficiency is concerned. To achieve this, it is necessary to explore architectures with high processing power composed by a network-on-chip to integrate many processing cores and other components. In this context, this paper presents a design space exploration over NoC-based manycore processor architectures with distributed and shared caches, using full-system simulations. We evaluate bottlenecks in such architectures with regard to energy efficiency, using different parallel scientific applications and considering aspects from caches and NoCs jointly. Five applications from NAS Parallel Benchmarks were executed over the proposed architectures, which vary in number of cores; in L2 cache size; and in 12 types of NoC topologies. A clustered topology was set up, in which we obtain performance gains up to 30.56% and reduction in energy consumption up to 38.53%, when compared to a traditional one.
多核处理器架构上的并行科学应用程序的性能是一个日益增加的挑战,特别是当涉及到能源效率时。为了实现这一目标,有必要探索由片上网络组成的具有高处理能力的架构,以集成许多处理核心和其他组件。在此背景下,本文采用全系统模拟的方法,对基于noc的多核处理器架构进行了设计空间探索,该架构具有分布式和共享缓存。我们使用不同的并行科学应用程序,并联合考虑缓存和noc的各个方面,评估此类架构在能效方面的瓶颈。来自NAS Parallel benchmark的五个应用程序在提议的架构上执行,这些架构的核心数量各不相同;在二级缓存大小;以及12种NoC拓扑。建立了一个集群拓扑,与传统拓扑相比,我们获得了高达30.56%的性能提升和高达38.53%的能耗降低。
{"title":"Design Space Exploration of Energy Efficient NoC-and Cache-Based Many-Core Architecture","authors":"M. Souza, H. Freitas, J. Méhaut","doi":"10.1109/CAHPC.2018.8645930","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645930","url":null,"abstract":"Performance of parallel scientific applications on many-core processor architectures is a challenge that increases every day, especially when energy efficiency is concerned. To achieve this, it is necessary to explore architectures with high processing power composed by a network-on-chip to integrate many processing cores and other components. In this context, this paper presents a design space exploration over NoC-based manycore processor architectures with distributed and shared caches, using full-system simulations. We evaluate bottlenecks in such architectures with regard to energy efficiency, using different parallel scientific applications and considering aspects from caches and NoCs jointly. Five applications from NAS Parallel Benchmarks were executed over the proposed architectures, which vary in number of cores; in L2 cache size; and in 12 types of NoC topologies. A clustered topology was set up, in which we obtain performance gains up to 30.56% and reduction in energy consumption up to 38.53%, when compared to a traditional one.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124908471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1