Servers have different energy efficiency and energy proportionality (EP) due to their hardware configuration (i.e., CPU generation and memory installation) and workload. However, current virtual machine (VM) scheduling in virtualized environments will saturate servers without considering their energy efficiency and EP differences. This article will discuss EASE, the energy efficiency and proportionality aware VM scheduling approach. EASE first executes customized computing intensive, memory intensive, and hybrid benchmarks to calculate a server's energy efficiency and EP. Then it schedules VMs to servers to keep them working at their peak energy efficiency point (or optimal working range). This step improves the overall energy efficiency of the cluster and the data center. For performance guarantee, EASE migrates VMs from servers under highly contending conditions. The experimental results on real clusters show that power consumption can be saved 37.07% ~ 49.98% in the homogeneous cluster. The average completion time of the computing intensive VMs increases only 0.31 % ~ 8.49%. In the heterogeneous nodes, the power consumption of the computing intensive VMs can be reduced by 44.22 %. The job completion time can be saved by 53.80%.
{"title":"EASE: Energy Efficiency and Proportionality Aware Virtual Machine Scheduling","authors":"Congfeng Jiang, Yumei Wang, Dongyang Ou, Yeliang Qiu, Youhuizi Li, Jian Wan, Bing Luo, Weisong Shi, C. Cérin","doi":"10.1109/CAHPC.2018.8645948","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645948","url":null,"abstract":"Servers have different energy efficiency and energy proportionality (EP) due to their hardware configuration (i.e., CPU generation and memory installation) and workload. However, current virtual machine (VM) scheduling in virtualized environments will saturate servers without considering their energy efficiency and EP differences. This article will discuss EASE, the energy efficiency and proportionality aware VM scheduling approach. EASE first executes customized computing intensive, memory intensive, and hybrid benchmarks to calculate a server's energy efficiency and EP. Then it schedules VMs to servers to keep them working at their peak energy efficiency point (or optimal working range). This step improves the overall energy efficiency of the cluster and the data center. For performance guarantee, EASE migrates VMs from servers under highly contending conditions. The experimental results on real clusters show that power consumption can be saved 37.07% ~ 49.98% in the homogeneous cluster. The average completion time of the computing intensive VMs increases only 0.31 % ~ 8.49%. In the heterogeneous nodes, the power consumption of the computing intensive VMs can be reduced by 44.22 %. The job completion time can be saved by 53.80%.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"131 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122295417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645916
Shad Kirmani, Hongyang Sun, P. Raghavan
Graph partitioning arises in many computational simulation workloads, including those that involve finite difference or finite element methods, where partitioning enables efficient parallel processing of the entire simulation. We focus on parallel geometric algorithms for partitioning large graphs whose vertices are associated with coordinates in two or three-dimensional space on multi-core processors. Compared with other types of partitioning algorithms, geometric schemes generally show better scalability on a large number of processors or cores. This paper studies the scalability and sensitivity of two parallel algorithms, namely, recursive coordinate bisection (denoted by pRCB) and geometric mesh partitioning (denoted by pGMP), in terms of their robustness to several key factors that affect the partition quality, including coordinate perturbation, approximate embedding, mesh quality and graph planarity. Our results indicate that the quality of a partition as measured by the size of the edge separator (or cutsize) remains consistently better for pGMP compared to pRCB. On average for our test suite, relative to pRCB, pGMP yields 25% smaller cutsizes on the original embedding, and across all perturbations cutsizes that are smaller by at least 8% and by as much as 50%. Not surprisingly, higher quality cuts are obtained at the expense of longer execution times; on a single core, pGMP has an average execution time that is almost 10 times slower than that of pRCB, but it scales better and catches up at 32-cores to be slower by less than 20%. With the current trends in core counts that continue to increase per chip, these results suggest that pGMP presents an attractive solution if a modest number of cores can be deployed to reduce execution times while providing high quality partitions.
{"title":"A Scalability and Sensitivity Study of Parallel Geometric Algorithms for Graph Partitioning","authors":"Shad Kirmani, Hongyang Sun, P. Raghavan","doi":"10.1109/CAHPC.2018.8645916","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645916","url":null,"abstract":"Graph partitioning arises in many computational simulation workloads, including those that involve finite difference or finite element methods, where partitioning enables efficient parallel processing of the entire simulation. We focus on parallel geometric algorithms for partitioning large graphs whose vertices are associated with coordinates in two or three-dimensional space on multi-core processors. Compared with other types of partitioning algorithms, geometric schemes generally show better scalability on a large number of processors or cores. This paper studies the scalability and sensitivity of two parallel algorithms, namely, recursive coordinate bisection (denoted by pRCB) and geometric mesh partitioning (denoted by pGMP), in terms of their robustness to several key factors that affect the partition quality, including coordinate perturbation, approximate embedding, mesh quality and graph planarity. Our results indicate that the quality of a partition as measured by the size of the edge separator (or cutsize) remains consistently better for pGMP compared to pRCB. On average for our test suite, relative to pRCB, pGMP yields 25% smaller cutsizes on the original embedding, and across all perturbations cutsizes that are smaller by at least 8% and by as much as 50%. Not surprisingly, higher quality cuts are obtained at the expense of longer execution times; on a single core, pGMP has an average execution time that is almost 10 times slower than that of pRCB, but it scales better and catches up at 32-cores to be slower by less than 20%. With the current trends in core counts that continue to increase per chip, these results suggest that pGMP presents an attractive solution if a modest number of cores can be deployed to reduce execution times while providing high quality partitions.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"146 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127333209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/cahpc.2018.8645958
{"title":"2018 30th International Symposium on Computer Architecture and High Performance Computing","authors":"","doi":"10.1109/cahpc.2018.8645958","DOIUrl":"https://doi.org/10.1109/cahpc.2018.8645958","url":null,"abstract":"","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"153 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115904383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645947
Romain Fontaine, L. Gonnord, L. Morel
Dataflow languages expose the application's potential parallelism naturally and have thus been studied and developped for the past thirty years as a solution for harnessing the increasing hardware parallelism. However, when generating code for parallel processors, current dataflow compilers only take into consideration the overall dataflow network of the application. This leaves out the potential parallelism that could be extracted from the internals of agents, typically when those include loop nests, for instance, but also potential application of intra-agent plpelining, or task spliting and rescheduling. In this work, we study the benefits of jointly using polyhedral compilation with dataflow languages. More precisely, we propose to expend the parallelization of dataflow programs by taking into account the parallelism exposed by loop nests describing the internal behavior of the program's agents. This approach is validated through the development of a prototype toolchain based on an extended version of the ΣC language. We demonstrate the benefit of this approach and the potentiality of further improvements on relevant case studies.
{"title":"Polyhedral Dataflow Programming: A Case Study","authors":"Romain Fontaine, L. Gonnord, L. Morel","doi":"10.1109/CAHPC.2018.8645947","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645947","url":null,"abstract":"Dataflow languages expose the application's potential parallelism naturally and have thus been studied and developped for the past thirty years as a solution for harnessing the increasing hardware parallelism. However, when generating code for parallel processors, current dataflow compilers only take into consideration the overall dataflow network of the application. This leaves out the potential parallelism that could be extracted from the internals of agents, typically when those include loop nests, for instance, but also potential application of intra-agent plpelining, or task spliting and rescheduling. In this work, we study the benefits of jointly using polyhedral compilation with dataflow languages. More precisely, we propose to expend the parallelization of dataflow programs by taking into account the parallelism exposed by loop nests describing the internal behavior of the program's agents. This approach is validated through the development of a prototype toolchain based on an extended version of the ΣC language. We demonstrate the benefit of this approach and the potentiality of further improvements on relevant case studies.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128474107","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645941
Yubo Qin, I. Rodero, P. Subedi, M. Parashar, S. Rigo
Computational demand has brought major changes to Advanced Cyber-Infrastructure (ACI) architectures. It is now possible to run scientific simulations faster and obtain more accurate results. However, power and energy have become critical concerns. Also, the current roadmap toward the new generation of ACI includes power budget as one of the main constraints. Current research efforts have studied power and performance tradeoffs and how to balance these (e.g., using Dynamic Voltage and Frequency Scaling (DVFS) and power capping for meeting power constraints, which can impact performance). However, applications may not tolerate degradation in performance, and other tradeoffs need to be explored to meet power budgets (e.g., involving the application in making energy-performance-quality tradeoff decisions). This paper proposes using the properties of AMR-based algorithms (e.g., dynamically adjusting the resolution of a simulation in combination with power capping techniques) to schedule or re-distribute the power budget. It specifically explores the opportunities to realize such an approach using checkpointing as a proof-of-concept use case and provides a characterization of a representative set of applications that use Adaptive Mesh Refinement (AMR) methods, including a Low-Mach-Number Combustion (LMC) application. It also explores the potential of utilizing power capping to understand power-quality tradeoffs via simulation.
{"title":"Exploring Power Budget Scheduling Opportunities and Tradeoffs for AMR-Based Applications","authors":"Yubo Qin, I. Rodero, P. Subedi, M. Parashar, S. Rigo","doi":"10.1109/CAHPC.2018.8645941","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645941","url":null,"abstract":"Computational demand has brought major changes to Advanced Cyber-Infrastructure (ACI) architectures. It is now possible to run scientific simulations faster and obtain more accurate results. However, power and energy have become critical concerns. Also, the current roadmap toward the new generation of ACI includes power budget as one of the main constraints. Current research efforts have studied power and performance tradeoffs and how to balance these (e.g., using Dynamic Voltage and Frequency Scaling (DVFS) and power capping for meeting power constraints, which can impact performance). However, applications may not tolerate degradation in performance, and other tradeoffs need to be explored to meet power budgets (e.g., involving the application in making energy-performance-quality tradeoff decisions). This paper proposes using the properties of AMR-based algorithms (e.g., dynamically adjusting the resolution of a simulation in combination with power capping techniques) to schedule or re-distribute the power budget. It specifically explores the opportunities to realize such an approach using checkpointing as a proof-of-concept use case and provides a characterization of a representative set of applications that use Adaptive Mesh Refinement (AMR) methods, including a Low-Mach-Number Combustion (LMC) application. It also explores the potential of utilizing power capping to understand power-quality tradeoffs via simulation.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130849962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645886
N. Gonzalez, J. Brunheroto, F. Artico, Yoonho Park, T. Carvalho, C. Miers, M. A. Pillon, G. Koslovski
The disparity between the bandwidth provided by modern processors and by the main memory led to the issue known as memory wall, in which application performance becomes completely bound by memory speed. Newer technologies are trying to increase memory bandwidth to address this issue, but the fact is that the effects of increasing bandwidth to application performance still lack exploration. This paper investigates these effects for scientific workflows focusing on the definition of a performance model and on the execution of experiments to validate the rationale for the model. The main contribution is based on two observations: memory bound applications benefit more from an increase to memory bandwidth, and the effects of improving bandwidth for a particular application gradually diminish as bandwidth is increased.
{"title":"Predicting the Performance Impact of Increasing Memory Bandwidth for Scientific Workflows","authors":"N. Gonzalez, J. Brunheroto, F. Artico, Yoonho Park, T. Carvalho, C. Miers, M. A. Pillon, G. Koslovski","doi":"10.1109/CAHPC.2018.8645886","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645886","url":null,"abstract":"The disparity between the bandwidth provided by modern processors and by the main memory led to the issue known as memory wall, in which application performance becomes completely bound by memory speed. Newer technologies are trying to increase memory bandwidth to address this issue, but the fact is that the effects of increasing bandwidth to application performance still lack exploration. This paper investigates these effects for scientific workflows focusing on the definition of a performance model and on the execution of experiments to validate the rationale for the model. The main contribution is based on two observations: memory bound applications benefit more from an increase to memory bandwidth, and the effects of improving bandwidth for a particular application gradually diminish as bandwidth is increased.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114804963","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645953
Vinicius Freitas, A. Santana, M. Castro, L. Pilla
Effectively mapping tasks of High Performance Computing (HPC) applications on parallel systems is crucial to assure substantial performance gains. As platforms and applications grow, load imbalance becomes a priority issue. Even though centralized rescheduling has been a viable solution to mitigate this problem, its efficiency is not able to keep up with the increasing size of shared memory platforms. To efficiently solve load imbalance today, and in the years to come, we should prioritize decentralized strategies developed for large scale platforms. In this paper, we propose our Batch Task Migration approach to improve decentralized global rescheduling, ultimately reducing communication costs and preserving task locality. We implemented and evaluated our approach in two different parallel platforms, using both synthetic workloads and a molecular dynamics (MD) benchmark. Our solution was able to achieve speedups of up to 3.75 and 1.15 on rescheduling time, when compared to other centralized and distributed approaches, respectively. Moreover, it improved the execution time of MD by factors up to 1.34 and 1.22 when compared to a scenario without load balancing on two different platforms.
{"title":"A Batch Task Migration Approach for Decentralized Global Rescheduling","authors":"Vinicius Freitas, A. Santana, M. Castro, L. Pilla","doi":"10.1109/CAHPC.2018.8645953","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645953","url":null,"abstract":"Effectively mapping tasks of High Performance Computing (HPC) applications on parallel systems is crucial to assure substantial performance gains. As platforms and applications grow, load imbalance becomes a priority issue. Even though centralized rescheduling has been a viable solution to mitigate this problem, its efficiency is not able to keep up with the increasing size of shared memory platforms. To efficiently solve load imbalance today, and in the years to come, we should prioritize decentralized strategies developed for large scale platforms. In this paper, we propose our Batch Task Migration approach to improve decentralized global rescheduling, ultimately reducing communication costs and preserving task locality. We implemented and evaluated our approach in two different parallel platforms, using both synthetic workloads and a molecular dynamics (MD) benchmark. Our solution was able to achieve speedups of up to 3.75 and 1.15 on rescheduling time, when compared to other centralized and distributed approaches, respectively. Moreover, it improved the execution time of MD by factors up to 1.34 and 1.22 when compared to a scenario without load balancing on two different platforms.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128110330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645946
H. Anzt, J. Dongarra
Jaccard weights are a popular metric for identifying communities in social network analytics. In this paper we present a kernel for efficiently computing the Jaccard weight matrix on G PU s. The kernel design is guided by fine-grained parallelism and the independent thread scheduling supported by NVIDIA's Volta architecture. This technology makes it possible to interleave the execution of divergent branches for enhanced data reuse and a higher instruction per cycle rate for memory-bound algorithms. In a performance evaluation using a set of publicly available social networks, we report the kernel execution time and analyze the built-in hardware counters on different GPU architectures. The findings have implications beyond the specific algorithm and suggest a reformulation of other data-sparse algorithms.
{"title":"A Jaccard Weights Kernel Leveraging Independent Thread Scheduling on GPUs","authors":"H. Anzt, J. Dongarra","doi":"10.1109/CAHPC.2018.8645946","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645946","url":null,"abstract":"Jaccard weights are a popular metric for identifying communities in social network analytics. In this paper we present a kernel for efficiently computing the Jaccard weight matrix on G PU s. The kernel design is guided by fine-grained parallelism and the independent thread scheduling supported by NVIDIA's Volta architecture. This technology makes it possible to interleave the execution of divergent branches for enhanced data reuse and a higher instruction per cycle rate for memory-bound algorithms. In a performance evaluation using a set of publicly available social networks, we report the kernel execution time and analyze the built-in hardware counters on different GPU architectures. The findings have implications beyond the specific algorithm and suggest a reformulation of other data-sparse algorithms.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"49 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129217630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645943
Su-Hwan Jang, Jongpil Jeong, Byungjun Park
Most of the current research about Proxy Mobile IPv6 (PMIPv6) focuses on how to optimize the interactive processes between the PMIPv6 and AAA (Authentication, Authorization, and Accounting) protocol. This paper describes a cost-effective hierarchical authentication scheme, which makes its focus on minimizing the authentication latency in AAA processing. In this scheme, a hierarchical AAA architecture is proposed, in which the AAA servers are deployed on the Local Mobility Anchor (LMA), the Root AAA server manages several Leaf AAA servers and the Brokers on behalf of the AAA server in home domain. The simulation results shows that the proposed scheme reduces the handoff and authentication latency evidently compared to the previous traditional authentication combination modeling
目前关于代理移动IPv6 (Proxy Mobile IPv6, PMIPv6)的研究主要集中在如何优化PMIPv6与AAA (Authentication, Authorization, and Accounting)协议之间的交互过程。本文提出了一种具有成本效益的分层认证方案,其重点是最小化AAA处理中的认证延迟。在该方案中,提出了一种分层AAA架构,其中AAA服务器部署在本地移动锚(Local Mobility Anchor, LMA)上,根AAA服务器管理多台Leaf AAA服务器,代理代理AAA服务器在主域中。仿真结果表明,与传统的认证组合建模相比,该方案明显降低了切换和认证延迟
{"title":"A Novel Broker-Based Hierarchical Authentication Scheme in Proxy Mobile IPv6 Networks","authors":"Su-Hwan Jang, Jongpil Jeong, Byungjun Park","doi":"10.1109/CAHPC.2018.8645943","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645943","url":null,"abstract":"Most of the current research about Proxy Mobile IPv6 (PMIPv6) focuses on how to optimize the interactive processes between the PMIPv6 and AAA (Authentication, Authorization, and Accounting) protocol. This paper describes a cost-effective hierarchical authentication scheme, which makes its focus on minimizing the authentication latency in AAA processing. In this scheme, a hierarchical AAA architecture is proposed, in which the AAA servers are deployed on the Local Mobility Anchor (LMA), the Root AAA server manages several Leaf AAA servers and the Brokers on behalf of the AAA server in home domain. The simulation results shows that the proposed scheme reduces the handoff and authentication latency evidently compared to the previous traditional authentication combination modeling","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123714167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-09-01DOI: 10.1109/CAHPC.2018.8645930
M. Souza, H. Freitas, J. Méhaut
Performance of parallel scientific applications on many-core processor architectures is a challenge that increases every day, especially when energy efficiency is concerned. To achieve this, it is necessary to explore architectures with high processing power composed by a network-on-chip to integrate many processing cores and other components. In this context, this paper presents a design space exploration over NoC-based manycore processor architectures with distributed and shared caches, using full-system simulations. We evaluate bottlenecks in such architectures with regard to energy efficiency, using different parallel scientific applications and considering aspects from caches and NoCs jointly. Five applications from NAS Parallel Benchmarks were executed over the proposed architectures, which vary in number of cores; in L2 cache size; and in 12 types of NoC topologies. A clustered topology was set up, in which we obtain performance gains up to 30.56% and reduction in energy consumption up to 38.53%, when compared to a traditional one.
{"title":"Design Space Exploration of Energy Efficient NoC-and Cache-Based Many-Core Architecture","authors":"M. Souza, H. Freitas, J. Méhaut","doi":"10.1109/CAHPC.2018.8645930","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645930","url":null,"abstract":"Performance of parallel scientific applications on many-core processor architectures is a challenge that increases every day, especially when energy efficiency is concerned. To achieve this, it is necessary to explore architectures with high processing power composed by a network-on-chip to integrate many processing cores and other components. In this context, this paper presents a design space exploration over NoC-based manycore processor architectures with distributed and shared caches, using full-system simulations. We evaluate bottlenecks in such architectures with regard to energy efficiency, using different parallel scientific applications and considering aspects from caches and NoCs jointly. Five applications from NAS Parallel Benchmarks were executed over the proposed architectures, which vary in number of cores; in L2 cache size; and in 12 types of NoC topologies. A clustered topology was set up, in which we obtain performance gains up to 30.56% and reduction in energy consumption up to 38.53%, when compared to a traditional one.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124908471","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}