Isolation is a desirable property for applications executing in multi-tenant computing systems. On the performance side, hardware resource isolation via partitioning mechanisms is commonly applied to achieve QoS, a necessary property for many noise-sensitive parallel workloads. Conversely, on the software side, partitioning is used, usually in the form of virtual machines, to provide secure environments with smaller attack surfaces than those present in shared software stacks. In this paper, we identify a further benefit from isolation, one that is currently less appreciated in most parallel computing settings: isolation of system software stacks, including OS kernels, can lead to significant performance benefits through a reduction in variability. To highlight the existing problem in shared software stacks, we first developed a new systematic approach to measure and characterize latent sources of variability in the Linux kernel. Using this approach, we find that hardware VMs are effective substrates for limiting kernel-level interference that otherwise occurs in monolithic kernel systems. Furthermore, by enabling reductions in variability, we find that virtualized environments often have superior worst-case performance characteristics than native or containerized environments. Finally, we demonstrate that due to their isolated software contexts, most virtualized applications consistently outperform their bare-metal counterparts when executing on 64-nodes of a multi-tenant, kernel-intensive cloud system.
{"title":"Reducing Kernel Surface Areas for Isolation and Scalability","authors":"Daniel Zahka, Brian Kocoloski, Katarzyna Keahey","doi":"10.1145/3337821.3337900","DOIUrl":"https://doi.org/10.1145/3337821.3337900","url":null,"abstract":"Isolation is a desirable property for applications executing in multi-tenant computing systems. On the performance side, hardware resource isolation via partitioning mechanisms is commonly applied to achieve QoS, a necessary property for many noise-sensitive parallel workloads. Conversely, on the software side, partitioning is used, usually in the form of virtual machines, to provide secure environments with smaller attack surfaces than those present in shared software stacks. In this paper, we identify a further benefit from isolation, one that is currently less appreciated in most parallel computing settings: isolation of system software stacks, including OS kernels, can lead to significant performance benefits through a reduction in variability. To highlight the existing problem in shared software stacks, we first developed a new systematic approach to measure and characterize latent sources of variability in the Linux kernel. Using this approach, we find that hardware VMs are effective substrates for limiting kernel-level interference that otherwise occurs in monolithic kernel systems. Furthermore, by enabling reductions in variability, we find that virtualized environments often have superior worst-case performance characteristics than native or containerized environments. Finally, we demonstrate that due to their isolated software contexts, most virtualized applications consistently outperform their bare-metal counterparts when executing on 64-nodes of a multi-tenant, kernel-intensive cloud system.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132982007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Marquita Ellis, Giulia Guidi, A. Buluç, L. Oliker, K. Yelick
We present a parallel algorithm and scalable implementation for genome analysis, specifically the problem of finding overlaps and alignments for data from "third generation" long read sequencers [29]. While long sequences of DNA offer enormous advantages for biological analysis and insight, current long read sequencing instruments have high error rates and therefore require different approaches to analysis than their short read counterparts. Our work focuses on an efficient distributed-memory parallelization of an accurate single-node algorithm for overlapping and aligning long reads. We achieve scalability of this irregular algorithm by addressing the competing issues of increasing parallelism, minimizing communication, constraining the memory footprint, and ensuring good load balance. The resulting application, diBELLA, is the first distributed memory overlapper and aligner specifically designed for long reads and parallel scalability. We describe and present analyses for high level design trade-offs and conduct an extensive empirical analysis that compares performance characteristics across state-of-the-art HPC systems as well as a commercial cloud architectures, highlighting the advantages of state-of-the-art network technologies.
{"title":"diBELLA: Distributed Long Read to Long Read Alignment","authors":"Marquita Ellis, Giulia Guidi, A. Buluç, L. Oliker, K. Yelick","doi":"10.1145/3337821.3337919","DOIUrl":"https://doi.org/10.1145/3337821.3337919","url":null,"abstract":"We present a parallel algorithm and scalable implementation for genome analysis, specifically the problem of finding overlaps and alignments for data from \"third generation\" long read sequencers [29]. While long sequences of DNA offer enormous advantages for biological analysis and insight, current long read sequencing instruments have high error rates and therefore require different approaches to analysis than their short read counterparts. Our work focuses on an efficient distributed-memory parallelization of an accurate single-node algorithm for overlapping and aligning long reads. We achieve scalability of this irregular algorithm by addressing the competing issues of increasing parallelism, minimizing communication, constraining the memory footprint, and ensuring good load balance. The resulting application, diBELLA, is the first distributed memory overlapper and aligner specifically designed for long reads and parallel scalability. We describe and present analyses for high level design trade-offs and conduct an extensive empirical analysis that compares performance characteristics across state-of-the-art HPC systems as well as a commercial cloud architectures, highlighting the advantages of state-of-the-art network technologies.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"68 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114258890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In data-intensive parallel computing clusters, it is important to provide deadline-guaranteed service to jobs while minimizing resource usage (e.g., network bandwidth and energy). Under the current computing framework (that first allocates data and then schedules jobs), in a busy cluster with many jobs, it is difficult to achieve these objectives simultaneously. We model the problem to simultaneously achieve the objectives using integer programming, and propose a heuristic Cooperative job Scheduling and data Allocation method (CSA). CSA novelly reverses the order of data allocation and job scheduling in the current computing framework, i.e., changing data-first-job-second to job-first-data-second. It enables CSA to proactively consolidate tasks with more common requested data to the same server when conducting deadline-aware scheduling, and also consolidate the tasks to as few servers as possible to maximize energy savings. This facilitates the subsequent data allocation step to allocate a data block to the server that hosts most of this data's requester tasks, thus maximally enhancing data locality and reduce bandwidth consumption. CSA also has a recursive schedule refinement process to adjust the job and data allocation schedules to improve system performance regarding the three objectives and achieve the tradeoff between data locality and energy savings with specified weights. We implemented CSA and a number of previous job schedulers on Apache Hadoop on a real supercomputing cluster. Trace-driven experiments in the simulation and the real cluster show that CSA outperforms other schedulers in supplying deadline-guarantee and resource-efficient services.
{"title":"Cooperative Job Scheduling and Data Allocation for Busy Data-Intensive Parallel Computing Clusters","authors":"Guoxin Liu, Haiying Shen, Haoyu Wang","doi":"10.1145/3337821.3337864","DOIUrl":"https://doi.org/10.1145/3337821.3337864","url":null,"abstract":"In data-intensive parallel computing clusters, it is important to provide deadline-guaranteed service to jobs while minimizing resource usage (e.g., network bandwidth and energy). Under the current computing framework (that first allocates data and then schedules jobs), in a busy cluster with many jobs, it is difficult to achieve these objectives simultaneously. We model the problem to simultaneously achieve the objectives using integer programming, and propose a heuristic Cooperative job Scheduling and data Allocation method (CSA). CSA novelly reverses the order of data allocation and job scheduling in the current computing framework, i.e., changing data-first-job-second to job-first-data-second. It enables CSA to proactively consolidate tasks with more common requested data to the same server when conducting deadline-aware scheduling, and also consolidate the tasks to as few servers as possible to maximize energy savings. This facilitates the subsequent data allocation step to allocate a data block to the server that hosts most of this data's requester tasks, thus maximally enhancing data locality and reduce bandwidth consumption. CSA also has a recursive schedule refinement process to adjust the job and data allocation schedules to improve system performance regarding the three objectives and achieve the tradeoff between data locality and energy savings with specified weights. We implemented CSA and a number of previous job schedulers on Apache Hadoop on a real supercomputing cluster. Trace-driven experiments in the simulation and the real cluster show that CSA outperforms other schedulers in supplying deadline-guarantee and resource-efficient services.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"281 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123720947","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
E. Ates, Yijia Zhang, Burak Aksar, Jim Brandt, V. Leung, Manuel Egele, A. Coskun
Modern high performance computing (HPC) systems, including supercomputers, routinely suffer from substantial performance variations. The same application with the same input can have more than 100% performance variation, and such variations cause reduced efficiency and wasted resources. There have been recent studies on performance variability and on designing automated methods for diagnosing "anomalies" that cause performance variability. These studies either observe data collected from HPC systems, or they rely on synthetic reproduction of performance variability scenarios. However, there is no standardized way of creating performance variability inducing synthetic anomalies; so, researchers rely on designing ad-hoc methods for reproducing performance variability. This paper addresses this lack of a common method for creating relevant performance anomalies by introducing HPAS, an HPC Performance Anomaly Suite, consisting of anomaly generators for the major subsystems in HPC systems. These easy-to-use synthetic anomaly generators facilitate low-effort evaluation and comparison of various analytics methods as well as performance or resilience of applications, middleware, or systems under realistic performance variability scenarios. The paper also provides an analysis of the behavior of the anomaly generators and demonstrates several use cases: (1) performance anomaly diagnosis using HPAS, (2) evaluation of resource management policies under performance variations, and (3) design of applications that are resilient to performance variability.
{"title":"HPAS","authors":"E. Ates, Yijia Zhang, Burak Aksar, Jim Brandt, V. Leung, Manuel Egele, A. Coskun","doi":"10.1145/3337821.3337907","DOIUrl":"https://doi.org/10.1145/3337821.3337907","url":null,"abstract":"Modern high performance computing (HPC) systems, including supercomputers, routinely suffer from substantial performance variations. The same application with the same input can have more than 100% performance variation, and such variations cause reduced efficiency and wasted resources. There have been recent studies on performance variability and on designing automated methods for diagnosing \"anomalies\" that cause performance variability. These studies either observe data collected from HPC systems, or they rely on synthetic reproduction of performance variability scenarios. However, there is no standardized way of creating performance variability inducing synthetic anomalies; so, researchers rely on designing ad-hoc methods for reproducing performance variability. This paper addresses this lack of a common method for creating relevant performance anomalies by introducing HPAS, an HPC Performance Anomaly Suite, consisting of anomaly generators for the major subsystems in HPC systems. These easy-to-use synthetic anomaly generators facilitate low-effort evaluation and comparison of various analytics methods as well as performance or resilience of applications, middleware, or systems under realistic performance variability scenarios. The paper also provides an analysis of the behavior of the anomaly generators and demonstrates several use cases: (1) performance anomaly diagnosis using HPAS, (2) evaluation of resource management policies under performance variations, and (3) design of applications that are resilient to performance variability.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114455563","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Navaridas, Joshua Lant, J. A. Pascual, M. Luján, J. Goodacre
Interconnection networks are one of the main limiting factors when it comes to scale out computing systems. In this paper, we explore what role the hybridization of topologies has on the design of an state-of-the-art exascale-capable computing system. More precisely we compare several hybrid topologies and compare with common single-topology ones when dealing with large-scale applicationlike traffic. In addition we explore how different aspects of the hybrid topology can affect the overall performance of the system. In particular, we found that hybrid topologies can outperform state-of-the-art torus and fattree networks as long as the density of connections is high enough--one connection every two or four nodes seems to be the sweet spot--and the size of the subtori is limited to a few nodes per dimension. Moreover, we explored two different alternatives to use in the upper tiers of the interconnect, a fattree and a generalised hypercube, and found little difference between the topologies, mostly depending on the workload to be executed.
{"title":"Design Exploration of Multi-tier Interconnection Networks for Exascale Systems","authors":"J. Navaridas, Joshua Lant, J. A. Pascual, M. Luján, J. Goodacre","doi":"10.1145/3337821.3337903","DOIUrl":"https://doi.org/10.1145/3337821.3337903","url":null,"abstract":"Interconnection networks are one of the main limiting factors when it comes to scale out computing systems. In this paper, we explore what role the hybridization of topologies has on the design of an state-of-the-art exascale-capable computing system. More precisely we compare several hybrid topologies and compare with common single-topology ones when dealing with large-scale applicationlike traffic. In addition we explore how different aspects of the hybrid topology can affect the overall performance of the system. In particular, we found that hybrid topologies can outperform state-of-the-art torus and fattree networks as long as the density of connections is high enough--one connection every two or four nodes seems to be the sweet spot--and the size of the subtori is limited to a few nodes per dimension. Moreover, we explored two different alternatives to use in the upper tiers of the interconnect, a fattree and a generalised hypercube, and found little difference between the topologies, mostly depending on the workload to be executed.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129740513","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fahim Chowdhury, Yue Zhu, T. Heer, Saul Paredes, A. Moody, R. Goldstone, K. Mohror, Weikuan Yu
Parallel File Systems (PFSs) are frequently deployed on leadership High Performance Computing (HPC) systems to ensure efficient I/O, persistent storage and scalable performance. Emerging Deep Learning (DL) applications incur new I/O and storage requirements to HPC systems with batched input of small random files. This mandates PFSs to have commensurate features that can meet the needs of DL applications. BeeGFS is a recently emerging PFS that has grabbed the attention of the research and industry world because of its performance, scalability and ease of use. While emphasizing a systematic performance analysis of BeeGFS, in this paper, we present the architectural and system features of BeeGFS, and perform an experimental evaluation using cutting-edge I/O, Metadata and DL application benchmarks. Particularly, we have utilized AlexNet and ResNet-50 models for the classification of ImageNet dataset using the Livermore Big Artificial Neural Network Toolkit (LBANN), and ImageNet data reader pipeline atop TensorFlow and Horovod. Through extensive performance characterization of BeeGFS, our study provides a useful documentation on how to leverage BeeGFS for the emerging DL applications.
{"title":"I/O Characterization and Performance Evaluation of BeeGFS for Deep Learning","authors":"Fahim Chowdhury, Yue Zhu, T. Heer, Saul Paredes, A. Moody, R. Goldstone, K. Mohror, Weikuan Yu","doi":"10.1145/3337821.3337902","DOIUrl":"https://doi.org/10.1145/3337821.3337902","url":null,"abstract":"Parallel File Systems (PFSs) are frequently deployed on leadership High Performance Computing (HPC) systems to ensure efficient I/O, persistent storage and scalable performance. Emerging Deep Learning (DL) applications incur new I/O and storage requirements to HPC systems with batched input of small random files. This mandates PFSs to have commensurate features that can meet the needs of DL applications. BeeGFS is a recently emerging PFS that has grabbed the attention of the research and industry world because of its performance, scalability and ease of use. While emphasizing a systematic performance analysis of BeeGFS, in this paper, we present the architectural and system features of BeeGFS, and perform an experimental evaluation using cutting-edge I/O, Metadata and DL application benchmarks. Particularly, we have utilized AlexNet and ResNet-50 models for the classification of ImageNet dataset using the Livermore Big Artificial Neural Network Toolkit (LBANN), and ImageNet data reader pipeline atop TensorFlow and Horovod. Through extensive performance characterization of BeeGFS, our study provides a useful documentation on how to leverage BeeGFS for the emerging DL applications.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125847361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Software Defined Networking (SDN) enables flexible flow control by deploying fine-grained rules in OpenFlow switches. Modern commodity switches usually use TCAM to store these rules and perform high-speed parallel lookups. Though efficient, the TCAM capacity is limited because TCAM is expensive in cost and power-hungry. The explosive growth in the number of rules has exacerbated the limitation of TCAM. There have been considerable efforts in implementing hybrid flow tables with both TCAM and RAM, where the high-speed TCAM is regarded as a cache to store the most popular rules and the cheap RAM is used to handle cache miss. The primary challenges for designing hybrid TCAM/RAM flow tables lie in how to improve cache hit rate and how to handle wildcard rule dependency when allocating rules between TCAM and RAM. In this paper, we present the design and evaluation of CuCa, a practical and efficient rule caching scheme for hybrid switches. Different from existing schemes, CuCa offers both offline and online algorithms for rule caching, corresponding to the proactive and reactive approaches to OpenFlow rule installation. By designing a two-stage-cache architecture in TCAM, CuCa can handle rule dependency efficiently and provide remarkable performance improvements. Simulation and real-world experiment results reveal that CuCa improves average TCAM hit rate by 38.7% compared to state-of-the-art schemes and by over 33% compared to the default caching algorithm of a commodity OpenFlow switch.
{"title":"A Tale of Two (Flow) Tables: Demystifying Rule Caching in OpenFlow Switches","authors":"Rui Li, Yu Pang, Jin Zhao, Xin Wang","doi":"10.1145/3337821.3337896","DOIUrl":"https://doi.org/10.1145/3337821.3337896","url":null,"abstract":"Software Defined Networking (SDN) enables flexible flow control by deploying fine-grained rules in OpenFlow switches. Modern commodity switches usually use TCAM to store these rules and perform high-speed parallel lookups. Though efficient, the TCAM capacity is limited because TCAM is expensive in cost and power-hungry. The explosive growth in the number of rules has exacerbated the limitation of TCAM. There have been considerable efforts in implementing hybrid flow tables with both TCAM and RAM, where the high-speed TCAM is regarded as a cache to store the most popular rules and the cheap RAM is used to handle cache miss. The primary challenges for designing hybrid TCAM/RAM flow tables lie in how to improve cache hit rate and how to handle wildcard rule dependency when allocating rules between TCAM and RAM. In this paper, we present the design and evaluation of CuCa, a practical and efficient rule caching scheme for hybrid switches. Different from existing schemes, CuCa offers both offline and online algorithms for rule caching, corresponding to the proactive and reactive approaches to OpenFlow rule installation. By designing a two-stage-cache architecture in TCAM, CuCa can handle rule dependency efficiently and provide remarkable performance improvements. Simulation and real-world experiment results reveal that CuCa improves average TCAM hit rate by 38.7% compared to state-of-the-art schemes and by over 33% compared to the default caching algorithm of a commodity OpenFlow switch.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123893794","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Auto vectorization techniques have been adopted by compilers to exploit data-level parallelism in parallel processing for decades. However, since processor architectures have kept enhancing with new features to improve vector/SIMD performance, legacy application binaries failed to fully exploit new vector/SIMD capabilities in modern architectures. For example, legacy ARMv7 binaries cannot benefit from ARMv8 SIMD double precision capability, and legacy x86 binaries cannot enjoy the power of AVX-512 extensions. In this paper, we study the fundamental issues involved in cross-ISA Dynamic Binary Translation (DBT) to convert non-vectorized loops to vector/SIMD forms to achieve greater computation throughput available in newer processor architectures. The key idea is to recover critical loop information from those application binaries in order to carry out vectorization at runtime. Experiment results show that our approach achieves an average speedup of 1.42x compared to ARMv7 native run across various benchmarks in an ARMv7-to-ARMv8 dynamic binary translation system.
{"title":"Exploiting Vector Processing in Dynamic Binary Translation","authors":"Chih-Min Lin, Sheng-Yu Fu, Ding-Yong Hong, Yu-Ping Liu, Jan-Jan Wu, W. Hsu","doi":"10.1145/3337821.3337844","DOIUrl":"https://doi.org/10.1145/3337821.3337844","url":null,"abstract":"Auto vectorization techniques have been adopted by compilers to exploit data-level parallelism in parallel processing for decades. However, since processor architectures have kept enhancing with new features to improve vector/SIMD performance, legacy application binaries failed to fully exploit new vector/SIMD capabilities in modern architectures. For example, legacy ARMv7 binaries cannot benefit from ARMv8 SIMD double precision capability, and legacy x86 binaries cannot enjoy the power of AVX-512 extensions. In this paper, we study the fundamental issues involved in cross-ISA Dynamic Binary Translation (DBT) to convert non-vectorized loops to vector/SIMD forms to achieve greater computation throughput available in newer processor architectures. The key idea is to recover critical loop information from those application binaries in order to carry out vectorization at runtime. Experiment results show that our approach achieves an average speedup of 1.42x compared to ARMv7 native run across various benchmarks in an ARMv7-to-ARMv8 dynamic binary translation system.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123972308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A. Malony, Srinivasan Ramesh, K. Huck, Nicholas Chaimov, S. Shende
Several robust performance systems have been created for parallel machines with the ability to observe diverse aspects of application execution on different hardware platforms. All of these are designed with the objective to support measurement methods that are efficient, portable, and scalable. For these reasons, the performance measurement infrastructure is tightly embedded with the application code and runtime execution environment. As parallel software and systems evolve, especially towards more heterogeneous, asynchronous, and dynamic operation, it is expected that the requirements for performance observation and awareness will change. For instance, heterogeneous machines introduce new types of performance data to capture and performance behaviors to characterize. Furthermore, there is a growing interest in interacting with the performance infrastructure for in situ analytics and policy-based control. The problem is that an existing performance system architecture could be constrained in its ability to evolve to meet these new requirements. The paper reports our research efforts to address this concern in the context of the TAU Performance System. In particular, we consider the use of a powerful plugin model to both capture existing capabilities in TAU and to extend its functionality in ways it was not necessarily conceived originally. The TAU plugin architecture supports three types of plugin paradigms: EVENT, TRIGGER, and AGENT. We demonstrate how each operates under several different scenarios. Results from larger-scale experiments are shown to highlight the fact that efficiency and robustness can be maintained, while new flexibility and programmability can be offered that leverages the power of the core TAU system while allowing significant and compelling extensions to be realized.
{"title":"A Plugin Architecture for the TAU Performance System","authors":"A. Malony, Srinivasan Ramesh, K. Huck, Nicholas Chaimov, S. Shende","doi":"10.1145/3337821.3337916","DOIUrl":"https://doi.org/10.1145/3337821.3337916","url":null,"abstract":"Several robust performance systems have been created for parallel machines with the ability to observe diverse aspects of application execution on different hardware platforms. All of these are designed with the objective to support measurement methods that are efficient, portable, and scalable. For these reasons, the performance measurement infrastructure is tightly embedded with the application code and runtime execution environment. As parallel software and systems evolve, especially towards more heterogeneous, asynchronous, and dynamic operation, it is expected that the requirements for performance observation and awareness will change. For instance, heterogeneous machines introduce new types of performance data to capture and performance behaviors to characterize. Furthermore, there is a growing interest in interacting with the performance infrastructure for in situ analytics and policy-based control. The problem is that an existing performance system architecture could be constrained in its ability to evolve to meet these new requirements. The paper reports our research efforts to address this concern in the context of the TAU Performance System. In particular, we consider the use of a powerful plugin model to both capture existing capabilities in TAU and to extend its functionality in ways it was not necessarily conceived originally. The TAU plugin architecture supports three types of plugin paradigms: EVENT, TRIGGER, and AGENT. We demonstrate how each operates under several different scenarios. Results from larger-scale experiments are shown to highlight the fact that efficiency and robustness can be maintained, while new flexibility and programmability can be offered that leverages the power of the core TAU system while allowing significant and compelling extensions to be realized.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117283700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We develop and evaluate parallel algorithms for a fundamental problem in numerical computing, namely the evaluation of a polynomial of a matrix. The algorithm consists of many building blocks that can be assembled in several ways. We investigate parallelism in individual building blocks, develop parallel implemenations, and assemble them into an overall parallel algorithm. We analyze the effects of both the dimension of the matrix and the degree of the polynomial on both arithmetic complexity and on parallelism, and we consequently propose which variants use in different cases. Our theoretical results indicate that one variant of the algorithm, based on applying the Paterson-Stockmeyer method to the entire matrix, parallelizes very effectively on virtually any matrix dimension and polynomial degree. However, it is not the most efficient from the arithmetic complexity viewpoint. Another algorithm, based on the Davies-Higham block recurrence is much more efficient from the arithmetic complexity viewpoint, but one of its building blocks is serial. Experimental results on a dual-socket 28-core server show that the first algorithm can effectively use all the cores, but that on high-degree polynomials the second algorithm is often faster, in spite of the sequential phase. This indicates that our parallel algorithms for the other phases are indeed effective.
{"title":"Parallel Algorithms for Evaluating Matrix Polynomials","authors":"Sivan Toledo, Amit Waisel","doi":"10.1145/3337821.3337871","DOIUrl":"https://doi.org/10.1145/3337821.3337871","url":null,"abstract":"We develop and evaluate parallel algorithms for a fundamental problem in numerical computing, namely the evaluation of a polynomial of a matrix. The algorithm consists of many building blocks that can be assembled in several ways. We investigate parallelism in individual building blocks, develop parallel implemenations, and assemble them into an overall parallel algorithm. We analyze the effects of both the dimension of the matrix and the degree of the polynomial on both arithmetic complexity and on parallelism, and we consequently propose which variants use in different cases. Our theoretical results indicate that one variant of the algorithm, based on applying the Paterson-Stockmeyer method to the entire matrix, parallelizes very effectively on virtually any matrix dimension and polynomial degree. However, it is not the most efficient from the arithmetic complexity viewpoint. Another algorithm, based on the Davies-Higham block recurrence is much more efficient from the arithmetic complexity viewpoint, but one of its building blocks is serial. Experimental results on a dual-socket 28-core server show that the first algorithm can effectively use all the cores, but that on high-degree polynomials the second algorithm is often faster, in spite of the sequential phase. This indicates that our parallel algorithms for the other phases are indeed effective.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117324691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}