Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00036
Jielong Xu, Jian Tang, Zhiyuan Xu, Chengxiang Yin, K. Kwiat, C. Kamhoua
In this paper, we present design, implementation and evaluation of a novel predictive control framework to enable reliable distributed stream data processing, which features a Deep Recurrent Neural Network (DRNN) model for performance prediction, and dynamic grouping for flexible control. Specifically, we present a novel DRNN model, which makes accurate performance prediction with careful consideration for interference of co-located worker processes, according to multilevel runtime statistics. Moreover, we design a new grouping method, dynamic grouping, which can distribute/re-distribute data tuples to downstream tasks according to any given split ratio on the fly. So it can be used to re-direct data tuples to bypass misbehaving workers. We implemented the proposed framework based on a widely used Distributed Stream Data Processing System (DSDPS), Storm. For validation and performance evaluation, we developed two representative stream data processing applications: Windowed URL Count and Continuous Queries. Extensive experimental results show: 1) The proposed DRNN model outperforms widely used baseline solutions, ARIMA and SVR, in terms of prediction accuracy; 2) dynamic grouping works as expected; and 3) the proposed framework enhances reliability by offering minor performance degradation with misbehaving workers.
{"title":"A Deep Recurrent Neural Network Based Predictive Control Framework for Reliable Distributed Stream Data Processing","authors":"Jielong Xu, Jian Tang, Zhiyuan Xu, Chengxiang Yin, K. Kwiat, C. Kamhoua","doi":"10.1109/IPDPS.2019.00036","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00036","url":null,"abstract":"In this paper, we present design, implementation and evaluation of a novel predictive control framework to enable reliable distributed stream data processing, which features a Deep Recurrent Neural Network (DRNN) model for performance prediction, and dynamic grouping for flexible control. Specifically, we present a novel DRNN model, which makes accurate performance prediction with careful consideration for interference of co-located worker processes, according to multilevel runtime statistics. Moreover, we design a new grouping method, dynamic grouping, which can distribute/re-distribute data tuples to downstream tasks according to any given split ratio on the fly. So it can be used to re-direct data tuples to bypass misbehaving workers. We implemented the proposed framework based on a widely used Distributed Stream Data Processing System (DSDPS), Storm. For validation and performance evaluation, we developed two representative stream data processing applications: Windowed URL Count and Continuous Queries. Extensive experimental results show: 1) The proposed DRNN model outperforms widely used baseline solutions, ARIMA and SVR, in terms of prediction accuracy; 2) dynamic grouping works as expected; and 3) the proposed framework enhances reliability by offering minor performance degradation with misbehaving workers.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121988347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00045
J. Hashmi, S. Chakraborty, Mohammadreza Bayatpour, H. Subramoni, D. Panda
Derived datatypes are commonly used in MPI applications to exchange non-contiguous data among processes. However, state-of-the-art MPI libraries do not offer efficient processing of derived datatypes and often rely on packing and unpacking the data at the sender and the receiver processes. This approach incurs the cost of extra copies and increases overall communication latency. While zero-copy communication schemes have been proposed for contiguous data, applying such techniques to non-contiguous data transfers bring forth several new challenges. In this work, we address these challenges and propose FALCON — Fast and Low-overhead Communication designs for intra-node MPI derived datatypes processing. We show that the memory layouts translation of derived datatypes introduce significant overheads in the communication path and propose novel solutions to mitigate such bottlenecks. We also find that the current MPI datatype routines cannot fully take advantage of the zero-copy mechanisms, and propose enhancements to the MPI standard to address these limitations. The experimental evaluations show that our proposed designs achieve up to 3 times improved intra-node communication latency and bandwidth over state-of-the-art MPI libraries. We also evaluate our designs with communication kernels of popular scientific applications such as MILC, WRF, NAS MG, and 3D-Stencil on three different multi-/many-core architectures and show up to 5.5 times improvement over state-of-the-art designs employed by production MPI libraries.
{"title":"FALCON: Efficient Designs for Zero-Copy MPI Datatype Processing on Emerging Architectures","authors":"J. Hashmi, S. Chakraborty, Mohammadreza Bayatpour, H. Subramoni, D. Panda","doi":"10.1109/IPDPS.2019.00045","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00045","url":null,"abstract":"Derived datatypes are commonly used in MPI applications to exchange non-contiguous data among processes. However, state-of-the-art MPI libraries do not offer efficient processing of derived datatypes and often rely on packing and unpacking the data at the sender and the receiver processes. This approach incurs the cost of extra copies and increases overall communication latency. While zero-copy communication schemes have been proposed for contiguous data, applying such techniques to non-contiguous data transfers bring forth several new challenges. In this work, we address these challenges and propose FALCON — Fast and Low-overhead Communication designs for intra-node MPI derived datatypes processing. We show that the memory layouts translation of derived datatypes introduce significant overheads in the communication path and propose novel solutions to mitigate such bottlenecks. We also find that the current MPI datatype routines cannot fully take advantage of the zero-copy mechanisms, and propose enhancements to the MPI standard to address these limitations. The experimental evaluations show that our proposed designs achieve up to 3 times improved intra-node communication latency and bandwidth over state-of-the-art MPI libraries. We also evaluate our designs with communication kernels of popular scientific applications such as MILC, WRF, NAS MG, and 3D-Stencil on three different multi-/many-core architectures and show up to 5.5 times improvement over state-of-the-art designs employed by production MPI libraries.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125613152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00097
G. Georgakoudis, I. Laguna, H. Vandierendonck, Dimitrios S. Nikolopoulos, M. Schulz
Soft errors threaten to disrupt supercomputing scaling. Fault injection is a key technique to understand the impact of faults on scientific applications. However, injecting faults in parallel applications has been prohibitively slow, inaccurate and hard to implement. In this paper, we present, the first fast and accurate fault injection framework for parallel, multi-threaded applications. uses novel compiler instrumentation and code generation techniques to achieve high accuracy and high speed. Using, we show that fault manifestations can be significantly different depending on whether they happen in the application itself or in the parallel runtime system. In our experimental evaluation on 15 HPC parallel programs, we show that is multiple factors faster and equally accurate in comparison with state-of-the-art dynamic binary instrumentation tools for fault injection.
{"title":"SAFIRE: Scalable and Accurate Fault Injection for Parallel Multithreaded Applications","authors":"G. Georgakoudis, I. Laguna, H. Vandierendonck, Dimitrios S. Nikolopoulos, M. Schulz","doi":"10.1109/IPDPS.2019.00097","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00097","url":null,"abstract":"Soft errors threaten to disrupt supercomputing scaling. Fault injection is a key technique to understand the impact of faults on scientific applications. However, injecting faults in parallel applications has been prohibitively slow, inaccurate and hard to implement. In this paper, we present, the first fast and accurate fault injection framework for parallel, multi-threaded applications. uses novel compiler instrumentation and code generation techniques to achieve high accuracy and high speed. Using, we show that fault manifestations can be significantly different depending on whether they happen in the application itself or in the parallel runtime system. In our experimental evaluation on 15 HPC parallel programs, we show that is multiple factors faster and equally accurate in comparison with state-of-the-art dynamic binary instrumentation tools for fault injection.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114234216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00012
A. Azad, A. Buluç
Finding connected components is one of the most widely used operations on a graph. Optimal serial algorithms for the problem have been known for half a century, and many competing parallel algorithms have been proposed over the last several decades under various different models of parallel computation. This paper presents a parallel connected-components algorithm that can run on distributed-memory computers. Our algorithm uses linear algebraic primitives and is based on a PRAM algorithm by Awerbuch and Shiloach. We show that the resulting algorithm, named LACC for Linear Algebraic Connected Components, outperforms competitors by a factor of up to 12x for small to medium scale graphs. For large graphs with more than 50B edges, LACC scales to 4K nodes (262K cores) of a Cray XC40 supercomputer and outperforms previous algorithms by a significant margin. This remarkable performance is accomplished by (1) exploiting sparsity that was not present in the original PRAM algorithm formulation, (2) using high-performance primitives of Combinatorial BLAS, and (3) identifying hot spots and optimizing them away by exploiting algorithmic insights.
{"title":"LACC: A Linear-Algebraic Algorithm for Finding Connected Components in Distributed Memory","authors":"A. Azad, A. Buluç","doi":"10.1109/IPDPS.2019.00012","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00012","url":null,"abstract":"Finding connected components is one of the most widely used operations on a graph. Optimal serial algorithms for the problem have been known for half a century, and many competing parallel algorithms have been proposed over the last several decades under various different models of parallel computation. This paper presents a parallel connected-components algorithm that can run on distributed-memory computers. Our algorithm uses linear algebraic primitives and is based on a PRAM algorithm by Awerbuch and Shiloach. We show that the resulting algorithm, named LACC for Linear Algebraic Connected Components, outperforms competitors by a factor of up to 12x for small to medium scale graphs. For large graphs with more than 50B edges, LACC scales to 4K nodes (262K cores) of a Cray XC40 supercomputer and outperforms previous algorithms by a significant margin. This remarkable performance is accomplished by (1) exploiting sparsity that was not present in the original PRAM algorithm formulation, (2) using high-performance primitives of Combinatorial BLAS, and (3) identifying hot spots and optimizing them away by exploiting algorithmic insights.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124531620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00079
P. Garncarek, T. Jurdzinski, D. Kowalski, Miguel A. Mosteiro
Millimeter wave communication (mmWave) allows high-speed access to the radio channel. Given the highly-directional nature of mmWave, dense deployments can be implemented with a macro base station serving many micro base stations, rather than connecting micro base stations directly to the core network as in legacy cellular systems. Moreover, micro base stations may cooperate in relaying packets to other micro base stations. Relays and spatial reuse speed up communication, but increase the complexity of scheduling. In this work, we study the mmWave wireless backhaul scheduling problem in the described architecture, assuming stochastic arrival of packets at the macro base station to be delivered to micro base stations. We present various results concerning system stability, defined as a bounded expected queue sizes of macro base station and micro base stations, under different patterns of random traffic. In particular, that almost all admissible arrival patterns could be handled by some universally stable algorithms, while non-admissible arrival patterns do not allow stability for any algorithm.
{"title":"mmWave Wireless Backhaul Scheduling of Stochastic Packet Arrivals","authors":"P. Garncarek, T. Jurdzinski, D. Kowalski, Miguel A. Mosteiro","doi":"10.1109/IPDPS.2019.00079","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00079","url":null,"abstract":"Millimeter wave communication (mmWave) allows high-speed access to the radio channel. Given the highly-directional nature of mmWave, dense deployments can be implemented with a macro base station serving many micro base stations, rather than connecting micro base stations directly to the core network as in legacy cellular systems. Moreover, micro base stations may cooperate in relaying packets to other micro base stations. Relays and spatial reuse speed up communication, but increase the complexity of scheduling. In this work, we study the mmWave wireless backhaul scheduling problem in the described architecture, assuming stochastic arrival of packets at the macro base station to be delivered to micro base stations. We present various results concerning system stability, defined as a bounded expected queue sizes of macro base station and micro base stations, under different patterns of random traffic. In particular, that almost all admissible arrival patterns could be handled by some universally stable algorithms, while non-admissible arrival patterns do not allow stability for any algorithm.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126421067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00046
P. Khanchandani, Roger Wattenhofer
The consensus number of an object is the maximum number of processes among which binary consensus can be solved using any number of instances of the object and read-write registers. Herlihy [1] showed in his seminal work that if an object has a consensus number of n, then its instances can be used to implement any non-trivial object or data structure that is shared among n processes, so that the implementation is wait-free and linearizable. Thus, an object such as compare-and-set with an infinite consensus number is "advanced" because its instances can be used to implement any non-trivial concurrent object shared among any number of processes. On the other hand, objects such as fetch-and-add or fetch-and-multiply have a consensus number of two and are "elementary". An important consequence of Herlihy's result was that any number of reasonable elementary objects are provably insufficient to implement an advanced object like compare-and-set. However, Ellen et al. [2] observed recently that real multiprocessors do not compute using objects but using instructions that are applied on memory locations. Using this observation, they show that it is possible to use a couple of elementary instructions on the same memory location to implement an advanced one, and consequently any non-trivial object or data structure. However, the above result is only a possibility and uses a generic universal construction as a black-box, which is not how we implement objects in practice, as the generic construction is quite inefficient with respect to the number of steps taken by a process and the number of shared objects used in the worst case. Instead, the efficient implementations are built upon the widely supported compare-and-set instruction and one cannot conclude from the previous result whether the elementary instructions can also produce equally efficient implementations like compare-and-set does or they are fundamentally limited in this respect. In this paper, we answer this question by giving a wait-free and linearizable implementation of compare-and-set using just two elementary instructions, half-max and max-write. The implementation takes O(1) steps per process and uses O(1) shared objects per process. Thus, any known or unknown compare-and-set based implementation can also be done using only two elementary instructions without any loss in efficiency. An interesting aspect of these elementary instructions is that depending on the underlying system, their throughput in a highly concurrent setting is larger than that of the compare-and-set instructions by a factor proportional to n.
{"title":"Two Elementary Instructions Make Compare-and-Swap","authors":"P. Khanchandani, Roger Wattenhofer","doi":"10.1109/IPDPS.2019.00046","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00046","url":null,"abstract":"The consensus number of an object is the maximum number of processes among which binary consensus can be solved using any number of instances of the object and read-write registers. Herlihy [1] showed in his seminal work that if an object has a consensus number of n, then its instances can be used to implement any non-trivial object or data structure that is shared among n processes, so that the implementation is wait-free and linearizable. Thus, an object such as compare-and-set with an infinite consensus number is \"advanced\" because its instances can be used to implement any non-trivial concurrent object shared among any number of processes. On the other hand, objects such as fetch-and-add or fetch-and-multiply have a consensus number of two and are \"elementary\". An important consequence of Herlihy's result was that any number of reasonable elementary objects are provably insufficient to implement an advanced object like compare-and-set. However, Ellen et al. [2] observed recently that real multiprocessors do not compute using objects but using instructions that are applied on memory locations. Using this observation, they show that it is possible to use a couple of elementary instructions on the same memory location to implement an advanced one, and consequently any non-trivial object or data structure. However, the above result is only a possibility and uses a generic universal construction as a black-box, which is not how we implement objects in practice, as the generic construction is quite inefficient with respect to the number of steps taken by a process and the number of shared objects used in the worst case. Instead, the efficient implementations are built upon the widely supported compare-and-set instruction and one cannot conclude from the previous result whether the elementary instructions can also produce equally efficient implementations like compare-and-set does or they are fundamentally limited in this respect. In this paper, we answer this question by giving a wait-free and linearizable implementation of compare-and-set using just two elementary instructions, half-max and max-write. The implementation takes O(1) steps per process and uses O(1) shared objects per process. Thus, any known or unknown compare-and-set based implementation can also be done using only two elementary instructions without any loss in efficiency. An interesting aspect of these elementary instructions is that depending on the underlying system, their throughput in a highly concurrent setting is larger than that of the compare-and-set instructions by a factor proportional to n.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116474686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00091
Mathieu Bacou, Grégoire Todeschi, A. Tchana, D. Hagimont, Baptiste Lepers, W. Zwaenepoel
In a modern data center (DC), the large majority of costs arise from the energy consumption. The most popular technique used to mitigate this issue in a virtualized DC is the virtual machine (VM) consolidation. Although the latter may increase server utilization by about 5-10%, it is difficult to actually notice server loads greater than 50%. By analyzing the traces from our cloud provider partner, confirmed by previous research work, we have identified that some VMs have sporadic periods of data computation followed by large intervals of idleness. These VMs often hinder the consolidation system to further increase the energy efficiency of the DC. In this paper we propose a novel DC power management system called Drowsy-DC, which is able to identify the aforementioned VMs that have similar periods of idleness. Further, these VMs are colocated on the same server so that their idle periods are exploited to put the server to a low power mode (suspend to RAM) until some data computation is required. By introducing a negligible overhead, our system is able to improve any VM consolidation system (up to 81% for OpenStack Neat).
{"title":"Drowsy-DC: Data Center Power Management System","authors":"Mathieu Bacou, Grégoire Todeschi, A. Tchana, D. Hagimont, Baptiste Lepers, W. Zwaenepoel","doi":"10.1109/IPDPS.2019.00091","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00091","url":null,"abstract":"In a modern data center (DC), the large majority of costs arise from the energy consumption. The most popular technique used to mitigate this issue in a virtualized DC is the virtual machine (VM) consolidation. Although the latter may increase server utilization by about 5-10%, it is difficult to actually notice server loads greater than 50%. By analyzing the traces from our cloud provider partner, confirmed by previous research work, we have identified that some VMs have sporadic periods of data computation followed by large intervals of idleness. These VMs often hinder the consolidation system to further increase the energy efficiency of the DC. In this paper we propose a novel DC power management system called Drowsy-DC, which is able to identify the aforementioned VMs that have similar periods of idleness. Further, these VMs are colocated on the same server so that their idle periods are exploited to put the server to a low power mode (suspend to RAM) until some data computation is required. By introducing a negligible overhead, our system is able to improve any VM consolidation system (up to 81% for OpenStack Neat).","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121624240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00042
Stephanie Labasan, Matthew Larsen, H. Childs, B. Rountree
One of the biggest challenges for leading-edge supercomputers is power usage. Looking forward, power is expected to become an increasingly limited resource, so it is critical to understand the runtime behaviors of applications in this constrained environment in order to use power wisely. Within this context, we explore the tradeoffs between power and performance specifically for visualization algorithms. With respect to execution behavior under a power limit, visualization algorithms differ from traditional HPC applications, like scientific simulations, because visualization is more data intensive. This data intensive characteristic lends itself to alternative strategies regarding power usage. In this study, we focus on a representative set of visualization algorithms, and explore their power and performance characteristics as a power bound is applied. The result is a study that identifies how future research efforts can exploit the execution characteristics of visualization applications in order to optimize performance under a power bound.
{"title":"Power and Performance Tradeoffs for Visualization Algorithms","authors":"Stephanie Labasan, Matthew Larsen, H. Childs, B. Rountree","doi":"10.1109/IPDPS.2019.00042","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00042","url":null,"abstract":"One of the biggest challenges for leading-edge supercomputers is power usage. Looking forward, power is expected to become an increasingly limited resource, so it is critical to understand the runtime behaviors of applications in this constrained environment in order to use power wisely. Within this context, we explore the tradeoffs between power and performance specifically for visualization algorithms. With respect to execution behavior under a power limit, visualization algorithms differ from traditional HPC applications, like scientific simulations, because visualization is more data intensive. This data intensive characteristic lends itself to alternative strategies regarding power usage. In this study, we focus on a representative set of visualization algorithms, and explore their power and performance characteristics as a power bound is applied. The result is a study that identifies how future research efforts can exploit the execution characteristics of visualization applications in order to optimize performance under a power bound.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"2015 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127666015","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/ipdps.2019.00009
Vinod E. F. Rebello, A. Melo
WORKSHOPS COMMITTEE Olivier Beaumont (Inria Bordeaux Sud-Ouest, France) Sunita Chandrasekaran (University of Delaware, USA) Ananth Kalyanaraman (Washington State University, USA) Cynthia A. Philips (Sandia National Laboratories, USA) Sivasankaran Rajamanickam (Sandia National Laboratories, USA) Min Si (Argonne National Laboratory, USA) Alan Sussman (University of Maryland, USA) Bora Ucar (CNRS, France)
研讨会委员会 Olivier Beaumont(法国波尔多西南大区研究所) Sunita Chandrasekaran(美国特拉华大学) Ananth Kalyanaraman(美国华盛顿州立大学) Cynthia A. Philips(美国桑迪亚国家实验室) Sivasankaran Rajamanickam(美国桑迪亚国家实验室) Min Si(美国阿贡国家实验室) Alan Sussman(美国马里兰大学) Bora Ucar(法国国家科学研究中心)
{"title":"IPDPS 2019 Organization","authors":"Vinod E. F. Rebello, A. Melo","doi":"10.1109/ipdps.2019.00009","DOIUrl":"https://doi.org/10.1109/ipdps.2019.00009","url":null,"abstract":"WORKSHOPS COMMITTEE Olivier Beaumont (Inria Bordeaux Sud-Ouest, France) Sunita Chandrasekaran (University of Delaware, USA) Ananth Kalyanaraman (Washington State University, USA) Cynthia A. Philips (Sandia National Laboratories, USA) Sivasankaran Rajamanickam (Sandia National Laboratories, USA) Min Si (Argonne National Laboratory, USA) Alan Sussman (University of Maryland, USA) Bora Ucar (CNRS, France)","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"35 1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131151585","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-01DOI: 10.1109/IPDPS.2019.00061
Woong Shin, Christopher Brumgard, Bing Xie, Sudharshan S. Vazhkudai, D. Ghoshal, S. Oral, L. Ramakrishnan
We present the design and implementation of Data Jockey, a data management system for HPC multi-tiered storage systems. As a centralized data management control plane, Data Jockey automates bulk data movement and placement for scientific workflows and integrates into existing HPC storage infrastructures. Data Jockey simplifies data management by eliminating human effort in programming complex data movements, laying datasets across multiple storage tiers when supporting complex workflows, which in turn increases the usability of multi-tiered storage systems emerging in modern HPC data centers. Specifically, Data Jockey presents a new data management scheme called "goal driven data management" that can automatically infer low-level bulk data movement plans from declarative high-level goal statements that come from the lifetime of iterative runs of scientific workflows. While doing so, Data Jockey aims to minimize data wait times by taking responsibility for datasets that are unused or to be used, and aggressively utilizing the capacity of the upper, higher performant storage tiers. We evaluated a prototype implementation of Data Jockey under a synthetic workload based on a year's worth of Oak Ridge Leadership Computing Facility's (OLCF) operational logs. Our evaluations suggest that Data Jockey leads to higher utilization of the upper storage tiers while minimizing the programming effort of data movement compared to human involved, per-domain ad-hoc data management scripts.
{"title":"Data Jockey: Automatic Data Management for HPC Multi-tiered Storage Systems","authors":"Woong Shin, Christopher Brumgard, Bing Xie, Sudharshan S. Vazhkudai, D. Ghoshal, S. Oral, L. Ramakrishnan","doi":"10.1109/IPDPS.2019.00061","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00061","url":null,"abstract":"We present the design and implementation of Data Jockey, a data management system for HPC multi-tiered storage systems. As a centralized data management control plane, Data Jockey automates bulk data movement and placement for scientific workflows and integrates into existing HPC storage infrastructures. Data Jockey simplifies data management by eliminating human effort in programming complex data movements, laying datasets across multiple storage tiers when supporting complex workflows, which in turn increases the usability of multi-tiered storage systems emerging in modern HPC data centers. Specifically, Data Jockey presents a new data management scheme called \"goal driven data management\" that can automatically infer low-level bulk data movement plans from declarative high-level goal statements that come from the lifetime of iterative runs of scientific workflows. While doing so, Data Jockey aims to minimize data wait times by taking responsibility for datasets that are unused or to be used, and aggressively utilizing the capacity of the upper, higher performant storage tiers. We evaluated a prototype implementation of Data Jockey under a synthetic workload based on a year's worth of Oak Ridge Leadership Computing Facility's (OLCF) operational logs. Our evaluations suggest that Data Jockey leads to higher utilization of the upper storage tiers while minimizing the programming effort of data movement compared to human involved, per-domain ad-hoc data management scripts.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123435781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}