2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献_第4页

A Deep Recurrent Neural Network Based Predictive Control Framework for Reliable Distributed Stream Data Processing 基于深度递归神经网络的可靠分布式流数据处理预测控制框架

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00036

Jielong Xu, Jian Tang, Zhiyuan Xu, Chengxiang Yin, K. Kwiat, C. Kamhoua

In this paper, we present design, implementation and evaluation of a novel predictive control framework to enable reliable distributed stream data processing, which features a Deep Recurrent Neural Network (DRNN) model for performance prediction, and dynamic grouping for flexible control. Specifically, we present a novel DRNN model, which makes accurate performance prediction with careful consideration for interference of co-located worker processes, according to multilevel runtime statistics. Moreover, we design a new grouping method, dynamic grouping, which can distribute/re-distribute data tuples to downstream tasks according to any given split ratio on the fly. So it can be used to re-direct data tuples to bypass misbehaving workers. We implemented the proposed framework based on a widely used Distributed Stream Data Processing System (DSDPS), Storm. For validation and performance evaluation, we developed two representative stream data processing applications: Windowed URL Count and Continuous Queries. Extensive experimental results show: 1) The proposed DRNN model outperforms widely used baseline solutions, ARIMA and SVR, in terms of prediction accuracy; 2) dynamic grouping works as expected; and 3) the proposed framework enhances reliability by offering minor performance degradation with misbehaving workers.

在本文中，我们提出了一种新的预测控制框架的设计、实现和评估，以实现可靠的分布式流数据处理，该框架具有用于性能预测的深度递归神经网络(DRNN)模型和用于灵活控制的动态分组。具体而言，我们提出了一种新的DRNN模型，该模型根据多层运行时统计数据，在仔细考虑同址工作进程干扰的情况下，进行准确的性能预测。此外，我们还设计了一种新的分组方法——动态分组，该方法可以动态地将数据元组按照给定的分割比例分配/重新分配给下游任务。因此，它可以用来重定向数据元组，以绕过行为不端的工作器。我们基于广泛使用的分布式流数据处理系统(DSDPS) Storm实现了提出的框架。为了验证和性能评估，我们开发了两个代表性的流数据处理应用程序:窗口URL计数和连续查询。大量的实验结果表明:1)所提出的DRNN模型在预测精度方面优于广泛使用的基线解决方案ARIMA和SVR;2)动态分组工作正常;3)提出的框架通过对行为不端的工人提供轻微的性能降低来提高可靠性。

{"title":"A Deep Recurrent Neural Network Based Predictive Control Framework for Reliable Distributed Stream Data Processing","authors":"Jielong Xu, Jian Tang, Zhiyuan Xu, Chengxiang Yin, K. Kwiat, C. Kamhoua","doi":"10.1109/IPDPS.2019.00036","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00036","url":null,"abstract":"In this paper, we present design, implementation and evaluation of a novel predictive control framework to enable reliable distributed stream data processing, which features a Deep Recurrent Neural Network (DRNN) model for performance prediction, and dynamic grouping for flexible control. Specifically, we present a novel DRNN model, which makes accurate performance prediction with careful consideration for interference of co-located worker processes, according to multilevel runtime statistics. Moreover, we design a new grouping method, dynamic grouping, which can distribute/re-distribute data tuples to downstream tasks according to any given split ratio on the fly. So it can be used to re-direct data tuples to bypass misbehaving workers. We implemented the proposed framework based on a widely used Distributed Stream Data Processing System (DSDPS), Storm. For validation and performance evaluation, we developed two representative stream data processing applications: Windowed URL Count and Continuous Queries. Extensive experimental results show: 1) The proposed DRNN model outperforms widely used baseline solutions, ARIMA and SVR, in terms of prediction accuracy; 2) dynamic grouping works as expected; and 3) the proposed framework enhances reliability by offering minor performance degradation with misbehaving workers.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121988347","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

FALCON: Efficient Designs for Zero-Copy MPI Datatype Processing on Emerging Architectures 新兴体系结构上零拷贝MPI数据类型处理的高效设计

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00045

J. Hashmi, S. Chakraborty, Mohammadreza Bayatpour, H. Subramoni, D. Panda

Derived datatypes are commonly used in MPI applications to exchange non-contiguous data among processes. However, state-of-the-art MPI libraries do not offer efficient processing of derived datatypes and often rely on packing and unpacking the data at the sender and the receiver processes. This approach incurs the cost of extra copies and increases overall communication latency. While zero-copy communication schemes have been proposed for contiguous data, applying such techniques to non-contiguous data transfers bring forth several new challenges. In this work, we address these challenges and propose FALCON — Fast and Low-overhead Communication designs for intra-node MPI derived datatypes processing. We show that the memory layouts translation of derived datatypes introduce significant overheads in the communication path and propose novel solutions to mitigate such bottlenecks. We also find that the current MPI datatype routines cannot fully take advantage of the zero-copy mechanisms, and propose enhancements to the MPI standard to address these limitations. The experimental evaluations show that our proposed designs achieve up to 3 times improved intra-node communication latency and bandwidth over state-of-the-art MPI libraries. We also evaluate our designs with communication kernels of popular scientific applications such as MILC, WRF, NAS MG, and 3D-Stencil on three different multi-/many-core architectures and show up to 5.5 times improvement over state-of-the-art designs employed by production MPI libraries.

派生数据类型通常在MPI应用程序中用于在进程之间交换不连续的数据。然而，最先进的MPI库不能提供对派生数据类型的有效处理，而且通常依赖于在发送方和接收方进程中对数据进行打包和解包。这种方法会产生额外的副本成本，并增加总体通信延迟。虽然已经提出了用于连续数据的零拷贝通信方案，但将这些技术应用于非连续数据传输带来了几个新的挑战。在这项工作中，我们解决了这些挑战，并提出了用于节点内MPI派生数据类型处理的FALCON - Fast和低开销通信设计。我们展示了派生数据类型的内存布局转换在通信路径中引入了显著的开销，并提出了缓解此类瓶颈的新解决方案。我们还发现当前的MPI数据类型例程不能充分利用零复制机制，并提出了对MPI标准的增强以解决这些限制。实验评估表明，与最先进的MPI库相比，我们提出的设计实现了高达3倍的节点内通信延迟和带宽改进。我们还用流行的科学应用程序的通信内核(如MILC、WRF、NAS MG和3D-Stencil)在三种不同的多核/多核架构上评估了我们的设计，并显示出比生产MPI库采用的最先进设计提高了5.5倍。

{"title":"FALCON: Efficient Designs for Zero-Copy MPI Datatype Processing on Emerging Architectures","authors":"J. Hashmi, S. Chakraborty, Mohammadreza Bayatpour, H. Subramoni, D. Panda","doi":"10.1109/IPDPS.2019.00045","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00045","url":null,"abstract":"Derived datatypes are commonly used in MPI applications to exchange non-contiguous data among processes. However, state-of-the-art MPI libraries do not offer efficient processing of derived datatypes and often rely on packing and unpacking the data at the sender and the receiver processes. This approach incurs the cost of extra copies and increases overall communication latency. While zero-copy communication schemes have been proposed for contiguous data, applying such techniques to non-contiguous data transfers bring forth several new challenges. In this work, we address these challenges and propose FALCON — Fast and Low-overhead Communication designs for intra-node MPI derived datatypes processing. We show that the memory layouts translation of derived datatypes introduce significant overheads in the communication path and propose novel solutions to mitigate such bottlenecks. We also find that the current MPI datatype routines cannot fully take advantage of the zero-copy mechanisms, and propose enhancements to the MPI standard to address these limitations. The experimental evaluations show that our proposed designs achieve up to 3 times improved intra-node communication latency and bandwidth over state-of-the-art MPI libraries. We also evaluate our designs with communication kernels of popular scientific applications such as MILC, WRF, NAS MG, and 3D-Stencil on three different multi-/many-core architectures and show up to 5.5 times improvement over state-of-the-art designs employed by production MPI libraries.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"102 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125613152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

SAFIRE: Scalable and Accurate Fault Injection for Parallel Multithreaded Applications SAFIRE:为并行多线程应用程序提供可伸缩和精确的故障注入

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00097

G. Georgakoudis, I. Laguna, H. Vandierendonck, Dimitrios S. Nikolopoulos, M. Schulz

Soft errors threaten to disrupt supercomputing scaling. Fault injection is a key technique to understand the impact of faults on scientific applications. However, injecting faults in parallel applications has been prohibitively slow, inaccurate and hard to implement. In this paper, we present, the first fast and accurate fault injection framework for parallel, multi-threaded applications. uses novel compiler instrumentation and code generation techniques to achieve high accuracy and high speed. Using, we show that fault manifestations can be significantly different depending on whether they happen in the application itself or in the parallel runtime system. In our experimental evaluation on 15 HPC parallel programs, we show that is multiple factors faster and equally accurate in comparison with state-of-the-art dynamic binary instrumentation tools for fault injection.

软错误可能会破坏超级计算的扩展。断层注入是了解断层对科学应用影响的关键技术。然而，在并行应用程序中注入错误非常缓慢、不准确且难以实现。在本文中，我们提出了第一个用于并行、多线程应用的快速、准确的故障注入框架。采用新颖的编译工具和代码生成技术，实现高精度和高速度。通过使用，我们展示了故障表现可能会有很大的不同，这取决于它们是发生在应用程序本身还是发生在并行运行时系统中。在我们对15个高性能计算并行程序的实验评估中，我们表明，与最先进的动态二进制仪器工具相比，故障注入的速度要快很多倍，而且同样准确。

引用次数: 7

LACC: A Linear-Algebraic Algorithm for Finding Connected Components in Distributed Memory LACC:一种寻找分布式内存中连接组件的线性代数算法

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00012

A. Azad, A. Buluç

Finding connected components is one of the most widely used operations on a graph. Optimal serial algorithms for the problem have been known for half a century, and many competing parallel algorithms have been proposed over the last several decades under various different models of parallel computation. This paper presents a parallel connected-components algorithm that can run on distributed-memory computers. Our algorithm uses linear algebraic primitives and is based on a PRAM algorithm by Awerbuch and Shiloach. We show that the resulting algorithm, named LACC for Linear Algebraic Connected Components, outperforms competitors by a factor of up to 12x for small to medium scale graphs. For large graphs with more than 50B edges, LACC scales to 4K nodes (262K cores) of a Cray XC40 supercomputer and outperforms previous algorithms by a significant margin. This remarkable performance is accomplished by (1) exploiting sparsity that was not present in the original PRAM algorithm formulation, (2) using high-performance primitives of Combinatorial BLAS, and (3) identifying hot spots and optimizing them away by exploiting algorithmic insights.

查找连接组件是图上使用最广泛的操作之一。该问题的最优串行算法已经存在了半个世纪，在过去的几十年里，在各种不同的并行计算模型下，提出了许多相互竞争的并行算法。提出了一种可在分布式存储计算机上运行的并行连接组件算法。我们的算法使用线性代数原语，并基于Awerbuch和Shiloach的PRAM算法。我们展示了结果算法，命名为线性代数连接组件LACC，在中小型图中优于竞争对手高达12倍的因素。对于具有超过50B条边的大型图，LACC可以扩展到Cray XC40超级计算机的4K节点(262K核)，并且显著优于以前的算法。这种卓越的性能是通过(1)利用原始PRAM算法公式中不存在的稀疏性，(2)使用组合BLAS的高性能原语，以及(3)通过利用算法洞察力识别热点并优化它们来实现的。

{"title":"LACC: A Linear-Algebraic Algorithm for Finding Connected Components in Distributed Memory","authors":"A. Azad, A. Buluç","doi":"10.1109/IPDPS.2019.00012","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00012","url":null,"abstract":"Finding connected components is one of the most widely used operations on a graph. Optimal serial algorithms for the problem have been known for half a century, and many competing parallel algorithms have been proposed over the last several decades under various different models of parallel computation. This paper presents a parallel connected-components algorithm that can run on distributed-memory computers. Our algorithm uses linear algebraic primitives and is based on a PRAM algorithm by Awerbuch and Shiloach. We show that the resulting algorithm, named LACC for Linear Algebraic Connected Components, outperforms competitors by a factor of up to 12x for small to medium scale graphs. For large graphs with more than 50B edges, LACC scales to 4K nodes (262K cores) of a Cray XC40 supercomputer and outperforms previous algorithms by a significant margin. This remarkable performance is accomplished by (1) exploiting sparsity that was not present in the original PRAM algorithm formulation, (2) using high-performance primitives of Combinatorial BLAS, and (3) identifying hot spots and optimizing them away by exploiting algorithmic insights.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124531620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

mmWave Wireless Backhaul Scheduling of Stochastic Packet Arrivals 随机分组到达毫米波无线回程调度

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00079

P. Garncarek, T. Jurdzinski, D. Kowalski, Miguel A. Mosteiro

Millimeter wave communication (mmWave) allows high-speed access to the radio channel. Given the highly-directional nature of mmWave, dense deployments can be implemented with a macro base station serving many micro base stations, rather than connecting micro base stations directly to the core network as in legacy cellular systems. Moreover, micro base stations may cooperate in relaying packets to other micro base stations. Relays and spatial reuse speed up communication, but increase the complexity of scheduling. In this work, we study the mmWave wireless backhaul scheduling problem in the described architecture, assuming stochastic arrival of packets at the macro base station to be delivered to micro base stations. We present various results concerning system stability, defined as a bounded expected queue sizes of macro base station and micro base stations, under different patterns of random traffic. In particular, that almost all admissible arrival patterns could be handled by some universally stable algorithms, while non-admissible arrival patterns do not allow stability for any algorithm.

毫米波通信(mmWave)允许高速接入无线信道。考虑到毫米波的高度定向特性，密集部署可以通过一个宏基站服务许多微基站来实现，而不是像传统蜂窝系统那样将微基站直接连接到核心网络。此外，微基站可以合作将分组中继到其他微基站。中继和空间复用加快了通信速度，但也增加了调度的复杂性。在这项工作中，我们研究了所述架构中的毫米波无线回程调度问题，假设数据包随机到达宏基站并传递给微基站。在不同的随机流量模式下，我们给出了关于系统稳定性的各种结果，将系统稳定性定义为宏基站和微基站有界的期望队列大小。特别是，几乎所有允许的到达模式都可以由一些普遍稳定的算法处理，而不允许的到达模式不允许任何算法的稳定性。

{"title":"mmWave Wireless Backhaul Scheduling of Stochastic Packet Arrivals","authors":"P. Garncarek, T. Jurdzinski, D. Kowalski, Miguel A. Mosteiro","doi":"10.1109/IPDPS.2019.00079","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00079","url":null,"abstract":"Millimeter wave communication (mmWave) allows high-speed access to the radio channel. Given the highly-directional nature of mmWave, dense deployments can be implemented with a macro base station serving many micro base stations, rather than connecting micro base stations directly to the core network as in legacy cellular systems. Moreover, micro base stations may cooperate in relaying packets to other micro base stations. Relays and spatial reuse speed up communication, but increase the complexity of scheduling. In this work, we study the mmWave wireless backhaul scheduling problem in the described architecture, assuming stochastic arrival of packets at the macro base station to be delivered to micro base stations. We present various results concerning system stability, defined as a bounded expected queue sizes of macro base station and micro base stations, under different patterns of random traffic. In particular, that almost all admissible arrival patterns could be handled by some universally stable algorithms, while non-admissible arrival patterns do not allow stability for any algorithm.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1116 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126421067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Two Elementary Instructions Make Compare-and-Swap 两个基本指令进行比较与交换

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00046

P. Khanchandani, Roger Wattenhofer

The consensus number of an object is the maximum number of processes among which binary consensus can be solved using any number of instances of the object and read-write registers. Herlihy [1] showed in his seminal work that if an object has a consensus number of n, then its instances can be used to implement any non-trivial object or data structure that is shared among n processes, so that the implementation is wait-free and linearizable. Thus, an object such as compare-and-set with an infinite consensus number is "advanced" because its instances can be used to implement any non-trivial concurrent object shared among any number of processes. On the other hand, objects such as fetch-and-add or fetch-and-multiply have a consensus number of two and are "elementary". An important consequence of Herlihy's result was that any number of reasonable elementary objects are provably insufficient to implement an advanced object like compare-and-set. However, Ellen et al. [2] observed recently that real multiprocessors do not compute using objects but using instructions that are applied on memory locations. Using this observation, they show that it is possible to use a couple of elementary instructions on the same memory location to implement an advanced one, and consequently any non-trivial object or data structure. However, the above result is only a possibility and uses a generic universal construction as a black-box, which is not how we implement objects in practice, as the generic construction is quite inefficient with respect to the number of steps taken by a process and the number of shared objects used in the worst case. Instead, the efficient implementations are built upon the widely supported compare-and-set instruction and one cannot conclude from the previous result whether the elementary instructions can also produce equally efficient implementations like compare-and-set does or they are fundamentally limited in this respect. In this paper, we answer this question by giving a wait-free and linearizable implementation of compare-and-set using just two elementary instructions, half-max and max-write. The implementation takes O(1) steps per process and uses O(1) shared objects per process. Thus, any known or unknown compare-and-set based implementation can also be done using only two elementary instructions without any loss in efficiency. An interesting aspect of these elementary instructions is that depending on the underlying system, their throughput in a highly concurrent setting is larger than that of the compare-and-set instructions by a factor proportional to n.

对象的共识数是使用任意数量的对象实例和读写寄存器解决二进制共识的最大进程数。Herlihy[1]在他的开创性工作中表明，如果一个对象具有n个共识数，那么它的实例可以用于实现在n个进程之间共享的任何非平凡对象或数据结构，因此实现是无等待和线性化的。因此，具有无限共识数的对象(如compare-and-set)是“高级”的，因为它的实例可用于实现在任意数量的进程之间共享的任何非平凡并发对象。另一方面，诸如“获取并添加”或“获取并相乘”之类的对象的共识数为2，并且是“基本的”。Herlihy结果的一个重要结论是，任何数量的合理基本对象都不足以实现像比较与设置这样的高级对象。然而，Ellen等人最近观察到，真正的多处理器不使用对象计算，而是使用应用于内存位置的指令。通过这一观察，他们表明可以在相同的内存位置上使用一对基本指令来实现高级指令，从而实现任何重要的对象或数据结构。然而，上述结果只是一种可能性，并且使用了通用的通用结构作为黑盒，这不是我们在实践中实现对象的方式，因为在最坏的情况下，就进程所采取的步骤数量和所使用的共享对象数量而言，通用结构是相当低效的。相反，有效的实现是建立在广泛支持的比较和设置指令之上的，人们不能从前面的结果中得出结论，基本指令是否也能像比较和设置一样产生同样有效的实现，或者它们在这方面基本上是有限的。在本文中，我们给出了一个无等待且线性化的比较与设置的实现，只使用两个基本指令half-max和max-write来回答这个问题。每个进程执行O(1)个步骤，每个进程使用O(1)个共享对象。因此，任何已知或未知的基于比较和设置的实现也可以只用两个基本指令完成，而不会损失任何效率。这些基本指令的一个有趣的方面是，根据底层系统的不同，它们在高度并发设置下的吞吐量比比较和设置指令的吞吐量大一个与n成正比的系数。

{"title":"Two Elementary Instructions Make Compare-and-Swap","authors":"P. Khanchandani, Roger Wattenhofer","doi":"10.1109/IPDPS.2019.00046","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00046","url":null,"abstract":"The consensus number of an object is the maximum number of processes among which binary consensus can be solved using any number of instances of the object and read-write registers. Herlihy [1] showed in his seminal work that if an object has a consensus number of n, then its instances can be used to implement any non-trivial object or data structure that is shared among n processes, so that the implementation is wait-free and linearizable. Thus, an object such as compare-and-set with an infinite consensus number is \"advanced\" because its instances can be used to implement any non-trivial concurrent object shared among any number of processes. On the other hand, objects such as fetch-and-add or fetch-and-multiply have a consensus number of two and are \"elementary\". An important consequence of Herlihy's result was that any number of reasonable elementary objects are provably insufficient to implement an advanced object like compare-and-set. However, Ellen et al. [2] observed recently that real multiprocessors do not compute using objects but using instructions that are applied on memory locations. Using this observation, they show that it is possible to use a couple of elementary instructions on the same memory location to implement an advanced one, and consequently any non-trivial object or data structure. However, the above result is only a possibility and uses a generic universal construction as a black-box, which is not how we implement objects in practice, as the generic construction is quite inefficient with respect to the number of steps taken by a process and the number of shared objects used in the worst case. Instead, the efficient implementations are built upon the widely supported compare-and-set instruction and one cannot conclude from the previous result whether the elementary instructions can also produce equally efficient implementations like compare-and-set does or they are fundamentally limited in this respect. In this paper, we answer this question by giving a wait-free and linearizable implementation of compare-and-set using just two elementary instructions, half-max and max-write. The implementation takes O(1) steps per process and uses O(1) shared objects per process. Thus, any known or unknown compare-and-set based implementation can also be done using only two elementary instructions without any loss in efficiency. An interesting aspect of these elementary instructions is that depending on the underlying system, their throughput in a highly concurrent setting is larger than that of the compare-and-set instructions by a factor proportional to n.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"125 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116474686","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Drowsy-DC: Data Center Power Management System 休眠- dc:数据中心电源管理系统

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00091

Mathieu Bacou, Grégoire Todeschi, A. Tchana, D. Hagimont, Baptiste Lepers, W. Zwaenepoel

In a modern data center (DC), the large majority of costs arise from the energy consumption. The most popular technique used to mitigate this issue in a virtualized DC is the virtual machine (VM) consolidation. Although the latter may increase server utilization by about 5-10%, it is difficult to actually notice server loads greater than 50%. By analyzing the traces from our cloud provider partner, confirmed by previous research work, we have identified that some VMs have sporadic periods of data computation followed by large intervals of idleness. These VMs often hinder the consolidation system to further increase the energy efficiency of the DC. In this paper we propose a novel DC power management system called Drowsy-DC, which is able to identify the aforementioned VMs that have similar periods of idleness. Further, these VMs are colocated on the same server so that their idle periods are exploited to put the server to a low power mode (suspend to RAM) until some data computation is required. By introducing a negligible overhead, our system is able to improve any VM consolidation system (up to 81% for OpenStack Neat).

在现代数据中心(DC)中，大部分成本来自能源消耗。在虚拟化数据中心中缓解此问题的最常用技术是虚拟机(VM)整合。尽管后者可能会使服务器利用率提高5-10%，但实际上很难注意到服务器负载超过50%。通过分析我们的云提供商合作伙伴提供的痕迹，并通过之前的研究工作确认，我们已经确定一些虚拟机具有零星的数据计算周期，然后是大间隔的空闲。这些虚拟机经常阻碍整合系统进一步提高数据中心的能效。在本文中，我们提出了一种新的直流电源管理系统，称为Drowsy-DC，它能够识别上述具有相似空闲时间的虚拟机。此外，这些虚拟机位于同一台服务器上，以便利用它们的空闲时间将服务器置于低功耗模式(挂起到RAM)，直到需要进行一些数据计算。通过引入一个可以忽略不计的开销，我们的系统能够改进任何VM整合系统(OpenStack Neat高达81%)。

{"title":"Drowsy-DC: Data Center Power Management System","authors":"Mathieu Bacou, Grégoire Todeschi, A. Tchana, D. Hagimont, Baptiste Lepers, W. Zwaenepoel","doi":"10.1109/IPDPS.2019.00091","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00091","url":null,"abstract":"In a modern data center (DC), the large majority of costs arise from the energy consumption. The most popular technique used to mitigate this issue in a virtualized DC is the virtual machine (VM) consolidation. Although the latter may increase server utilization by about 5-10%, it is difficult to actually notice server loads greater than 50%. By analyzing the traces from our cloud provider partner, confirmed by previous research work, we have identified that some VMs have sporadic periods of data computation followed by large intervals of idleness. These VMs often hinder the consolidation system to further increase the energy efficiency of the DC. In this paper we propose a novel DC power management system called Drowsy-DC, which is able to identify the aforementioned VMs that have similar periods of idleness. Further, these VMs are colocated on the same server so that their idle periods are exploited to put the server to a low power mode (suspend to RAM) until some data computation is required. By introducing a negligible overhead, our system is able to improve any VM consolidation system (up to 81% for OpenStack Neat).","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"132 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121624240","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Power and Performance Tradeoffs for Visualization Algorithms 可视化算法的功率和性能权衡

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00042

Stephanie Labasan, Matthew Larsen, H. Childs, B. Rountree

One of the biggest challenges for leading-edge supercomputers is power usage. Looking forward, power is expected to become an increasingly limited resource, so it is critical to understand the runtime behaviors of applications in this constrained environment in order to use power wisely. Within this context, we explore the tradeoffs between power and performance specifically for visualization algorithms. With respect to execution behavior under a power limit, visualization algorithms differ from traditional HPC applications, like scientific simulations, because visualization is more data intensive. This data intensive characteristic lends itself to alternative strategies regarding power usage. In this study, we focus on a representative set of visualization algorithms, and explore their power and performance characteristics as a power bound is applied. The result is a study that identifies how future research efforts can exploit the execution characteristics of visualization applications in order to optimize performance under a power bound.

最先进的超级计算机面临的最大挑战之一是耗电量。展望未来，预计电力将成为一种越来越有限的资源，因此，为了明智地使用电力，理解这种受限环境中应用程序的运行时行为至关重要。在这种情况下，我们将专门探讨可视化算法在功率和性能之间的权衡。关于在功率限制下的执行行为，可视化算法不同于传统的HPC应用程序，如科学模拟，因为可视化更需要数据密集型。这种数据密集型特性使其适合于关于电力使用的替代策略。在本研究中，我们重点研究了一组具有代表性的可视化算法，并探讨了它们在应用功率界时的功率和性能特征。结果是一项研究，确定了未来的研究工作如何利用可视化应用程序的执行特性，以便在功率限制下优化性能。

引用次数: 4

IPDPS 2019 Organization IPDPS 2019组织

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/ipdps.2019.00009

Vinod E. F. Rebello, A. Melo

WORKSHOPS COMMITTEE Olivier Beaumont (Inria Bordeaux Sud-Ouest, France) Sunita Chandrasekaran (University of Delaware, USA) Ananth Kalyanaraman (Washington State University, USA) Cynthia A. Philips (Sandia National Laboratories, USA) Sivasankaran Rajamanickam (Sandia National Laboratories, USA) Min Si (Argonne National Laboratory, USA) Alan Sussman (University of Maryland, USA) Bora Ucar (CNRS, France)

研讨会委员会 Olivier Beaumont（法国波尔多西南大区研究所） Sunita Chandrasekaran（美国特拉华大学） Ananth Kalyanaraman（美国华盛顿州立大学） Cynthia A. Philips（美国桑迪亚国家实验室） Sivasankaran Rajamanickam（美国桑迪亚国家实验室） Min Si（美国阿贡国家实验室） Alan Sussman（美国马里兰大学） Bora Ucar（法国国家科学研究中心）

引用次数: 0

Data Jockey: Automatic Data Management for HPC Multi-tiered Storage Systems 数据骑师:HPC多层存储系统的自动数据管理

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-01 DOI: 10.1109/IPDPS.2019.00061

Woong Shin, Christopher Brumgard, Bing Xie, Sudharshan S. Vazhkudai, D. Ghoshal, S. Oral, L. Ramakrishnan

We present the design and implementation of Data Jockey, a data management system for HPC multi-tiered storage systems. As a centralized data management control plane, Data Jockey automates bulk data movement and placement for scientific workflows and integrates into existing HPC storage infrastructures. Data Jockey simplifies data management by eliminating human effort in programming complex data movements, laying datasets across multiple storage tiers when supporting complex workflows, which in turn increases the usability of multi-tiered storage systems emerging in modern HPC data centers. Specifically, Data Jockey presents a new data management scheme called "goal driven data management" that can automatically infer low-level bulk data movement plans from declarative high-level goal statements that come from the lifetime of iterative runs of scientific workflows. While doing so, Data Jockey aims to minimize data wait times by taking responsibility for datasets that are unused or to be used, and aggressively utilizing the capacity of the upper, higher performant storage tiers. We evaluated a prototype implementation of Data Jockey under a synthetic workload based on a year's worth of Oak Ridge Leadership Computing Facility's (OLCF) operational logs. Our evaluations suggest that Data Jockey leads to higher utilization of the upper storage tiers while minimizing the programming effort of data movement compared to human involved, per-domain ad-hoc data management scripts.

介绍了一种用于高性能计算(HPC)多层存储系统的数据管理系统Data Jockey的设计与实现。作为一个集中式的数据管理控制平面，data Jockey可以为科学工作流自动化批量数据移动和放置，并集成到现有的HPC存储基础设施中。Data Jockey通过消除编程复杂数据移动的人力，在支持复杂工作流时跨多个存储层放置数据集，从而简化了数据管理，从而提高了现代HPC数据中心中出现的多层存储系统的可用性。具体来说，Data Jockey提出了一种新的数据管理方案，称为“目标驱动的数据管理”，它可以从声明性的高级目标语句中自动推断低级的批量数据移动计划，这些目标语句来自于科学工作流的迭代运行的生命周期。在这样做的同时，Data Jockey的目标是通过对未使用或将要使用的数据集负责，并积极利用更高性能存储层的容量，从而最大限度地减少数据等待时间。我们基于Oak Ridge Leadership Computing Facility (OLCF)一年的操作日志，在合成工作负载下评估了Data Jockey的原型实现。我们的评估表明，与人工参与的、每个域的临时数据管理脚本相比，Data Jockey提高了上层存储层的利用率，同时最大限度地减少了数据移动的编程工作。

{"title":"Data Jockey: Automatic Data Management for HPC Multi-tiered Storage Systems","authors":"Woong Shin, Christopher Brumgard, Bing Xie, Sudharshan S. Vazhkudai, D. Ghoshal, S. Oral, L. Ramakrishnan","doi":"10.1109/IPDPS.2019.00061","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00061","url":null,"abstract":"We present the design and implementation of Data Jockey, a data management system for HPC multi-tiered storage systems. As a centralized data management control plane, Data Jockey automates bulk data movement and placement for scientific workflows and integrates into existing HPC storage infrastructures. Data Jockey simplifies data management by eliminating human effort in programming complex data movements, laying datasets across multiple storage tiers when supporting complex workflows, which in turn increases the usability of multi-tiered storage systems emerging in modern HPC data centers. Specifically, Data Jockey presents a new data management scheme called \"goal driven data management\" that can automatically infer low-level bulk data movement plans from declarative high-level goal statements that come from the lifetime of iterative runs of scientific workflows. While doing so, Data Jockey aims to minimize data wait times by taking responsibility for datasets that are unused or to be used, and aggressively utilizing the capacity of the upper, higher performant storage tiers. We evaluated a prototype implementation of Data Jockey under a synthetic workload based on a year's worth of Oak Ridge Leadership Computing Facility's (OLCF) operational logs. Our evaluations suggest that Data Jockey leads to higher utilization of the upper storage tiers while minimizing the programming effort of data movement compared to human involved, per-domain ad-hoc data management scripts.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123435781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 11