Proceedings of the 48th International Conference on Parallel Processing最新文献_第6页

Incorporating Probabilistic Optimizations for Resource Provisioning of Data Processing Workflows 结合概率优化的数据处理工作流资源配置

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337847

Amelie Chi Zhou, Yao Xiao, Bingsheng He, Shadi Ibrahim, Reynold Cheng

Workflow is an important model for big data processing and resource provisioning is crucial to the performance of workflows. Recently, system variations in the cloud and large-scale clusters, such as those in I/O and network performances, have been observed to greatly affect the performance of workflows. Traditional resource provisioning methods, which overlook these variations, can lead to suboptimal resource provisioning results. In this paper, we provide a general solution for workflow performance optimizations considering system variations. Specifically, we model system variations as time-dependent random variables and take their probability distributions as optimization input. Despite its effectiveness, this solution involves heavy computation overhead. Thus, we propose three pruning techniques to simplify workflow structure and reduce the probability evaluation overhead. We implement our techniques in a runtime library, which allows users to incorporate efficient probabilistic optimization into existing resource provisioning methods. Experiments show that probabilistic solutions can improve the performance by 51% compared to state-of-the-art static solutions while guaranteeing budget constraint, and our pruning techniques can greatly reduce the overhead of probabilistic optimization.

工作流是大数据处理的重要模型，资源配置对工作流的性能至关重要。最近，人们观察到云和大规模集群中的系统变化，例如I/O和网络性能的变化，会极大地影响工作流的性能。传统的资源供应方法忽略了这些变化，可能会导致次优的资源供应结果。在本文中，我们为考虑系统变化的工作流性能优化提供了一个通用的解决方案。具体而言，我们将系统变化建模为随时间变化的随机变量，并将其概率分布作为优化输入。尽管它很有效，但这个解决方案涉及大量的计算开销。因此，我们提出了三种修剪技术来简化工作流结构并降低概率评估开销。我们在运行时库中实现我们的技术，它允许用户将有效的概率优化合并到现有的资源供应方法中。实验表明，在保证预算约束的情况下，概率优化方案的性能比最先进的静态解决方案提高51%，并且我们的修剪技术可以大大降低概率优化的开销。

{"title":"Incorporating Probabilistic Optimizations for Resource Provisioning of Data Processing Workflows","authors":"Amelie Chi Zhou, Yao Xiao, Bingsheng He, Shadi Ibrahim, Reynold Cheng","doi":"10.1145/3337821.3337847","DOIUrl":"https://doi.org/10.1145/3337821.3337847","url":null,"abstract":"Workflow is an important model for big data processing and resource provisioning is crucial to the performance of workflows. Recently, system variations in the cloud and large-scale clusters, such as those in I/O and network performances, have been observed to greatly affect the performance of workflows. Traditional resource provisioning methods, which overlook these variations, can lead to suboptimal resource provisioning results. In this paper, we provide a general solution for workflow performance optimizations considering system variations. Specifically, we model system variations as time-dependent random variables and take their probability distributions as optimization input. Despite its effectiveness, this solution involves heavy computation overhead. Thus, we propose three pruning techniques to simplify workflow structure and reduce the probability evaluation overhead. We implement our techniques in a runtime library, which allows users to incorporate efficient probabilistic optimization into existing resource provisioning methods. Experiments show that probabilistic solutions can improve the performance by 51% compared to state-of-the-art static solutions while guaranteeing budget constraint, and our pruning techniques can greatly reduce the overhead of probabilistic optimization.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125888780","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

CPpf: a prefetch aware LLC partitioning approach CPpf:一种支持预取的LLC分区方法

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337895

Jun Xiao, A. Pimentel, Xu Liu

Hardware cache prefetching is deployed in modern multicore processors to reduce memory latencies, addressing the memory wall problem. However, it tends to increase the Last Level Cache (LLC) contention among applications in multiprogrammed workloads, leading to a performance degradation for the overall system. To study the interaction between hardware prefetching and LLC cache management, we first analyze the variation of application performance when varying the effective LLC space in the presence and absence of hardware prefetching. We observe that hardware prefetching can compensate the application performance loss due to the reduced effective cache space. Motivated by this observation, we classify applications into two categories, prefetching sensitive (PS) and non prefetching sensitive (NPS) applications, by the degree of performance benefit they experience from hardware prefetchers. To address the cache contention and also to mitigate the potential prefetch-related cache interference, we propose CPpf, a cache partitioning approach for improving the shared cache management in the presence of hardware prefetching. CPpf consists of a method using Precise Event-Based Sampling techniques for the online classification of PS and NPS applications and a cache partitioning scheme using Cache Allocation technology to distribute the cache space among PS and NPS applications. We implemented CPpf as a user-level runtime system on Linux. Compared with a non-partitioning approach, CPpf achieves speedups of up to 1.20, 1.08 and 1.06 for workloads with 2, 4 and 8 single-threaded applications, respectively. Moreover, it achieves speedups of up to 1.22 and 1.11 for workloads composed of two applications with 4 threads and 8 threads, respectively.

硬件缓存预取部署在现代多核处理器中，以减少内存延迟，解决内存墙问题。然而，它往往会增加多程序工作负载中应用程序之间的最后一级缓存争用，从而导致整个系统的性能下降。为了研究硬件预取与LLC缓存管理之间的交互作用，我们首先分析了在硬件预取和不预取的情况下，改变有效LLC空间对应用程序性能的影响。我们观察到硬件预取可以补偿由于减少的有效缓存空间而导致的应用程序性能损失。基于这一观察结果，我们根据应用程序从硬件预取器中获得的性能优势程度，将应用程序分为两类，预取敏感(PS)和非预取敏感(NPS)应用程序。为了解决缓存争用并减轻潜在的与预取相关的缓存干扰，我们提出了CPpf，这是一种缓存分区方法，用于改善硬件预取存在时的共享缓存管理。CPpf包括一种使用基于事件的精确采样技术对PS和NPS应用程序进行在线分类的方法，以及一种使用缓存分配技术在PS和NPS应用程序之间分配缓存空间的缓存分区方案。我们将CPpf实现为Linux上的用户级运行时系统。与非分区方法相比，对于具有2、4和8个单线程应用程序的工作负载，CPpf分别实现了高达1.20、1.08和1.06的速度提升。此外，对于由两个分别具有4线程和8线程的应用程序组成的工作负载，它可以实现高达1.22和1.11的速度提升。

{"title":"CPpf: a prefetch aware LLC partitioning approach","authors":"Jun Xiao, A. Pimentel, Xu Liu","doi":"10.1145/3337821.3337895","DOIUrl":"https://doi.org/10.1145/3337821.3337895","url":null,"abstract":"Hardware cache prefetching is deployed in modern multicore processors to reduce memory latencies, addressing the memory wall problem. However, it tends to increase the Last Level Cache (LLC) contention among applications in multiprogrammed workloads, leading to a performance degradation for the overall system. To study the interaction between hardware prefetching and LLC cache management, we first analyze the variation of application performance when varying the effective LLC space in the presence and absence of hardware prefetching. We observe that hardware prefetching can compensate the application performance loss due to the reduced effective cache space. Motivated by this observation, we classify applications into two categories, prefetching sensitive (PS) and non prefetching sensitive (NPS) applications, by the degree of performance benefit they experience from hardware prefetchers. To address the cache contention and also to mitigate the potential prefetch-related cache interference, we propose CPpf, a cache partitioning approach for improving the shared cache management in the presence of hardware prefetching. CPpf consists of a method using Precise Event-Based Sampling techniques for the online classification of PS and NPS applications and a cache partitioning scheme using Cache Allocation technology to distribute the cache space among PS and NPS applications. We implemented CPpf as a user-level runtime system on Linux. Compared with a non-partitioning approach, CPpf achieves speedups of up to 1.20, 1.08 and 1.06 for workloads with 2, 4 and 8 single-threaded applications, respectively. Moreover, it achieves speedups of up to 1.22 and 1.11 for workloads composed of two applications with 4 threads and 8 threads, respectively.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125576875","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Predictable GPUs Frequency Scaling for Energy and Performance 可预测的gpu频率缩放能量和性能

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337833

Kaijie Fan, Biagio Cosenza, B. Juurlink

Dynamic voltage and frequency scaling (DVFS) is an important solution to balance performance and energy consumption, and hardware vendors provide management libraries that allow the programmer to change both memory and core frequencies. The possibility to manually set these frequencies is a great opportunity for application tuning, which can focus on the best application-dependent setting. However, this task is not straightforward because of the large set of possible configurations and because of the multi-objective nature of the problem, which minimizes energy consumption and maximizes performance. This paper proposes a method to predict the best core and memory frequency configurations on GPUs for an input OpenCL kernel. Our modeling approach, based on machine learning, first predicts speedup and normalized energy over the default frequency configuration. Then, it combines the two models into a multi-objective one that predicts a Pareto-set of frequency configurations. The approach uses static code features, is built on a set of carefully designed micro-benchmarks, and can predict the best frequency settings of a new kernel without executing it. Test results show that our modeling approach is very accurate on predicting extrema points and Pareto set for ten out of twelve test benchmarks, and discover frequency configurations that dominate the default configuration in either energy or performance.

动态电压和频率缩放(DVFS)是平衡性能和能耗的重要解决方案，硬件供应商提供了允许程序员改变内存和核心频率的管理库。手动设置这些频率的可能性是应用程序调优的绝佳机会，它可以专注于与应用程序相关的最佳设置。然而，这项任务并不简单，因为可能的配置集很大，而且因为问题的多目标性质，这将最小化能量消耗并最大化性能。本文提出了一种预测输入OpenCL内核在gpu上的最佳内核和内存频率配置的方法。我们的建模方法基于机器学习，首先预测默认频率配置上的加速和归一化能量。然后，将两个模型组合成一个多目标模型，该模型预测频率配置的帕累托集。该方法使用静态代码特性，构建在一组精心设计的微基准之上，并且可以在不执行新内核的情况下预测其最佳频率设置。测试结果表明，我们的建模方法在预测12个测试基准中的10个极值点和Pareto集方面非常准确，并且发现在能量或性能方面主导默认配置的频率配置。

{"title":"Predictable GPUs Frequency Scaling for Energy and Performance","authors":"Kaijie Fan, Biagio Cosenza, B. Juurlink","doi":"10.1145/3337821.3337833","DOIUrl":"https://doi.org/10.1145/3337821.3337833","url":null,"abstract":"Dynamic voltage and frequency scaling (DVFS) is an important solution to balance performance and energy consumption, and hardware vendors provide management libraries that allow the programmer to change both memory and core frequencies. The possibility to manually set these frequencies is a great opportunity for application tuning, which can focus on the best application-dependent setting. However, this task is not straightforward because of the large set of possible configurations and because of the multi-objective nature of the problem, which minimizes energy consumption and maximizes performance. This paper proposes a method to predict the best core and memory frequency configurations on GPUs for an input OpenCL kernel. Our modeling approach, based on machine learning, first predicts speedup and normalized energy over the default frequency configuration. Then, it combines the two models into a multi-objective one that predicts a Pareto-set of frequency configurations. The approach uses static code features, is built on a set of carefully designed micro-benchmarks, and can predict the best frequency settings of a new kernel without executing it. Test results show that our modeling approach is very accurate on predicting extrema points and Pareto set for ten out of twelve test benchmarks, and discover frequency configurations that dominate the default configuration in either energy or performance.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1524 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127445523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

LFOC LFOC

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337925

Adrian Garcia-Garcia, J. C. Saez, Fernando Castro, Manuel Prieto-Matias

Multicore processors constitute the main architecture choice for modern computing systems in different market segments. Despite their benefits, the contention that naturally appears when multiple applications compete for the use of shared resources among cores, such as the last-level cache (LLC), may lead to substantial performance degradation. This may have a negative impact on key system aspects such as throughput and fairness. Assigning the various applications in the workload to separate LLC partitions with possibly different sizes, has been proven effective to mitigate shared-resource contention effects. In this article we propose LFOC, a clustering-based cache partitioning scheme that strives to deliver fairness while providing acceptable system throughput. LFOC leverages the Intel Cache Allocation Technology (CAT), which enables the system software to divide the LLC into different partitions. To accomplish its goals, LFOC tries to mimic the behavior of the optimal cache-clustering solution, which we could approximate by means of a simulator in different scenarios. To this end, LFOC effectively identifies streaming aggressor programs and cache sensitive applications, which are then assigned to separate cache partitions. We implemented LFOC in the Linux kernel and evaluated it on a real system featuring an Intel Skylake processor, where we compare its effectiveness to that of two state-of-the-art policies that optimize fairness and throughput, respectively. Our experimental analysis reveals that LFOC is able to bring a higher reduction in unfairness by leveraging a lightweight algorithm suitable for adoption in a real OS.

{"title":"LFOC","authors":"Adrian Garcia-Garcia, J. C. Saez, Fernando Castro, Manuel Prieto-Matias","doi":"10.1145/3337821.3337925","DOIUrl":"https://doi.org/10.1145/3337821.3337925","url":null,"abstract":"Multicore processors constitute the main architecture choice for modern computing systems in different market segments. Despite their benefits, the contention that naturally appears when multiple applications compete for the use of shared resources among cores, such as the last-level cache (LLC), may lead to substantial performance degradation. This may have a negative impact on key system aspects such as throughput and fairness. Assigning the various applications in the workload to separate LLC partitions with possibly different sizes, has been proven effective to mitigate shared-resource contention effects. In this article we propose LFOC, a clustering-based cache partitioning scheme that strives to deliver fairness while providing acceptable system throughput. LFOC leverages the Intel Cache Allocation Technology (CAT), which enables the system software to divide the LLC into different partitions. To accomplish its goals, LFOC tries to mimic the behavior of the optimal cache-clustering solution, which we could approximate by means of a simulator in different scenarios. To this end, LFOC effectively identifies streaming aggressor programs and cache sensitive applications, which are then assigned to separate cache partitions. We implemented LFOC in the Linux kernel and evaluated it on a real system featuring an Intel Skylake processor, where we compare its effectiveness to that of two state-of-the-art policies that optimize fairness and throughput, respectively. Our experimental analysis reveals that LFOC is able to bring a higher reduction in unfairness by leveraging a lightweight algorithm suitable for adoption in a real OS.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129521253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

SaC 囊

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337865

M. Azhar, M. Pericàs, P. Stenström

Reducing the energy to carry out computational tasks is key to almost any computing application. We focus in this paper on iterative applications that have explicit computational deadlines per iteration. Our objective is to meet the computational deadlines while minimizing energy. We leverage the vast configuration space offered by heterogeneous multicore platforms which typically expose three dimensions for energy saving configurability: Voltage/frequency levels, thread count and core type (e.g. ARM big/LITTLE). We note that when choosing the most energy-efficient configuration that meets the computational deadline, an iteration will typically finish before the deadline and execution-time slack will build up across iterations. Our proposed slack management policy - SaC (Slack as a Currency) - proactively explores the configuration space to select configurations that can save substantial amounts of energy. To avoid the overheads of an exhaustive search of the configuration space, our proposal also comprises a low-overhead, on-line method by which one can assess each point in the configuration space by linearly interpolating between the endpoints in each configuration-space dimension. Overall, we show that our proposed slack management policy and linear-interpolation configuration assessment method can yield 62% energy savings on top of race-to-idle without missing any deadlines.

{"title":"SaC","authors":"M. Azhar, M. Pericàs, P. Stenström","doi":"10.1145/3337821.3337865","DOIUrl":"https://doi.org/10.1145/3337821.3337865","url":null,"abstract":"Reducing the energy to carry out computational tasks is key to almost any computing application. We focus in this paper on iterative applications that have explicit computational deadlines per iteration. Our objective is to meet the computational deadlines while minimizing energy. We leverage the vast configuration space offered by heterogeneous multicore platforms which typically expose three dimensions for energy saving configurability: Voltage/frequency levels, thread count and core type (e.g. ARM big/LITTLE). We note that when choosing the most energy-efficient configuration that meets the computational deadline, an iteration will typically finish before the deadline and execution-time slack will build up across iterations. Our proposed slack management policy - SaC (Slack as a Currency) - proactively explores the configuration space to select configurations that can save substantial amounts of energy. To avoid the overheads of an exhaustive search of the configuration space, our proposal also comprises a low-overhead, on-line method by which one can assess each point in the configuration space by linearly interpolating between the endpoints in each configuration-space dimension. Overall, we show that our proposed slack management policy and linear-interpolation configuration assessment method can yield 62% energy savings on top of race-to-idle without missing any deadlines.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129631771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Multi-Objective Reinforcement Learning for Reconfiguring Data Stream Analytics on Edge Computing 基于边缘计算的数据流分析多目标强化学习

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337894

A. Veith, Felipe Rodrigo de Souza, M. Assunção, L. Lefèvre, J. Anjos

There is increasing demand for handling massive amounts of data in a timely manner via Distributed Stream Processing (DSP). A DSP application is often structured as a directed graph whose vertices are operators that perform transformations over the incoming data and edges representing the data streams between operators. DSP applications are traditionally deployed on the Cloud in order to explore the virtually unlimited number of resources. Edge computing has emerged as a suitable paradigm for executing parts of DSP applications by offloading certain operators from the Cloud and placing them close to where the data is generated, hence minimising the overall time required to process data events (i.e., the end-to-end latency). The operator reconfiguration consists of changing the initial placement by reassigning operators to different devices given target performance metrics. In this work, we model the operator reconfiguration as a Reinforcement Learning (RL) problem and define a multi-objective reward considering metrics regarding operator reconfiguration, and infrastructure and application improvement. Experimental results show that reconfiguration algorithms that minimise only end-to-end processing latency can have a substantial impact on WAN traffic and communication cost. The results also demonstrate that when reconfiguring operators, RL algorithms improve by over 50% the performance of the initial placement provided by state-of-the-art approaches.

通过分布式流处理(DSP)及时处理海量数据的需求越来越大。DSP应用程序通常结构为有向图，其顶点是在传入数据上执行转换的算子，边缘表示算子之间的数据流。为了探索几乎无限的资源，DSP应用程序传统上部署在云上。边缘计算已经成为执行部分DSP应用程序的合适范例，通过从云中卸载某些操作并将其放置在数据生成的位置附近，从而最大限度地减少处理数据事件所需的总时间(即端到端延迟)。操作员重新配置包括通过将操作员重新分配到给定目标性能指标的不同设备来改变初始位置。在这项工作中，我们将操作员重新配置建模为一个强化学习(RL)问题，并定义了一个多目标奖励，考虑了操作员重新配置、基础设施和应用程序改进的指标。实验结果表明，仅最小化端到端处理延迟的重构算法可以对广域网流量和通信成本产生重大影响。结果还表明，当重新配置操作员时，强化学习算法比最先进的方法提供的初始放置性能提高了50%以上。

{"title":"Multi-Objective Reinforcement Learning for Reconfiguring Data Stream Analytics on Edge Computing","authors":"A. Veith, Felipe Rodrigo de Souza, M. Assunção, L. Lefèvre, J. Anjos","doi":"10.1145/3337821.3337894","DOIUrl":"https://doi.org/10.1145/3337821.3337894","url":null,"abstract":"There is increasing demand for handling massive amounts of data in a timely manner via Distributed Stream Processing (DSP). A DSP application is often structured as a directed graph whose vertices are operators that perform transformations over the incoming data and edges representing the data streams between operators. DSP applications are traditionally deployed on the Cloud in order to explore the virtually unlimited number of resources. Edge computing has emerged as a suitable paradigm for executing parts of DSP applications by offloading certain operators from the Cloud and placing them close to where the data is generated, hence minimising the overall time required to process data events (i.e., the end-to-end latency). The operator reconfiguration consists of changing the initial placement by reassigning operators to different devices given target performance metrics. In this work, we model the operator reconfiguration as a Reinforcement Learning (RL) problem and define a multi-objective reward considering metrics regarding operator reconfiguration, and infrastructure and application improvement. Experimental results show that reconfiguration algorithms that minimise only end-to-end processing latency can have a substantial impact on WAN traffic and communication cost. The results also demonstrate that when reconfiguring operators, RL algorithms improve by over 50% the performance of the initial placement provided by state-of-the-art approaches.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124334942","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Spatially-aware Parallel I/O for Particle Data 粒子数据的空间感知并行I/O

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337875

Sidharth Kumar, Steve Petruzza, W. Usher, Valerio Pascucci

Particle data are used across a diverse set of large scale simulations, for example, in cosmology, molecular dynamics and combustion. At scale these applications generate tremendous amounts of data, which is often saved in an unstructured format that does not preserve spatial locality; resulting in poor read performance for post-processing analysis and visualization tasks, which typically make spatial queries. In this work, we explore some of the challenges of large scale particle data management, and introduce new techniques to perform scalable, spatially-aware write and read operations. We propose an adaptive aggregation technique to improve the performance of data aggregation, for both uniform and non-uniform particle distributions. Furthermore, we enable efficient read operations by employing a level of detail re-ordering and a multi-resolution layout. Finally, we demonstrate the scalability of our techniques with experiments on large scale simulation workloads up to 256K cores on two different leadership supercomputers, Mira and Theta.

粒子数据被用于各种各样的大规模模拟，例如宇宙学、分子动力学和燃烧。在规模上，这些应用程序产生大量的数据，这些数据通常以不保留空间局部性的非结构化格式保存;导致后处理分析和可视化任务的读取性能较差，这些任务通常会进行空间查询。在这项工作中，我们探索了大规模粒子数据管理的一些挑战，并引入了新技术来执行可扩展的、空间感知的写入和读取操作。我们提出了一种自适应聚合技术来提高均匀和非均匀粒子分布的数据聚合性能。此外，我们通过采用详细的重新排序和多分辨率布局来实现高效的读取操作。最后，我们在两个不同的领先超级计算机Mira和Theta上进行了高达256K核的大规模模拟工作负载实验，展示了我们技术的可扩展性。

引用次数: 4

HyperPRAW

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337876

Carlos Fernandez Musoles, D. Coca, P. Richmond

High Performance Computing (HPC) demand is on the rise, particularly for large distributed computing. HPC systems have, by design, very heterogeneous architectures, both in computation and in communication bandwidth, resulting in wide variations in the cost of communications between compute units. If large distributed applications are to take full advantage of HPC, the physical communication capabilities must be taken into consideration when allocating workload. Hypergraphs are good at modelling total volume of communication in parallel and distributed applications. To the best of our knowledge, there are no hypergraph partitioning algorithms to date that are architecture-aware. We propose a novel restreaming hypergraph partitioning algorithm (HyperPRAW) that takes advantage of peer to peer physical bandwidth profiling data to improve distributed applications performance in HPC systems. Our results show that not only the quality of the partitions achieved by our algorithm is comparable with state-of-the-art multilevel partitioning, but that the runtime performance in a synthetic benchmark is significantly reduced in 10 hypergraph models tested, with speedup factors of up to 14x.

{"title":"HyperPRAW","authors":"Carlos Fernandez Musoles, D. Coca, P. Richmond","doi":"10.1145/3337821.3337876","DOIUrl":"https://doi.org/10.1145/3337821.3337876","url":null,"abstract":"High Performance Computing (HPC) demand is on the rise, particularly for large distributed computing. HPC systems have, by design, very heterogeneous architectures, both in computation and in communication bandwidth, resulting in wide variations in the cost of communications between compute units. If large distributed applications are to take full advantage of HPC, the physical communication capabilities must be taken into consideration when allocating workload. Hypergraphs are good at modelling total volume of communication in parallel and distributed applications. To the best of our knowledge, there are no hypergraph partitioning algorithms to date that are architecture-aware. We propose a novel restreaming hypergraph partitioning algorithm (HyperPRAW) that takes advantage of peer to peer physical bandwidth profiling data to improve distributed applications performance in HPC systems. Our results show that not only the quality of the partitions achieved by our algorithm is comparable with state-of-the-art multilevel partitioning, but that the runtime performance in a synthetic benchmark is significantly reduced in 10 hypergraph models tested, with speedup factors of up to 14x.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"86 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124029503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Cynthia 辛西娅

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337873

Haoyue Zheng, Fei Xu, Li Chen, Zhi Zhou, Fangming Liu

It becomes an increasingly popular trend for deep neural networks with large-scale datasets to be trained in a distributed manner in the cloud. However, widely known as resource-intensive and time-consuming, distributed deep neural network (DDNN) training suffers from unpredictable performance in the cloud, due to the intricate factors of resource bottleneck, heterogeneity and the imbalance of computation and communication which eventually cause severe resource under-utilization. In this paper, we propose Cynthia, a cost-efficient cloud resource provisioning framework to provide predictable DDNN training performance and reduce the training budget. To explicitly explore the resource bottleneck and heterogeneity, Cynthia predicts the DDNN training time by leveraging a lightweight analytical performance model based on the resource consumption of workers and parameter servers. With an accurate performance prediction, Cynthia is able to optimally provision the cost-efficient cloud instances to jointly guarantee the training performance and minimize the training budget. We implement Cynthia on top of Kubernetes by launching a 56-docker cluster to train four representative DNN models. Extensive prototype experiments on Amazon EC2 demonstrate that Cynthia can provide predictable training performance while reducing the monetary cost for DDNN workloads by up to 50.6%, in comparison to state-of-the-art resource provisioning strategies, yet with acceptable runtime overhead.

{"title":"Cynthia","authors":"Haoyue Zheng, Fei Xu, Li Chen, Zhi Zhou, Fangming Liu","doi":"10.1145/3337821.3337873","DOIUrl":"https://doi.org/10.1145/3337821.3337873","url":null,"abstract":"It becomes an increasingly popular trend for deep neural networks with large-scale datasets to be trained in a distributed manner in the cloud. However, widely known as resource-intensive and time-consuming, distributed deep neural network (DDNN) training suffers from unpredictable performance in the cloud, due to the intricate factors of resource bottleneck, heterogeneity and the imbalance of computation and communication which eventually cause severe resource under-utilization. In this paper, we propose Cynthia, a cost-efficient cloud resource provisioning framework to provide predictable DDNN training performance and reduce the training budget. To explicitly explore the resource bottleneck and heterogeneity, Cynthia predicts the DDNN training time by leveraging a lightweight analytical performance model based on the resource consumption of workers and parameter servers. With an accurate performance prediction, Cynthia is able to optimally provision the cost-efficient cloud instances to jointly guarantee the training performance and minimize the training budget. We implement Cynthia on top of Kubernetes by launching a 56-docker cluster to train four representative DNN models. Extensive prototype experiments on Amazon EC2 demonstrate that Cynthia can provide predictable training performance while reducing the monetary cost for DDNN workloads by up to 50.6%, in comparison to state-of-the-art resource provisioning strategies, yet with acceptable runtime overhead.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130797899","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 16

Performance, Energy, and Scalability Analysis and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks 并行癌症深度学习CANDLE基准的性能、能量和可扩展性分析与改进

Proceedings of the 48th International Conference on Parallel Processing

Pub Date : 2019-08-05 DOI: 10.1145/3337821.3337905

Xingfu Wu, V. Taylor, J. Wozniak, R. Stevens, T. Brettin, Fangfang Xia

Training scientific deep learning models requires the significant compute power of high-performance computing systems. In this paper, we analyze the performance characteristics of the benchmarks from the exploratory research project CANDLE (Cancer Distributed Learning Environment) with a focus on the hyperparameters epochs, batch sizes, and learning rates. We present the parallel methodology that uses the distributed deep learning framework Horovod to parallelize the CANDLE benchmarks. We then use scaling strategies for both epochs and batch size with linear learning rate scaling to investigate how they impact the execution time and accuracy as well as the power, energy, and scalability of the parallel CANDLE benchmarks under conditions of strong scaling and weak scaling on the IBM Power9 heterogeneous system Summit at Oak Ridge National Laboratory and the Cray XC40 Theta at Argonne National Laboratory. This study provides insights into how to set the proper numbers of epochs, batch sizes, and compute resources for these benchmarks to preserve the high accuracy and to reduce the execution time of the benchmarks. We identify the data-loading performance bottleneck and then improve the performance and energy for better scalability. Results with the modified benchmarks on Summit indicate up to 78.25% in performance improvement and up to 78% in energy saving under strong scaling on up to 384 GPUs, and up to 79.5% in performance improvement and up to 77.11% in energy saving under weak scaling on up to 3,072 GPUs. On Theta, we achieve up to 45.22% performance improvement and up to 41.78% in energy saving under strong scaling on up to 384 nodes. Moreover, the modification dramatically reduces the broadcast overhead.

训练科学的深度学习模型需要高性能计算系统的强大计算能力。在本文中，我们分析了来自探索性研究项目CANDLE(癌症分布式学习环境)的基准测试的性能特征，重点关注超参数时代、批大小和学习率。我们提出了使用分布式深度学习框架Horovod并行化CANDLE基准测试的并行方法。然后，在橡树岭国家实验室的IBM Power9异构系统峰会和Argonne国家实验室的Cray XC40 Theta上，我们使用具有线性学习率缩放的时代和批大小缩放策略来研究它们如何影响执行时间和准确性，以及并行CANDLE基准测试在强缩放和弱缩放条件下的功率、能量和可伸缩性。本研究提供了如何为这些基准设置适当的epoch数、批大小和计算资源的见解，以保持高准确性并减少基准的执行时间。我们识别数据加载性能瓶颈，然后改进性能和能量以获得更好的可伸缩性。在Summit上修改基准测试的结果表明，在最多384个gpu的强扩展下，性能提高78.25%，节能78%;在最多3072个gpu的弱扩展下，性能提高79.5%，节能77.11%。在Theta上，我们在384个节点的强大扩展下实现了高达45.22%的性能提升和高达41.78%的节能。此外，这种修改极大地减少了广播开销。

{"title":"Performance, Energy, and Scalability Analysis and Improvement of Parallel Cancer Deep Learning CANDLE Benchmarks","authors":"Xingfu Wu, V. Taylor, J. Wozniak, R. Stevens, T. Brettin, Fangfang Xia","doi":"10.1145/3337821.3337905","DOIUrl":"https://doi.org/10.1145/3337821.3337905","url":null,"abstract":"Training scientific deep learning models requires the significant compute power of high-performance computing systems. In this paper, we analyze the performance characteristics of the benchmarks from the exploratory research project CANDLE (Cancer Distributed Learning Environment) with a focus on the hyperparameters epochs, batch sizes, and learning rates. We present the parallel methodology that uses the distributed deep learning framework Horovod to parallelize the CANDLE benchmarks. We then use scaling strategies for both epochs and batch size with linear learning rate scaling to investigate how they impact the execution time and accuracy as well as the power, energy, and scalability of the parallel CANDLE benchmarks under conditions of strong scaling and weak scaling on the IBM Power9 heterogeneous system Summit at Oak Ridge National Laboratory and the Cray XC40 Theta at Argonne National Laboratory. This study provides insights into how to set the proper numbers of epochs, batch sizes, and compute resources for these benchmarks to preserve the high accuracy and to reduce the execution time of the benchmarks. We identify the data-loading performance bottleneck and then improve the performance and energy for better scalability. Results with the modified benchmarks on Summit indicate up to 78.25% in performance improvement and up to 78% in energy saving under strong scaling on up to 384 GPUs, and up to 79.5% in performance improvement and up to 77.11% in energy saving under weak scaling on up to 3,072 GPUs. On Theta, we achieve up to 45.22% performance improvement and up to 41.78% in energy saving under strong scaling on up to 384 nodes. Moreover, the modification dramatically reduces the broadcast overhead.","PeriodicalId":405273,"journal":{"name":"Proceedings of the 48th International Conference on Parallel Processing","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134042405","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19