Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing最新文献

英文中文

CuMF_SGD: Parallelized Stochastic Gradient Descent for Matrix Factorization on GPUs 基于gpu的矩阵分解并行化随机梯度下降

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2017-06-26 DOI: 10.1145/3078597.3078602

Xiaolong Xie, Wei Tan, L. Fong, Yun Liang

Stochastic gradient descent (SGD) is widely used by many machine learning algorithms. It is efficient for big data ap- plications due to its low algorithmic complexity. SGD is inherently serial and its parallelization is not trivial. How to parallelize SGD on many-core architectures (e.g. GPUs) for high efficiency is a big challenge. In this paper, we present cuMF_SGD, a parallelized SGD solution for matrix factorization on GPUs. We first design high-performance GPU computation kernels that accelerate individual SGD updates by exploiting model parallelism. We then design efficient schemes that parallelize SGD updates by exploiting data parallelism. Finally, we scale cuMF SGD to large data sets that cannot fit into one GPU's memory. Evaluations on three public data sets show that cuMF_SGD outperforms existing solutions, including a 64- node CPU system, by a large margin using only one GPU card.

随机梯度下降(SGD)被广泛应用于许多机器学习算法中。由于其算法复杂度低，在大数据应用中效率高。SGD本质上是串行的，它的并行化不是微不足道的。如何在多核架构(例如gpu)上并行化SGD以获得高效率是一个很大的挑战。在本文中，我们提出了一种在gpu上用于矩阵分解的并行SGD解决方案cuMF_SGD。我们首先设计了高性能GPU计算内核，通过利用模型并行性来加速单个SGD更新。然后，我们设计了有效的方案，通过利用数据并行性来并行SGD更新。最后，我们将cuMF SGD扩展到一个GPU内存无法容纳的大型数据集。对三个公开数据集的评估表明，cuMF_SGD仅使用一个GPU卡就大大优于现有的解决方案，包括64节点CPU系统。

引用次数: 37

ArrayUDF: User-Defined Scientific Data Analysis on Arrays ArrayUDF:用户自定义的阵列科学数据分析

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2017-06-26 DOI: 10.1145/3078597.3078599

Bin Dong, Kesheng Wu, S. Byna, Jialin Liu, Weijie Zhao, Florin Rusu

User-Defined Functions (UDF) allow application programmers to specify analysis operations on data, while leaving the data management tasks to the system. This general approach enables numerous custom analysis functions and is at the heart of the modern Big Data systems. Even though the UDF mechanism can theoretically support arbitrary operations, a wide variety of common operations -- such as computing the moving average of a time series, the vorticity of a fluid flow, etc., -- are hard to express and slow to execute. Since these operations are traditionally performed on multi-dimensional arrays, we propose to extend the expressiveness of structural locality for supporting UDF operations on arrays. We further propose an in situ UDF mechanism, called ArrayUDF, to implement the structural locality. ArrayUDF allows users to define computations on adjacent array cells without the use of join operations and executes the UDF directly on arrays stored in data files without requiring to load their content into a data management system. Additionally, we present a thorough theoretical analysis of the data access cost to exploit the structural locality, which enables ArrayUDF to automatically select the best array partitioning strategy for a given UDF operation. In a series of performance evaluations on large scientific datasets, we have observed that -- using the generic UDF interface -- ArrayUDF consistently outperforms Spark, SciDB, and RasDaMan.

用户定义函数(UDF)允许应用程序程序员指定对数据的分析操作，而将数据管理任务留给系统。这种通用方法可以实现许多自定义分析功能，并且是现代大数据系统的核心。尽管UDF机制在理论上可以支持任意操作，但是各种各样的常见操作——比如计算时间序列的移动平均，流体流动的涡度等——很难表达，执行起来也很慢。由于这些操作传统上是在多维数组上执行的，因此我们建议扩展结构局部性的表达性，以支持数组上的UDF操作。我们进一步提出了一种称为ArrayUDF的原位UDF机制来实现结构局部性。ArrayUDF允许用户在不使用连接操作的情况下定义相邻数组单元的计算，并直接对存储在数据文件中的数组执行UDF，而不需要将其内容加载到数据管理系统中。此外，我们对数据访问成本进行了全面的理论分析，以利用结构局部性，这使得ArrayUDF能够为给定的UDF操作自动选择最佳的数组分区策略。在对大型科学数据集的一系列性能评估中，我们观察到——使用通用的UDF接口——ArrayUDF始终优于Spark、SciDB和RasDaMan。

{"title":"ArrayUDF: User-Defined Scientific Data Analysis on Arrays","authors":"Bin Dong, Kesheng Wu, S. Byna, Jialin Liu, Weijie Zhao, Florin Rusu","doi":"10.1145/3078597.3078599","DOIUrl":"https://doi.org/10.1145/3078597.3078599","url":null,"abstract":"User-Defined Functions (UDF) allow application programmers to specify analysis operations on data, while leaving the data management tasks to the system. This general approach enables numerous custom analysis functions and is at the heart of the modern Big Data systems. Even though the UDF mechanism can theoretically support arbitrary operations, a wide variety of common operations -- such as computing the moving average of a time series, the vorticity of a fluid flow, etc., -- are hard to express and slow to execute. Since these operations are traditionally performed on multi-dimensional arrays, we propose to extend the expressiveness of structural locality for supporting UDF operations on arrays. We further propose an in situ UDF mechanism, called ArrayUDF, to implement the structural locality. ArrayUDF allows users to define computations on adjacent array cells without the use of join operations and executes the UDF directly on arrays stored in data files without requiring to load their content into a data management system. Additionally, we present a thorough theoretical analysis of the data access cost to exploit the structural locality, which enables ArrayUDF to automatically select the best array partitioning strategy for a given UDF operation. In a series of performance evaluations on large scientific datasets, we have observed that -- using the generic UDF interface -- ArrayUDF consistently outperforms Spark, SciDB, and RasDaMan.","PeriodicalId":436194,"journal":{"name":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132439376","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Using Scientific Computing to Advance Wildland Fire Monitoring and Prediction 利用科学计算推进野火监测与预测

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2017-06-26 DOI: 10.1145/3078597.3078619

J. Coen

New technologies have transformed our understanding of wildland fire behavior, providing a better ability to observe them from a variety of platforms, simulate their growth with computational models, and interpret their frequency and controls in a global context. These tools have shown how wildland fires are among the extremes of weather events and can produce behaviors such as fire whirls, blow-ups, bursts of flame along the surface, and winds ten times stronger than ambient conditions, all of which result from the interactions between a fire and its atmospheric environment. I will highlight current research in integrated weather -- wildland fire computational modeling, fire detection, and observation, and their application to understanding and prediction. Coupled weather-wildland fire models tie numerical weather prediction models to wildland fire behavior modules to simulate the impact of a fire on the atmosphere and the subsequent feedback of these fire-induced winds on fire behavior, i.e. how a fire "creates its own weather". NCAR's CAWFE® modeling system has been used to explain fundamental fire phenomena and reproduce the unfolding of past fire events. Recent work, in which CAWFE has been integrated with satellite-based active fire detection data, addresses the challenges of applying it as an operational forecast tool. This newer generation of tools brought many goals within sight -- rapid fire detection, nearly ubiquitous monitoring, and recognition that many of the distinctive characteristics of fire events are reproducible and perhaps predictable in real time. Concurrently, these more complex tools raise new challenges. I conclude with innovative model-data fusion approaches to overcome some of these remaining puzzles.

新技术已经改变了我们对野火行为的理解，提供了更好的能力从各种平台观察它们，用计算模型模拟它们的增长，并在全球范围内解释它们的频率和控制。这些工具显示了野火是如何在极端天气事件中产生的，并且可以产生诸如火焰漩涡、爆炸、火焰沿表面爆发以及比环境条件强十倍的风等行为，所有这些都是火灾与其大气环境之间相互作用的结果。我将重点介绍当前综合天气方面的研究——野火计算建模、火灾探测和观测，以及它们在理解和预测方面的应用。天气-野火耦合模式将数值天气预报模式与野火行为模块结合起来，模拟火灾对大气的影响，以及随后这些火灾引起的风对火灾行为的反馈，即火灾如何“创造自己的天气”。NCAR的CAWFE®建模系统已被用于解释基本的火灾现象并重现过去火灾事件的展开。在最近的工作中，CAWFE与基于卫星的主动火灾探测数据相结合，解决了将其应用于业务预测工具的挑战。这种新一代的工具带来了许多目标——快速的火灾探测，几乎无处不在的监控，以及识别火灾事件的许多独特特征是可重复的，也许是可实时预测的。同时，这些更复杂的工具也带来了新的挑战。最后，我提出了一些创新的模型-数据融合方法来克服这些遗留的难题。

{"title":"Using Scientific Computing to Advance Wildland Fire Monitoring and Prediction","authors":"J. Coen","doi":"10.1145/3078597.3078619","DOIUrl":"https://doi.org/10.1145/3078597.3078619","url":null,"abstract":"New technologies have transformed our understanding of wildland fire behavior, providing a better ability to observe them from a variety of platforms, simulate their growth with computational models, and interpret their frequency and controls in a global context. These tools have shown how wildland fires are among the extremes of weather events and can produce behaviors such as fire whirls, blow-ups, bursts of flame along the surface, and winds ten times stronger than ambient conditions, all of which result from the interactions between a fire and its atmospheric environment. I will highlight current research in integrated weather -- wildland fire computational modeling, fire detection, and observation, and their application to understanding and prediction. Coupled weather-wildland fire models tie numerical weather prediction models to wildland fire behavior modules to simulate the impact of a fire on the atmosphere and the subsequent feedback of these fire-induced winds on fire behavior, i.e. how a fire \"creates its own weather\". NCAR's CAWFE® modeling system has been used to explain fundamental fire phenomena and reproduce the unfolding of past fire events. Recent work, in which CAWFE has been integrated with satellite-based active fire detection data, addresses the challenges of applying it as an operational forecast tool. This newer generation of tools brought many goals within sight -- rapid fire detection, nearly ubiquitous monitoring, and recognition that many of the distinctive characteristics of fire events are reproducible and perhaps predictable in real time. Concurrently, these more complex tools raise new challenges. I conclude with innovative model-data fusion approaches to overcome some of these remaining puzzles.","PeriodicalId":436194,"journal":{"name":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127885176","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations 推还是拉:减少图计算中的通信和同步

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2017-06-26 DOI: 10.1145/3078597.3078616

Maciej Besta, Michal Podstawski, Linus Groner, Edgar Solomonik, T. Hoefler

We reduce the cost of communication and synchronization in graph processing by analyzing the fastest way to process graphs: pushing the updates to a shared state or pulling the updates to a private state. We investigate the applicability of this push-pull dichotomy to various algorithms and its impact on complexity, performance, and the amount of used locks, atomics, and reads/writes. We consider 11 graph algorithms, 3 programming models, 2 graph abstractions, and various families of graphs. The conducted analysis illustrates surprising differences between push and pull variants of different algorithms in performance, speed of convergence, and code complexity; the insights are backed up by performance data from hardware counters. We use these findings to illustrate which variant is faster for each algorithm and to develop generic strategies that enable even higher speedups. Our insights can be used to accelerate graph processing engines or libraries on both massively-parallel shared-memory machines as well as distributed-memory systems.

我们通过分析处理图的最快方式来减少图处理中的通信和同步成本:将更新推送到共享状态或将更新拉到私有状态。我们研究了这种推拉二分法对各种算法的适用性，以及它对复杂性、性能和使用的锁、原子和读/写的数量的影响。我们考虑了11种图算法，3种编程模型，2种图抽象和各种图族。所进行的分析说明了不同算法在性能、收敛速度和代码复杂性方面的推式和拉式变体之间的惊人差异;这些见解得到了硬件计数器的性能数据的支持。我们使用这些发现来说明每种算法哪种变体更快，并开发通用策略以实现更高的速度。我们的见解可以用于加速大规模并行共享内存机器和分布式内存系统上的图形处理引擎或库。

{"title":"To Push or To Pull: On Reducing Communication and Synchronization in Graph Computations","authors":"Maciej Besta, Michal Podstawski, Linus Groner, Edgar Solomonik, T. Hoefler","doi":"10.1145/3078597.3078616","DOIUrl":"https://doi.org/10.1145/3078597.3078616","url":null,"abstract":"We reduce the cost of communication and synchronization in graph processing by analyzing the fastest way to process graphs: pushing the updates to a shared state or pulling the updates to a private state. We investigate the applicability of this push-pull dichotomy to various algorithms and its impact on complexity, performance, and the amount of used locks, atomics, and reads/writes. We consider 11 graph algorithms, 3 programming models, 2 graph abstractions, and various families of graphs. The conducted analysis illustrates surprising differences between push and pull variants of different algorithms in performance, speed of convergence, and code complexity; the insights are backed up by performance data from hardware counters. We use these findings to illustrate which variant is faster for each algorithm and to develop generic strategies that enable even higher speedups. Our insights can be used to accelerate graph processing engines or libraries on both massively-parallel shared-memory machines as well as distributed-memory systems.","PeriodicalId":436194,"journal":{"name":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130243522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 126

Towards a More Complete Understanding of SDC Propagation 更全面地了解SDC传播

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2017-06-26 DOI: 10.1145/3078597.3078617

Jon C. Calhoun, M. Snir, Luke N. Olson, W. Gropp

With the rate of errors that can silently effect an application's state/output expected to increase on future HPC machines, numerous application-level detection and recovery schemes have been proposed. Recovery is more efficient when errors are contained and affect only part of the computation's state. Containment is usually achieved by verifying all information leaking out of a statically defined containment domain, which is an expensive procedure. Alternatively, error propagation can be analyzed to bound the domain that is affected by a detected error. This paper investigates how silent data corruption (SDC) due to soft errors propagates through three HPC applications: HPCCG, Jacobi, and CoMD. To allow for more detailed view of error propagation, the paper tracks propagation at the instruction and application variable level. The impact of detection latency on error propagation is shown along with an application's ability to recover. Finally, the impact of compiler optimizations are explored along with the impact of local problem size on error propagation.

在未来的高性能计算机器上，错误率可能会无声地影响应用程序的状态/输出，因此人们提出了许多应用程序级的检测和恢复方案。当包含错误并且只影响部分计算状态时，恢复会更有效。通常通过验证从静态定义的包含域泄漏的所有信息来实现包含，这是一个昂贵的过程。或者，可以分析错误传播以绑定受检测到的错误影响的域。本文研究了由软错误引起的静默数据损坏(SDC)如何通过三个高性能计算应用程序:HPCCG、Jacobi和CoMD传播。为了更详细地了解错误传播，本文跟踪了指令和应用程序变量级别的传播。检测延迟对错误传播的影响与应用程序的恢复能力一起显示。最后，探讨了编译器优化的影响以及局部问题大小对错误传播的影响。

引用次数: 25

Enabling Workflow-Aware Scheduling on HPC Systems 在HPC系统上启用工作流感知调度

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2017-06-26 DOI: 10.1145/3078597.3078604

G. P. R. Álvarez, E. Elmroth, Per-Olov Östberg, L. Ramakrishnan

Scientific workflows are increasingly common in the workloads of current High Performance Computing (HPC) systems. However, HPC schedulers do not incorporate workflow-specific mechanisms beyond the capacity to declare dependencies between their jobs. Thus, workflows are run as sets of batch jobs with dependencies, which induces long intermediate wait times and, consequently, long workflow turnaround times. Alternatively, to reduce their turnaround time, workflows may be submitted as single pilot jobs that are allocated their maximum required resources for their entire runtime. Pilot jobs achieve shorter turnaround times but reduce the HPC system's utilization because resources may idle during the workflow's execution. We present a workflow-aware scheduling (WoAS) system that enables existing scheduling algorithms to exploit fine-grained information on a workflow's resource requirements and structure without modification. The current implementation of WoAS is integrated into Slurm, a widely used HPC batch scheduler. We evaluate the system using a simulator using real and synthetic workflows and a synthetic baseline workload that captures job patterns observed over three years of workload data from Edison, a large supercomputer hosted at the National Energy Research Scientific Computing Center. Our results show that WoAS reduces workflow turnaround times and improves system utilization without significantly slowing down conventional jobs.

科学工作流在当前高性能计算(HPC)系统的工作负载中越来越普遍。然而，HPC调度器除了声明它们的作业之间的依赖关系之外，并没有合并特定于工作流的机制。因此，工作流作为具有依赖关系的批处理作业集运行，这会导致较长的中间等待时间，从而导致较长的工作流周转时间。另外，为了减少周转时间，工作流可以作为单个试验作业提交，这些试验作业在整个运行时被分配了最大所需资源。试点作业实现了更短的周转时间，但降低了HPC系统的利用率，因为在工作流执行期间资源可能会闲置。我们提出了一个工作流感知调度(WoAS)系统，该系统使现有的调度算法能够在不修改的情况下利用工作流资源需求和结构的细粒度信息。WoAS的当前实现被集成到Slurm中，这是一个广泛使用的HPC批调度程序。我们使用模拟器对系统进行了评估，该模拟器使用了真实的和合成的工作流程，以及合成的基线工作负载，该工作负载捕获了从国家能源研究科学计算中心托管的大型超级计算机Edison观察到的三年工作负载数据的工作模式。我们的结果表明，WoAS减少了工作流周转时间，提高了系统利用率，而不会显著降低传统作业的速度。

{"title":"Enabling Workflow-Aware Scheduling on HPC Systems","authors":"G. P. R. Álvarez, E. Elmroth, Per-Olov Östberg, L. Ramakrishnan","doi":"10.1145/3078597.3078604","DOIUrl":"https://doi.org/10.1145/3078597.3078604","url":null,"abstract":"Scientific workflows are increasingly common in the workloads of current High Performance Computing (HPC) systems. However, HPC schedulers do not incorporate workflow-specific mechanisms beyond the capacity to declare dependencies between their jobs. Thus, workflows are run as sets of batch jobs with dependencies, which induces long intermediate wait times and, consequently, long workflow turnaround times. Alternatively, to reduce their turnaround time, workflows may be submitted as single pilot jobs that are allocated their maximum required resources for their entire runtime. Pilot jobs achieve shorter turnaround times but reduce the HPC system's utilization because resources may idle during the workflow's execution. We present a workflow-aware scheduling (WoAS) system that enables existing scheduling algorithms to exploit fine-grained information on a workflow's resource requirements and structure without modification. The current implementation of WoAS is integrated into Slurm, a widely used HPC batch scheduler. We evaluate the system using a simulator using real and synthetic workflows and a synthetic baseline workload that captures job patterns observed over three years of workload data from Edison, a large supercomputer hosted at the National Energy Research Scientific Computing Center. Our results show that WoAS reduces workflow turnaround times and improves system utilization without significantly slowing down conventional jobs.","PeriodicalId":436194,"journal":{"name":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","volume":"768 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134614050","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 28

Predicting Output Performance of a Petascale Supercomputer 预测千兆级超级计算机的输出性能

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2017-06-26 DOI: 10.1145/3078597.3078614

Bing Xie, Yezhou Huang, J. Chase, J. Choi, S. Klasky, J. Lofstead, S. Oral

In this paper, we develop a predictive model useful for output performance prediction of supercomputer file systems under production load. Our target environment is Titan---the 3rd fastest supercomputer in the world---and its Lustre-based multi-stage write path. We observe from Titan that although output performance is highly variable at small time scales, the mean performance is stable and consistent over typical application run times. Moreover, we find that output performance is non-linearly related to its correlated parameters due to interference and saturation on individual stages on the path. These observations enable us to build a predictive model of expected write times of output patterns and I/O configurations, using feature transformations to capture non-linear relationships. We identify the candidate features based on the structure of the Lustre/Titan write path, and use feature transformation functions to produce a model space with 135,000 candidate models. By searching for the minimal mean square error in this space we identify a good model and show that it is effective.

本文建立了一个预测模型，用于超级计算机文件系统在生产负载下的输出性能预测。我们的目标环境是Titan——世界上第三快的超级计算机——以及它基于luster的多级写入路径。我们从Titan观察到，尽管输出性能在小时间尺度上变化很大，但在典型的应用程序运行时间内，平均性能是稳定和一致的。此外，我们发现由于路径上各个阶段的干扰和饱和，输出性能与其相关参数呈非线性关系。这些观察结果使我们能够构建输出模式和I/O配置的预期写入时间的预测模型，使用特征转换来捕获非线性关系。我们基于Lustre/Titan写入路径的结构识别候选特征，并使用特征转换函数生成包含135,000个候选模型的模型空间。通过在这个空间中寻找最小均方误差，我们找到了一个好的模型，并证明了它是有效的。

{"title":"Predicting Output Performance of a Petascale Supercomputer","authors":"Bing Xie, Yezhou Huang, J. Chase, J. Choi, S. Klasky, J. Lofstead, S. Oral","doi":"10.1145/3078597.3078614","DOIUrl":"https://doi.org/10.1145/3078597.3078614","url":null,"abstract":"In this paper, we develop a predictive model useful for output performance prediction of supercomputer file systems under production load. Our target environment is Titan---the 3rd fastest supercomputer in the world---and its Lustre-based multi-stage write path. We observe from Titan that although output performance is highly variable at small time scales, the mean performance is stable and consistent over typical application run times. Moreover, we find that output performance is non-linearly related to its correlated parameters due to interference and saturation on individual stages on the path. These observations enable us to build a predictive model of expected write times of output patterns and I/O configurations, using feature transformations to capture non-linear relationships. We identify the candidate features based on the structure of the Lustre/Titan write path, and use feature transformation functions to produce a model space with 135,000 candidate models. By searching for the minimal mean square error in this space we identify a good model and show that it is effective.","PeriodicalId":436194,"journal":{"name":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125468411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 50

Diagnosing Machine Learning Pipelines with Fine-grained Lineage 用细粒度谱系诊断机器学习管道

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2017-06-26 DOI: 10.1145/3078597.3078603

Zhao Zhang, Evan R. Sparks, M. Franklin

We present the Hippo system to enable the diagnosis of distributed machine learning (ML) pipelines by leveraging fine-grained data lineage. Hippo exposes a concise yet powerful API, derived from primitive lineage types, to capture fine-grained data lineage for each data transformation. It records the input datasets, the output datasets and the cell-level mapping between them. It also collects sufficient information that is needed to reproduce the computation. Hippo efficiently enables common ML diagnosis operations such as code debugging, result analysis, data anomaly removal, and computation replay. By exploiting the metadata separation and high-order function encoding strategies, we observe an O(10^3)x total improvement in lineage storage efficiency vs. the baseline of cell-wise mapping recording while maintaining the lineage integrity. Hippo can answer the real use case lineage queries within a few seconds, which is low enough to enable interactive diagnosis of ML pipelines.

我们提出了Hippo系统，通过利用细粒度数据谱系来实现分布式机器学习(ML)管道的诊断。Hippo公开了一个简洁但功能强大的API，它派生自原始沿袭类型，用于为每个数据转换捕获细粒度的数据沿袭。它记录输入数据集、输出数据集以及它们之间的单元级映射。它还收集再现计算所需的足够信息。Hippo有效地支持常见的ML诊断操作，如代码调试、结果分析、数据异常去除和计算回放。通过利用元数据分离和高阶函数编码策略，我们观察到在保持谱系完整性的同时，谱系存储效率比基于细胞的映射记录的基线提高了0(10^3)倍。Hippo可以在几秒钟内回答真实的用例谱系查询，这足够低，可以实现ML管道的交互式诊断。

引用次数: 19

TCP Throughput Profiles Using Measurements over Dedicated Connections 使用专用连接测量的TCP吞吐量配置文件

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2017-06-26 DOI: 10.1145/3078597.3078615

N. Rao, Qiang Liu, S. Sen, D. Towsley, Gayane Vardoyan, R. Kettimuthu, Ian T Foster

ide-area data transfers in high-performance computing infrastructures are increasingly being carried over dynamically provisioned dedicated network connections that provide high capacities with no competing traffic. We present extensive TCP throughput measurements and time traces over a suite of physical and emulated 10 Gbps connections with 0-366 ms round-trip times (RTTs). Contrary to the general expectation, they show significant statistical and temporal variations, in addition to the overall dependencies on the congestion control mechanism, buffer size, and the number of parallel streams. We analyze several throughput profiles that have highly desirable concave regions wherein the throughput decreases slowly with RTTs, in stark contrast to the convex profiles predicted by various TCP analytical models. We present a generic throughput model that abstracts the ramp-up and sustainment phases of TCP flows, which provides insights into qualitative trends observed in measurements across TCP variants: (i) slow-start followed by well-sustained throughput leads to concave regions; (ii) large buffers and multiple parallel streams expand the concave regions in addition to improving the throughput; and (iii) stable throughput dynamics, indicated by a smoother Poincare map and smaller Lyapunov exponents, lead to wider concave regions. These measurements and analytical results together enable us to select a TCP variant and its parameters for a given connection to achieve high throughput with statistical guarantees.

高性能计算基础设施中的Ide-area数据传输越来越多地通过动态配置的专用网络连接进行，这些网络连接提供高容量，没有竞争流量。我们在一套物理和模拟的10 Gbps连接中提供了广泛的TCP吞吐量测量和时间跟踪，往返时间为0-366 ms (rtt)。与一般预期相反，除了对拥塞控制机制、缓冲区大小和并行流数量的总体依赖之外，它们还显示出显著的统计和时间变化。我们分析了几个吞吐量配置文件，它们具有非常理想的凹区域，其中吞吐量随着rtt而缓慢下降，与各种TCP分析模型预测的凸配置文件形成鲜明对比。我们提出了一个通用的吞吐量模型，抽象了TCP流量的上升和维持阶段，它提供了对TCP变量测量中观察到的定性趋势的见解:(i)缓慢启动后持续良好的吞吐量导致凹区域;(ii)大缓冲区和多个并行流除了提高吞吐量外，还扩大了凹区;(iii)稳定的吞吐量动态，由更平滑的庞加莱图和较小的李亚普诺夫指数表示，导致更宽的凹区。这些测量和分析结果使我们能够为给定的连接选择TCP变体及其参数，从而在统计保证的情况下实现高吞吐量。

{"title":"TCP Throughput Profiles Using Measurements over Dedicated Connections","authors":"N. Rao, Qiang Liu, S. Sen, D. Towsley, Gayane Vardoyan, R. Kettimuthu, Ian T Foster","doi":"10.1145/3078597.3078615","DOIUrl":"https://doi.org/10.1145/3078597.3078615","url":null,"abstract":"ide-area data transfers in high-performance computing infrastructures are increasingly being carried over dynamically provisioned dedicated network connections that provide high capacities with no competing traffic. We present extensive TCP throughput measurements and time traces over a suite of physical and emulated 10 Gbps connections with 0-366 ms round-trip times (RTTs). Contrary to the general expectation, they show significant statistical and temporal variations, in addition to the overall dependencies on the congestion control mechanism, buffer size, and the number of parallel streams. We analyze several throughput profiles that have highly desirable concave regions wherein the throughput decreases slowly with RTTs, in stark contrast to the convex profiles predicted by various TCP analytical models. We present a generic throughput model that abstracts the ramp-up and sustainment phases of TCP flows, which provides insights into qualitative trends observed in measurements across TCP variants: (i) slow-start followed by well-sustained throughput leads to concave regions; (ii) large buffers and multiple parallel streams expand the concave regions in addition to improving the throughput; and (iii) stable throughput dynamics, indicated by a smoother Poincare map and smaller Lyapunov exponents, lead to wider concave regions. These measurements and analytical results together enable us to select a TCP variant and its parameters for a given connection to achieve high throughput with statistical guarantees.","PeriodicalId":436194,"journal":{"name":"Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing","volume":"61 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127807903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 18

NICE: Network-Integrated Cluster-Efficient Storage NICE:网络集成集群高效存储

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

Pub Date : 2017-06-26 DOI: 10.1145/3078597.3078612

S. Al-Kiswany, Suli Yang, A. Arpaci-Dusseau, Remzi H. Arpaci-Dusseau

We present NICE, a key-value storage system design that leverages new software-defined network capabilities to build cluster-based network-efficient storage system. NICE presents novel techniques to co-design network routing and multicast with storage replication, consistency, and load balancing to achieve higher efficiency, performance, and scalability. We implement the NICEKV prototype. NICEKV follows the NICE approach in designing four essential network-centric storage mechanisms: request routing, replication, consistency, and load balancing. Our evaluation shows that the proposed approach brings significant performance gains compared to the current key-value systems design: up to 7× put/get performance improvement, up to 2× reduction in network load, 3× to 9× load reduction on the storage nodes, and the elimination of scalability bottlenecks present in current designs.

我们提出了NICE，一种键值存储系统设计，利用新的软件定义网络功能来构建基于集群的网络高效存储系统。NICE提出了新颖的技术，通过存储复制、一致性和负载平衡来共同设计网络路由和多播，以实现更高的效率、性能和可扩展性。我们实现了NICEKV原型。NICEKV遵循NICE方法设计了四种基本的以网络为中心的存储机制:请求路由、复制、一致性和负载平衡。我们的评估表明，与当前的键值系统设计相比，所提出的方法带来了显著的性能提升:高达7倍的放/取性能改进，高达2倍的网络负载减少，存储节点上的负载减少3到9倍，并且消除了当前设计中存在的可伸缩性瓶颈。

引用次数: 10

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Proceedings of the 26th International Symposium on High-Performance Parallel and Distributed Computing

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀