2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

Northup: Divide-and-Conquer Programming in Systems with Heterogeneous Memories and Processors 具有异构存储器和处理器的系统中的分而治之编程

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00043

Shuai Che, Jieming Yin

In recent years we have seen rapid development in both frontiers of emerging memory technologies and accelerator architectures. Future memory systems are becoming deeper and more heterogeneous. Adopting NVM and die-stacked DRAM on each HPC node is a new trend of development. On the other hand, GPUs and many-core processors have been widely deployed in today's supercomputers. However, software for programming and managing a system that consists of heterogeneous memories and processors is still in its very early stage of development. How to exploit such deep memory hierarchy and heterogeneous processors with minimal programming effort is an important issue to address. In this paper, we propose Northup, a programming and runtime framework, using a divide-and-conquer approach to map an application efficiently to heterogeneous systems. The proposed solution presents a portable layer that abstracts the system architecture, providing flexibility to support easy integration of new memories and processor nodes. We show that Northup out-of-core execution with SSD is only an average of 17% slower than in-memory processing for the evaluated applications.

近年来，我们看到了新兴存储技术和加速器架构的快速发展。未来的存储系统将变得更加深入和异构。在每个高性能计算节点上采用NVM和模堆叠DRAM是一个新的发展趋势。另一方面，gpu和多核处理器已经广泛应用于今天的超级计算机中。然而，用于编程和管理由异构存储器和处理器组成的系统的软件仍处于开发的早期阶段。如何以最少的编程努力利用这种深层内存层次结构和异构处理器是需要解决的重要问题。在本文中，我们提出了Northup，一个编程和运行时框架，使用分而治之的方法将应用程序有效地映射到异构系统。提出的解决方案提出了一个抽象系统架构的可移植层，提供灵活性以支持新存储器和处理器节点的轻松集成。我们表明，在评估的应用程序中，使用SSD执行Northup的核外处理只比内存内处理平均慢17%。

{"title":"Northup: Divide-and-Conquer Programming in Systems with Heterogeneous Memories and Processors","authors":"Shuai Che, Jieming Yin","doi":"10.1109/IPDPS.2019.00043","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00043","url":null,"abstract":"In recent years we have seen rapid development in both frontiers of emerging memory technologies and accelerator architectures. Future memory systems are becoming deeper and more heterogeneous. Adopting NVM and die-stacked DRAM on each HPC node is a new trend of development. On the other hand, GPUs and many-core processors have been widely deployed in today's supercomputers. However, software for programming and managing a system that consists of heterogeneous memories and processors is still in its very early stage of development. How to exploit such deep memory hierarchy and heterogeneous processors with minimal programming effort is an important issue to address. In this paper, we propose Northup, a programming and runtime framework, using a divide-and-conquer approach to map an application efficiently to heterogeneous systems. The proposed solution presents a portable layer that abstracts the system architecture, providing flexibility to support easy integration of new memories and processor nodes. We show that Northup out-of-core execution with SSD is only an average of 17% slower than in-memory processing for the evaluated applications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126116018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Stochastic Gradient Descent on Modern Hardware: Multi-core CPU or GPU? Synchronous or Asynchronous? 现代硬件的随机梯度下降:多核CPU还是GPU?同步还是异步?

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00113

Yujing Ma, Florin Rusu, Martin Torres

There is an increased interest in building data analytics frameworks with advanced algebraic capabilities both in industry and academia. Many of these frameworks, e.g., TensorFlow, implement their compute-intensive primitives in two flavors—as multi-thread routines for multi-core CPUs and as highly-parallel kernels executed on GPU. Stochastic gradient descent (SGD) is the most popular optimization method for model training implemented extensively on modern data analytics platforms. While the data-intensive properties of SGD are well-known, there is an intense debate on which of the many SGD variants is better in practice. In this paper, we perform a comprehensive experimental study of parallel SGD for training machine learning models. We consider the impact of three factors – computing architecture (multi-core CPU or GPU), synchronous or asynchronous model updates, and data sparsity – on three measures—hardware efficiency, statistical efficiency, and time to convergence. We draw several interesting findings from our experiments with logistic regression (LR), support vector machines (SVM), and deep neural nets (MLP) on five real datasets. As expected, GPU always outperforms parallel CPU for synchronous SGD. The gap is, however, only 2-5X for simple models, and below 7X even for fully-connected deep nets. For asynchronous SGD, CPU is undoubtedly the optimal solution, outperforming GPU in time to convergence even when the GPU has a speedup of 10X or more. The choice between synchronous GPU and asynchronous CPU is not straightforward and depends on the task and the characteristics of the data. Thus, CPU should not be easily discarded for machine learning workloads. We hope that our insights provide a useful guide for applying parallel SGD in practice and – more importantly – choosing the appropriate computing architecture.

在工业界和学术界，人们对构建具有高级代数能力的数据分析框架越来越感兴趣。许多这样的框架，例如TensorFlow，以两种方式实现它们的计算密集型原语——作为多核cpu的多线程例程和在GPU上执行的高度并行内核。随机梯度下降(SGD)是现代数据分析平台上广泛应用的最流行的模型训练优化方法。虽然SGD的数据密集型特性是众所周知的，但是在许多SGD变体中，哪一种在实践中更好还存在着激烈的争论。在本文中，我们对并行SGD进行了全面的实验研究，用于训练机器学习模型。我们考虑了三个因素——计算架构(多核CPU或GPU)、同步或异步模型更新和数据稀疏性——对硬件效率、统计效率和收敛时间这三个指标的影响。我们从逻辑回归(LR)、支持向量机(SVM)和深度神经网络(MLP)在五个真实数据集上的实验中得出了几个有趣的发现。正如预期的那样，对于同步SGD, GPU总是优于并行CPU。然而，对于简单模型，这一差距仅为2-5倍，即使对于完全连接的深度网络，这一差距也低于7倍。对于异步SGD, CPU无疑是最优的解决方案，即使GPU有10倍或更多的加速，CPU也能及时优于GPU。同步GPU和异步CPU之间的选择并不简单，取决于任务和数据的特征。因此，CPU不应该因为机器学习工作负载而被轻易丢弃。我们希望我们的见解为在实践中应用并行SGD提供有用的指导，更重要的是，为选择合适的计算体系结构提供有用的指导。

{"title":"Stochastic Gradient Descent on Modern Hardware: Multi-core CPU or GPU? Synchronous or Asynchronous?","authors":"Yujing Ma, Florin Rusu, Martin Torres","doi":"10.1109/IPDPS.2019.00113","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00113","url":null,"abstract":"There is an increased interest in building data analytics frameworks with advanced algebraic capabilities both in industry and academia. Many of these frameworks, e.g., TensorFlow, implement their compute-intensive primitives in two flavors—as multi-thread routines for multi-core CPUs and as highly-parallel kernels executed on GPU. Stochastic gradient descent (SGD) is the most popular optimization method for model training implemented extensively on modern data analytics platforms. While the data-intensive properties of SGD are well-known, there is an intense debate on which of the many SGD variants is better in practice. In this paper, we perform a comprehensive experimental study of parallel SGD for training machine learning models. We consider the impact of three factors – computing architecture (multi-core CPU or GPU), synchronous or asynchronous model updates, and data sparsity – on three measures—hardware efficiency, statistical efficiency, and time to convergence. We draw several interesting findings from our experiments with logistic regression (LR), support vector machines (SVM), and deep neural nets (MLP) on five real datasets. As expected, GPU always outperforms parallel CPU for synchronous SGD. The gap is, however, only 2-5X for simple models, and below 7X even for fully-connected deep nets. For asynchronous SGD, CPU is undoubtedly the optimal solution, outperforming GPU in time to convergence even when the GPU has a speedup of 10X or more. The choice between synchronous GPU and asynchronous CPU is not straightforward and depends on the task and the characteristics of the data. Thus, CPU should not be easily discarded for machine learning workloads. We hope that our insights provide a useful guide for applying parallel SGD in practice and – more importantly – choosing the appropriate computing architecture.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121874871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

SunwayLB: Enabling Extreme-Scale Lattice Boltzmann Method Based Computing Fluid Dynamics Simulations on Sunway TaihuLight SunwayLB:基于神威太湖之光的极端尺度晶格玻尔兹曼方法计算流体动力学仿真

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00065

Zhao Liu, Xuesen Chu, Xiaojing Lv, Hongsong Meng, Shupeng Shi, Wenji Han, Jingheng Xu, H. Fu, Guangwen Yang

The Lattice Boltzmann Method (LBM) is a relatively new class of Computational Fluid Dynamics methods. In this paper, we report our work on SunwayLB, which enables LBM based solutions aiming for industrial applications. We propose several techniques to boost the simulation speed and improve the scalability of SunwayLB, including a customized multi-level domain decomposition and data sharing scheme, a carefully orchestrated strategy to fuse kernels with different performance constraints for a more balanced workload, and optimization strategies for assembly code, which bring up to 137x speedup. Based on these optimization schemes, we manage to perform the largest direct numerical simulation which involves up to 5.6 trillion lattice cells, achieving 11,245 billion cell updates per second (GLUPS), 77% memory bandwidth utilization and a sustained performance of 4.7 PFlops. We also demonstrate a series of computational experiments for extreme-large scale fluid flow, as examples of real-world applications, to check the validity and performance of our work. The results show that SunwayLB is competent for a practical solution for industrial applications.

晶格玻尔兹曼方法(LBM)是一类相对较新的计算流体动力学方法。在本文中，我们报告了我们在SunwayLB上的工作，它使基于LBM的解决方案能够用于工业应用。我们提出了几种技术来提高模拟速度和提高SunwayLB的可扩展性，包括定制的多级域分解和数据共享方案，精心编排的策略来融合具有不同性能约束的内核以实现更平衡的工作负载，以及汇编代码的优化策略，这些策略可以带来高达137倍的加速。基于这些优化方案，我们成功地执行了最大的直接数值模拟，涉及多达5.6万亿晶格单元，实现每秒11,245亿单元更新(GLUPS)， 77%的内存带宽利用率和4.7 PFlops的持续性能。我们还演示了一系列的超大规模流体流动的计算实验，作为实际应用的例子，以检查我们工作的有效性和性能。结果表明，SunwayLB能够胜任工业应用的实际解决方案。

{"title":"SunwayLB: Enabling Extreme-Scale Lattice Boltzmann Method Based Computing Fluid Dynamics Simulations on Sunway TaihuLight","authors":"Zhao Liu, Xuesen Chu, Xiaojing Lv, Hongsong Meng, Shupeng Shi, Wenji Han, Jingheng Xu, H. Fu, Guangwen Yang","doi":"10.1109/IPDPS.2019.00065","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00065","url":null,"abstract":"The Lattice Boltzmann Method (LBM) is a relatively new class of Computational Fluid Dynamics methods. In this paper, we report our work on SunwayLB, which enables LBM based solutions aiming for industrial applications. We propose several techniques to boost the simulation speed and improve the scalability of SunwayLB, including a customized multi-level domain decomposition and data sharing scheme, a carefully orchestrated strategy to fuse kernels with different performance constraints for a more balanced workload, and optimization strategies for assembly code, which bring up to 137x speedup. Based on these optimization schemes, we manage to perform the largest direct numerical simulation which involves up to 5.6 trillion lattice cells, achieving 11,245 billion cell updates per second (GLUPS), 77% memory bandwidth utilization and a sustained performance of 4.7 PFlops. We also demonstrate a series of computational experiments for extreme-large scale fluid flow, as examples of real-world applications, to check the validity and performance of our work. The results show that SunwayLB is competent for a practical solution for industrial applications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130571202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 15

NCQ-Aware I/O Scheduling for Conventional Solid State Drives 基于ncq的传统固态硬盘I/O调度

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00062

Haoqiang Fan, Song Wu, Shadi Ibrahim, Ximing Chen, Hai Jin, Jiang Xiao, Haibing Guan

While current fairness-driven I/O schedulers are successful in allocating equal time/resource share to concurrent workloads, they ignore the I/O request queueing or reordering in storage device layer, such as Native Command Queueing (NCQ). As a result, requests of different workloads cannot have an equal chance to enter NCQ (NCQ conflict) and fairness is violated. We address this issue by providing the first systematic empirical analysis on how NCQ affects I/O fairness and SSD utilization and accordingly proposing a NCQ-aware I/O scheduling scheme, NASS. The basic idea of NASS is to elaborately control the request dispatch of workloads to relieve NCQ conflict and improve NCQ utilization. NASS builds on two core components: an evaluation model to quantify important features of the workload, and a dispatch control algorithm to set the appropriate request dispatch of running workloads. We integrate NASS into four state-of-the-art I/O schedulers and evaluate its effectiveness using widely used benchmarks and real world applications. Results show that with NASS, I/O schedulers can achieve 11-23% better fairness and at the same time improve device utilization by 9-29%.

虽然当前公平驱动的I/O调度器能够成功地为并发工作负载分配相等的时间/资源共享，但它们忽略了存储设备层中的I/O请求排队或重新排序，例如本机命令排队(NCQ)。导致不同工作负载的请求不能有平等的机会进入NCQ (NCQ冲突)，破坏了公平性。为了解决这个问题，我们首次对NCQ如何影响I/O公平性和SSD利用率进行了系统的实证分析，并据此提出了一个NCQ感知的I/O调度方案NASS。NASS的基本思想是对工作负载的请求调度进行精细控制，以缓解NCQ冲突，提高NCQ利用率。NASS构建在两个核心组件之上:一个评估模型，用于量化工作负载的重要特性;一个调度控制算法，用于设置运行工作负载的适当请求调度。我们将NASS集成到四个最先进的I/O调度器中，并使用广泛使用的基准测试和实际应用程序评估其有效性。结果表明，使用NASS后，I/O调度器的公平性提高了11-23%，同时设备利用率提高了9-29%。

{"title":"NCQ-Aware I/O Scheduling for Conventional Solid State Drives","authors":"Haoqiang Fan, Song Wu, Shadi Ibrahim, Ximing Chen, Hai Jin, Jiang Xiao, Haibing Guan","doi":"10.1109/IPDPS.2019.00062","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00062","url":null,"abstract":"While current fairness-driven I/O schedulers are successful in allocating equal time/resource share to concurrent workloads, they ignore the I/O request queueing or reordering in storage device layer, such as Native Command Queueing (NCQ). As a result, requests of different workloads cannot have an equal chance to enter NCQ (NCQ conflict) and fairness is violated. We address this issue by providing the first systematic empirical analysis on how NCQ affects I/O fairness and SSD utilization and accordingly proposing a NCQ-aware I/O scheduling scheme, NASS. The basic idea of NASS is to elaborately control the request dispatch of workloads to relieve NCQ conflict and improve NCQ utilization. NASS builds on two core components: an evaluation model to quantify important features of the workload, and a dispatch control algorithm to set the appropriate request dispatch of running workloads. We integrate NASS into four state-of-the-art I/O schedulers and evaluate its effectiveness using widely used benchmarks and real world applications. Results show that with NASS, I/O schedulers can achieve 11-23% better fairness and at the same time improve device utilization by 9-29%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131743459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

An Approach for Parallel Loading and Pre-Processing of Unstructured Meshes Stored in Spatially Scattered Fashion 空间分散存储的非结构化网格并行加载与预处理方法

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00084

Ondrej Meca, L. Ríha, T. Brzobohatý

This paper presents a workflow for parallel loading of database files containing sequentially stored unstructured meshes that are not considered to be efficiently read in parallel. In such a file consecutive elements are not spatially located and their respective nodes are at unknown positions in the file. This makes parallel loading challenging since adjacent elements are on different MPI processes, and their respective nodes are on unknown MPI processes. These two facts lead to a high communication overhead and very poor scalability if not addressed properly. In a standard approach, a sequentially stored mesh is sequentially converted to a particular parallel format accepted by a solver. This represents a significant bottleneck. Our proposed algorithm demonstrates that this bottleneck can be overcome, since it is able to (i) efficiently recreate an arbitrary stored sequential mesh in the distributed memory of a supercomputer without gathering the information into a single MPI rank, and (ii) prepare the mesh for massively parallel solvers.

本文提出了一种并行加载数据库文件的工作流程，其中包含顺序存储的非结构化网格，这些网格被认为不能有效地并行读取。在这样的文件中，连续的元素没有空间定位，它们各自的节点在文件中处于未知位置。这使得并行加载具有挑战性，因为相邻元素位于不同的MPI进程上，而它们各自的节点位于未知的MPI进程上。如果处理不当，这两个事实将导致较高的通信开销和非常差的可伸缩性。在标准方法中，顺序存储的网格依次转换为求解器接受的特定并行格式。这是一个重要的瓶颈。我们提出的算法表明，这一瓶颈是可以克服的，因为它能够(i)在超级计算机的分布式内存中有效地重新创建任意存储的顺序网格，而无需将信息收集到单个MPI秩中，并且(ii)为大规模并行求解器准备网格。

引用次数: 2

Aladdin: Optimized Maximum Flow Management for Shared Production Clusters 阿拉丁:为共享生产集群优化的最大流量管理

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00078

Heng Wu, Wen-bo Zhang, Yuanjia Xu, Hao Xiang, Tao Huang, Haiyang Ding, Zhenguo Zhang

The rise in popularity of long-lived applications (LLAs), such as deep learning and latency-sensitive online Web services, has brought new challenges for cluster schedulers in shared production environments. Scheduling LLAs needs to support complex placement constraints (e.g., to run multiple containers of an application on different machines) and larger degrees of parallelism to provide global optimization. But existing schedulers usually suffer severe constraint violations, high latency and low resource efficiency. This paper describes Aladdin, a novel cluster scheduler that can maximize resource efficiency while avoiding constraint violations: (i) it proposes a multidimensional and nonlinear capacity function to support constraint expressions; (ii) it applies an optimized maximum flow algorithm to improve resource efficiency. Experiments with an Alibaba workload trace from a 10,000-machine cluster show that Aladdin can reduce violated constraints by as mush as 20%. Meanwhile, it improves resource efficiency by 50% compared with state-of-the-art schedulers.

深度学习和对延迟敏感的在线Web服务等长期应用程序(LLAs)的流行给共享生产环境中的集群调度器带来了新的挑战。调度LLAs需要支持复杂的放置约束(例如，在不同的机器上运行应用程序的多个容器)和更大程度的并行性，以提供全局优化。但是现有的调度器通常存在严重的约束违反、高延迟和低资源效率的问题。本文描述了一种新的集群调度器Aladdin，它可以最大限度地提高资源效率，同时避免违反约束:(i)提出了一个多维非线性容量函数来支持约束表达式;(ii)采用优化的最大流量算法，提高资源效率。对阿里巴巴1万台机器集群的工作负载跟踪实验表明，Aladdin可以将违反约束的情况减少多达20%。同时，与最先进的调度器相比，它将资源效率提高了50%。

{"title":"Aladdin: Optimized Maximum Flow Management for Shared Production Clusters","authors":"Heng Wu, Wen-bo Zhang, Yuanjia Xu, Hao Xiang, Tao Huang, Haiyang Ding, Zhenguo Zhang","doi":"10.1109/IPDPS.2019.00078","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00078","url":null,"abstract":"The rise in popularity of long-lived applications (LLAs), such as deep learning and latency-sensitive online Web services, has brought new challenges for cluster schedulers in shared production environments. Scheduling LLAs needs to support complex placement constraints (e.g., to run multiple containers of an application on different machines) and larger degrees of parallelism to provide global optimization. But existing schedulers usually suffer severe constraint violations, high latency and low resource efficiency. This paper describes Aladdin, a novel cluster scheduler that can maximize resource efficiency while avoiding constraint violations: (i) it proposes a multidimensional and nonlinear capacity function to support constraint expressions; (ii) it applies an optimized maximum flow algorithm to improve resource efficiency. Experiments with an Alibaba workload trace from a 10,000-machine cluster show that Aladdin can reduce violated constraints by as mush as 20%. Meanwhile, it improves resource efficiency by 50% compared with state-of-the-art schedulers.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127854326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

FastJoin: A Skewness-Aware Distributed Stream Join System FastJoin:一个感知偏度的分布式流连接系统

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00111

Shunjie Zhou, Fan Zhang, Hanhua Chen, Hai Jin, B. Zhou

In the bigdata era, many applications are required to perform quick and accurate join operations on large-scale real-time data streams, such as stock trading and online advertisement analysis. To achieve high throughput and low latency, distributed stream join systems explore efficient stream partitioning strategies to execute the complex stream join procedure in parallel. Existing systems mainly deploy two kinds of partitioning strategies, i.e., random partitioning and hash partitioning. Random partitioning strategy partitions one data stream uniformly while broadcasting all the tuples of the other data stream. This simple strategy may incur lots of unnecessary computations for low-selectivity stream join. Hash partitioning strategy maps all the tuples of the two data streams according to their attributes for joining. However, hash partitioning strategy suffers from a serious load imbalance problem caused by the skew distribution of the attributes, which is common in real-world data. The skewed load may seriously affect the system performance. In this paper, we carefully model the load skewness problem in distributed join systems. We explore the key tuples which lead to the heavy load skewness, and propose an efficient key selection algorithm, GreedyFit to find out these key tuples. We design a lightweight tuple migration strategy to solve the load imbalance problem in real-time and implement a new distributed stream join system, FastJoin. Experimental results using real-world data show that FastJoin can significantly improve the system performance in terms of throughput and latency compared to the state-of-the-art stream join systems.

在大数据时代，许多应用需要对大规模实时数据流进行快速、准确的联接操作，如股票交易、在线广告分析等。为了实现高吞吐量和低延迟，分布式流连接系统探索有效的流分区策略来并行执行复杂的流连接过程。现有系统主要部署两种分区策略，即随机分区和哈希分区。随机分区策略对一个数据流进行统一分区，同时广播另一个数据流的所有元组。对于低选择性流连接，这个简单的策略可能会导致大量不必要的计算。哈希分区策略将两个数据流的所有元组根据其属性进行映射以进行连接。然而，哈希分区策略存在严重的负载不平衡问题，这是由属性的倾斜分布引起的，这在实际数据中很常见。负载倾斜会严重影响系统性能。本文对分布式连接系统中的负载偏度问题进行了详细的建模。我们探讨了导致重负载偏度的键元组，并提出了一种高效的键选择算法——GreedyFit来找出这些键元组。为了实时解决负载不平衡问题，我们设计了一种轻量级的元组迁移策略，并实现了一种新的分布式流连接系统FastJoin。使用真实数据的实验结果表明，与最先进的流连接系统相比，FastJoin在吞吐量和延迟方面可以显着提高系统性能。

{"title":"FastJoin: A Skewness-Aware Distributed Stream Join System","authors":"Shunjie Zhou, Fan Zhang, Hanhua Chen, Hai Jin, B. Zhou","doi":"10.1109/IPDPS.2019.00111","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00111","url":null,"abstract":"In the bigdata era, many applications are required to perform quick and accurate join operations on large-scale real-time data streams, such as stock trading and online advertisement analysis. To achieve high throughput and low latency, distributed stream join systems explore efficient stream partitioning strategies to execute the complex stream join procedure in parallel. Existing systems mainly deploy two kinds of partitioning strategies, i.e., random partitioning and hash partitioning. Random partitioning strategy partitions one data stream uniformly while broadcasting all the tuples of the other data stream. This simple strategy may incur lots of unnecessary computations for low-selectivity stream join. Hash partitioning strategy maps all the tuples of the two data streams according to their attributes for joining. However, hash partitioning strategy suffers from a serious load imbalance problem caused by the skew distribution of the attributes, which is common in real-world data. The skewed load may seriously affect the system performance. In this paper, we carefully model the load skewness problem in distributed join systems. We explore the key tuples which lead to the heavy load skewness, and propose an efficient key selection algorithm, GreedyFit to find out these key tuples. We design a lightweight tuple migration strategy to solve the load imbalance problem in real-time and implement a new distributed stream join system, FastJoin. Experimental results using real-world data show that FastJoin can significantly improve the system performance in terms of throughput and latency compared to the state-of-the-art stream join systems.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128520962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 12

Incremental Graph Processing for On-line Analytics 联机分析的增量图处理

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00108

Scott Sallinen, R. Pearce, M. Ripeanu

Modern data generation is enormous; we now capture events at increasingly fine granularity, and require processing at rates approaching real-time. For graph analytics, this explosion in data volumes and processing demands has not been matched by improved algorithmic or infrastructure techniques. Instead of exploring solutions to keep up with the velocity of the generated data, most of today's systems focus on analyzing individually built historic snapshots. Modern graph analytics pipelines must evolve to become viable at massive scale, and move away from static, post-processing scenarios to support on-line analysis. This paper presents our progress towards a system that analyzes dynamic incremental graphs, responsive at single-change granularity. We present an algorithmic structure using principles of recursive updates and monotonic convergence, and a set of incremental graph algorithms that can be implemented based on this structure. We also present the required middleware to support graph analytics at fine, event-level granularity. We envision that graph topology changes are processed asynchronously, concurrently, and independently (without shared state), converging an algorithm's state (e.g. single-source shortest path distances, connectivity analysis labeling) to its deterministic answer. The expected long-term impact of this work is to enable a transition away from offline graph analytics, allowing knowledge to be extracted from networked systems in real-time.

现代数据的产生是巨大的;我们现在以越来越精细的粒度捕获事件，并要求以接近实时的速率进行处理。对于图形分析来说，数据量和处理需求的爆炸式增长并没有得到改进的算法或基础设施技术的匹配。今天的大多数系统都专注于分析单独构建的历史快照，而不是探索解决方案来跟上生成数据的速度。现代图形分析管道必须不断发展，才能在大规模上变得可行，并从静态的后处理场景转向支持在线分析。本文介绍了我们在分析动态增量图的系统方面取得的进展，该系统在单更改粒度下响应。我们提出了一种使用递归更新和单调收敛原理的算法结构，以及一组可以基于该结构实现的增量图算法。我们还提供了所需的中间件，以支持精细的、事件级粒度的图形分析。我们设想图拓扑变化是异步、并发和独立(没有共享状态)处理的，将算法的状态(例如，单源最短路径距离，连通性分析标记)收敛到其确定性答案。这项工作的预期长期影响是实现从离线图形分析的过渡，允许从网络系统中实时提取知识。

{"title":"Incremental Graph Processing for On-line Analytics","authors":"Scott Sallinen, R. Pearce, M. Ripeanu","doi":"10.1109/IPDPS.2019.00108","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00108","url":null,"abstract":"Modern data generation is enormous; we now capture events at increasingly fine granularity, and require processing at rates approaching real-time. For graph analytics, this explosion in data volumes and processing demands has not been matched by improved algorithmic or infrastructure techniques. Instead of exploring solutions to keep up with the velocity of the generated data, most of today's systems focus on analyzing individually built historic snapshots. Modern graph analytics pipelines must evolve to become viable at massive scale, and move away from static, post-processing scenarios to support on-line analysis. This paper presents our progress towards a system that analyzes dynamic incremental graphs, responsive at single-change granularity. We present an algorithmic structure using principles of recursive updates and monotonic convergence, and a set of incremental graph algorithms that can be implemented based on this structure. We also present the required middleware to support graph analytics at fine, event-level granularity. We envision that graph topology changes are processed asynchronously, concurrently, and independently (without shared state), converging an algorithm's state (e.g. single-source shortest path distances, connectivity analysis labeling) to its deterministic answer. The expected long-term impact of this work is to enable a transition away from offline graph analytics, allowing knowledge to be extracted from networked systems in real-time.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131601565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Architecting Racetrack Memory Preshift through Pattern-Based Prediction Mechanisms 通过基于模式的预测机制构建赛马场记忆预移位

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00037

Adrian Colaso, P. Prieto, Pablo Abad Fidalgo, J. Gregorio, Valentin Puente

Racetrack Memories (RM) are a promising spintronic technology able to provide multi-bit storage in a single cell (tape-like) through a ferromagnetic nanowire with multiple domains. This technology offers superior density, non-volatility and low static power compared to CMOS memories. These features have attracted great interest in the adoption of RM as a replacement of RAM technology, from Main memory (DRAM) to maybe on-chip cache hierarchy (SRAM). One of the main drawbacks of this technology is the serialized access to the bits stored in each domain, resulting in unpredictable access time. An appropriate header management policy can potentially reduce the number of shift operations required to access the correct position. Simple policies such as leaving read/write head on the last domain accessed (or on the next) provide enough improvement in the presence of a certain level of locality on data access. However, in those cases with much lower locality, a more accurate behavior from the header management policy would be desirable. In this paper, we explore the utilization of hardware prefetching policies to implement the header management policy. "Predicting" the length and direction of the next displacement, it is possible to reduce shift operations, improving memory access time. The results of our experiments show that, with an appropriate header, our proposal reduces average shift latency by up to 50% in L2 and LLC, improving average memory access time by up to 10%.

赛道存储器(RM)是一种很有前途的自旋电子技术，能够通过具有多畴的铁磁纳米线在单个单元(磁带状)中提供多比特存储。与CMOS存储器相比，该技术具有优越的密度、非易失性和低静态功耗。这些特性吸引了人们对采用RM作为RAM技术的替代品的极大兴趣，从主存储器(DRAM)到片上缓存层次结构(SRAM)。该技术的主要缺点之一是对存储在每个域中的比特进行序列化访问，导致访问时间不可预测。适当的标头管理策略可以潜在地减少访问正确位置所需的移位操作的数量。简单的策略，如在最后访问的域(或下一个)上保留读/写头，在数据访问中存在一定级别的局部性时，提供了足够的改进。然而，在那些局部性低得多的情况下，头管理策略中更准确的行为是可取的。在本文中，我们探讨了利用硬件预取策略来实现报头管理策略。“预测”下一个位移的长度和方向，可以减少移位操作，提高内存访问时间。我们的实验结果表明，使用适当的标头，我们的建议将L2和LLC的平均移位延迟减少了50%，将平均内存访问时间提高了10%。

{"title":"Architecting Racetrack Memory Preshift through Pattern-Based Prediction Mechanisms","authors":"Adrian Colaso, P. Prieto, Pablo Abad Fidalgo, J. Gregorio, Valentin Puente","doi":"10.1109/IPDPS.2019.00037","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00037","url":null,"abstract":"Racetrack Memories (RM) are a promising spintronic technology able to provide multi-bit storage in a single cell (tape-like) through a ferromagnetic nanowire with multiple domains. This technology offers superior density, non-volatility and low static power compared to CMOS memories. These features have attracted great interest in the adoption of RM as a replacement of RAM technology, from Main memory (DRAM) to maybe on-chip cache hierarchy (SRAM). One of the main drawbacks of this technology is the serialized access to the bits stored in each domain, resulting in unpredictable access time. An appropriate header management policy can potentially reduce the number of shift operations required to access the correct position. Simple policies such as leaving read/write head on the last domain accessed (or on the next) provide enough improvement in the presence of a certain level of locality on data access. However, in those cases with much lower locality, a more accurate behavior from the header management policy would be desirable. In this paper, we explore the utilization of hardware prefetching policies to implement the header management policy. \"Predicting\" the length and direction of the next displacement, it is possible to reduce shift operations, improving memory access time. The results of our experiments show that, with an appropriate header, our proposal reduces average shift latency by up to 50% in L2 and LLC, improving average memory access time by up to 10%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115620789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Optimal Placement of In-memory Checkpoints Under Heterogeneous Failure Likelihoods 异构故障可能性下内存检查点的最优放置

2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)

Pub Date : 2019-05-20 DOI: 10.1109/IPDPS.2019.00098

Zaeem Hussain, T. Znati, R. Melhem

In-memory checkpointing has increased in popularity over the years because it significantly improves the time to take a checkpoint. It is usually accomplished by placing all or part of a processor's checkpoint into the local memory of a remote node within the cluster. If, however, the checkpointed node and the node containing its checkpoint both fail in quick succession, recovery using in-memory checkpoints becomes impossible. In this paper, we explore the problem of placing in-memory checkpoints among nodes whose individual failure likelihoods are not identical. We provide theoretical results on the optimal way to place in-memory checkpoints such that the probability of occurrence of a catastrophic failure, i.e. failure of a node as well as the node containing its checkpoint, is minimized. Using the failure logs spread over 5 years of a 49,152 node supercomputer, we show that checkpoint placement schemes that utilize knowledge of node failure likelihoods, and are guided by the theoretical results we provide, can significantly reduce the total number of such catastrophic failures when compared with placement schemes that are oblivious of the heterogeneity in nodes based on their failure likelihoods.

多年来，内存中的检查点越来越受欢迎，因为它显著地缩短了检查点的时间。它通常通过将处理器检查点的全部或部分放置到集群内远程节点的本地内存中来实现。但是，如果检查点节点和包含其检查点的节点都快速连续失败，那么使用内存中的检查点进行恢复将变得不可能。在本文中，我们探讨了在单个故障可能性不相同的节点之间放置内存检查点的问题。我们提供了关于在内存中放置检查点的最佳方法的理论结果，以便最小化发生灾难性故障的概率，即节点以及包含其检查点的节点的故障。使用49,152个节点的超级计算机5年的故障日志，我们表明，与基于故障可能性忽略节点异质性的放置方案相比，利用节点故障可能性知识并以我们提供的理论结果为指导的检查点放置方案可以显着减少此类灾难性故障的总数。

{"title":"Optimal Placement of In-memory Checkpoints Under Heterogeneous Failure Likelihoods","authors":"Zaeem Hussain, T. Znati, R. Melhem","doi":"10.1109/IPDPS.2019.00098","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00098","url":null,"abstract":"In-memory checkpointing has increased in popularity over the years because it significantly improves the time to take a checkpoint. It is usually accomplished by placing all or part of a processor's checkpoint into the local memory of a remote node within the cluster. If, however, the checkpointed node and the node containing its checkpoint both fail in quick succession, recovery using in-memory checkpoints becomes impossible. In this paper, we explore the problem of placing in-memory checkpoints among nodes whose individual failure likelihoods are not identical. We provide theoretical results on the optimal way to place in-memory checkpoints such that the probability of occurrence of a catastrophic failure, i.e. failure of a node as well as the node containing its checkpoint, is minimized. Using the failure logs spread over 5 years of a 49,152 node supercomputer, we show that checkpoint placement schemes that utilize knowledge of node failure likelihoods, and are guided by the theoretical results we provide, can significantly reduce the total number of such catastrophic failures when compared with placement schemes that are oblivious of the heterogeneity in nodes based on their failure likelihoods.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"14 43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124746958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2