Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00043
Shuai Che, Jieming Yin
In recent years we have seen rapid development in both frontiers of emerging memory technologies and accelerator architectures. Future memory systems are becoming deeper and more heterogeneous. Adopting NVM and die-stacked DRAM on each HPC node is a new trend of development. On the other hand, GPUs and many-core processors have been widely deployed in today's supercomputers. However, software for programming and managing a system that consists of heterogeneous memories and processors is still in its very early stage of development. How to exploit such deep memory hierarchy and heterogeneous processors with minimal programming effort is an important issue to address. In this paper, we propose Northup, a programming and runtime framework, using a divide-and-conquer approach to map an application efficiently to heterogeneous systems. The proposed solution presents a portable layer that abstracts the system architecture, providing flexibility to support easy integration of new memories and processor nodes. We show that Northup out-of-core execution with SSD is only an average of 17% slower than in-memory processing for the evaluated applications.
{"title":"Northup: Divide-and-Conquer Programming in Systems with Heterogeneous Memories and Processors","authors":"Shuai Che, Jieming Yin","doi":"10.1109/IPDPS.2019.00043","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00043","url":null,"abstract":"In recent years we have seen rapid development in both frontiers of emerging memory technologies and accelerator architectures. Future memory systems are becoming deeper and more heterogeneous. Adopting NVM and die-stacked DRAM on each HPC node is a new trend of development. On the other hand, GPUs and many-core processors have been widely deployed in today's supercomputers. However, software for programming and managing a system that consists of heterogeneous memories and processors is still in its very early stage of development. How to exploit such deep memory hierarchy and heterogeneous processors with minimal programming effort is an important issue to address. In this paper, we propose Northup, a programming and runtime framework, using a divide-and-conquer approach to map an application efficiently to heterogeneous systems. The proposed solution presents a portable layer that abstracts the system architecture, providing flexibility to support easy integration of new memories and processor nodes. We show that Northup out-of-core execution with SSD is only an average of 17% slower than in-memory processing for the evaluated applications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126116018","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00113
Yujing Ma, Florin Rusu, Martin Torres
There is an increased interest in building data analytics frameworks with advanced algebraic capabilities both in industry and academia. Many of these frameworks, e.g., TensorFlow, implement their compute-intensive primitives in two flavors—as multi-thread routines for multi-core CPUs and as highly-parallel kernels executed on GPU. Stochastic gradient descent (SGD) is the most popular optimization method for model training implemented extensively on modern data analytics platforms. While the data-intensive properties of SGD are well-known, there is an intense debate on which of the many SGD variants is better in practice. In this paper, we perform a comprehensive experimental study of parallel SGD for training machine learning models. We consider the impact of three factors – computing architecture (multi-core CPU or GPU), synchronous or asynchronous model updates, and data sparsity – on three measures—hardware efficiency, statistical efficiency, and time to convergence. We draw several interesting findings from our experiments with logistic regression (LR), support vector machines (SVM), and deep neural nets (MLP) on five real datasets. As expected, GPU always outperforms parallel CPU for synchronous SGD. The gap is, however, only 2-5X for simple models, and below 7X even for fully-connected deep nets. For asynchronous SGD, CPU is undoubtedly the optimal solution, outperforming GPU in time to convergence even when the GPU has a speedup of 10X or more. The choice between synchronous GPU and asynchronous CPU is not straightforward and depends on the task and the characteristics of the data. Thus, CPU should not be easily discarded for machine learning workloads. We hope that our insights provide a useful guide for applying parallel SGD in practice and – more importantly – choosing the appropriate computing architecture.
{"title":"Stochastic Gradient Descent on Modern Hardware: Multi-core CPU or GPU? Synchronous or Asynchronous?","authors":"Yujing Ma, Florin Rusu, Martin Torres","doi":"10.1109/IPDPS.2019.00113","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00113","url":null,"abstract":"There is an increased interest in building data analytics frameworks with advanced algebraic capabilities both in industry and academia. Many of these frameworks, e.g., TensorFlow, implement their compute-intensive primitives in two flavors—as multi-thread routines for multi-core CPUs and as highly-parallel kernels executed on GPU. Stochastic gradient descent (SGD) is the most popular optimization method for model training implemented extensively on modern data analytics platforms. While the data-intensive properties of SGD are well-known, there is an intense debate on which of the many SGD variants is better in practice. In this paper, we perform a comprehensive experimental study of parallel SGD for training machine learning models. We consider the impact of three factors – computing architecture (multi-core CPU or GPU), synchronous or asynchronous model updates, and data sparsity – on three measures—hardware efficiency, statistical efficiency, and time to convergence. We draw several interesting findings from our experiments with logistic regression (LR), support vector machines (SVM), and deep neural nets (MLP) on five real datasets. As expected, GPU always outperforms parallel CPU for synchronous SGD. The gap is, however, only 2-5X for simple models, and below 7X even for fully-connected deep nets. For asynchronous SGD, CPU is undoubtedly the optimal solution, outperforming GPU in time to convergence even when the GPU has a speedup of 10X or more. The choice between synchronous GPU and asynchronous CPU is not straightforward and depends on the task and the characteristics of the data. Thus, CPU should not be easily discarded for machine learning workloads. We hope that our insights provide a useful guide for applying parallel SGD in practice and – more importantly – choosing the appropriate computing architecture.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"101 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121874871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00065
Zhao Liu, Xuesen Chu, Xiaojing Lv, Hongsong Meng, Shupeng Shi, Wenji Han, Jingheng Xu, H. Fu, Guangwen Yang
The Lattice Boltzmann Method (LBM) is a relatively new class of Computational Fluid Dynamics methods. In this paper, we report our work on SunwayLB, which enables LBM based solutions aiming for industrial applications. We propose several techniques to boost the simulation speed and improve the scalability of SunwayLB, including a customized multi-level domain decomposition and data sharing scheme, a carefully orchestrated strategy to fuse kernels with different performance constraints for a more balanced workload, and optimization strategies for assembly code, which bring up to 137x speedup. Based on these optimization schemes, we manage to perform the largest direct numerical simulation which involves up to 5.6 trillion lattice cells, achieving 11,245 billion cell updates per second (GLUPS), 77% memory bandwidth utilization and a sustained performance of 4.7 PFlops. We also demonstrate a series of computational experiments for extreme-large scale fluid flow, as examples of real-world applications, to check the validity and performance of our work. The results show that SunwayLB is competent for a practical solution for industrial applications.
{"title":"SunwayLB: Enabling Extreme-Scale Lattice Boltzmann Method Based Computing Fluid Dynamics Simulations on Sunway TaihuLight","authors":"Zhao Liu, Xuesen Chu, Xiaojing Lv, Hongsong Meng, Shupeng Shi, Wenji Han, Jingheng Xu, H. Fu, Guangwen Yang","doi":"10.1109/IPDPS.2019.00065","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00065","url":null,"abstract":"The Lattice Boltzmann Method (LBM) is a relatively new class of Computational Fluid Dynamics methods. In this paper, we report our work on SunwayLB, which enables LBM based solutions aiming for industrial applications. We propose several techniques to boost the simulation speed and improve the scalability of SunwayLB, including a customized multi-level domain decomposition and data sharing scheme, a carefully orchestrated strategy to fuse kernels with different performance constraints for a more balanced workload, and optimization strategies for assembly code, which bring up to 137x speedup. Based on these optimization schemes, we manage to perform the largest direct numerical simulation which involves up to 5.6 trillion lattice cells, achieving 11,245 billion cell updates per second (GLUPS), 77% memory bandwidth utilization and a sustained performance of 4.7 PFlops. We also demonstrate a series of computational experiments for extreme-large scale fluid flow, as examples of real-world applications, to check the validity and performance of our work. The results show that SunwayLB is competent for a practical solution for industrial applications.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130571202","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00062
Haoqiang Fan, Song Wu, Shadi Ibrahim, Ximing Chen, Hai Jin, Jiang Xiao, Haibing Guan
While current fairness-driven I/O schedulers are successful in allocating equal time/resource share to concurrent workloads, they ignore the I/O request queueing or reordering in storage device layer, such as Native Command Queueing (NCQ). As a result, requests of different workloads cannot have an equal chance to enter NCQ (NCQ conflict) and fairness is violated. We address this issue by providing the first systematic empirical analysis on how NCQ affects I/O fairness and SSD utilization and accordingly proposing a NCQ-aware I/O scheduling scheme, NASS. The basic idea of NASS is to elaborately control the request dispatch of workloads to relieve NCQ conflict and improve NCQ utilization. NASS builds on two core components: an evaluation model to quantify important features of the workload, and a dispatch control algorithm to set the appropriate request dispatch of running workloads. We integrate NASS into four state-of-the-art I/O schedulers and evaluate its effectiveness using widely used benchmarks and real world applications. Results show that with NASS, I/O schedulers can achieve 11-23% better fairness and at the same time improve device utilization by 9-29%.
{"title":"NCQ-Aware I/O Scheduling for Conventional Solid State Drives","authors":"Haoqiang Fan, Song Wu, Shadi Ibrahim, Ximing Chen, Hai Jin, Jiang Xiao, Haibing Guan","doi":"10.1109/IPDPS.2019.00062","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00062","url":null,"abstract":"While current fairness-driven I/O schedulers are successful in allocating equal time/resource share to concurrent workloads, they ignore the I/O request queueing or reordering in storage device layer, such as Native Command Queueing (NCQ). As a result, requests of different workloads cannot have an equal chance to enter NCQ (NCQ conflict) and fairness is violated. We address this issue by providing the first systematic empirical analysis on how NCQ affects I/O fairness and SSD utilization and accordingly proposing a NCQ-aware I/O scheduling scheme, NASS. The basic idea of NASS is to elaborately control the request dispatch of workloads to relieve NCQ conflict and improve NCQ utilization. NASS builds on two core components: an evaluation model to quantify important features of the workload, and a dispatch control algorithm to set the appropriate request dispatch of running workloads. We integrate NASS into four state-of-the-art I/O schedulers and evaluate its effectiveness using widely used benchmarks and real world applications. Results show that with NASS, I/O schedulers can achieve 11-23% better fairness and at the same time improve device utilization by 9-29%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131743459","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00084
Ondrej Meca, L. Ríha, T. Brzobohatý
This paper presents a workflow for parallel loading of database files containing sequentially stored unstructured meshes that are not considered to be efficiently read in parallel. In such a file consecutive elements are not spatially located and their respective nodes are at unknown positions in the file. This makes parallel loading challenging since adjacent elements are on different MPI processes, and their respective nodes are on unknown MPI processes. These two facts lead to a high communication overhead and very poor scalability if not addressed properly. In a standard approach, a sequentially stored mesh is sequentially converted to a particular parallel format accepted by a solver. This represents a significant bottleneck. Our proposed algorithm demonstrates that this bottleneck can be overcome, since it is able to (i) efficiently recreate an arbitrary stored sequential mesh in the distributed memory of a supercomputer without gathering the information into a single MPI rank, and (ii) prepare the mesh for massively parallel solvers.
{"title":"An Approach for Parallel Loading and Pre-Processing of Unstructured Meshes Stored in Spatially Scattered Fashion","authors":"Ondrej Meca, L. Ríha, T. Brzobohatý","doi":"10.1109/IPDPS.2019.00084","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00084","url":null,"abstract":"This paper presents a workflow for parallel loading of database files containing sequentially stored unstructured meshes that are not considered to be efficiently read in parallel. In such a file consecutive elements are not spatially located and their respective nodes are at unknown positions in the file. This makes parallel loading challenging since adjacent elements are on different MPI processes, and their respective nodes are on unknown MPI processes. These two facts lead to a high communication overhead and very poor scalability if not addressed properly. In a standard approach, a sequentially stored mesh is sequentially converted to a particular parallel format accepted by a solver. This represents a significant bottleneck. Our proposed algorithm demonstrates that this bottleneck can be overcome, since it is able to (i) efficiently recreate an arbitrary stored sequential mesh in the distributed memory of a supercomputer without gathering the information into a single MPI rank, and (ii) prepare the mesh for massively parallel solvers.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133271016","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The rise in popularity of long-lived applications (LLAs), such as deep learning and latency-sensitive online Web services, has brought new challenges for cluster schedulers in shared production environments. Scheduling LLAs needs to support complex placement constraints (e.g., to run multiple containers of an application on different machines) and larger degrees of parallelism to provide global optimization. But existing schedulers usually suffer severe constraint violations, high latency and low resource efficiency. This paper describes Aladdin, a novel cluster scheduler that can maximize resource efficiency while avoiding constraint violations: (i) it proposes a multidimensional and nonlinear capacity function to support constraint expressions; (ii) it applies an optimized maximum flow algorithm to improve resource efficiency. Experiments with an Alibaba workload trace from a 10,000-machine cluster show that Aladdin can reduce violated constraints by as mush as 20%. Meanwhile, it improves resource efficiency by 50% compared with state-of-the-art schedulers.
{"title":"Aladdin: Optimized Maximum Flow Management for Shared Production Clusters","authors":"Heng Wu, Wen-bo Zhang, Yuanjia Xu, Hao Xiang, Tao Huang, Haiyang Ding, Zhenguo Zhang","doi":"10.1109/IPDPS.2019.00078","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00078","url":null,"abstract":"The rise in popularity of long-lived applications (LLAs), such as deep learning and latency-sensitive online Web services, has brought new challenges for cluster schedulers in shared production environments. Scheduling LLAs needs to support complex placement constraints (e.g., to run multiple containers of an application on different machines) and larger degrees of parallelism to provide global optimization. But existing schedulers usually suffer severe constraint violations, high latency and low resource efficiency. This paper describes Aladdin, a novel cluster scheduler that can maximize resource efficiency while avoiding constraint violations: (i) it proposes a multidimensional and nonlinear capacity function to support constraint expressions; (ii) it applies an optimized maximum flow algorithm to improve resource efficiency. Experiments with an Alibaba workload trace from a 10,000-machine cluster show that Aladdin can reduce violated constraints by as mush as 20%. Meanwhile, it improves resource efficiency by 50% compared with state-of-the-art schedulers.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127854326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00111
Shunjie Zhou, Fan Zhang, Hanhua Chen, Hai Jin, B. Zhou
In the bigdata era, many applications are required to perform quick and accurate join operations on large-scale real-time data streams, such as stock trading and online advertisement analysis. To achieve high throughput and low latency, distributed stream join systems explore efficient stream partitioning strategies to execute the complex stream join procedure in parallel. Existing systems mainly deploy two kinds of partitioning strategies, i.e., random partitioning and hash partitioning. Random partitioning strategy partitions one data stream uniformly while broadcasting all the tuples of the other data stream. This simple strategy may incur lots of unnecessary computations for low-selectivity stream join. Hash partitioning strategy maps all the tuples of the two data streams according to their attributes for joining. However, hash partitioning strategy suffers from a serious load imbalance problem caused by the skew distribution of the attributes, which is common in real-world data. The skewed load may seriously affect the system performance. In this paper, we carefully model the load skewness problem in distributed join systems. We explore the key tuples which lead to the heavy load skewness, and propose an efficient key selection algorithm, GreedyFit to find out these key tuples. We design a lightweight tuple migration strategy to solve the load imbalance problem in real-time and implement a new distributed stream join system, FastJoin. Experimental results using real-world data show that FastJoin can significantly improve the system performance in terms of throughput and latency compared to the state-of-the-art stream join systems.
{"title":"FastJoin: A Skewness-Aware Distributed Stream Join System","authors":"Shunjie Zhou, Fan Zhang, Hanhua Chen, Hai Jin, B. Zhou","doi":"10.1109/IPDPS.2019.00111","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00111","url":null,"abstract":"In the bigdata era, many applications are required to perform quick and accurate join operations on large-scale real-time data streams, such as stock trading and online advertisement analysis. To achieve high throughput and low latency, distributed stream join systems explore efficient stream partitioning strategies to execute the complex stream join procedure in parallel. Existing systems mainly deploy two kinds of partitioning strategies, i.e., random partitioning and hash partitioning. Random partitioning strategy partitions one data stream uniformly while broadcasting all the tuples of the other data stream. This simple strategy may incur lots of unnecessary computations for low-selectivity stream join. Hash partitioning strategy maps all the tuples of the two data streams according to their attributes for joining. However, hash partitioning strategy suffers from a serious load imbalance problem caused by the skew distribution of the attributes, which is common in real-world data. The skewed load may seriously affect the system performance. In this paper, we carefully model the load skewness problem in distributed join systems. We explore the key tuples which lead to the heavy load skewness, and propose an efficient key selection algorithm, GreedyFit to find out these key tuples. We design a lightweight tuple migration strategy to solve the load imbalance problem in real-time and implement a new distributed stream join system, FastJoin. Experimental results using real-world data show that FastJoin can significantly improve the system performance in terms of throughput and latency compared to the state-of-the-art stream join systems.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128520962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00108
Scott Sallinen, R. Pearce, M. Ripeanu
Modern data generation is enormous; we now capture events at increasingly fine granularity, and require processing at rates approaching real-time. For graph analytics, this explosion in data volumes and processing demands has not been matched by improved algorithmic or infrastructure techniques. Instead of exploring solutions to keep up with the velocity of the generated data, most of today's systems focus on analyzing individually built historic snapshots. Modern graph analytics pipelines must evolve to become viable at massive scale, and move away from static, post-processing scenarios to support on-line analysis. This paper presents our progress towards a system that analyzes dynamic incremental graphs, responsive at single-change granularity. We present an algorithmic structure using principles of recursive updates and monotonic convergence, and a set of incremental graph algorithms that can be implemented based on this structure. We also present the required middleware to support graph analytics at fine, event-level granularity. We envision that graph topology changes are processed asynchronously, concurrently, and independently (without shared state), converging an algorithm's state (e.g. single-source shortest path distances, connectivity analysis labeling) to its deterministic answer. The expected long-term impact of this work is to enable a transition away from offline graph analytics, allowing knowledge to be extracted from networked systems in real-time.
{"title":"Incremental Graph Processing for On-line Analytics","authors":"Scott Sallinen, R. Pearce, M. Ripeanu","doi":"10.1109/IPDPS.2019.00108","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00108","url":null,"abstract":"Modern data generation is enormous; we now capture events at increasingly fine granularity, and require processing at rates approaching real-time. For graph analytics, this explosion in data volumes and processing demands has not been matched by improved algorithmic or infrastructure techniques. Instead of exploring solutions to keep up with the velocity of the generated data, most of today's systems focus on analyzing individually built historic snapshots. Modern graph analytics pipelines must evolve to become viable at massive scale, and move away from static, post-processing scenarios to support on-line analysis. This paper presents our progress towards a system that analyzes dynamic incremental graphs, responsive at single-change granularity. We present an algorithmic structure using principles of recursive updates and monotonic convergence, and a set of incremental graph algorithms that can be implemented based on this structure. We also present the required middleware to support graph analytics at fine, event-level granularity. We envision that graph topology changes are processed asynchronously, concurrently, and independently (without shared state), converging an algorithm's state (e.g. single-source shortest path distances, connectivity analysis labeling) to its deterministic answer. The expected long-term impact of this work is to enable a transition away from offline graph analytics, allowing knowledge to be extracted from networked systems in real-time.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131601565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00037
Adrian Colaso, P. Prieto, Pablo Abad Fidalgo, J. Gregorio, Valentin Puente
Racetrack Memories (RM) are a promising spintronic technology able to provide multi-bit storage in a single cell (tape-like) through a ferromagnetic nanowire with multiple domains. This technology offers superior density, non-volatility and low static power compared to CMOS memories. These features have attracted great interest in the adoption of RM as a replacement of RAM technology, from Main memory (DRAM) to maybe on-chip cache hierarchy (SRAM). One of the main drawbacks of this technology is the serialized access to the bits stored in each domain, resulting in unpredictable access time. An appropriate header management policy can potentially reduce the number of shift operations required to access the correct position. Simple policies such as leaving read/write head on the last domain accessed (or on the next) provide enough improvement in the presence of a certain level of locality on data access. However, in those cases with much lower locality, a more accurate behavior from the header management policy would be desirable. In this paper, we explore the utilization of hardware prefetching policies to implement the header management policy. "Predicting" the length and direction of the next displacement, it is possible to reduce shift operations, improving memory access time. The results of our experiments show that, with an appropriate header, our proposal reduces average shift latency by up to 50% in L2 and LLC, improving average memory access time by up to 10%.
{"title":"Architecting Racetrack Memory Preshift through Pattern-Based Prediction Mechanisms","authors":"Adrian Colaso, P. Prieto, Pablo Abad Fidalgo, J. Gregorio, Valentin Puente","doi":"10.1109/IPDPS.2019.00037","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00037","url":null,"abstract":"Racetrack Memories (RM) are a promising spintronic technology able to provide multi-bit storage in a single cell (tape-like) through a ferromagnetic nanowire with multiple domains. This technology offers superior density, non-volatility and low static power compared to CMOS memories. These features have attracted great interest in the adoption of RM as a replacement of RAM technology, from Main memory (DRAM) to maybe on-chip cache hierarchy (SRAM). One of the main drawbacks of this technology is the serialized access to the bits stored in each domain, resulting in unpredictable access time. An appropriate header management policy can potentially reduce the number of shift operations required to access the correct position. Simple policies such as leaving read/write head on the last domain accessed (or on the next) provide enough improvement in the presence of a certain level of locality on data access. However, in those cases with much lower locality, a more accurate behavior from the header management policy would be desirable. In this paper, we explore the utilization of hardware prefetching policies to implement the header management policy. \"Predicting\" the length and direction of the next displacement, it is possible to reduce shift operations, improving memory access time. The results of our experiments show that, with an appropriate header, our proposal reduces average shift latency by up to 50% in L2 and LLC, improving average memory access time by up to 10%.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"135 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115620789","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-05-20DOI: 10.1109/IPDPS.2019.00098
Zaeem Hussain, T. Znati, R. Melhem
In-memory checkpointing has increased in popularity over the years because it significantly improves the time to take a checkpoint. It is usually accomplished by placing all or part of a processor's checkpoint into the local memory of a remote node within the cluster. If, however, the checkpointed node and the node containing its checkpoint both fail in quick succession, recovery using in-memory checkpoints becomes impossible. In this paper, we explore the problem of placing in-memory checkpoints among nodes whose individual failure likelihoods are not identical. We provide theoretical results on the optimal way to place in-memory checkpoints such that the probability of occurrence of a catastrophic failure, i.e. failure of a node as well as the node containing its checkpoint, is minimized. Using the failure logs spread over 5 years of a 49,152 node supercomputer, we show that checkpoint placement schemes that utilize knowledge of node failure likelihoods, and are guided by the theoretical results we provide, can significantly reduce the total number of such catastrophic failures when compared with placement schemes that are oblivious of the heterogeneity in nodes based on their failure likelihoods.
{"title":"Optimal Placement of In-memory Checkpoints Under Heterogeneous Failure Likelihoods","authors":"Zaeem Hussain, T. Znati, R. Melhem","doi":"10.1109/IPDPS.2019.00098","DOIUrl":"https://doi.org/10.1109/IPDPS.2019.00098","url":null,"abstract":"In-memory checkpointing has increased in popularity over the years because it significantly improves the time to take a checkpoint. It is usually accomplished by placing all or part of a processor's checkpoint into the local memory of a remote node within the cluster. If, however, the checkpointed node and the node containing its checkpoint both fail in quick succession, recovery using in-memory checkpoints becomes impossible. In this paper, we explore the problem of placing in-memory checkpoints among nodes whose individual failure likelihoods are not identical. We provide theoretical results on the optimal way to place in-memory checkpoints such that the probability of occurrence of a catastrophic failure, i.e. failure of a node as well as the node containing its checkpoint, is minimized. Using the failure logs spread over 5 years of a 49,152 node supercomputer, we show that checkpoint placement schemes that utilize knowledge of node failure likelihoods, and are guided by the theoretical results we provide, can significantly reduce the total number of such catastrophic failures when compared with placement schemes that are oblivious of the heterogeneity in nodes based on their failure likelihoods.","PeriodicalId":403406,"journal":{"name":"2019 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"14 43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2019-05-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124746958","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}