Md. Mostofa Ali Patwary, S. Byna, N. Satish, N. Sundaram, Z. Lukic, V. Roytershteyn, Michael J. Anderson, Yushu Yao, Prabhat, P. Dubey
Modern cosmology and plasma physics codes are now capable of simulating trillions of particles on petascale systems. Each timestep output from such simulations is on the order of 10s of TBs. Summarizing and analyzing raw particle data is challenging, and scientists often focus on density structures, whether in the real 3D space, or a high-dimensional phase space. In this work, we develop a highly scalable version of the clustering algorithm Dbscan, and apply it to the largest datasets produced by state-of-the-art codes. Our system, called Bd-Cats, is the first one capable of performing end-to-end analysis at trillion particle scale (including: loading the data, geometric partitioning, computing kd-trees, performing clustering analysis, and storing the results). We show analysis of 1.4 trillion particles from a plasma physics simulation, and a 10,2403 particle cosmological simulation, utilizing ~100,000 cores in 30 minutes. Bd-Cats is helping infer mechanisms behind particle acceleration in plasma physics and holds promise for qualitatively superior clustering in cosmology. Both of these results were previously intractable at the trillion particle scale.
{"title":"BD-CATS: big data clustering at trillion particle scale","authors":"Md. Mostofa Ali Patwary, S. Byna, N. Satish, N. Sundaram, Z. Lukic, V. Roytershteyn, Michael J. Anderson, Yushu Yao, Prabhat, P. Dubey","doi":"10.1145/2807591.2807616","DOIUrl":"https://doi.org/10.1145/2807591.2807616","url":null,"abstract":"Modern cosmology and plasma physics codes are now capable of simulating trillions of particles on petascale systems. Each timestep output from such simulations is on the order of 10s of TBs. Summarizing and analyzing raw particle data is challenging, and scientists often focus on density structures, whether in the real 3D space, or a high-dimensional phase space. In this work, we develop a highly scalable version of the clustering algorithm Dbscan, and apply it to the largest datasets produced by state-of-the-art codes. Our system, called Bd-Cats, is the first one capable of performing end-to-end analysis at trillion particle scale (including: loading the data, geometric partitioning, computing kd-trees, performing clustering analysis, and storing the results). We show analysis of 1.4 trillion particles from a plasma physics simulation, and a 10,2403 particle cosmological simulation, utilizing ~100,000 cores in 30 minutes. Bd-Cats is helping infer mechanisms behind particle acceleration in plasma physics and holds promise for qualitatively superior clustering in cosmology. Both of these results were previously intractable at the trillion particle scale.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134271167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The Breadth-First Search (BFS) algorithm serves as the foundation for many graph-processing applications and analytics workloads. While Graphics Processing Unit (GPU) offers massive parallelism, achieving high-performance BFS on GPUs entails efficient scheduling of a large number of GPU threads and effective utilization of GPU memory hierarchy. In this paper, we present Enterprise, a new GPU-based BFS system that combines three techniques to remove potential performance bottlenecks: (1) streamlined GPU threads scheduling through constructing a frontier queue without contention from concurrent threads, yet containing no duplicated frontiers and optimized for both top-down and bottom-up BFS. (2) GPU workload balancing that classifies the frontiers based on different out-degrees to utilize the full spectrum of GPU parallel granularity, which significantly increases thread-level parallelism; and (3) GPU based BFS direction optimization quantifies the effect of hub vertices on direction-switching and selectively caches a small set of critical hub vertices in the limited GPU shared memory to reduce expensive random data accesses. We have evaluated Enterprise on a large variety of graphs with different GPU devices. Enterprise achieves up to 76 billion traversed edges per second (TEPS) on a single NVIDIA Kepler K40, and up to 122 billion TEPS on two GPUs that ranks No. 45 in the Graph 500 on November 2014. Enterprise is also very energy-efficient as No. 1 in the GreenGraph 500 (small data category), delivering 446 million TEPS per watt.
{"title":"Enterprise: breadth-first graph traversal on GPUs","authors":"Hang Liu, H. Howie Huang","doi":"10.1145/2807591.2807594","DOIUrl":"https://doi.org/10.1145/2807591.2807594","url":null,"abstract":"The Breadth-First Search (BFS) algorithm serves as the foundation for many graph-processing applications and analytics workloads. While Graphics Processing Unit (GPU) offers massive parallelism, achieving high-performance BFS on GPUs entails efficient scheduling of a large number of GPU threads and effective utilization of GPU memory hierarchy. In this paper, we present Enterprise, a new GPU-based BFS system that combines three techniques to remove potential performance bottlenecks: (1) streamlined GPU threads scheduling through constructing a frontier queue without contention from concurrent threads, yet containing no duplicated frontiers and optimized for both top-down and bottom-up BFS. (2) GPU workload balancing that classifies the frontiers based on different out-degrees to utilize the full spectrum of GPU parallel granularity, which significantly increases thread-level parallelism; and (3) GPU based BFS direction optimization quantifies the effect of hub vertices on direction-switching and selectively caches a small set of critical hub vertices in the limited GPU shared memory to reduce expensive random data accesses. We have evaluated Enterprise on a large variety of graphs with different GPU devices. Enterprise achieves up to 76 billion traversed edges per second (TEPS) on a single NVIDIA Kepler K40, and up to 122 billion TEPS on two GPUs that ranks No. 45 in the Graph 500 on November 2014. Enterprise is also very energy-efficient as No. 1 in the GreenGraph 500 (small data category), delivering 446 million TEPS per watt.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132538228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recursive parallel programming models such as Cilk strive to simplify the task of parallel programming by enabling a simple divide-and-conquer programming model. This model is effective in recursively partitioning work into smaller parts and combining their results. However, recursive work partitioning can impose additional constraints on concurrency than is implied by the true dependencies in a program. In this paper, we present a speculation-based approach to alleviate the concurrency constraints imposed by such recursive parallel programs. We design a runtime infrastructure that supports speculative execution and a predictor to accurately learn and identify opportunities to relax extraneous concurrency constraints. Experimental evaluation demonstrates that speculative relaxation of concurrency constraints can deliver gains of up to 1.6x on 30 cores over baseline Cilk.
{"title":"CilkSpec: optimistic concurrency for Cilk","authors":"Shaizeen Aga, S. Krishnamoorthy, S. Narayanasamy","doi":"10.1145/2807591.2807597","DOIUrl":"https://doi.org/10.1145/2807591.2807597","url":null,"abstract":"Recursive parallel programming models such as Cilk strive to simplify the task of parallel programming by enabling a simple divide-and-conquer programming model. This model is effective in recursively partitioning work into smaller parts and combining their results. However, recursive work partitioning can impose additional constraints on concurrency than is implied by the true dependencies in a program. In this paper, we present a speculation-based approach to alleviate the concurrency constraints imposed by such recursive parallel programs. We design a runtime infrastructure that supports speculative execution and a predictor to accurately learn and identify opportunities to relax extraneous concurrency constraints. Experimental evaluation demonstrates that speculative relaxation of concurrency constraints can deliver gains of up to 1.6x on 30 cores over baseline Cilk.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116354961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yandong Wang, Li Zhang, Jian Tan, Min Li, Yuqing Gao, X. Guerin, Xiaoqiao Meng, S. Meng
In this paper, we describe our experiences and lessons learned from building a general-purpose in-memory key-value middleware, called HydraDB. HydraDB synthesizes a collection of state-of-the-art techniques, including continuous fault-tolerance, Remote Direct Memory Access (RDMA), as well as awareness for multicore systems, etc, to deliver a high-throughput, low-latency access service in a reliable manner for cluster computing applications. The uniqueness of HydraDB mainly lies in its design commitment to fully exploit the RDMA protocol to comprehensively optimize various aspects of a general-purpose key-value store, including latency-critical operations, read enhancement, and data replications for high-availability service, etc. At the same time, HydraDB strives to efficiently utilize multicore systems to prevent data manipulation on the servers from curbing the potential of RDMA. Many teams in our organization have adopted HydraDB to improve the execution of their cluster computing frameworks, including Hadoop, Spark, Sensemaking analytics, and Call Record Processing. In addition, our performance evaluation with a variety of YCSB workloads also shows that HydraDB can substantially outperform several existing in-memory key-value stores by an order of magnitude. Our detailed performance evaluation further corroborates our design choices.
在本文中,我们描述了我们从构建一个称为HydraDB的通用内存中键值中间件中获得的经验和教训。HydraDB综合了一系列最先进的技术,包括持续容错、远程直接内存访问(RDMA)以及对多核系统的感知等,以可靠的方式为集群计算应用程序提供高吞吐量、低延迟的访问服务。HydraDB的独特之处主要在于其设计承诺充分利用RDMA协议,全面优化通用键值存储的各个方面,包括延迟关键操作、读取增强、高可用性服务的数据复制等。同时,HydraDB努力有效地利用多核系统,以防止服务器上的数据操作抑制RDMA的潜力。我们组织中的许多团队都采用了HydraDB来改进其集群计算框架的执行,包括Hadoop、Spark、Sensemaking分析和Call Record Processing。此外,我们对各种YCSB工作负载进行的性能评估也表明,HydraDB的性能可以大大优于几种现有的内存中的键值存储。我们详细的性能评估进一步证实了我们的设计选择。
{"title":"HydraDB: a resilient RDMA-driven key-value middleware for in-memory cluster computing","authors":"Yandong Wang, Li Zhang, Jian Tan, Min Li, Yuqing Gao, X. Guerin, Xiaoqiao Meng, S. Meng","doi":"10.1145/2807591.2807614","DOIUrl":"https://doi.org/10.1145/2807591.2807614","url":null,"abstract":"In this paper, we describe our experiences and lessons learned from building a general-purpose in-memory key-value middleware, called HydraDB. HydraDB synthesizes a collection of state-of-the-art techniques, including continuous fault-tolerance, Remote Direct Memory Access (RDMA), as well as awareness for multicore systems, etc, to deliver a high-throughput, low-latency access service in a reliable manner for cluster computing applications. The uniqueness of HydraDB mainly lies in its design commitment to fully exploit the RDMA protocol to comprehensively optimize various aspects of a general-purpose key-value store, including latency-critical operations, read enhancement, and data replications for high-availability service, etc. At the same time, HydraDB strives to efficiently utilize multicore systems to prevent data manipulation on the servers from curbing the potential of RDMA. Many teams in our organization have adopted HydraDB to improve the execution of their cluster computing frameworks, including Hadoop, Spark, Sensemaking analytics, and Call Record Processing. In addition, our performance evaluation with a variety of YCSB workloads also shows that HydraDB can substantially outperform several existing in-memory key-value stores by an order of magnitude. Our detailed performance evaluation further corroborates our design choices.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130958396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chenhan D. Yu, Jianyu Huang, W. Austin, Bo Xiao, G. Biros
Nearest neighbor search is a cornerstone problem in computational geometry, non-parametric statistics, and machine learning. For N points, exhaustive search requires quadratic work, but many fast algorithms reduce the complexity for exact and approximate searches. The common kernel (kNN kernel) in all these algorithms solves many small-size problems exactly using exhaustive search. We propose an efficient implementation and performance analysis for the kNN kernel on x86 architectures. By fusing the distance calculation with the neighbor selection, we are able to utilize memory throughput. We present an analysis of the algorithm and explain parameter selection. We perform an experimental study varying the size of the problem, the dimension of the dataset, and the number of nearest neighbors. Overall we observe significant speedups. For example, when searching for 16 neighbors in a point dataset with 1.6 million points in 64 dimensions, our kernel is over 4 times faster than existing methods.
{"title":"Performance optimization for the k-nearest neighbors kernel on x86 architectures","authors":"Chenhan D. Yu, Jianyu Huang, W. Austin, Bo Xiao, G. Biros","doi":"10.1145/2807591.2807601","DOIUrl":"https://doi.org/10.1145/2807591.2807601","url":null,"abstract":"Nearest neighbor search is a cornerstone problem in computational geometry, non-parametric statistics, and machine learning. For N points, exhaustive search requires quadratic work, but many fast algorithms reduce the complexity for exact and approximate searches. The common kernel (kNN kernel) in all these algorithms solves many small-size problems exactly using exhaustive search. We propose an efficient implementation and performance analysis for the kNN kernel on x86 architectures. By fusing the distance calculation with the neighbor selection, we are able to utilize memory throughput. We present an analysis of the algorithm and explain parameter selection. We perform an experimental study varying the size of the problem, the dimension of the dataset, and the number of nearest neighbors. Overall we observe significant speedups. For example, when searching for 16 neighbors in a point dataset with 1.6 million points in 64 dimensions, our kernel is over 4 times faster than existing methods.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"242 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131995593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, R. Vuduc
This paper describes a novel framework, called InTensLi ("intensely"), for producing fast single-node implementations of dense tensor-times-matrix multiply (Ttm) of arbitrary dimension. Whereas conventional implementations of Ttm rely on explicitly converting the input tensor operand into a matrix---in order to be able to use any available and fast general matrix-matrix multiply (Gemm) implementation---our framework's strategy is to carry out the Ttmin-place, avoiding this copy. As the resulting implementations expose tuning parameters, this paper also describes a heuristic empirical model for selecting an optimal configuration based on the Ttm's inputs. When compared to widely used single-node Ttm implementations that are available in the Tensor Toolbox and Cyclops Tensor Framework (Ctf), In-TensLi's in-place and input-adaptive Ttm implementations achieve 4× and 13× speedups, showing Gemm-like performance on a variety of input sizes.
{"title":"An input-adaptive and in-place approach to dense tensor-times-matrix multiply","authors":"Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, R. Vuduc","doi":"10.1145/2807591.2807671","DOIUrl":"https://doi.org/10.1145/2807591.2807671","url":null,"abstract":"This paper describes a novel framework, called I<scp>n</scp>T<scp>ens</scp>L<scp>i</scp> (\"intensely\"), for producing fast single-node implementations of dense tensor-times-matrix multiply (T<scp>tm</scp>) of arbitrary dimension. Whereas conventional implementations of T<scp>tm</scp> rely on explicitly converting the input tensor operand into a matrix---in order to be able to use any available and fast general matrix-matrix multiply (G<scp>emm</scp>) implementation---our framework's strategy is to carry out the T<scp>tm</scp> <i>in-place</i>, avoiding this copy. As the resulting implementations expose tuning parameters, this paper also describes a heuristic empirical model for selecting an optimal configuration based on the T<scp>tm</scp>'s inputs. When compared to widely used single-node T<scp>tm</scp> implementations that are available in the Tensor Toolbox and Cyclops Tensor Framework (C<scp>tf</scp>), In-TensLi's in-place and input-adaptive T<scp>tm</scp> implementations achieve 4× and 13× speedups, showing Gemm-like performance on a variety of input sizes.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114287709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Junghyun Kim, Thanh Tuan Dao, Jaehoon Jung, Jinyoung Joo, Jaejin Lee
Heterogeneous systems are widening their user-base, and heterogeneous computing is becoming popular in supercomputing. Among others, OpenCL and CUDA are the most popular programming models for heterogeneous systems. Although OpenCL inherited many features from CUDA and they have almost the same platform model, they are not compatible with each other. In this paper, we present similarities and differences between them and propose an automatic translation framework for both OpenCL to CUDA and CUDA to OpenCL. We describe features that make it difficult to translate from one to the other and provide our solution. We show that our translator achieves comparable performance between the original and target applications in both directions. Since each programming model separately has a wide user-base and large code-base, our translation framework is useful to extend the code-base for each programming model and unifies the efforts to develop applications for heterogeneous systems.
{"title":"Bridging OpenCL and CUDA: a comparative analysis and translation","authors":"Junghyun Kim, Thanh Tuan Dao, Jaehoon Jung, Jinyoung Joo, Jaejin Lee","doi":"10.1145/2807591.2807621","DOIUrl":"https://doi.org/10.1145/2807591.2807621","url":null,"abstract":"Heterogeneous systems are widening their user-base, and heterogeneous computing is becoming popular in supercomputing. Among others, OpenCL and CUDA are the most popular programming models for heterogeneous systems. Although OpenCL inherited many features from CUDA and they have almost the same platform model, they are not compatible with each other. In this paper, we present similarities and differences between them and propose an automatic translation framework for both OpenCL to CUDA and CUDA to OpenCL. We describe features that make it difficult to translate from one to the other and provide our solution. We show that our translator achieves comparable performance between the original and target applications in both directions. Since each programming model separately has a wide user-base and large code-base, our translation framework is useful to extend the code-base for each programming model and unifies the efforts to develop applications for heterogeneous systems.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123249691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
With the increased popularity of multi-GPU nodes in modern HPC clusters, it is imperative to develop matching programming paradigms for their efficient utilization. In order to take advantage of the local GPUs and the low-latency high-throughput interconnects that link them, programmers need to meticulously adapt parallel applications with respect to load balancing, boundary conditions and device synchronization. This paper presents MAPS-Multi, an automatic multi-GPU partitioning framework that distributes the workload based on the underlying memory access patterns. The framework consists of host- and device-level APIs that allow programs to efficiently run on a variety of GPU and multi-GPU architectures. The framework implements several layers of code optimization, device abstraction, and automatic inference of inter-GPU memory exchanges. The paper demonstrates that the performance of MAPS-Multi achieves near-linear scaling on fundamental computational operations, as well as real-world applications in deep learning and multivariate analysis.
{"title":"Memory access patterns: the missing piece of the multi-GPU puzzle","authors":"Tal Ben-Nun, Ely Levy, A. Barak, Erik Rubin","doi":"10.1145/2807591.2807611","DOIUrl":"https://doi.org/10.1145/2807591.2807611","url":null,"abstract":"With the increased popularity of multi-GPU nodes in modern HPC clusters, it is imperative to develop matching programming paradigms for their efficient utilization. In order to take advantage of the local GPUs and the low-latency high-throughput interconnects that link them, programmers need to meticulously adapt parallel applications with respect to load balancing, boundary conditions and device synchronization. This paper presents MAPS-Multi, an automatic multi-GPU partitioning framework that distributes the workload based on the underlying memory access patterns. The framework consists of host- and device-level APIs that allow programs to efficiently run on a variety of GPU and multi-GPU architectures. The framework implements several layers of code optimization, device abstraction, and automatic inference of inter-GPU memory exchanges. The paper demonstrates that the performance of MAPS-Multi achieves near-linear scaling on fundamental computational operations, as well as real-world applications in deep learning and multivariate analysis.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"03 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127182183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
D. Rossinelli, Yu-Hang Tang, K. Lykov, D. Alexeev, M. Bernaschi, P. Hadjidoukas, M. Bisson, W. Joubert, Christian Conti, G. Karniadakis, M. Fatica, I. Pivkin, P. Koumoutsakos
We present simulations of blood and cancer cell separation in complex microfluidic channels with subcellular resolution, demonstrating unprecedented time to solution, performing at 65.5% of the available 39.4 PetaInstructions/s in the 18, 688 nodes of the Titan supercomputer. These simulations outperform by one to three orders of magnitude the current state of the art in terms of numbers of simulated cells and computational elements. The computational setup emulates the conditions and the geometric complexity of microfluidic experiments and our results reproduce the experimental findings. These simulations provide sub-micron resolution while accessing time scales relevant to engineering designs. We demonstrate an improvement of up to 45X over competing state-of-the-art solvers, thus establishing the frontiers of simulations by particle based methods. Our simulations redefine the role of computational science for the development of microfluidics -- a technology that is becoming as important to medicine as integrated circuits have been to computers.
{"title":"The in-silico lab-on-a-chip: petascale and high-throughput simulations of microfluidics at cell resolution","authors":"D. Rossinelli, Yu-Hang Tang, K. Lykov, D. Alexeev, M. Bernaschi, P. Hadjidoukas, M. Bisson, W. Joubert, Christian Conti, G. Karniadakis, M. Fatica, I. Pivkin, P. Koumoutsakos","doi":"10.1145/2807591.2807677","DOIUrl":"https://doi.org/10.1145/2807591.2807677","url":null,"abstract":"We present simulations of blood and cancer cell separation in complex microfluidic channels with subcellular resolution, demonstrating unprecedented time to solution, performing at 65.5% of the available 39.4 PetaInstructions/s in the 18, 688 nodes of the Titan supercomputer. These simulations outperform by one to three orders of magnitude the current state of the art in terms of numbers of simulated cells and computational elements. The computational setup emulates the conditions and the geometric complexity of microfluidic experiments and our results reproduce the experimental findings. These simulations provide sub-micron resolution while accessing time scales relevant to engineering designs. We demonstrate an improvement of up to 45X over competing state-of-the-art solvers, thus establishing the frontiers of simulations by particle based methods. Our simulations redefine the role of computational science for the development of microfluidics -- a technology that is becoming as important to medicine as integrated circuits have been to computers.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"2013 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127385313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Building MapReduce applications using the Message-Passing Interface (MPI) enables us to exploit the performance of large HPC clusters for big data analytics. However, due to the lacking of native fault tolerance support in MPI and the incompatibility between the MapReduce fault tolerance model and HPC schedulers, it is very hard to provide a fault tolerant MapReduce runtime for HPC clusters. We propose and develop FT-MRMPI, the first fault tolerant MapReduce framework on MPI for HPC clusters. We discover a unique way to perform failure detection and recovery by exploiting the current MPI semantics and the new proposal of user-level failure mitigation. We design and develop the checkpoint/restart model for fault tolerant MapReduce in MPI. We further tailor the detect/resume model to conserve work for more efficient fault tolerance. The experimental results on a 256-node HPC cluster show that FT-MRMPI effectively masks failures and reduces the job completion time by 39%.
{"title":"Fault tolerant MapReduce-MPI for HPC clusters","authors":"Yanfei Guo, Wesley Bland, P. Balaji, Xiaobo Zhou","doi":"10.1145/2807591.2807617","DOIUrl":"https://doi.org/10.1145/2807591.2807617","url":null,"abstract":"Building MapReduce applications using the Message-Passing Interface (MPI) enables us to exploit the performance of large HPC clusters for big data analytics. However, due to the lacking of native fault tolerance support in MPI and the incompatibility between the MapReduce fault tolerance model and HPC schedulers, it is very hard to provide a fault tolerant MapReduce runtime for HPC clusters. We propose and develop FT-MRMPI, the first fault tolerant MapReduce framework on MPI for HPC clusters. We discover a unique way to perform failure detection and recovery by exploiting the current MPI semantics and the new proposal of user-level failure mitigation. We design and develop the checkpoint/restart model for fault tolerant MapReduce in MPI. We further tailor the detect/resume model to conserve work for more efficient fault tolerance. The experimental results on a 256-node HPC cluster show that FT-MRMPI effectively masks failures and reduces the job completion time by 39%.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116441477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}