首页 > 最新文献

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
BD-CATS: big data clustering at trillion particle scale BD-CATS:万亿粒子尺度的大数据集群
Md. Mostofa Ali Patwary, S. Byna, N. Satish, N. Sundaram, Z. Lukic, V. Roytershteyn, Michael J. Anderson, Yushu Yao, Prabhat, P. Dubey
Modern cosmology and plasma physics codes are now capable of simulating trillions of particles on petascale systems. Each timestep output from such simulations is on the order of 10s of TBs. Summarizing and analyzing raw particle data is challenging, and scientists often focus on density structures, whether in the real 3D space, or a high-dimensional phase space. In this work, we develop a highly scalable version of the clustering algorithm Dbscan, and apply it to the largest datasets produced by state-of-the-art codes. Our system, called Bd-Cats, is the first one capable of performing end-to-end analysis at trillion particle scale (including: loading the data, geometric partitioning, computing kd-trees, performing clustering analysis, and storing the results). We show analysis of 1.4 trillion particles from a plasma physics simulation, and a 10,2403 particle cosmological simulation, utilizing ~100,000 cores in 30 minutes. Bd-Cats is helping infer mechanisms behind particle acceleration in plasma physics and holds promise for qualitatively superior clustering in cosmology. Both of these results were previously intractable at the trillion particle scale.
现代宇宙学和等离子体物理代码现在能够在千万亿次系统上模拟数万亿个粒子。这种模拟的每个时间步输出大约是10s tb。总结和分析原始粒子数据是具有挑战性的,科学家经常关注密度结构,无论是在真实的3D空间,还是在高维相空间。在这项工作中,我们开发了一个高度可扩展的聚类算法Dbscan版本,并将其应用于由最先进的代码产生的最大数据集。我们的系统名为Bd-Cats,是第一个能够在万亿粒子尺度上执行端到端分析的系统(包括:加载数据、几何分区、计算kd树、执行聚类分析和存储结果)。我们展示了对来自等离子体物理模拟的1.4万亿个粒子的分析,以及102403个粒子的宇宙学模拟,在30分钟内使用了~100,000个核心。Bd-Cats正在帮助推断等离子体物理中粒子加速背后的机制,并有望在宇宙学中获得质量更好的聚类。这两个结果以前在万亿粒子尺度上都是难以解决的。
{"title":"BD-CATS: big data clustering at trillion particle scale","authors":"Md. Mostofa Ali Patwary, S. Byna, N. Satish, N. Sundaram, Z. Lukic, V. Roytershteyn, Michael J. Anderson, Yushu Yao, Prabhat, P. Dubey","doi":"10.1145/2807591.2807616","DOIUrl":"https://doi.org/10.1145/2807591.2807616","url":null,"abstract":"Modern cosmology and plasma physics codes are now capable of simulating trillions of particles on petascale systems. Each timestep output from such simulations is on the order of 10s of TBs. Summarizing and analyzing raw particle data is challenging, and scientists often focus on density structures, whether in the real 3D space, or a high-dimensional phase space. In this work, we develop a highly scalable version of the clustering algorithm Dbscan, and apply it to the largest datasets produced by state-of-the-art codes. Our system, called Bd-Cats, is the first one capable of performing end-to-end analysis at trillion particle scale (including: loading the data, geometric partitioning, computing kd-trees, performing clustering analysis, and storing the results). We show analysis of 1.4 trillion particles from a plasma physics simulation, and a 10,2403 particle cosmological simulation, utilizing ~100,000 cores in 30 minutes. Bd-Cats is helping infer mechanisms behind particle acceleration in plasma physics and holds promise for qualitatively superior clustering in cosmology. Both of these results were previously intractable at the trillion particle scale.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134271167","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
Enterprise: breadth-first graph traversal on GPUs 企业:gpu上的宽度优先图遍历
Hang Liu, H. Howie Huang
The Breadth-First Search (BFS) algorithm serves as the foundation for many graph-processing applications and analytics workloads. While Graphics Processing Unit (GPU) offers massive parallelism, achieving high-performance BFS on GPUs entails efficient scheduling of a large number of GPU threads and effective utilization of GPU memory hierarchy. In this paper, we present Enterprise, a new GPU-based BFS system that combines three techniques to remove potential performance bottlenecks: (1) streamlined GPU threads scheduling through constructing a frontier queue without contention from concurrent threads, yet containing no duplicated frontiers and optimized for both top-down and bottom-up BFS. (2) GPU workload balancing that classifies the frontiers based on different out-degrees to utilize the full spectrum of GPU parallel granularity, which significantly increases thread-level parallelism; and (3) GPU based BFS direction optimization quantifies the effect of hub vertices on direction-switching and selectively caches a small set of critical hub vertices in the limited GPU shared memory to reduce expensive random data accesses. We have evaluated Enterprise on a large variety of graphs with different GPU devices. Enterprise achieves up to 76 billion traversed edges per second (TEPS) on a single NVIDIA Kepler K40, and up to 122 billion TEPS on two GPUs that ranks No. 45 in the Graph 500 on November 2014. Enterprise is also very energy-efficient as No. 1 in the GreenGraph 500 (small data category), delivering 446 million TEPS per watt.
广度优先搜索(BFS)算法是许多图形处理应用程序和分析工作负载的基础。虽然图形处理单元(GPU)提供了大量并行性,但在GPU上实现高性能BFS需要对大量GPU线程进行高效调度和有效利用GPU内存层次。在本文中,我们提出了一种新的基于GPU的BFS系统Enterprise,它结合了三种技术来消除潜在的性能瓶颈:(1)通过构建一个没有并发线程争用的边界队列来简化GPU线程调度,但不包含重复的边界,并针对自上而下和自下而上的BFS进行了优化。(2)基于不同出界度划分边界的GPU工作负载均衡,充分利用GPU的全谱并行粒度,显著提高线程级并行度;(3)基于GPU的BFS方向优化量化了集线器顶点对方向切换的影响,并选择性地将一小部分关键集线器顶点缓存在有限的GPU共享内存中,以减少昂贵的随机数据访问。我们已经在不同GPU设备的各种图形上评估了Enterprise。企业在单个NVIDIA Kepler K40上实现高达每秒760亿遍历边(TEPS),在2014年11月的图表500中排名第45位的两个gpu上实现高达1220亿TEPS。企业也非常节能,在GreenGraph 500(小数据类别)中排名第一,每瓦输出4.46亿TEPS。
{"title":"Enterprise: breadth-first graph traversal on GPUs","authors":"Hang Liu, H. Howie Huang","doi":"10.1145/2807591.2807594","DOIUrl":"https://doi.org/10.1145/2807591.2807594","url":null,"abstract":"The Breadth-First Search (BFS) algorithm serves as the foundation for many graph-processing applications and analytics workloads. While Graphics Processing Unit (GPU) offers massive parallelism, achieving high-performance BFS on GPUs entails efficient scheduling of a large number of GPU threads and effective utilization of GPU memory hierarchy. In this paper, we present Enterprise, a new GPU-based BFS system that combines three techniques to remove potential performance bottlenecks: (1) streamlined GPU threads scheduling through constructing a frontier queue without contention from concurrent threads, yet containing no duplicated frontiers and optimized for both top-down and bottom-up BFS. (2) GPU workload balancing that classifies the frontiers based on different out-degrees to utilize the full spectrum of GPU parallel granularity, which significantly increases thread-level parallelism; and (3) GPU based BFS direction optimization quantifies the effect of hub vertices on direction-switching and selectively caches a small set of critical hub vertices in the limited GPU shared memory to reduce expensive random data accesses. We have evaluated Enterprise on a large variety of graphs with different GPU devices. Enterprise achieves up to 76 billion traversed edges per second (TEPS) on a single NVIDIA Kepler K40, and up to 122 billion TEPS on two GPUs that ranks No. 45 in the Graph 500 on November 2014. Enterprise is also very energy-efficient as No. 1 in the GreenGraph 500 (small data category), delivering 446 million TEPS per watt.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132538228","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 152
CilkSpec: optimistic concurrency for Cilk CilkSpec: Cilk的乐观并发性
Shaizeen Aga, S. Krishnamoorthy, S. Narayanasamy
Recursive parallel programming models such as Cilk strive to simplify the task of parallel programming by enabling a simple divide-and-conquer programming model. This model is effective in recursively partitioning work into smaller parts and combining their results. However, recursive work partitioning can impose additional constraints on concurrency than is implied by the true dependencies in a program. In this paper, we present a speculation-based approach to alleviate the concurrency constraints imposed by such recursive parallel programs. We design a runtime infrastructure that supports speculative execution and a predictor to accurately learn and identify opportunities to relax extraneous concurrency constraints. Experimental evaluation demonstrates that speculative relaxation of concurrency constraints can deliver gains of up to 1.6x on 30 cores over baseline Cilk.
递归并行编程模型(如Cilk)通过支持简单的分而治之编程模型,力求简化并行编程的任务。该模型在递归地将工作划分为更小的部分并组合它们的结果方面是有效的。然而,递归工作分区可能会对并发性施加额外的约束,而不是程序中真正的依赖所暗示的约束。在本文中,我们提出了一种基于推测的方法来减轻这种递归并行程序所施加的并发约束。我们设计了一个支持推测执行的运行时基础设施和一个预测器,以准确地学习和识别放松无关并发性约束的机会。实验评估表明,推测性放宽并发约束可以在30个核上提供1.6倍的增益。
{"title":"CilkSpec: optimistic concurrency for Cilk","authors":"Shaizeen Aga, S. Krishnamoorthy, S. Narayanasamy","doi":"10.1145/2807591.2807597","DOIUrl":"https://doi.org/10.1145/2807591.2807597","url":null,"abstract":"Recursive parallel programming models such as Cilk strive to simplify the task of parallel programming by enabling a simple divide-and-conquer programming model. This model is effective in recursively partitioning work into smaller parts and combining their results. However, recursive work partitioning can impose additional constraints on concurrency than is implied by the true dependencies in a program. In this paper, we present a speculation-based approach to alleviate the concurrency constraints imposed by such recursive parallel programs. We design a runtime infrastructure that supports speculative execution and a predictor to accurately learn and identify opportunities to relax extraneous concurrency constraints. Experimental evaluation demonstrates that speculative relaxation of concurrency constraints can deliver gains of up to 1.6x on 30 cores over baseline Cilk.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116354961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
HydraDB: a resilient RDMA-driven key-value middleware for in-memory cluster computing HydraDB:用于内存集群计算的弹性rdma驱动的键值中间件
Yandong Wang, Li Zhang, Jian Tan, Min Li, Yuqing Gao, X. Guerin, Xiaoqiao Meng, S. Meng
In this paper, we describe our experiences and lessons learned from building a general-purpose in-memory key-value middleware, called HydraDB. HydraDB synthesizes a collection of state-of-the-art techniques, including continuous fault-tolerance, Remote Direct Memory Access (RDMA), as well as awareness for multicore systems, etc, to deliver a high-throughput, low-latency access service in a reliable manner for cluster computing applications. The uniqueness of HydraDB mainly lies in its design commitment to fully exploit the RDMA protocol to comprehensively optimize various aspects of a general-purpose key-value store, including latency-critical operations, read enhancement, and data replications for high-availability service, etc. At the same time, HydraDB strives to efficiently utilize multicore systems to prevent data manipulation on the servers from curbing the potential of RDMA. Many teams in our organization have adopted HydraDB to improve the execution of their cluster computing frameworks, including Hadoop, Spark, Sensemaking analytics, and Call Record Processing. In addition, our performance evaluation with a variety of YCSB workloads also shows that HydraDB can substantially outperform several existing in-memory key-value stores by an order of magnitude. Our detailed performance evaluation further corroborates our design choices.
在本文中,我们描述了我们从构建一个称为HydraDB的通用内存中键值中间件中获得的经验和教训。HydraDB综合了一系列最先进的技术,包括持续容错、远程直接内存访问(RDMA)以及对多核系统的感知等,以可靠的方式为集群计算应用程序提供高吞吐量、低延迟的访问服务。HydraDB的独特之处主要在于其设计承诺充分利用RDMA协议,全面优化通用键值存储的各个方面,包括延迟关键操作、读取增强、高可用性服务的数据复制等。同时,HydraDB努力有效地利用多核系统,以防止服务器上的数据操作抑制RDMA的潜力。我们组织中的许多团队都采用了HydraDB来改进其集群计算框架的执行,包括Hadoop、Spark、Sensemaking分析和Call Record Processing。此外,我们对各种YCSB工作负载进行的性能评估也表明,HydraDB的性能可以大大优于几种现有的内存中的键值存储。我们详细的性能评估进一步证实了我们的设计选择。
{"title":"HydraDB: a resilient RDMA-driven key-value middleware for in-memory cluster computing","authors":"Yandong Wang, Li Zhang, Jian Tan, Min Li, Yuqing Gao, X. Guerin, Xiaoqiao Meng, S. Meng","doi":"10.1145/2807591.2807614","DOIUrl":"https://doi.org/10.1145/2807591.2807614","url":null,"abstract":"In this paper, we describe our experiences and lessons learned from building a general-purpose in-memory key-value middleware, called HydraDB. HydraDB synthesizes a collection of state-of-the-art techniques, including continuous fault-tolerance, Remote Direct Memory Access (RDMA), as well as awareness for multicore systems, etc, to deliver a high-throughput, low-latency access service in a reliable manner for cluster computing applications. The uniqueness of HydraDB mainly lies in its design commitment to fully exploit the RDMA protocol to comprehensively optimize various aspects of a general-purpose key-value store, including latency-critical operations, read enhancement, and data replications for high-availability service, etc. At the same time, HydraDB strives to efficiently utilize multicore systems to prevent data manipulation on the servers from curbing the potential of RDMA. Many teams in our organization have adopted HydraDB to improve the execution of their cluster computing frameworks, including Hadoop, Spark, Sensemaking analytics, and Call Record Processing. In addition, our performance evaluation with a variety of YCSB workloads also shows that HydraDB can substantially outperform several existing in-memory key-value stores by an order of magnitude. Our detailed performance evaluation further corroborates our design choices.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130958396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 59
Performance optimization for the k-nearest neighbors kernel on x86 architectures x86架构上k近邻内核的性能优化
Chenhan D. Yu, Jianyu Huang, W. Austin, Bo Xiao, G. Biros
Nearest neighbor search is a cornerstone problem in computational geometry, non-parametric statistics, and machine learning. For N points, exhaustive search requires quadratic work, but many fast algorithms reduce the complexity for exact and approximate searches. The common kernel (kNN kernel) in all these algorithms solves many small-size problems exactly using exhaustive search. We propose an efficient implementation and performance analysis for the kNN kernel on x86 architectures. By fusing the distance calculation with the neighbor selection, we are able to utilize memory throughput. We present an analysis of the algorithm and explain parameter selection. We perform an experimental study varying the size of the problem, the dimension of the dataset, and the number of nearest neighbors. Overall we observe significant speedups. For example, when searching for 16 neighbors in a point dataset with 1.6 million points in 64 dimensions, our kernel is over 4 times faster than existing methods.
最近邻搜索是计算几何、非参数统计和机器学习中的一个基础问题。对于N个点,穷举搜索需要二次功,但许多快速算法降低了精确和近似搜索的复杂性。所有这些算法中的公共核(kNN核)都使用穷举搜索精确地解决了许多小型问题。我们提出了一种基于x86架构的kNN内核的高效实现和性能分析。通过融合距离计算和邻居选择,我们能够利用内存吞吐量。我们对算法进行了分析,并解释了参数的选择。我们进行了一项实验研究,改变了问题的大小、数据集的维度和最近邻居的数量。总的来说,我们观察到显著的加速。例如,当在64维的160万个点数据集中搜索16个邻居时,我们的内核比现有方法快4倍以上。
{"title":"Performance optimization for the k-nearest neighbors kernel on x86 architectures","authors":"Chenhan D. Yu, Jianyu Huang, W. Austin, Bo Xiao, G. Biros","doi":"10.1145/2807591.2807601","DOIUrl":"https://doi.org/10.1145/2807591.2807601","url":null,"abstract":"Nearest neighbor search is a cornerstone problem in computational geometry, non-parametric statistics, and machine learning. For N points, exhaustive search requires quadratic work, but many fast algorithms reduce the complexity for exact and approximate searches. The common kernel (kNN kernel) in all these algorithms solves many small-size problems exactly using exhaustive search. We propose an efficient implementation and performance analysis for the kNN kernel on x86 architectures. By fusing the distance calculation with the neighbor selection, we are able to utilize memory throughput. We present an analysis of the algorithm and explain parameter selection. We perform an experimental study varying the size of the problem, the dimension of the dataset, and the number of nearest neighbors. Overall we observe significant speedups. For example, when searching for 16 neighbors in a point dataset with 1.6 million points in 64 dimensions, our kernel is over 4 times faster than existing methods.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"242 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131995593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 34
An input-adaptive and in-place approach to dense tensor-times-matrix multiply 密集张量-时间-矩阵乘法的输入自适应就地方法
Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, R. Vuduc
This paper describes a novel framework, called InTensLi ("intensely"), for producing fast single-node implementations of dense tensor-times-matrix multiply (Ttm) of arbitrary dimension. Whereas conventional implementations of Ttm rely on explicitly converting the input tensor operand into a matrix---in order to be able to use any available and fast general matrix-matrix multiply (Gemm) implementation---our framework's strategy is to carry out the Ttm in-place, avoiding this copy. As the resulting implementations expose tuning parameters, this paper also describes a heuristic empirical model for selecting an optimal configuration based on the Ttm's inputs. When compared to widely used single-node Ttm implementations that are available in the Tensor Toolbox and Cyclops Tensor Framework (Ctf), In-TensLi's in-place and input-adaptive Ttm implementations achieve 4× and 13× speedups, showing Gemm-like performance on a variety of input sizes.
本文描述了一个新的框架,称为InTensLi(“强烈”),用于生成任意维的密集张量乘以矩阵乘法(Ttm)的快速单节点实现。传统的Ttm实现依赖于显式地将输入张量操作数转换为矩阵——为了能够使用任何可用的、快速的一般矩阵-矩阵乘法(Gemm)实现——我们框架的策略是就地执行Ttm,避免这种复制。由于结果实现暴露了调优参数,本文还描述了一个启发式经验模型,用于根据Ttm的输入选择最优配置。与在张量工具箱和Cyclops张量框架(Ctf)中广泛使用的单节点Ttm实现相比,in- tensli的就地和输入自适应Ttm实现实现了4倍和13倍的速度提升,在各种输入大小上显示出类似gem的性能。
{"title":"An input-adaptive and in-place approach to dense tensor-times-matrix multiply","authors":"Jiajia Li, Casey Battaglino, Ioakeim Perros, Jimeng Sun, R. Vuduc","doi":"10.1145/2807591.2807671","DOIUrl":"https://doi.org/10.1145/2807591.2807671","url":null,"abstract":"This paper describes a novel framework, called I<scp>n</scp>T<scp>ens</scp>L<scp>i</scp> (\"intensely\"), for producing fast single-node implementations of dense tensor-times-matrix multiply (T<scp>tm</scp>) of arbitrary dimension. Whereas conventional implementations of T<scp>tm</scp> rely on explicitly converting the input tensor operand into a matrix---in order to be able to use any available and fast general matrix-matrix multiply (G<scp>emm</scp>) implementation---our framework's strategy is to carry out the T<scp>tm</scp> <i>in-place</i>, avoiding this copy. As the resulting implementations expose tuning parameters, this paper also describes a heuristic empirical model for selecting an optimal configuration based on the T<scp>tm</scp>'s inputs. When compared to widely used single-node T<scp>tm</scp> implementations that are available in the Tensor Toolbox and Cyclops Tensor Framework (C<scp>tf</scp>), In-TensLi's in-place and input-adaptive T<scp>tm</scp> implementations achieve 4× and 13× speedups, showing Gemm-like performance on a variety of input sizes.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114287709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
Bridging OpenCL and CUDA: a comparative analysis and translation 桥接OpenCL和CUDA:比较分析和翻译
Junghyun Kim, Thanh Tuan Dao, Jaehoon Jung, Jinyoung Joo, Jaejin Lee
Heterogeneous systems are widening their user-base, and heterogeneous computing is becoming popular in supercomputing. Among others, OpenCL and CUDA are the most popular programming models for heterogeneous systems. Although OpenCL inherited many features from CUDA and they have almost the same platform model, they are not compatible with each other. In this paper, we present similarities and differences between them and propose an automatic translation framework for both OpenCL to CUDA and CUDA to OpenCL. We describe features that make it difficult to translate from one to the other and provide our solution. We show that our translator achieves comparable performance between the original and target applications in both directions. Since each programming model separately has a wide user-base and large code-base, our translation framework is useful to extend the code-base for each programming model and unifies the efforts to develop applications for heterogeneous systems.
异构系统正在扩大其用户基础,并且异构计算在超级计算中越来越流行。其中,OpenCL和CUDA是异构系统中最流行的编程模型。虽然OpenCL继承了CUDA的许多特性,而且它们有几乎相同的平台模型,但它们之间并不兼容。在本文中,我们介绍了它们之间的异同,并提出了一个OpenCL到CUDA和CUDA到OpenCL的自动翻译框架。我们描述了难以从一种语言转换为另一种语言的特征,并提供了我们的解决方案。我们的翻译器在两个方向上都达到了原始应用程序和目标应用程序的相当性能。由于每个编程模型分别具有广泛的用户基础和大量的代码库,我们的翻译框架对于扩展每个编程模型的代码库和统一为异构系统开发应用程序的努力是有用的。
{"title":"Bridging OpenCL and CUDA: a comparative analysis and translation","authors":"Junghyun Kim, Thanh Tuan Dao, Jaehoon Jung, Jinyoung Joo, Jaejin Lee","doi":"10.1145/2807591.2807621","DOIUrl":"https://doi.org/10.1145/2807591.2807621","url":null,"abstract":"Heterogeneous systems are widening their user-base, and heterogeneous computing is becoming popular in supercomputing. Among others, OpenCL and CUDA are the most popular programming models for heterogeneous systems. Although OpenCL inherited many features from CUDA and they have almost the same platform model, they are not compatible with each other. In this paper, we present similarities and differences between them and propose an automatic translation framework for both OpenCL to CUDA and CUDA to OpenCL. We describe features that make it difficult to translate from one to the other and provide our solution. We show that our translator achieves comparable performance between the original and target applications in both directions. Since each programming model separately has a wide user-base and large code-base, our translation framework is useful to extend the code-base for each programming model and unifies the efforts to develop applications for heterogeneous systems.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123249691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
Memory access patterns: the missing piece of the multi-GPU puzzle 内存访问模式:多gpu拼图中缺失的一块
Tal Ben-Nun, Ely Levy, A. Barak, Erik Rubin
With the increased popularity of multi-GPU nodes in modern HPC clusters, it is imperative to develop matching programming paradigms for their efficient utilization. In order to take advantage of the local GPUs and the low-latency high-throughput interconnects that link them, programmers need to meticulously adapt parallel applications with respect to load balancing, boundary conditions and device synchronization. This paper presents MAPS-Multi, an automatic multi-GPU partitioning framework that distributes the workload based on the underlying memory access patterns. The framework consists of host- and device-level APIs that allow programs to efficiently run on a variety of GPU and multi-GPU architectures. The framework implements several layers of code optimization, device abstraction, and automatic inference of inter-GPU memory exchanges. The paper demonstrates that the performance of MAPS-Multi achieves near-linear scaling on fundamental computational operations, as well as real-world applications in deep learning and multivariate analysis.
随着现代高性能计算集群中多gpu节点的日益普及,为了高效利用多gpu节点,开发匹配的编程范式势在必行。为了利用本地gpu和连接它们的低延迟高吞吐量互连,程序员需要根据负载平衡、边界条件和设备同步精心调整并行应用程序。本文提出了一种基于底层内存访问模式分配工作负载的自动多gpu分区框架MAPS-Multi。该框架由主机和设备级api组成,允许程序在各种GPU和多GPU架构上有效运行。该框架实现了多层代码优化、设备抽象和gpu间内存交换的自动推理。本文证明了MAPS-Multi的性能在基本计算操作上实现了近线性缩放,以及在深度学习和多元分析中的实际应用。
{"title":"Memory access patterns: the missing piece of the multi-GPU puzzle","authors":"Tal Ben-Nun, Ely Levy, A. Barak, Erik Rubin","doi":"10.1145/2807591.2807611","DOIUrl":"https://doi.org/10.1145/2807591.2807611","url":null,"abstract":"With the increased popularity of multi-GPU nodes in modern HPC clusters, it is imperative to develop matching programming paradigms for their efficient utilization. In order to take advantage of the local GPUs and the low-latency high-throughput interconnects that link them, programmers need to meticulously adapt parallel applications with respect to load balancing, boundary conditions and device synchronization. This paper presents MAPS-Multi, an automatic multi-GPU partitioning framework that distributes the workload based on the underlying memory access patterns. The framework consists of host- and device-level APIs that allow programs to efficiently run on a variety of GPU and multi-GPU architectures. The framework implements several layers of code optimization, device abstraction, and automatic inference of inter-GPU memory exchanges. The paper demonstrates that the performance of MAPS-Multi achieves near-linear scaling on fundamental computational operations, as well as real-world applications in deep learning and multivariate analysis.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"03 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127182183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 48
The in-silico lab-on-a-chip: petascale and high-throughput simulations of microfluidics at cell resolution 芯片上的硅实验室:千兆级和高通量微流体在细胞分辨率的模拟
D. Rossinelli, Yu-Hang Tang, K. Lykov, D. Alexeev, M. Bernaschi, P. Hadjidoukas, M. Bisson, W. Joubert, Christian Conti, G. Karniadakis, M. Fatica, I. Pivkin, P. Koumoutsakos
We present simulations of blood and cancer cell separation in complex microfluidic channels with subcellular resolution, demonstrating unprecedented time to solution, performing at 65.5% of the available 39.4 PetaInstructions/s in the 18, 688 nodes of the Titan supercomputer. These simulations outperform by one to three orders of magnitude the current state of the art in terms of numbers of simulated cells and computational elements. The computational setup emulates the conditions and the geometric complexity of microfluidic experiments and our results reproduce the experimental findings. These simulations provide sub-micron resolution while accessing time scales relevant to engineering designs. We demonstrate an improvement of up to 45X over competing state-of-the-art solvers, thus establishing the frontiers of simulations by particle based methods. Our simulations redefine the role of computational science for the development of microfluidics -- a technology that is becoming as important to medicine as integrated circuits have been to computers.
我们以亚细胞分辨率在复杂的微流体通道中模拟血液和癌细胞的分离,展示了前所未有的解决时间,在泰坦超级计算机的18,688个节点上,以39.4 PetaInstructions/s的65.5%的速度执行。这些模拟在模拟细胞和计算元素的数量方面比目前的技术水平高出一到三个数量级。计算装置模拟了微流控实验的条件和几何复杂性,所得结果与实验结果吻合。这些模拟提供亚微米分辨率,同时访问与工程设计相关的时间尺度。我们展示了比竞争最先进的解决方案提高了45倍,从而建立了基于粒子的方法模拟的前沿。我们的模拟重新定义了计算科学在微流体发展中的作用——微流体技术对医学的重要性,就像集成电路对计算机的重要性一样。
{"title":"The in-silico lab-on-a-chip: petascale and high-throughput simulations of microfluidics at cell resolution","authors":"D. Rossinelli, Yu-Hang Tang, K. Lykov, D. Alexeev, M. Bernaschi, P. Hadjidoukas, M. Bisson, W. Joubert, Christian Conti, G. Karniadakis, M. Fatica, I. Pivkin, P. Koumoutsakos","doi":"10.1145/2807591.2807677","DOIUrl":"https://doi.org/10.1145/2807591.2807677","url":null,"abstract":"We present simulations of blood and cancer cell separation in complex microfluidic channels with subcellular resolution, demonstrating unprecedented time to solution, performing at 65.5% of the available 39.4 PetaInstructions/s in the 18, 688 nodes of the Titan supercomputer. These simulations outperform by one to three orders of magnitude the current state of the art in terms of numbers of simulated cells and computational elements. The computational setup emulates the conditions and the geometric complexity of microfluidic experiments and our results reproduce the experimental findings. These simulations provide sub-micron resolution while accessing time scales relevant to engineering designs. We demonstrate an improvement of up to 45X over competing state-of-the-art solvers, thus establishing the frontiers of simulations by particle based methods. Our simulations redefine the role of computational science for the development of microfluidics -- a technology that is becoming as important to medicine as integrated circuits have been to computers.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"2013 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127385313","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Fault tolerant MapReduce-MPI for HPC clusters HPC集群的容错MapReduce-MPI
Yanfei Guo, Wesley Bland, P. Balaji, Xiaobo Zhou
Building MapReduce applications using the Message-Passing Interface (MPI) enables us to exploit the performance of large HPC clusters for big data analytics. However, due to the lacking of native fault tolerance support in MPI and the incompatibility between the MapReduce fault tolerance model and HPC schedulers, it is very hard to provide a fault tolerant MapReduce runtime for HPC clusters. We propose and develop FT-MRMPI, the first fault tolerant MapReduce framework on MPI for HPC clusters. We discover a unique way to perform failure detection and recovery by exploiting the current MPI semantics and the new proposal of user-level failure mitigation. We design and develop the checkpoint/restart model for fault tolerant MapReduce in MPI. We further tailor the detect/resume model to conserve work for more efficient fault tolerance. The experimental results on a 256-node HPC cluster show that FT-MRMPI effectively masks failures and reduces the job completion time by 39%.
使用消息传递接口(Message-Passing Interface, MPI)构建MapReduce应用程序使我们能够利用大型高性能计算集群的性能进行大数据分析。然而,由于MPI缺乏原生容错支持,以及MapReduce容错模型与HPC调度器之间的不兼容,为HPC集群提供容错的MapReduce运行时非常困难。我们提出并开发了FT-MRMPI,这是HPC集群上第一个基于MPI的容错MapReduce框架。通过利用当前的MPI语义和用户级故障缓解的新建议,我们发现了一种执行故障检测和恢复的独特方法。设计并开发了MPI中容错MapReduce的检查点/重启模型。我们进一步调整检测/恢复模型,以节省工作,实现更有效的容错。在256节点的高性能计算集群上的实验结果表明,FT-MRMPI有效地掩盖了故障,并将作业完成时间缩短了39%。
{"title":"Fault tolerant MapReduce-MPI for HPC clusters","authors":"Yanfei Guo, Wesley Bland, P. Balaji, Xiaobo Zhou","doi":"10.1145/2807591.2807617","DOIUrl":"https://doi.org/10.1145/2807591.2807617","url":null,"abstract":"Building MapReduce applications using the Message-Passing Interface (MPI) enables us to exploit the performance of large HPC clusters for big data analytics. However, due to the lacking of native fault tolerance support in MPI and the incompatibility between the MapReduce fault tolerance model and HPC schedulers, it is very hard to provide a fault tolerant MapReduce runtime for HPC clusters. We propose and develop FT-MRMPI, the first fault tolerant MapReduce framework on MPI for HPC clusters. We discover a unique way to perform failure detection and recovery by exploiting the current MPI semantics and the new proposal of user-level failure mitigation. We design and develop the checkpoint/restart model for fault tolerant MapReduce in MPI. We further tailor the detect/resume model to conserve work for more efficient fault tolerance. The experimental results on a 256-node HPC cluster show that FT-MRMPI effectively masks failures and reduces the job completion time by 39%.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116441477","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
期刊
SC15: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1