首页 > 最新文献

ACM Transactions on Parallel Computing最新文献

英文 中文
Bandwidth-Optimal Random Shuffling for GPUs gpu的带宽优化随机洗牌
IF 1.6 Q2 Computer Science Pub Date : 2021-06-11 DOI: 10.1145/3505287
Rory Mitchell, Daniel Stokes, E. Frank, G. Holmes
Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies, and existing parallel shuffling algorithms are unsuitable for GPU architectures because they incur a large number of read/write operations to high latency global memory. To address this, we provide a method of generating pseudo-random permutations in parallel by fusing suitable pseudo-random bijective functions with stream compaction operations. Our algorithm, termed “bijective shuffle” trades increased per-thread arithmetic operations for reduced global memory transactions. It is work-efficient, deterministic, and only requires a single global memory read and write per shuffle input, thus maximising use of global memory bandwidth. To empirically demonstrate the correctness of the algorithm, we develop a statistical test for the quality of pseudo-random permutations based on kernel space embeddings. Experimental results show that the bijective shuffle algorithm outperforms competing algorithms on GPUs, showing improvements of between one and two orders of magnitude and approaching peak device bandwidth.
传统上用于在cpu上对数据进行洗刷的线性时间算法,如Fisher-Yates方法,由于固有的顺序依赖性,不太适合在GPU上实现,现有的并行洗刷算法不适合GPU架构,因为它们会对高延迟的全局内存产生大量的读/写操作。为了解决这个问题,我们提供了一种通过将合适的伪随机双射函数与流压缩操作融合来并行生成伪随机排列的方法。我们的算法,称为“双目标洗牌”,用增加的每线程算术运算来换取减少的全局内存事务。它工作效率高,具有确定性,并且每次洗牌输入只需要一次全局内存读写,从而最大限度地利用全局内存带宽。为了从经验上证明该算法的正确性,我们开发了一个基于核空间嵌入的伪随机排列质量的统计测试。实验结果表明,双目标洗牌算法在gpu上优于竞争算法,表现出一到两个数量级的改进,并且接近峰值设备带宽。
{"title":"Bandwidth-Optimal Random Shuffling for GPUs","authors":"Rory Mitchell, Daniel Stokes, E. Frank, G. Holmes","doi":"10.1145/3505287","DOIUrl":"https://doi.org/10.1145/3505287","url":null,"abstract":"Linear-time algorithms that are traditionally used to shuffle data on CPUs, such as the method of Fisher-Yates, are not well suited to implementation on GPUs due to inherent sequential dependencies, and existing parallel shuffling algorithms are unsuitable for GPU architectures because they incur a large number of read/write operations to high latency global memory. To address this, we provide a method of generating pseudo-random permutations in parallel by fusing suitable pseudo-random bijective functions with stream compaction operations. Our algorithm, termed “bijective shuffle” trades increased per-thread arithmetic operations for reduced global memory transactions. It is work-efficient, deterministic, and only requires a single global memory read and write per shuffle input, thus maximising use of global memory bandwidth. To empirically demonstrate the correctness of the algorithm, we develop a statistical test for the quality of pseudo-random permutations based on kernel space embeddings. Experimental results show that the bijective shuffle algorithm outperforms competing algorithms on GPUs, showing improvements of between one and two orders of magnitude and approaching peak device bandwidth.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2021-06-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"46192279","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Engineering In-place (Shared-memory) Sorting Algorithms 工程就地(共享内存)排序算法
IF 1.6 Q2 Computer Science Pub Date : 2020-09-28 DOI: 10.1145/3505286
Michael Axtmann, Sascha Witt, Daniel Ferizovic, P. Sanders
We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our new comparison-based algorithm In-place Parallel Super Scalar Samplesort (IPS4o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings. Another surprising result is that IPS4o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS2Ra) turns out to be the best algorithm. Claims to have the – in some sense – “best” sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.
我们提出了新的顺序和并行排序算法,这些算法现在代表了适用于各种输入大小、输入分布、数据类型和机器的已知最快技术。令人惊讶的是,速度优势的一部分是由于算法的额外功能,即它们不需要输入阵列之外的大量空间。以前,就地功能通常意味着性能惩罚。我们的主要算法贡献是一种可证明具有缓存效率的块式就地数据分发方法。我们还将这种方法并行化,同时考虑到动态负载平衡和内存局部性。我们新的基于比较的原位并行超标量样本排序算法(IPS4o)将该技术与无分支决策树相结合。通过考虑具有许多相等元素的情况并动态调整分布度,我们获得了一种高度鲁棒的算法,该算法比以前最好的基于并行比较的排序算法几乎高出三倍。无论我们是否考虑到位、并行或顺序设置,该算法也优于基于比较的最佳竞争对手。另一个令人惊讶的结果是,IPS4o甚至在各种情况下都优于最好的(原位或非原位)整数排序算法。在剩下的许多情况下(通常涉及接近均匀的输入分布、小键或顺序设置),我们新的原位并行超标量基数排序(IPS2Ra)被证明是最好的算法。在许多论文中都可以找到声称拥有某种意义上“最佳”排序算法的说法,但这些说法不可能都是真的。因此,我们的结论基于一项广泛的实验研究,该研究涉及21种最先进的排序代码、6种数据类型、10种输入分布、4台机器、4种内存分配策略和7个数量级以上的输入大小的大部分叉积。这证实了关于我们算法稳健性能的说法,同时揭示了许多竞争对手在相关出版物中报道的具体测量之外的主要性能问题。整数排序算法尤其如此,这给了我们一个理由,即更喜欢基于比较的算法进行稳健的通用排序。
{"title":"Engineering In-place (Shared-memory) Sorting Algorithms","authors":"Michael Axtmann, Sascha Witt, Daniel Ferizovic, P. Sanders","doi":"10.1145/3505286","DOIUrl":"https://doi.org/10.1145/3505286","url":null,"abstract":"We present new sequential and parallel sorting algorithms that now represent the fastest known techniques for a wide range of input sizes, input distributions, data types, and machines. Somewhat surprisingly, part of the speed advantage is due to the additional feature of the algorithms to work in-place, i.e., they do not need a significant amount of space beyond the input array. Previously, the in-place feature often implied performance penalties. Our main algorithmic contribution is a blockwise approach to in-place data distribution that is provably cache-efficient. We also parallelize this approach taking dynamic load balancing and memory locality into account. Our new comparison-based algorithm In-place Parallel Super Scalar Samplesort (IPS4o), combines this technique with branchless decision trees. By taking cases with many equal elements into account and by adapting the distribution degree dynamically, we obtain a highly robust algorithm that outperforms the best previous in-place parallel comparison-based sorting algorithms by almost a factor of three. That algorithm also outperforms the best comparison-based competitors regardless of whether we consider in-place or not in-place, parallel or sequential settings. Another surprising result is that IPS4o even outperforms the best (in-place or not in-place) integer sorting algorithms in a wide range of situations. In many of the remaining cases (often involving near-uniform input distributions, small keys, or a sequential setting), our new In-place Parallel Super Scalar Radix Sort (IPS2Ra) turns out to be the best algorithm. Claims to have the – in some sense – “best” sorting algorithm can be found in many papers which cannot all be true. Therefore, we base our conclusions on an extensive experimental study involving a large part of the cross product of 21 state-of-the-art sorting codes, 6 data types, 10 input distributions, 4 machines, 4 memory allocation strategies, and input sizes varying over 7 orders of magnitude. This confirms the claims made about the robust performance of our algorithms while revealing major performance problems in many competitors outside the concrete set of measurements reported in the associated publications. This is particularly true for integer sorting algorithms giving one reason to prefer comparison-based algorithms for robust general-purpose sorting.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2020-09-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43709384","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing 异步多gpu编程模型与大规模图形处理的应用
IF 1.6 Q2 Computer Science Pub Date : 2020-08-01 DOI: 10.1145/3399730
Tal Ben-Nun, M. Sutton, Sreepathi Pai, K. Pingali
ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/06-ART18 $15.00 https://doi.org/10.1145/3399730 ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:2 T. Ben-Nun et al. Fig. 1. Multi-GPU node schematics. via a low-latency, high-throughput bus (see Figure 1). These interconnects allow parallel applications to exchange data efficiently and to take advantage of the combined computational power and memory size of the GPUs, but may vary substantially between node types. Multi-GPU nodes are usually programmed using one of two methods. In the simple approach, each GPU is managed separately, using one process per device [19, 26]. Alternatively, a Bulk Synchronous Parallel (BSP) [42] programming model is used, in which applications are executed in rounds, and each round consists of local computation followed by global communication [6, 33]. The first approach is subject to overhead from various sources, such as the operating system, and requires a message-passing interface for communication. The BSP model, however, can introduce unnecessary serialization at the global barriers that implement round-based execution. Both programming methods may result in under-utilization of multi-GPU platforms, particularly for irregular applications, which may suffer from load imbalance and may have unpredictable communication patterns. In principle, asynchronous programming models can reduce some of those problems, because unlike in round-based communication, processors can compute and communicate autonomously without waiting for other processors to reach global barriers. However, there are few applications that exploit asynchronous execution, since their development requires an in-depth knowledge of the underlying architecture and communication network and involves performing intricate adaptations to the code. This article presents Groute, an asynchronous programming model and runtime environment [2] that can be used to develop a wide range of applications on multi-GPU systems. Based on concepts from low-level networking, Groute aims to overcome the programming complexity of asynchronous applications on multi-GPU and heterogeneous platforms. The communication constructs of Groute are simple, but they can be used to efficiently express programs that range from regular applications and BSP applications to nontrivial irregular algorithms. The asynchronous nature of the runtime environment also promotes load balancing, leading to better utilization of heterogeneous multi-GPU nodes. This article is an extended version of previously published work [7], where we explain the concepts in greater detail, consider newer multi-GPU topologies, and elaborate on the evaluated algorithms, as well as scalability considerations.
允许赊账付款。以其他方式复制或重新发布,在服务器上发布或重新分发到列表,需要事先获得特定许可和/或付费。从permissions@acm.org请求权限。©2020美国计算机协会。2329-4949/2020/06-ART18 $15.00 https://doi.org/10.1145/3399730 ACM并行计算学报,第7卷,第3期,第18条。出版日期:2020年6月。[18:2] T. Ben-Nun等。图1所示。多gpu节点原理图。通过低延迟、高吞吐量总线(见图1)。这些互连允许并行应用程序有效地交换数据,并利用gpu的综合计算能力和内存大小,但节点类型之间可能存在很大差异。多gpu节点通常使用两种方法中的一种进行编程。在简单的方法中,每个GPU被单独管理,每个设备使用一个进程[19,26]。另一种方法是使用Bulk Synchronous Parallel (BSP)[42]编程模型,其中应用程序以轮执行,每轮由局部计算组成,然后进行全局通信[6,33]。第一种方法受到来自各种来源(如操作系统)的开销的影响,并且需要一个用于通信的消息传递接口。然而,BSP模型可能在实现基于轮的执行的全局屏障上引入不必要的序列化。这两种编程方法都可能导致多gpu平台的利用率不足,特别是对于不规则的应用程序,它可能遭受负载不平衡和可能具有不可预测的通信模式。原则上,异步编程模型可以减少其中的一些问题,因为与基于轮的通信不同,处理器可以自主计算和通信,而无需等待其他处理器到达全局屏障。然而,很少有应用程序利用异步执行,因为它们的开发需要深入了解底层体系结构和通信网络,并涉及对代码执行复杂的调整。本文介绍了Groute,这是一种异步编程模型和运行时环境b[2],可用于在多gpu系统上开发各种应用程序。基于低级网络的概念,Groute旨在克服多gpu和异构平台上异步应用程序的编程复杂性。Groute的通信结构很简单,但它们可以用来有效地表达从常规应用程序和BSP应用程序到不平凡的不规则算法的各种程序。运行时环境的异步特性还促进了负载平衡,从而更好地利用异构多gpu节点。本文是先前发布的工作[7]的扩展版本,其中我们更详细地解释了概念,考虑了较新的多gpu拓扑,并详细说明了评估的算法以及可伸缩性注意事项。主要贡献如下:•我们定义了用于异步执行和通信的抽象编程结构。•我们展示了这些结构可用于定义各种算法,包括规则和不规则并行算法。美国计算机学会并行计算学报,Vol. 7, No. 3, Article 18。出版日期:2020年6月。•我们比较了我们实现的性能方面,使用在现有框架中编写的应用程序作为基准。•我们表明,使用Groute,可以实现异步应用程序,在大多数情况下优于最先进的实现,与单个GPU的基线执行相比,在8个GPU上产生高达7.32倍的加速。一般来说,加速器的作用是通过允许它们卸载应用程序的数据并行部分来补充可用的cpu。cpu依次负责进程管理、通信、输入/输出任务、内存传输和数据预处理/后处理。如图1所示,cpu和加速器通过前端总线(FSB,实现包括QPI和HyperTransport)相互连接。FSB通道(其计数是内存传输带宽的一个指标)连接到PCI-Express或NVLink等同时支持CPU-GPU和GPU-GPU通信的互连。由于硬件布局的限制,例如使用相同的主板和电源单元,多gpu节点通常由~ 1-25个gpu组成。cpu、gpu和互连的拓扑结构有两种,一种是完全全对连接,另一种是分层交换拓扑,如图所示。在图1(a)所示的树状拓扑中,gpu的每个四联体(即1 - 4和5-8)之间可以执行直接通信操作,但与其他四联体的通信是间接的,因此速度较慢。 例如,GPU 1和GPU 4可以直接通信,但从GPU 4到GPU 5的数据传输必须经过互连。交换接口允许每个CPU以相同的速率与所有gpu通信。在其他配置中,cpu直接连接到它们的四组gpu,这导致CPU-GPU带宽可变,具体取决于进程的位置。GPU架构包含多个内存复制引擎,支持同时执行代码和双向(输入/输出)内存传输。下面,我们将详细说明在多gpu节点中使用并发副本进行有效通信的不同方式。gpu间通信gpu之间的内存传输由供应商运行时通过隐式和显式接口提供。对于前者,统一内存和托管内存等抽象利用虚拟内存执行复制、分页和预取。然而,通过显式复制,用户可以完全控制内存的传输方式和时间。当确切的内存访问模式已知时,通常最好显式地控制内存移动,因为预取可能会损害内存延迟绑定的应用程序。出于这个原因,我们将在下面重点讨论显式gpu间通信。显卡之间的显式内存传输可以由主机或设备发起。主机发起的内存传输(Peer transfer)由显式复制命令支持,而设备发起的内存传输(Direct Access, DA)是通过gpu间的内存访问实现的。请注意,根据总线拓扑,并非所有gpu对之间都可以直接访问对等内存。但是,所有gpu都可以访问固定的主机内存。设备发起的内存传输是通过虚拟寻址实现的,它将所有主机和设备内存映射到单个地址空间。虽然比对等传输更灵活,但数据处理性能对内存对齐、合并、活动线程数量和访问顺序高度敏感。使用微基准测试(图2),我们在实验设置的8个gpu系统上测量了100 MB的传输,平均超过100次试验(参见第5节了解详细规格)。美国计算机学会并行计算学报,Vol. 7, No. 3, Article 18。出版日期:2020年6月。[18] T. Ben-Nun等。图2所示。gpu间内存传输微基准。图2(a)显示了同板gpu、不同板gpu和CPU-GPU通信的设备启动内存访问传输速率。该图展示了数据处理频谱的两个极端——从严格管理的合并访问(蓝色条,左侧)到随机的非管理访问(红色条,右侧)。观察到合并访问的性能比随机访问高21倍。还要注意,内存传输速率与拓扑中路径的距离相关。由于增加了双板gpu的级别(如图1(a)所示),CPU-GPU的传输速度比两个不同板的gpu更快。为了支持无法访问彼此内存的gpu之间的设备启动传输,可以执行两阶段间接复制。在间接复制中,源GPU首先将信息“推送”到主机内存,然后由目标GPU使用主机标志和系统范围的内存围栏进行同步“拉”。在图1(a)所示的拓扑中,gpu一次只能传输到一个目的地。这阻碍了异步系统的响应性,特别是在传输大缓冲区时。解决这个问题的一种方法是将消息分成数据包,就像在网络中一样。图2(b)显示了使用分组内存传输与使用单个对等传输的开销。如图所示,开销随着数据包大小的增加呈线性下降,对于1 - 10mb的数据包,开销在~ 1%到10%之间。单个应用程序可以调整此参数,以平衡延迟和带宽。图2(c)比较了直接(推)和间接(推/拉)传输的传输速率,显示了分组设备发起的传输和细粒度控制是有利的,甚至优于主机管理的分组对等传输。请注意,由于设备启动的内存访问是用用户代码编写的,因此可以在传输期间执行额外的数据处理。多gpu通信的另一个重要方面是多个源/目标传输,如在集体操作中。由于互连和内存复制引擎的结构,幼稚的应用程序很可能使总线拥挤。在NCCL库[31]中使用的一种方法是在总线上创建环形拓扑。在这种方法中,
{"title":"Groute: Asynchronous Multi-GPU Programming Model with Applications to Large-scale Graph Processing","authors":"Tal Ben-Nun, M. Sutton, Sreepathi Pai, K. Pingali","doi":"10.1145/3399730","DOIUrl":"https://doi.org/10.1145/3399730","url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/06-ART18 $15.00 https://doi.org/10.1145/3399730 ACM Transactions on Parallel Computing, Vol. 7, No. 3, Article 18. Publication date: June 2020. 18:2 T. Ben-Nun et al. Fig. 1. Multi-GPU node schematics. via a low-latency, high-throughput bus (see Figure 1). These interconnects allow parallel applications to exchange data efficiently and to take advantage of the combined computational power and memory size of the GPUs, but may vary substantially between node types. Multi-GPU nodes are usually programmed using one of two methods. In the simple approach, each GPU is managed separately, using one process per device [19, 26]. Alternatively, a Bulk Synchronous Parallel (BSP) [42] programming model is used, in which applications are executed in rounds, and each round consists of local computation followed by global communication [6, 33]. The first approach is subject to overhead from various sources, such as the operating system, and requires a message-passing interface for communication. The BSP model, however, can introduce unnecessary serialization at the global barriers that implement round-based execution. Both programming methods may result in under-utilization of multi-GPU platforms, particularly for irregular applications, which may suffer from load imbalance and may have unpredictable communication patterns. In principle, asynchronous programming models can reduce some of those problems, because unlike in round-based communication, processors can compute and communicate autonomously without waiting for other processors to reach global barriers. However, there are few applications that exploit asynchronous execution, since their development requires an in-depth knowledge of the underlying architecture and communication network and involves performing intricate adaptations to the code. This article presents Groute, an asynchronous programming model and runtime environment [2] that can be used to develop a wide range of applications on multi-GPU systems. Based on concepts from low-level networking, Groute aims to overcome the programming complexity of asynchronous applications on multi-GPU and heterogeneous platforms. The communication constructs of Groute are simple, but they can be used to efficiently express programs that range from regular applications and BSP applications to nontrivial irregular algorithms. The asynchronous nature of the runtime environment also promotes load balancing, leading to better utilization of heterogeneous multi-GPU nodes. This article is an extended version of previously published work [7], where we explain the concepts in greater detail, consider newer multi-GPU topologies, and elaborate on the evaluated algorithms, as well as scalability considerations. ","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2020-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79522401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
FEAST: A Lightweight Lock-free Concurrent Binary Search Tree FEAST:一个轻量级的无锁并发二叉搜索树
IF 1.6 Q2 Computer Science Pub Date : 2020-05-31 DOI: 10.1145/3391438
Aravind Natarajan, Arunmoezhi Ramachandran, N. Mittal
We present a lock-free algorithm for concurrent manipulation of a binary search tree (BST) in an asynchronous shared memory system that supports search, insert, and delete operations. In addition to read and write instructions, our algorithm uses (single-word) compare-and-swap (CAS) and bit-test-and-set (BTS) read-modify-write (RMW) instructions, both of which are commonly supported by many modern processors including Intel 64 and AMD64. In contrast to most of the existing concurrent algorithms for a binary search tree, our algorithm is edge-based rather than node-based. When compared to other concurrent algorithms for a binary search tree, modify (insert and delete) operations in our algorithm (a) work on a smaller section of the tree, (b) execute fewer RMW instructions, or (c) use fewer dynamically allocated objects. In our experiments, our lock-free algorithm significantly outperformed all other algorithms for a concurrent binary search tree especially when the contention was high. We also describe modifications to our basic lock-free algorithm so that the amortized complexity of any operation in the modified algorithm can be bounded by the sum of the tree height and the point contention to within a constant factor while preserving the other desirable features of our algorithm.
我们提出了一种在异步共享内存系统中并行操作二叉搜索树(BST)的无锁算法,该算法支持搜索、插入和删除操作。除了读和写指令之外,我们的算法还使用(单字)比较和交换(CAS)和位测试和设置(BTS)读-修改-写(RMW)指令,这两种指令通常被许多现代处理器支持,包括Intel 64和AMD64。与大多数现有的二叉搜索树并发算法相比,我们的算法是基于边的,而不是基于节点的。与二叉搜索树的其他并发算法相比,我们算法中的修改(插入和删除)操作(a)在更小的树部分上工作,(b)执行更少的RMW指令,或者(c)使用更少的动态分配对象。在我们的实验中,我们的无锁算法明显优于并发二叉搜索树的所有其他算法,特别是在争用高的情况下。我们还描述了对基本无锁算法的修改,以便修改后的算法中任何操作的平摊复杂度可以由树高度和点争用的总和限定在一个常数因子内,同时保留了算法的其他理想特性。
{"title":"FEAST: A Lightweight Lock-free Concurrent Binary Search Tree","authors":"Aravind Natarajan, Arunmoezhi Ramachandran, N. Mittal","doi":"10.1145/3391438","DOIUrl":"https://doi.org/10.1145/3391438","url":null,"abstract":"We present a lock-free algorithm for concurrent manipulation of a binary search tree (BST) in an asynchronous shared memory system that supports search, insert, and delete operations. In addition to read and write instructions, our algorithm uses (single-word) compare-and-swap (CAS) and bit-test-and-set (BTS) read-modify-write (RMW) instructions, both of which are commonly supported by many modern processors including Intel 64 and AMD64. In contrast to most of the existing concurrent algorithms for a binary search tree, our algorithm is edge-based rather than node-based. When compared to other concurrent algorithms for a binary search tree, modify (insert and delete) operations in our algorithm (a) work on a smaller section of the tree, (b) execute fewer RMW instructions, or (c) use fewer dynamically allocated objects. In our experiments, our lock-free algorithm significantly outperformed all other algorithms for a concurrent binary search tree especially when the contention was high. We also describe modifications to our basic lock-free algorithm so that the amortized complexity of any operation in the modified algorithm can be bounded by the sum of the tree height and the point contention to within a constant factor while preserving the other desirable features of our algorithm.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2020-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82801233","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows CoREC:可扩展和弹性的内存数据暂存原位工作流
IF 1.6 Q2 Computer Science Pub Date : 2020-05-16 DOI: 10.1145/3391448
Shaohua Duan, P. Subedi, Philip E. Davis, K. Teranishi, H. Kolla, Marc Gamell, M. Parashar
ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/05-ART12 $15.00 https://doi.org/10.1145/3391448 ACM Transactions on Parallel Computing, Vol. 7, No. 2, Article 12. Publication date: May 2020.
允许赊账付款。以其他方式复制或重新发布,在服务器上发布或重新分发到列表,需要事先获得特定许可和/或付费。从permissions@acm.org请求权限。©2020美国计算机协会。2329-4949/2020/05-ART12 $15.00 https://doi.org/10.1145/3391448 ACM并行计算学报,第7卷,第2期,第12条。出版日期:2020年5月。
{"title":"CoREC: Scalable and Resilient In-memory Data Staging for In-situ Workflows","authors":"Shaohua Duan, P. Subedi, Philip E. Davis, K. Teranishi, H. Kolla, Marc Gamell, M. Parashar","doi":"10.1145/3391448","DOIUrl":"https://doi.org/10.1145/3391448","url":null,"abstract":"ing with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from permissions@acm.org. © 2020 Association for Computing Machinery. 2329-4949/2020/05-ART12 $15.00 https://doi.org/10.1145/3391448 ACM Transactions on Parallel Computing, Vol. 7, No. 2, Article 12. Publication date: May 2020.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2020-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78792111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
ROC: A Reconfigurable Optical Computer for Simulating Physical Processes 用于模拟物理过程的可重构光学计算机
IF 1.6 Q2 Computer Science Pub Date : 2020-04-02 DOI: 10.1145/3380944
Jeff Anderson, Engin Kayraklioglu, Shuai Sun, Joseph Crandall, Y. Alkabani, Vikram K. Narayana, V. Sorger, T. El-Ghazawi
Due to the end of Moore’s law and Dennard scaling, we are entering a new era of processors. Computing systems are increasingly facing power and performance challenges due to both deviceand circuit-related challenges with resistive and capacitive charging. Non-von Neumann architectures are needed to support future computations through innovative post-Moore’s law architectures. To enable these emerging architectures with high-performance and at ultra-low power, both parallel computation and inter-node communication on-the-chip can be supported using photons. To this end, we introduce ROC, a reconfigurable optical computer that can solve partial differential equations (PDEs). PDE solvers form the basis for many traditional simulation problems in science and engineering that are currently performed on supercomputers. Instead of solving problems iteratively, the proposed engine uses a resistive mesh architecture to solve a PDE in a single iteration (one-shot). Instead of using actual electrical circuits, the physical underlying hardware emulates such structures using a silicon-photonics mesh that splits light into separate pathways, allowing it to add or subtract optical power analogous to programmable resistors. The time to obtain the PDE solution then only depends on the time-of-flight of a photon through the programmed mesh, which can be on the order of 10’s of picoseconds given the millimeter-compact integrated photonic circuit. Numerically validated experimental results show that, over multiple configurations, ROC can achieve several orders of magnitude improvement over state-of-the-art GPUs when speed, power, and size are taken into account. Further, it comes within approximately 90% precision of current numerical solvers. As such, ROC can be a viable reconfigurable, approximate computer with the potential for more precise results when replacing silicon-photonics building blocks with nanoscale photonic lumped-elements.
由于摩尔定律和登纳德缩放的终结,我们正在进入一个处理器的新时代。由于器件和电路与电阻式和电容式充电相关的挑战,计算系统正日益面临功率和性能方面的挑战。非冯·诺伊曼架构需要通过创新的后摩尔定律架构来支持未来的计算。为了使这些新兴架构具有高性能和超低功耗,可以使用光子支持并行计算和片上节点间通信。为此,我们介绍了ROC,一个可重构的光学计算机,可以求解偏微分方程(PDEs)。PDE求解器构成了目前在超级计算机上执行的科学和工程中许多传统模拟问题的基础。该引擎不是迭代求解问题,而是使用电阻网格架构在单次迭代(一次)中求解PDE。而不是使用实际的电路,物理底层硬件模拟这样的结构使用硅光子学网格,将光分成不同的路径,允许它增加或减少光功率类似于可编程电阻。获得PDE解的时间仅取决于光子通过编程网格的飞行时间,考虑到毫米紧凑的集成光子电路,该时间可以在10皮秒的数量级上。数值验证的实验结果表明,在多种配置下,当考虑到速度、功率和尺寸时,ROC可以比最先进的gpu实现几个数量级的改进。此外,它的精度在当前数值求解器的90%左右。因此,ROC可以是一个可行的可重构的近似计算机,当用纳米级光子集总元件取代硅光子构建块时,具有更精确结果的潜力。
{"title":"ROC: A Reconfigurable Optical Computer for Simulating Physical Processes","authors":"Jeff Anderson, Engin Kayraklioglu, Shuai Sun, Joseph Crandall, Y. Alkabani, Vikram K. Narayana, V. Sorger, T. El-Ghazawi","doi":"10.1145/3380944","DOIUrl":"https://doi.org/10.1145/3380944","url":null,"abstract":"Due to the end of Moore’s law and Dennard scaling, we are entering a new era of processors. Computing systems are increasingly facing power and performance challenges due to both deviceand circuit-related challenges with resistive and capacitive charging. Non-von Neumann architectures are needed to support future computations through innovative post-Moore’s law architectures. To enable these emerging architectures with high-performance and at ultra-low power, both parallel computation and inter-node communication on-the-chip can be supported using photons. To this end, we introduce ROC, a reconfigurable optical computer that can solve partial differential equations (PDEs). PDE solvers form the basis for many traditional simulation problems in science and engineering that are currently performed on supercomputers. Instead of solving problems iteratively, the proposed engine uses a resistive mesh architecture to solve a PDE in a single iteration (one-shot). Instead of using actual electrical circuits, the physical underlying hardware emulates such structures using a silicon-photonics mesh that splits light into separate pathways, allowing it to add or subtract optical power analogous to programmable resistors. The time to obtain the PDE solution then only depends on the time-of-flight of a photon through the programmed mesh, which can be on the order of 10’s of picoseconds given the millimeter-compact integrated photonic circuit. Numerically validated experimental results show that, over multiple configurations, ROC can achieve several orders of magnitude improvement over state-of-the-art GPUs when speed, power, and size are taken into account. Further, it comes within approximately 90% precision of current numerical solvers. As such, ROC can be a viable reconfigurable, approximate computer with the potential for more precise results when replacing silicon-photonics building blocks with nanoscale photonic lumped-elements.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2020-04-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78470620","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Scheduling Mutual Exclusion Accesses in Equal-Length Jobs 调度等长作业中的互斥访问
IF 1.6 Q2 Computer Science Pub Date : 2019-09-10 DOI: 10.1145/3342562
D. Kagaris, S. Dutta
A fundamental problem in parallel and distributed processing is the partial serialization that is imposed due to the need for mutually exclusive access to common resources. In this article, we investigate the problem of optimally scheduling (in terms of makespan) a set of jobs, where each job consists of the same number L of unit-duration tasks, and each task either accesses exclusively one resource from a given set of resources or accesses a fully shareable resource. We develop and establish the optimality of a fast polynomial-time algorithm to find a schedule with the shortest makespan for any number of jobs and for any number of resources for the case of L = 2. In the notation commonly used for job-shop scheduling problems, this result means that the problem J |dij=1, nj =2|Cmax is polynomially solvable, adding to the polynomial solutions known for the problems J2 | nj ≤ 2 | Cmax and J2 | dij = 1 | Cmax (whereas other closely related versions such as J2 | nj ≤ 3 | Cmax, J2 | dij ∈ { 1,2} | Cmax, J3 | nj ≤ 2 | Cmax, J3 | dij=1 | Cmax, and J |dij=1, nj ≤ 3| Cmax are all known to be NP-complete). For the general case L > 2 (i.e., for the job-shop problem J |dij=1, nj =L> 2| Cmax), we present a competitive heuristic and provide experimental comparisons with other heuristic versions and, when possible, with the ideal integer linear programming formulation.
并行和分布式处理中的一个基本问题是由于需要对公共资源进行互斥访问而导致的部分序列化。在本文中,我们研究最优调度一组作业的问题(就makespan而言),其中每个作业由相同数量的L个单位持续时间任务组成,每个任务要么只访问给定资源集中的一个资源,要么访问一个完全可共享的资源。我们开发并建立了一个快速多项式时间算法的最优性,用于在L = 2的情况下找到任意数量的作业和任意数量的资源具有最短完工时间的调度。符号常用的作业车间调度问题,这一结果意味着问题J | dij = 1, nj = 2 | Cmax可以用多项式来解决,增加问题的多项式解已知J2 | nj≤2 | Cmax和J2 | dij = 1 | Cmax(而其他密切相关的版本如J2 | nj≤3 | Cmax, J2 | dij∈{1,2}| Cmax, J3 | nj≤2 | Cmax, J3 | dij = 1 | Cmax,和J | dij = 1, nj≤3 | Cmax是所有已知的非完全多项式)。对于一般情况L> 2(即,对于job-shop问题J |dij=1, nj =L> 2| Cmax),我们提出了一个竞争性启发式,并提供了与其他启发式版本的实验比较,并且在可能的情况下,与理想整数线性规划公式进行了比较。
{"title":"Scheduling Mutual Exclusion Accesses in Equal-Length Jobs","authors":"D. Kagaris, S. Dutta","doi":"10.1145/3342562","DOIUrl":"https://doi.org/10.1145/3342562","url":null,"abstract":"A fundamental problem in parallel and distributed processing is the partial serialization that is imposed due to the need for mutually exclusive access to common resources. In this article, we investigate the problem of optimally scheduling (in terms of makespan) a set of jobs, where each job consists of the same number <i>L</i> of unit-duration tasks, and each task either accesses exclusively one resource from a given set of resources or accesses a fully shareable resource. We develop and establish the optimality of a fast polynomial-time algorithm to find a schedule with the shortest makespan for any number of jobs and for any number of resources for the case of <i>L</i> = 2. In the notation commonly used for job-shop scheduling problems, this result means that the problem <i>J</i> |<i>d</i><sub><i>ij</i></sub>=1, <i>n</i><sub><i>j</i></sub> =2|<i>C</i><sub>max</sub> is polynomially solvable, adding to the polynomial solutions known for the problems <i>J</i>2 | <i>n</i><sub><i>j</i></sub> ≤ 2 | <i>C</i><sub>max</sub> and <i>J</i>2 | <i>d</i><sub><i>ij</i></sub> = 1 | <i>C</i><sub>max</sub> (whereas other closely related versions such as <i>J</i>2 | <i>n</i><sub><i>j</i></sub> ≤ 3 | <i>C</i><sub>max</sub>, <i>J</i>2 | <i>d</i><sub><i>ij</i></sub> ∈ { 1,2} | <i>C</i><sub>max</sub>, <i>J</i>3 | <i>n</i><sub><i>j</i></sub> ≤ 2 | <i>C</i><sub>max</sub>, <i>J</i>3 | <i>d</i><sub><i>ij</i></sub>=1 | <i>C</i><sub>max</sub>, and <i>J</i> |<i>d</i><sub><i>ij</i></sub>=1, <i>n</i><sub><i>j</i></sub> ≤ 3| <i>C</i><sub>max</sub> are all known to be NP-complete). For the general case <i>L</i> > 2 (i.e., for the job-shop problem <i>J</i> |<i>d</i><sub><i>ij</i></sub>=1, <i>n</i><sub><i>j</i></sub> =<i>L</i>> 2| <i>C</i><sub>max</sub>), we present a competitive heuristic and provide experimental comparisons with other heuristic versions and, when possible, with the ideal integer linear programming formulation.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86220203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
I/O Scheduling Strategy for Periodic Applications 周期性应用的I/O调度策略
IF 1.6 Q2 Computer Science Pub Date : 2019-09-10 DOI: 10.1145/3338510
G. Aupy, Ana Gainaru, Valentin Le Fèvre
With the ever-growing need of data in HPC applications, the congestion at the I/O level becomes critical in supercomputers. Architectural enhancement such as burst buffers and pre-fetching are added to machines but are not sufficient to prevent congestion. Recent online I/O scheduling strategies have been put in place, but they add an additional congestion point and overheads in the computation of applications. In this work, we show how to take advantage of the periodic nature of HPC applications to develop efficient periodic scheduling strategies for their I/O transfers. Our strategy computes once during the job scheduling phase a pattern that defines the I/O behavior for each application, after which the applications run independently, performing their I/O at the specified times. Our strategy limits the amount of congestion at the I/O node level and can be easily integrated into current job schedulers. We validate this model through extensive simulations and experiments on an HPC cluster by comparing it to state-of-the-art online solutions, showing that not only does our scheduler have the advantage of being de-centralized and thus overcoming the overhead of online schedulers, but also that it performs better than the other solutions, improving the application dilation up to 16% and the maximum system efficiency up to 18%.
随着高性能计算应用程序对数据需求的不断增长,超级计算机I/O级的拥塞问题变得至关重要。诸如突发缓冲区和预取之类的架构增强被添加到机器中,但不足以防止拥塞。最近的在线I/O调度策略已经就位,但是它们在应用程序的计算中增加了额外的拥塞点和开销。在这项工作中,我们展示了如何利用HPC应用程序的周期性特性,为它们的I/O传输开发有效的周期性调度策略。我们的策略在作业调度阶段计算一次模式,该模式定义每个应用程序的I/O行为,之后应用程序独立运行,在指定时间执行它们的I/O。我们的策略限制了I/O节点级别的拥塞量,并且可以很容易地集成到当前的作业调度器中。我们通过在HPC集群上进行大量的模拟和实验来验证该模型,并将其与最先进的在线解决方案进行比较,结果表明,我们的调度器不仅具有去中心化的优势,从而克服了在线调度器的开销,而且它的性能比其他解决方案更好,将应用程序扩展提高了16%,最大系统效率提高了18%。
{"title":"I/O Scheduling Strategy for Periodic Applications","authors":"G. Aupy, Ana Gainaru, Valentin Le Fèvre","doi":"10.1145/3338510","DOIUrl":"https://doi.org/10.1145/3338510","url":null,"abstract":"With the ever-growing need of data in HPC applications, the congestion at the I/O level becomes critical in supercomputers. Architectural enhancement such as burst buffers and pre-fetching are added to machines but are not sufficient to prevent congestion. Recent online I/O scheduling strategies have been put in place, but they add an additional congestion point and overheads in the computation of applications.\u0000 In this work, we show how to take advantage of the periodic nature of HPC applications to develop efficient periodic scheduling strategies for their I/O transfers. Our strategy computes once during the job scheduling phase a pattern that defines the I/O behavior for each application, after which the applications run independently, performing their I/O at the specified times. Our strategy limits the amount of congestion at the I/O node level and can be easily integrated into current job schedulers. We validate this model through extensive simulations and experiments on an HPC cluster by comparing it to state-of-the-art online solutions, showing that not only does our scheduler have the advantage of being de-centralized and thus overcoming the overhead of online schedulers, but also that it performs better than the other solutions, improving the application dilation up to 16% and the maximum system efficiency up to 18%.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84499785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Modeling Universal Globally Adaptive Load-Balanced Routing 建模通用全局自适应负载均衡路由
IF 1.6 Q2 Computer Science Pub Date : 2019-09-10 DOI: 10.1145/3349620
Md Atiqul Mollah, Wenqi Wang, Peyman Faizian, Md. Shafayat Rahman, Xin Yuan, S. Pakin, M. Lang
Universal globally adaptive load-balanced (UGAL) routing has been proposed for various interconnection networks and has been deployed in a number of current-generation supercomputers. Although UGAL-based schemes have been extensively studied, most existing results are based on either simulation or measurement. Without a theoretical understanding of UGAL, multiple questions remain: For which traffic patterns is UGAL most suited? In addition, what determines the performance of the UGAL-based scheme on a particular network configuration? In this work, we develop a set of throughput models for UGALbased on linear programming. We show that the throughput models are valid across the torus, Dragonfly, and Slim Fly network topologies. Finally, we identify a robust model that can accurately and efficiently predict UGAL throughput for a set of representative traffic patterns across different topologies. Our models not only provide a mechanism to predict UGAL performance on large-scale interconnection networks but also reveal the inner working of UGAL and further our understanding of this type of routing.
通用全局自适应负载均衡(UGAL)路由已被提出用于各种互连网络,并已在许多当代超级计算机中部署。虽然基于ugal的方案已经得到了广泛的研究,但大多数现有的结果都是基于模拟或测量。如果没有对UGAL的理论理解,就会存在多个问题:UGAL最适合哪种流量模式?此外,在特定的网络配置上,是什么决定了基于ugal的方案的性能?在这项工作中,我们开发了一套基于线性规划的ugala吞吐量模型。我们证明了吞吐量模型在环面、蜻蜓和Slim Fly网络拓扑中是有效的。最后,我们确定了一个健壮的模型,该模型可以准确有效地预测跨不同拓扑的一组代表性流量模式的UGAL吞吐量。我们的模型不仅提供了一种在大规模互连网络上预测UGAL性能的机制,而且揭示了UGAL的内部工作原理,并进一步加深了我们对这类路由的理解。
{"title":"Modeling Universal Globally Adaptive Load-Balanced Routing","authors":"Md Atiqul Mollah, Wenqi Wang, Peyman Faizian, Md. Shafayat Rahman, Xin Yuan, S. Pakin, M. Lang","doi":"10.1145/3349620","DOIUrl":"https://doi.org/10.1145/3349620","url":null,"abstract":"Universal globally adaptive load-balanced (UGAL) routing has been proposed for various interconnection networks and has been deployed in a number of current-generation supercomputers. Although UGAL-based schemes have been extensively studied, most existing results are based on either simulation or measurement. Without a theoretical understanding of UGAL, multiple questions remain: For which traffic patterns is UGAL most suited? In addition, what determines the performance of the UGAL-based scheme on a particular network configuration? In this work, we develop a set of throughput models for UGALbased on linear programming. We show that the throughput models are valid across the torus, Dragonfly, and Slim Fly network topologies. Finally, we identify a robust model that can accurately and efficiently predict UGAL throughput for a set of representative traffic patterns across different topologies. Our models not only provide a mechanism to predict UGAL performance on large-scale interconnection networks but also reveal the inner working of UGAL and further our understanding of this type of routing.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84908923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Scalable Deep Learning via I/O Analysis and Optimization 基于I/O分析和优化的可扩展深度学习
IF 1.6 Q2 Computer Science Pub Date : 2019-09-10 DOI: 10.1145/3331526
S. Pumma, Min Si, W. Feng, P. Balaji
Scalable deep neural network training has been gaining prominence because of the increasing importance of deep learning in a multitude of scientific and commercial domains. Consequently, a number of researchers have investigated techniques to optimize deep learning systems. Much of the prior work has focused on runtime and algorithmic enhancements to optimize the computation and communication. Despite these enhancements, however, deep learning systems still suffer from scalability limitations, particularly with respect to data I/O. This situation is especially true for training models where the computation can be effectively parallelized, leaving I/O as the major bottleneck. In fact, our analysis shows that I/O can take up to 90% of the total training time. Thus, in this article, we first analyze LMDB, the most widely used I/O subsystem of deep learning frameworks, to understand the causes of this I/O inefficiency. Based on our analysis, we propose LMDBIO—an optimized I/O plugin for scalable deep learning. LMDBIO includes six novel optimizations that together address the various shortcomings in existing I/O for deep learning. Our experimental results show that LMDBIO significantly outperforms LMDB in all cases and improves overall application performance by up to 65-fold on a 9,216-core system.
由于深度学习在众多科学和商业领域的重要性日益增加,可扩展的深度神经网络训练已经获得了突出的地位。因此,许多研究人员已经研究了优化深度学习系统的技术。之前的大部分工作都集中在运行时和算法增强上,以优化计算和通信。然而,尽管有这些增强,深度学习系统仍然受到可扩展性的限制,特别是在数据I/O方面。这种情况尤其适用于训练模型,其中计算可以有效地并行化,使I/O成为主要瓶颈。事实上,我们的分析表明,I/O最多可以占用总训练时间的90%。因此,在本文中,我们首先分析LMDB(深度学习框架中使用最广泛的I/O子系统),以了解这种I/O效率低下的原因。基于我们的分析,我们提出了lmdbio -一个优化的I/O插件,用于可扩展的深度学习。LMDBIO包括六种新的优化,它们共同解决了用于深度学习的现有I/O中的各种缺点。我们的实验结果表明,LMDBIO在所有情况下都明显优于LMDB,并在9,216核的系统上将整体应用程序性能提高了65倍。
{"title":"Scalable Deep Learning via I/O Analysis and Optimization","authors":"S. Pumma, Min Si, W. Feng, P. Balaji","doi":"10.1145/3331526","DOIUrl":"https://doi.org/10.1145/3331526","url":null,"abstract":"Scalable deep neural network training has been gaining prominence because of the increasing importance of deep learning in a multitude of scientific and commercial domains. Consequently, a number of researchers have investigated techniques to optimize deep learning systems. Much of the prior work has focused on runtime and algorithmic enhancements to optimize the computation and communication. Despite these enhancements, however, deep learning systems still suffer from scalability limitations, particularly with respect to data I/O. This situation is especially true for training models where the computation can be effectively parallelized, leaving I/O as the major bottleneck. In fact, our analysis shows that I/O can take up to 90% of the total training time. Thus, in this article, we first analyze LMDB, the most widely used I/O subsystem of deep learning frameworks, to understand the causes of this I/O inefficiency. Based on our analysis, we propose LMDBIO—an optimized I/O plugin for scalable deep learning. LMDBIO includes six novel optimizations that together address the various shortcomings in existing I/O for deep learning. Our experimental results show that LMDBIO significantly outperforms LMDB in all cases and improves overall application performance by up to 65-fold on a 9,216-core system.","PeriodicalId":42115,"journal":{"name":"ACM Transactions on Parallel Computing","volume":null,"pages":null},"PeriodicalIF":1.6,"publicationDate":"2019-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87253170","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
期刊
ACM Transactions on Parallel Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1