首页 > 最新文献

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文 中文
Localized Fault Recovery for Nested Fork-Join Programs 嵌套Fork-Join程序的局部故障恢复
Gokcen Kestor, S. Krishnamoorthy, Wenjing Ma
Nested fork-join programs scheduled using work stealing can automatically balance load and adapt to changes in the execution environment. In this paper, we design an approach to efficiently recover from faults encountered by these programs. Specifically, we focus on localized recovery of the task space in the presence of fail-stop failures. We present an approach to efficiently track, under work stealing, the relationships between the work executed by various threads. This information is used to identify and schedule the tasks to be re-executed without interfering with normal task execution. The algorithm precisely computes the work lost, incurs minimal re-execution overhead, and can recover from an arbitrary number of failures. Experimental evaluation demonstrates low overheads in the absence of failures, recovery overheads on the same order as the lost work, and much lower recovery costs than alternative strategies.
使用工作窃取调度的嵌套fork-join程序可以自动平衡负载并适应执行环境的变化。在本文中,我们设计了一种有效地从这些程序遇到的故障中恢复的方法。具体来说,我们关注的是在存在故障停止故障的情况下任务空间的局部恢复。我们提出了一种方法,在工作窃取的情况下,有效地跟踪由不同线程执行的工作之间的关系。此信息用于识别和调度要重新执行的任务,而不会干扰正常的任务执行。该算法精确地计算丢失的工作,产生最小的重新执行开销,并且可以从任意数量的故障中恢复。实验评估表明,在没有故障的情况下,恢复开销低,恢复开销与丢失的工作相同,并且比其他策略的恢复成本低得多。
{"title":"Localized Fault Recovery for Nested Fork-Join Programs","authors":"Gokcen Kestor, S. Krishnamoorthy, Wenjing Ma","doi":"10.1109/IPDPS.2017.75","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.75","url":null,"abstract":"Nested fork-join programs scheduled using work stealing can automatically balance load and adapt to changes in the execution environment. In this paper, we design an approach to efficiently recover from faults encountered by these programs. Specifically, we focus on localized recovery of the task space in the presence of fail-stop failures. We present an approach to efficiently track, under work stealing, the relationships between the work executed by various threads. This information is used to identify and schedule the tasks to be re-executed without interfering with normal task execution. The algorithm precisely computes the work lost, incurs minimal re-execution overhead, and can recover from an arbitrary number of failures. Experimental evaluation demonstrates low overheads in the absence of failures, recovery overheads on the same order as the lost work, and much lower recovery costs than alternative strategies.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132822982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Design and Implementation of Papyrus: Parallel Aggregate Persistent Storage Papyrus的设计与实现:并行聚合持久存储
Jungwon Kim, Kittisak Sajjapongse, Seyong Lee, J. Vetter
A surprising development in recently announced HPC platforms is the addition of, sometimes massive amounts of, persistent (nonvolatile) memory (NVM) in order to increase memory capacity and compensate for plateauing I/O capabilities. However, there are no portable and scalable programming interfaces using aggregate NVM effectively. This paper introduces Papyrus: a new software system built to exploit emerging capability of NVM in HPC architectures. Papyrus (or Parallel Aggregate Persistent -YRU- Storage) is a novel programming system that provides features for scalable, aggregate, persistent memory in an extreme-scale system for typical HPC usage scenarios. Papyrus mainly consists of Papyrus Virtual File System (VFS) and Papyrus Template Container Library (TCL). Papyrus VFS provides a uniform aggregate NVM storage image across diverse NVM architectures. It enables Papyrus TCL to provide a portable and scalable high-level container programming interface whose data elements are distributed across multiple NVM nodes without requiring the user to handle complex communication, synchronization, replication, and consistency model. We evaluate Papyrus on two HPC systems, including UTK Beacon and NERSC Cori, using real NVM storage devices.
在最近发布的HPC平台中,一个令人惊讶的发展是增加了(有时是大量的)持久(非易失性)内存(NVM),以增加内存容量并弥补I/O能力的停滞。然而,没有可移植和可伸缩的编程接口可以有效地使用聚合NVM。本文介绍了一种新的软件系统Papyrus,它是为了在高性能计算架构中利用NVM的新兴能力而构建的。Papyrus(或Parallel Aggregate Persistent - yru - Storage)是一种新颖的编程系统,它为典型HPC使用场景的超大规模系统提供了可扩展、聚合、持久内存的特性。Papyrus主要由Papyrus虚拟文件系统(VFS)和Papyrus模板容器库(TCL)组成。Papyrus VFS提供跨不同NVM架构的统一聚合NVM存储映像。它使Papyrus TCL能够提供可移植和可扩展的高级容器编程接口,其数据元素分布在多个NVM节点上,而不需要用户处理复杂的通信、同步、复制和一致性模型。我们在两个HPC系统(包括UTK Beacon和NERSC Cori)上使用真实的NVM存储设备对Papyrus进行了评估。
{"title":"Design and Implementation of Papyrus: Parallel Aggregate Persistent Storage","authors":"Jungwon Kim, Kittisak Sajjapongse, Seyong Lee, J. Vetter","doi":"10.1109/IPDPS.2017.72","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.72","url":null,"abstract":"A surprising development in recently announced HPC platforms is the addition of, sometimes massive amounts of, persistent (nonvolatile) memory (NVM) in order to increase memory capacity and compensate for plateauing I/O capabilities. However, there are no portable and scalable programming interfaces using aggregate NVM effectively. This paper introduces Papyrus: a new software system built to exploit emerging capability of NVM in HPC architectures. Papyrus (or Parallel Aggregate Persistent -YRU- Storage) is a novel programming system that provides features for scalable, aggregate, persistent memory in an extreme-scale system for typical HPC usage scenarios. Papyrus mainly consists of Papyrus Virtual File System (VFS) and Papyrus Template Container Library (TCL). Papyrus VFS provides a uniform aggregate NVM storage image across diverse NVM architectures. It enables Papyrus TCL to provide a portable and scalable high-level container programming interface whose data elements are distributed across multiple NVM nodes without requiring the user to handle complex communication, synchronization, replication, and consistency model. We evaluate Papyrus on two HPC systems, including UTK Beacon and NERSC Cori, using real NVM storage devices.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131848758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems 高度并行加速器系统中并行环路和数据的自动分布
Yonghong Yan, Jiawen Liu, K. Cameron, M. Umar
Heterogeneous computing systems, e.g., those with accelerators than the host CPUs, offer the accelerated performance for a variety of workloads. However, most parallel programming models require platform dependent, time-consuming hand-tuning efforts for collectively using all the resources in a system to achieve efficient results. In this work, we explore the use of OpenMP parallel language extensions to empower users with the ability to design applications that automatically and simultaneously leverage CPUs and accelerators to further optimize use of available resources. We believe such automation will be key to ensuring codes adapt to increases in the number and diversity of accelerator resources for future computing systems. The proposed system combines language extensions to OpenMP, load-balancing algorithms and heuristics, and a runtime system for loop distribution across heterogeneous processing elements. We demonstrate the effectiveness of our automated approach to program on systems with multiple CPUs, GPUs, and MICs.
异构计算系统,例如,那些比主机cpu有加速器的系统,可以为各种工作负载提供加速性能。然而,大多数并行编程模型需要依赖于平台的、耗时的手动调优工作,以便共同使用系统中的所有资源来获得有效的结果。在这项工作中,我们探索了OpenMP并行语言扩展的使用,使用户能够设计自动同时利用cpu和加速器的应用程序,以进一步优化可用资源的使用。我们相信,这种自动化将是确保代码适应未来计算系统中加速器资源数量和多样性增加的关键。提出的系统结合了OpenMP的语言扩展、负载平衡算法和启发式,以及跨异构处理元素的循环分发的运行时系统。我们展示了我们的自动化方法在具有多个cpu、gpu和mic的系统上编程的有效性。
{"title":"HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems","authors":"Yonghong Yan, Jiawen Liu, K. Cameron, M. Umar","doi":"10.1109/IPDPS.2017.99","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.99","url":null,"abstract":"Heterogeneous computing systems, e.g., those with accelerators than the host CPUs, offer the accelerated performance for a variety of workloads. However, most parallel programming models require platform dependent, time-consuming hand-tuning efforts for collectively using all the resources in a system to achieve efficient results. In this work, we explore the use of OpenMP parallel language extensions to empower users with the ability to design applications that automatically and simultaneously leverage CPUs and accelerators to further optimize use of available resources. We believe such automation will be key to ensuring codes adapt to increases in the number and diversity of accelerator resources for future computing systems. The proposed system combines language extensions to OpenMP, load-balancing algorithms and heuristics, and a runtime system for loop distribution across heterogeneous processing elements. We demonstrate the effectiveness of our automated approach to program on systems with multiple CPUs, GPUs, and MICs.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114423383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Addressing Performance Heterogeneity in MapReduce Clusters with Elastic Tasks 使用弹性任务解决MapReduce集群的性能异构问题
Wei Chen, J. Rao, Xiaobo Zhou
MapReduce applications, which require access to a large number of computing nodes, are commonly deployed in heterogeneous environments. The performance discrepancy between individual nodes in a heterogeneous cluster present significant challenges to attain good performance in MapReduce jobs. MapReduce implementations designed and optimized for homogeneous environments perform poorly on heterogeneous clusters. We attribute suboptimal performance in heterogeneous clusters to significant load imbalance between map tasks. We identify two MapReduce designs that hinder load balancing: (1) static binding between mappers and their data makes it difficult to exploit data redundancy for load balancing; (2) uniform map sizes is not optimal for nodes with heterogeneous performance. To address these issues, we propose FlexMap, a user-transparent approach that dynamically provisions map tasks to match distinct machine capacity in heterogeneous environments. We implemented FlexMap in Hadoop-2.6.0. Experimental results show that it reduces job completion time by as much as 40% compared to stock Hadoop and 30% to SkewTune.
MapReduce应用通常部署在异构环境中,需要访问大量的计算节点。在异构集群中,单个节点之间的性能差异对MapReduce作业的良好性能提出了重大挑战。为同构环境设计和优化的MapReduce实现在异构集群上表现不佳。我们将异构集群中的次优性能归因于映射任务之间的显著负载不平衡。我们确定了两个阻碍负载平衡的MapReduce设计:(1)映射器和它们的数据之间的静态绑定使得难以利用数据冗余来实现负载平衡;(2)对于异构性能的节点,统一的映射大小不是最优的。为了解决这些问题,我们提出了FlexMap,这是一种用户透明的方法,可以动态地提供映射任务,以匹配异构环境中不同的机器容量。我们在Hadoop-2.6.0中实现了FlexMap。实验结果表明,与现有Hadoop相比,它可以减少多达40%的作业完成时间,比SkewTune减少30%。
{"title":"Addressing Performance Heterogeneity in MapReduce Clusters with Elastic Tasks","authors":"Wei Chen, J. Rao, Xiaobo Zhou","doi":"10.1109/IPDPS.2017.28","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.28","url":null,"abstract":"MapReduce applications, which require access to a large number of computing nodes, are commonly deployed in heterogeneous environments. The performance discrepancy between individual nodes in a heterogeneous cluster present significant challenges to attain good performance in MapReduce jobs. MapReduce implementations designed and optimized for homogeneous environments perform poorly on heterogeneous clusters. We attribute suboptimal performance in heterogeneous clusters to significant load imbalance between map tasks. We identify two MapReduce designs that hinder load balancing: (1) static binding between mappers and their data makes it difficult to exploit data redundancy for load balancing; (2) uniform map sizes is not optimal for nodes with heterogeneous performance. To address these issues, we propose FlexMap, a user-transparent approach that dynamically provisions map tasks to match distinct machine capacity in heterogeneous environments. We implemented FlexMap in Hadoop-2.6.0. Experimental results show that it reduces job completion time by as much as 40% compared to stock Hadoop and 30% to SkewTune.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132444552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
NVIDIA Deep Learning Tutorial NVIDIA深度学习教程
J. Bernauer
Learn how hardware and software stacks enable not only quick prototyping, but also efficient large-scale production deployments. The tutorial will conclude with a discussion about hands-on deep learning training opportunities as well as free academic teaching materials and GPU cloud platforms for university faculty.
了解硬件和软件堆栈如何不仅支持快速原型设计,还支持高效的大规模生产部署。本教程最后将讨论有关实践深度学习培训机会以及免费学术教材和GPU云平台的大学教师。
{"title":"NVIDIA Deep Learning Tutorial","authors":"J. Bernauer","doi":"10.1109/IPDPS.2017.7","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.7","url":null,"abstract":"Learn how hardware and software stacks enable not only quick prototyping, but also efficient large-scale production deployments. The tutorial will conclude with a discussion about hands-on deep learning training opportunities as well as free academic teaching materials and GPU cloud platforms for university faculty.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132925360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Container-Based Cloud Platform for Mobile Computation Offloading 基于容器的移动计算卸载云平台
Song Wu, Chao Niu, J. Rao, Hai Jin, Xiaohai Dai
With the explosive growth of smartphones and cloud computing, mobile cloud, which leverages cloud resource to boost the performance of mobile applications, becomes attrac- tive. Many efforts have been made to improve the performance and reduce energy consumption of mobile devices by offloading computational codes to the cloud. However, the offloading cost caused by the cloud platform has been ignored for many years. In this paper, we propose Rattrap, a lightweight cloud platform which improves the offloading performance from cloud side. To achieve such goals, we analyze the characteristics of typical of- floading workloads and design our platform solution accordingly. Rattrap develops a new runtime environment, Cloud Android Container, for mobile computation offloading, replacing heavy- weight virtual machines (VMs). Our design exploits the idea of running operating systems with differential kernel features inside containers with driver extensions, which partially breaks the limitation of OS-level virtualization. With proposed resource sharing and code cache mechanism, Rattrap fundamentally improves the offloading performance. Our evaluation shows that Rattrap not only reduces the startup time of runtime environments and shows an average speedup of 16x, but also saves a large amount of system resources such as 75% memory footprint and at least 79% disk capacity. Moreover, Rattrap improves offloading response by as high as 63% over the cloud platform based on VM, and thus saving the battery life.
随着智能手机和云计算的爆炸式增长,利用云资源提升移动应用程序性能的移动云变得具有吸引力。通过将计算代码卸载到云端,已经做出了许多努力来提高移动设备的性能并降低能耗。然而,云平台带来的卸载成本多年来一直被忽视。在本文中,我们提出了Rattrap,一个轻量级的云平台,提高了从云端的卸载性能。为了实现这些目标,我们分析了典型负载负载的特征,并相应地设计了我们的平台解决方案。Rattrap开发了一个新的运行时环境,云Android容器,用于移动计算卸载,取代重型虚拟机(vm)。我们的设计利用了在带有驱动扩展的容器内运行具有不同内核特性的操作系统的思想,这在一定程度上打破了操作系统级虚拟化的限制。通过提出的资源共享和代码缓存机制,Rattrap从根本上提高了卸载性能。我们的评估表明,Rattrap不仅减少了运行时环境的启动时间,平均速度提高了16倍,而且还节省了大量的系统资源,例如75%的内存占用和至少79%的磁盘容量。此外,Rattrap在基于VM的云平台上将卸载响应提高了63%,从而节省了电池寿命。
{"title":"Container-Based Cloud Platform for Mobile Computation Offloading","authors":"Song Wu, Chao Niu, J. Rao, Hai Jin, Xiaohai Dai","doi":"10.1109/IPDPS.2017.47","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.47","url":null,"abstract":"With the explosive growth of smartphones and cloud computing, mobile cloud, which leverages cloud resource to boost the performance of mobile applications, becomes attrac- tive. Many efforts have been made to improve the performance and reduce energy consumption of mobile devices by offloading computational codes to the cloud. However, the offloading cost caused by the cloud platform has been ignored for many years. In this paper, we propose Rattrap, a lightweight cloud platform which improves the offloading performance from cloud side. To achieve such goals, we analyze the characteristics of typical of- floading workloads and design our platform solution accordingly. Rattrap develops a new runtime environment, Cloud Android Container, for mobile computation offloading, replacing heavy- weight virtual machines (VMs). Our design exploits the idea of running operating systems with differential kernel features inside containers with driver extensions, which partially breaks the limitation of OS-level virtualization. With proposed resource sharing and code cache mechanism, Rattrap fundamentally improves the offloading performance. Our evaluation shows that Rattrap not only reduces the startup time of runtime environments and shows an average speedup of 16x, but also saves a large amount of system resources such as 75% memory footprint and at least 79% disk capacity. Moreover, Rattrap improves offloading response by as high as 63% over the cloud platform based on VM, and thus saving the battery life.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"34 13","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132973209","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
Application Level Reordering of Remote Direct Memory Access Operations 远程直接内存访问操作的应用程序级重新排序
W. Lavrijsen, Costin Iancu
We present methods for the effective application level reordering of non-blocking RDMA operations. We supplement out-of-order hardware delivery mechanisms with heuristics to account for the CPU side overhead of communication and for differences in network latency: a runtime scheduler takes into account message sizes, destination and concurrency and reorders operations to improve overall communication throughput. Results are validated on InfiniBand and Cray Aries networks, for SPMD and hybrid (SPMD+OpenMP) programming models. We show up to 5! potential speedup, with 30-50% more typical, for synthetic message patterns in microbenchmarks. We also obtain up to 33% improvement in the communication stages in application settings. While the design space is complex, the resulting scheduler is simple, both internally and at the application level interfaces. It also provides performance portability across networks and programming models. We believe these techniques can be easily retrofitted within any application or runtime framework that uses one-sided communication, e.g. using GASNet, MPI 3.0 RMA or low level APIs such as IBVerbs.
我们提出了对非阻塞RDMA操作进行有效的应用层重排序的方法。我们用启发式方法补充无序硬件交付机制,以考虑通信的CPU端开销和网络延迟的差异:运行时调度器考虑消息大小、目的地和并发性,并重新排序操作以提高总体通信吞吐量。结果在InfiniBand和Cray Aries网络上进行了验证,用于SPMD和混合(SPMD+OpenMP)编程模型。我们出现了5个!微基准测试中合成消息模式的潜在加速,通常为30-50%。我们还在应用程序设置的通信阶段获得了高达33%的改进。虽然设计空间很复杂,但最终的调度器在内部和应用程序级接口上都很简单。它还提供跨网络和编程模型的性能可移植性。我们相信这些技术可以很容易地在任何使用单边通信的应用程序或运行时框架中进行改造,例如使用GASNet、MPI 3.0 RMA或低级api(如IBVerbs)。
{"title":"Application Level Reordering of Remote Direct Memory Access Operations","authors":"W. Lavrijsen, Costin Iancu","doi":"10.1109/IPDPS.2017.98","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.98","url":null,"abstract":"We present methods for the effective application level reordering of non-blocking RDMA operations. We supplement out-of-order hardware delivery mechanisms with heuristics to account for the CPU side overhead of communication and for differences in network latency: a runtime scheduler takes into account message sizes, destination and concurrency and reorders operations to improve overall communication throughput. Results are validated on InfiniBand and Cray Aries networks, for SPMD and hybrid (SPMD+OpenMP) programming models. We show up to 5! potential speedup, with 30-50% more typical, for synthetic message patterns in microbenchmarks. We also obtain up to 33% improvement in the communication stages in application settings. While the design space is complex, the resulting scheduler is simple, both internally and at the application level interfaces. It also provides performance portability across networks and programming models. We believe these techniques can be easily retrofitted within any application or runtime framework that uses one-sided communication, e.g. using GASNet, MPI 3.0 RMA or low level APIs such as IBVerbs.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"203 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116174080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
E^2MC: Entropy Encoding Based Memory Compression for GPUs 基于熵编码的gpu内存压缩
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.101
S. Lal, J. Lucas, B. Juurlink
Modern Graphics Processing Units (GPUs) provide much higher off-chip memory bandwidth than CPUs, but many GPU applications are still limited by memory bandwidth. Unfortunately, off-chip memory bandwidth is growing slower than the number of cores and has become a performance bottleneck. Thus, optimizations of effective memory bandwidth play a significant role for scaling the performance of GPUs. Memory compression is a promising approach for improving memory bandwidth which can translate into higher performance and energy efficiency. However, compression is not free and its challenges need to be addressed, otherwise the benefits of compression may be offset by its overhead. We propose an entropy encoding based memory compression (E2MC) technique for GPUs, which is based on the well-known Huffman encoding. We study the feasibility of entropy encoding for GPUs and show that it achieves higher compression ratios than state-of-the-art GPU compression techniques. Furthermore, we address the key challenges of probability estimation, choosing an appropriate symbol length for encoding, and decompression with low latency. The average compression ratio of E2MC is 53% higher than the state of the art. This translates into an average speedup of 20% compared to no compression and 8% higher compared to the state of the art. Energy consumption and energy-delayproduct are reduced by 13% and 27%, respectively. Moreover, the compression ratio achieved by E2MC is close to the optimal compression ratio given by Shannon’s source coding theorem.
现代图形处理单元(GPU)提供比cpu更高的片外内存带宽,但许多GPU应用程序仍然受到内存带宽的限制。不幸的是,片外内存带宽的增长速度低于内核数量的增长速度,已经成为性能瓶颈。因此,有效内存带宽的优化对于扩展gpu的性能起着重要的作用。内存压缩是一种很有前途的提高内存带宽的方法,它可以转化为更高的性能和能源效率。然而,压缩不是免费的,需要解决它的挑战,否则压缩的好处可能会被它的开销所抵消。我们提出了一种基于熵编码的gpu内存压缩(E2MC)技术,该技术基于众所周知的霍夫曼编码。我们研究了GPU的熵编码的可行性,并表明它比最先进的GPU压缩技术实现了更高的压缩比。此外,我们还解决了概率估计、选择合适的编码符号长度和低延迟解压缩的关键挑战。E2MC的平均压缩比比目前水平高53%。这意味着与没有压缩相比,平均速度提高了20%,与目前的技术水平相比,平均速度提高了8%。能源消耗和能源延误分别降低13%和27%。而且,E2MC所获得的压缩比接近Shannon源编码定理给出的最优压缩比。
{"title":"E^2MC: Entropy Encoding Based Memory Compression for GPUs","authors":"S. Lal, J. Lucas, B. Juurlink","doi":"10.1109/IPDPS.2017.101","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.101","url":null,"abstract":"Modern Graphics Processing Units (GPUs) provide much higher off-chip memory bandwidth than CPUs, but many GPU applications are still limited by memory bandwidth. Unfortunately, off-chip memory bandwidth is growing slower than the number of cores and has become a performance bottleneck. Thus, optimizations of effective memory bandwidth play a significant role for scaling the performance of GPUs. Memory compression is a promising approach for improving memory bandwidth which can translate into higher performance and energy efficiency. However, compression is not free and its challenges need to be addressed, otherwise the benefits of compression may be offset by its overhead. We propose an entropy encoding based memory compression (E2MC) technique for GPUs, which is based on the well-known Huffman encoding. We study the feasibility of entropy encoding for GPUs and show that it achieves higher compression ratios than state-of-the-art GPU compression techniques. Furthermore, we address the key challenges of probability estimation, choosing an appropriate symbol length for encoding, and decompression with low latency. The average compression ratio of E2MC is 53% higher than the state of the art. This translates into an average speedup of 20% compared to no compression and 8% higher compared to the state of the art. Energy consumption and energy-delayproduct are reduced by 13% and 27%, respectively. Moreover, the compression ratio achieved by E2MC is close to the optimal compression ratio given by Shannon’s source coding theorem.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123194738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
A Scalable System Architecture to Addressing the Next Generation of Predictive Simulation Workflows with Coupled Compute and Data Intensive Applications 一个可扩展的系统架构,以解决与耦合计算和数据密集型应用程序的下一代预测仿真工作流
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.129
M. Seager
Trends in the emerging digital economy are pushing the virtual representation of products and services. Creating these digital twins requires a combination of real time data ingestion, simulation of physical products under real world conditions, service delivery optimization and data analytics as well as ML/DL anomaly detection and decision making. Quantification of Uncertainty in the simulations will also be a compute and data intensive workflow that will drive the simulation improvement cycle. Future high-end computing systems designs need to comprehend these types of complex workflows and provide a flexible framework for optimizing the design and operations under dynamic load conditions for them.
新兴数字经济的趋势正在推动产品和服务的虚拟表现。创建这些数字孪生需要结合实时数据摄取,现实世界条件下物理产品的模拟,服务交付优化和数据分析,以及ML/DL异常检测和决策。模拟中的不确定性量化也将是一个计算和数据密集型工作流程,将推动模拟改进周期。未来的高端计算系统设计需要理解这些类型的复杂工作流程,并提供一个灵活的框架来优化动态负载条件下的设计和操作。
{"title":"A Scalable System Architecture to Addressing the Next Generation of Predictive Simulation Workflows with Coupled Compute and Data Intensive Applications","authors":"M. Seager","doi":"10.1109/IPDPS.2017.129","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.129","url":null,"abstract":"Trends in the emerging digital economy are pushing the virtual representation of products and services. Creating these digital twins requires a combination of real time data ingestion, simulation of physical products under real world conditions, service delivery optimization and data analytics as well as ML/DL anomaly detection and decision making. Quantification of Uncertainty in the simulations will also be a compute and data intensive workflow that will drive the simulation improvement cycle. Future high-end computing systems designs need to comprehend these types of complex workflows and provide a flexible framework for optimizing the design and operations under dynamic load conditions for them.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"177 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124850288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers MetaKV:用于分布式突发缓冲区元数据管理的键值存储
Teng Wang, A. Moody, Yue Zhu, K. Mohror, Kento Sato, T. Islam, Weikuan Yu
Distributed burst buffers are a promising storage architecture for handling I/O workloads for exascale computing. Their aggregate storage bandwidth grows linearly with system node count. However, although scientific applications can achieve scalable write bandwidth by having each process write to its node-local burst buffer, metadata challenges remain formidable, especially for files shared across many processes. This is due to the need to track and organize file segments across the distributed burst buffers in a global index. Because this global index can be accessed concurrently by thousands or more processes in a scientific application, the scalability of metadata management is a severe performance-limiting factor. In this paper, we propose MetaKV: a key-value store that provides fast and scalable metadata management for HPC metadata workloads on distributed burst buffers. MetaKV complements the functionality of an existing key-value store with specialized metadata services that efficiently handle bursty and concurrent metadata workloads: compressed storage management, supervised block clustering, and log-ring based collective message reduction. Our experiments demonstrate that MetaKV outperforms the state-of-the-art key-value stores by a significant margin. It improves put and get metadata operations by as much as 2.66× and 6.29×, respectively, and the benefits of MetaKV increase with increasing metadata workload demand.
分布式突发缓冲区是一种很有前途的存储架构,用于处理百亿亿次计算的I/O工作负载。它们的总存储带宽随系统节点数线性增长。然而,尽管科学应用程序可以通过让每个进程写入其节点本地突发缓冲区来实现可扩展的写入带宽,但元数据挑战仍然是艰巨的,特别是对于跨多个进程共享的文件。这是由于需要在全局索引中跟踪和组织跨分布式突发缓冲区的文件段。因为这个全局索引可以被科学应用程序中的数千个或更多进程并发访问,所以元数据管理的可伸缩性是一个严重的性能限制因素。在本文中,我们提出了MetaKV:一个键值存储,为分布式突发缓冲区上的HPC元数据工作负载提供快速和可扩展的元数据管理。MetaKV用专门的元数据服务补充了现有键值存储的功能,这些服务可以有效地处理突发和并发的元数据工作负载:压缩存储管理、监督块集群和基于日志环的集体消息减少。我们的实验表明,MetaKV的性能明显优于最先进的键值存储。它将元数据的put和get操作分别提高了2.66倍和6.29倍,并且MetaKV的好处随着元数据工作负载需求的增加而增加。
{"title":"MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers","authors":"Teng Wang, A. Moody, Yue Zhu, K. Mohror, Kento Sato, T. Islam, Weikuan Yu","doi":"10.1109/IPDPS.2017.39","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.39","url":null,"abstract":"Distributed burst buffers are a promising storage architecture for handling I/O workloads for exascale computing. Their aggregate storage bandwidth grows linearly with system node count. However, although scientific applications can achieve scalable write bandwidth by having each process write to its node-local burst buffer, metadata challenges remain formidable, especially for files shared across many processes. This is due to the need to track and organize file segments across the distributed burst buffers in a global index. Because this global index can be accessed concurrently by thousands or more processes in a scientific application, the scalability of metadata management is a severe performance-limiting factor. In this paper, we propose MetaKV: a key-value store that provides fast and scalable metadata management for HPC metadata workloads on distributed burst buffers. MetaKV complements the functionality of an existing key-value store with specialized metadata services that efficiently handle bursty and concurrent metadata workloads: compressed storage management, supervised block clustering, and log-ring based collective message reduction. Our experiments demonstrate that MetaKV outperforms the state-of-the-art key-value stores by a significant margin. It improves put and get metadata operations by as much as 2.66× and 6.29×, respectively, and the benefits of MetaKV increase with increasing metadata workload demand.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129041877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
期刊
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1