首页 > 最新文献

2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)最新文献

英文 中文
Localized Fault Recovery for Nested Fork-Join Programs 嵌套Fork-Join程序的局部故障恢复
Gokcen Kestor, S. Krishnamoorthy, Wenjing Ma
Nested fork-join programs scheduled using work stealing can automatically balance load and adapt to changes in the execution environment. In this paper, we design an approach to efficiently recover from faults encountered by these programs. Specifically, we focus on localized recovery of the task space in the presence of fail-stop failures. We present an approach to efficiently track, under work stealing, the relationships between the work executed by various threads. This information is used to identify and schedule the tasks to be re-executed without interfering with normal task execution. The algorithm precisely computes the work lost, incurs minimal re-execution overhead, and can recover from an arbitrary number of failures. Experimental evaluation demonstrates low overheads in the absence of failures, recovery overheads on the same order as the lost work, and much lower recovery costs than alternative strategies.
使用工作窃取调度的嵌套fork-join程序可以自动平衡负载并适应执行环境的变化。在本文中,我们设计了一种有效地从这些程序遇到的故障中恢复的方法。具体来说,我们关注的是在存在故障停止故障的情况下任务空间的局部恢复。我们提出了一种方法,在工作窃取的情况下,有效地跟踪由不同线程执行的工作之间的关系。此信息用于识别和调度要重新执行的任务,而不会干扰正常的任务执行。该算法精确地计算丢失的工作,产生最小的重新执行开销,并且可以从任意数量的故障中恢复。实验评估表明,在没有故障的情况下,恢复开销低,恢复开销与丢失的工作相同,并且比其他策略的恢复成本低得多。
{"title":"Localized Fault Recovery for Nested Fork-Join Programs","authors":"Gokcen Kestor, S. Krishnamoorthy, Wenjing Ma","doi":"10.1109/IPDPS.2017.75","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.75","url":null,"abstract":"Nested fork-join programs scheduled using work stealing can automatically balance load and adapt to changes in the execution environment. In this paper, we design an approach to efficiently recover from faults encountered by these programs. Specifically, we focus on localized recovery of the task space in the presence of fail-stop failures. We present an approach to efficiently track, under work stealing, the relationships between the work executed by various threads. This information is used to identify and schedule the tasks to be re-executed without interfering with normal task execution. The algorithm precisely computes the work lost, incurs minimal re-execution overhead, and can recover from an arbitrary number of failures. Experimental evaluation demonstrates low overheads in the absence of failures, recovery overheads on the same order as the lost work, and much lower recovery costs than alternative strategies.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132822982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Design and Implementation of Papyrus: Parallel Aggregate Persistent Storage Papyrus的设计与实现:并行聚合持久存储
Jungwon Kim, Kittisak Sajjapongse, Seyong Lee, J. Vetter
A surprising development in recently announced HPC platforms is the addition of, sometimes massive amounts of, persistent (nonvolatile) memory (NVM) in order to increase memory capacity and compensate for plateauing I/O capabilities. However, there are no portable and scalable programming interfaces using aggregate NVM effectively. This paper introduces Papyrus: a new software system built to exploit emerging capability of NVM in HPC architectures. Papyrus (or Parallel Aggregate Persistent -YRU- Storage) is a novel programming system that provides features for scalable, aggregate, persistent memory in an extreme-scale system for typical HPC usage scenarios. Papyrus mainly consists of Papyrus Virtual File System (VFS) and Papyrus Template Container Library (TCL). Papyrus VFS provides a uniform aggregate NVM storage image across diverse NVM architectures. It enables Papyrus TCL to provide a portable and scalable high-level container programming interface whose data elements are distributed across multiple NVM nodes without requiring the user to handle complex communication, synchronization, replication, and consistency model. We evaluate Papyrus on two HPC systems, including UTK Beacon and NERSC Cori, using real NVM storage devices.
在最近发布的HPC平台中,一个令人惊讶的发展是增加了(有时是大量的)持久(非易失性)内存(NVM),以增加内存容量并弥补I/O能力的停滞。然而,没有可移植和可伸缩的编程接口可以有效地使用聚合NVM。本文介绍了一种新的软件系统Papyrus,它是为了在高性能计算架构中利用NVM的新兴能力而构建的。Papyrus(或Parallel Aggregate Persistent - yru - Storage)是一种新颖的编程系统,它为典型HPC使用场景的超大规模系统提供了可扩展、聚合、持久内存的特性。Papyrus主要由Papyrus虚拟文件系统(VFS)和Papyrus模板容器库(TCL)组成。Papyrus VFS提供跨不同NVM架构的统一聚合NVM存储映像。它使Papyrus TCL能够提供可移植和可扩展的高级容器编程接口,其数据元素分布在多个NVM节点上,而不需要用户处理复杂的通信、同步、复制和一致性模型。我们在两个HPC系统(包括UTK Beacon和NERSC Cori)上使用真实的NVM存储设备对Papyrus进行了评估。
{"title":"Design and Implementation of Papyrus: Parallel Aggregate Persistent Storage","authors":"Jungwon Kim, Kittisak Sajjapongse, Seyong Lee, J. Vetter","doi":"10.1109/IPDPS.2017.72","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.72","url":null,"abstract":"A surprising development in recently announced HPC platforms is the addition of, sometimes massive amounts of, persistent (nonvolatile) memory (NVM) in order to increase memory capacity and compensate for plateauing I/O capabilities. However, there are no portable and scalable programming interfaces using aggregate NVM effectively. This paper introduces Papyrus: a new software system built to exploit emerging capability of NVM in HPC architectures. Papyrus (or Parallel Aggregate Persistent -YRU- Storage) is a novel programming system that provides features for scalable, aggregate, persistent memory in an extreme-scale system for typical HPC usage scenarios. Papyrus mainly consists of Papyrus Virtual File System (VFS) and Papyrus Template Container Library (TCL). Papyrus VFS provides a uniform aggregate NVM storage image across diverse NVM architectures. It enables Papyrus TCL to provide a portable and scalable high-level container programming interface whose data elements are distributed across multiple NVM nodes without requiring the user to handle complex communication, synchronization, replication, and consistency model. We evaluate Papyrus on two HPC systems, including UTK Beacon and NERSC Cori, using real NVM storage devices.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131848758","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems 高度并行加速器系统中并行环路和数据的自动分布
Yonghong Yan, Jiawen Liu, K. Cameron, M. Umar
Heterogeneous computing systems, e.g., those with accelerators than the host CPUs, offer the accelerated performance for a variety of workloads. However, most parallel programming models require platform dependent, time-consuming hand-tuning efforts for collectively using all the resources in a system to achieve efficient results. In this work, we explore the use of OpenMP parallel language extensions to empower users with the ability to design applications that automatically and simultaneously leverage CPUs and accelerators to further optimize use of available resources. We believe such automation will be key to ensuring codes adapt to increases in the number and diversity of accelerator resources for future computing systems. The proposed system combines language extensions to OpenMP, load-balancing algorithms and heuristics, and a runtime system for loop distribution across heterogeneous processing elements. We demonstrate the effectiveness of our automated approach to program on systems with multiple CPUs, GPUs, and MICs.
异构计算系统,例如,那些比主机cpu有加速器的系统,可以为各种工作负载提供加速性能。然而,大多数并行编程模型需要依赖于平台的、耗时的手动调优工作,以便共同使用系统中的所有资源来获得有效的结果。在这项工作中,我们探索了OpenMP并行语言扩展的使用,使用户能够设计自动同时利用cpu和加速器的应用程序,以进一步优化可用资源的使用。我们相信,这种自动化将是确保代码适应未来计算系统中加速器资源数量和多样性增加的关键。提出的系统结合了OpenMP的语言扩展、负载平衡算法和启发式,以及跨异构处理元素的循环分发的运行时系统。我们展示了我们的自动化方法在具有多个cpu、gpu和mic的系统上编程的有效性。
{"title":"HOMP: Automated Distribution of Parallel Loops and Data in Highly Parallel Accelerator-Based Systems","authors":"Yonghong Yan, Jiawen Liu, K. Cameron, M. Umar","doi":"10.1109/IPDPS.2017.99","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.99","url":null,"abstract":"Heterogeneous computing systems, e.g., those with accelerators than the host CPUs, offer the accelerated performance for a variety of workloads. However, most parallel programming models require platform dependent, time-consuming hand-tuning efforts for collectively using all the resources in a system to achieve efficient results. In this work, we explore the use of OpenMP parallel language extensions to empower users with the ability to design applications that automatically and simultaneously leverage CPUs and accelerators to further optimize use of available resources. We believe such automation will be key to ensuring codes adapt to increases in the number and diversity of accelerator resources for future computing systems. The proposed system combines language extensions to OpenMP, load-balancing algorithms and heuristics, and a runtime system for loop distribution across heterogeneous processing elements. We demonstrate the effectiveness of our automated approach to program on systems with multiple CPUs, GPUs, and MICs.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114423383","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Application Level Reordering of Remote Direct Memory Access Operations 远程直接内存访问操作的应用程序级重新排序
W. Lavrijsen, Costin Iancu
We present methods for the effective application level reordering of non-blocking RDMA operations. We supplement out-of-order hardware delivery mechanisms with heuristics to account for the CPU side overhead of communication and for differences in network latency: a runtime scheduler takes into account message sizes, destination and concurrency and reorders operations to improve overall communication throughput. Results are validated on InfiniBand and Cray Aries networks, for SPMD and hybrid (SPMD+OpenMP) programming models. We show up to 5! potential speedup, with 30-50% more typical, for synthetic message patterns in microbenchmarks. We also obtain up to 33% improvement in the communication stages in application settings. While the design space is complex, the resulting scheduler is simple, both internally and at the application level interfaces. It also provides performance portability across networks and programming models. We believe these techniques can be easily retrofitted within any application or runtime framework that uses one-sided communication, e.g. using GASNet, MPI 3.0 RMA or low level APIs such as IBVerbs.
我们提出了对非阻塞RDMA操作进行有效的应用层重排序的方法。我们用启发式方法补充无序硬件交付机制,以考虑通信的CPU端开销和网络延迟的差异:运行时调度器考虑消息大小、目的地和并发性,并重新排序操作以提高总体通信吞吐量。结果在InfiniBand和Cray Aries网络上进行了验证,用于SPMD和混合(SPMD+OpenMP)编程模型。我们出现了5个!微基准测试中合成消息模式的潜在加速,通常为30-50%。我们还在应用程序设置的通信阶段获得了高达33%的改进。虽然设计空间很复杂,但最终的调度器在内部和应用程序级接口上都很简单。它还提供跨网络和编程模型的性能可移植性。我们相信这些技术可以很容易地在任何使用单边通信的应用程序或运行时框架中进行改造,例如使用GASNet、MPI 3.0 RMA或低级api(如IBVerbs)。
{"title":"Application Level Reordering of Remote Direct Memory Access Operations","authors":"W. Lavrijsen, Costin Iancu","doi":"10.1109/IPDPS.2017.98","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.98","url":null,"abstract":"We present methods for the effective application level reordering of non-blocking RDMA operations. We supplement out-of-order hardware delivery mechanisms with heuristics to account for the CPU side overhead of communication and for differences in network latency: a runtime scheduler takes into account message sizes, destination and concurrency and reorders operations to improve overall communication throughput. Results are validated on InfiniBand and Cray Aries networks, for SPMD and hybrid (SPMD+OpenMP) programming models. We show up to 5! potential speedup, with 30-50% more typical, for synthetic message patterns in microbenchmarks. We also obtain up to 33% improvement in the communication stages in application settings. While the design space is complex, the resulting scheduler is simple, both internally and at the application level interfaces. It also provides performance portability across networks and programming models. We believe these techniques can be easily retrofitted within any application or runtime framework that uses one-sided communication, e.g. using GASNet, MPI 3.0 RMA or low level APIs such as IBVerbs.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116174080","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
E^2MC: Entropy Encoding Based Memory Compression for GPUs 基于熵编码的gpu内存压缩
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.101
S. Lal, J. Lucas, B. Juurlink
Modern Graphics Processing Units (GPUs) provide much higher off-chip memory bandwidth than CPUs, but many GPU applications are still limited by memory bandwidth. Unfortunately, off-chip memory bandwidth is growing slower than the number of cores and has become a performance bottleneck. Thus, optimizations of effective memory bandwidth play a significant role for scaling the performance of GPUs. Memory compression is a promising approach for improving memory bandwidth which can translate into higher performance and energy efficiency. However, compression is not free and its challenges need to be addressed, otherwise the benefits of compression may be offset by its overhead. We propose an entropy encoding based memory compression (E2MC) technique for GPUs, which is based on the well-known Huffman encoding. We study the feasibility of entropy encoding for GPUs and show that it achieves higher compression ratios than state-of-the-art GPU compression techniques. Furthermore, we address the key challenges of probability estimation, choosing an appropriate symbol length for encoding, and decompression with low latency. The average compression ratio of E2MC is 53% higher than the state of the art. This translates into an average speedup of 20% compared to no compression and 8% higher compared to the state of the art. Energy consumption and energy-delayproduct are reduced by 13% and 27%, respectively. Moreover, the compression ratio achieved by E2MC is close to the optimal compression ratio given by Shannon’s source coding theorem.
现代图形处理单元(GPU)提供比cpu更高的片外内存带宽,但许多GPU应用程序仍然受到内存带宽的限制。不幸的是,片外内存带宽的增长速度低于内核数量的增长速度,已经成为性能瓶颈。因此,有效内存带宽的优化对于扩展gpu的性能起着重要的作用。内存压缩是一种很有前途的提高内存带宽的方法,它可以转化为更高的性能和能源效率。然而,压缩不是免费的,需要解决它的挑战,否则压缩的好处可能会被它的开销所抵消。我们提出了一种基于熵编码的gpu内存压缩(E2MC)技术,该技术基于众所周知的霍夫曼编码。我们研究了GPU的熵编码的可行性,并表明它比最先进的GPU压缩技术实现了更高的压缩比。此外,我们还解决了概率估计、选择合适的编码符号长度和低延迟解压缩的关键挑战。E2MC的平均压缩比比目前水平高53%。这意味着与没有压缩相比,平均速度提高了20%,与目前的技术水平相比,平均速度提高了8%。能源消耗和能源延误分别降低13%和27%。而且,E2MC所获得的压缩比接近Shannon源编码定理给出的最优压缩比。
{"title":"E^2MC: Entropy Encoding Based Memory Compression for GPUs","authors":"S. Lal, J. Lucas, B. Juurlink","doi":"10.1109/IPDPS.2017.101","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.101","url":null,"abstract":"Modern Graphics Processing Units (GPUs) provide much higher off-chip memory bandwidth than CPUs, but many GPU applications are still limited by memory bandwidth. Unfortunately, off-chip memory bandwidth is growing slower than the number of cores and has become a performance bottleneck. Thus, optimizations of effective memory bandwidth play a significant role for scaling the performance of GPUs. Memory compression is a promising approach for improving memory bandwidth which can translate into higher performance and energy efficiency. However, compression is not free and its challenges need to be addressed, otherwise the benefits of compression may be offset by its overhead. We propose an entropy encoding based memory compression (E2MC) technique for GPUs, which is based on the well-known Huffman encoding. We study the feasibility of entropy encoding for GPUs and show that it achieves higher compression ratios than state-of-the-art GPU compression techniques. Furthermore, we address the key challenges of probability estimation, choosing an appropriate symbol length for encoding, and decompression with low latency. The average compression ratio of E2MC is 53% higher than the state of the art. This translates into an average speedup of 20% compared to no compression and 8% higher compared to the state of the art. Energy consumption and energy-delayproduct are reduced by 13% and 27%, respectively. Moreover, the compression ratio achieved by E2MC is close to the optimal compression ratio given by Shannon’s source coding theorem.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123194738","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
A Scalable System Architecture to Addressing the Next Generation of Predictive Simulation Workflows with Coupled Compute and Data Intensive Applications 一个可扩展的系统架构,以解决与耦合计算和数据密集型应用程序的下一代预测仿真工作流
Pub Date : 2017-05-01 DOI: 10.1109/IPDPS.2017.129
M. Seager
Trends in the emerging digital economy are pushing the virtual representation of products and services. Creating these digital twins requires a combination of real time data ingestion, simulation of physical products under real world conditions, service delivery optimization and data analytics as well as ML/DL anomaly detection and decision making. Quantification of Uncertainty in the simulations will also be a compute and data intensive workflow that will drive the simulation improvement cycle. Future high-end computing systems designs need to comprehend these types of complex workflows and provide a flexible framework for optimizing the design and operations under dynamic load conditions for them.
新兴数字经济的趋势正在推动产品和服务的虚拟表现。创建这些数字孪生需要结合实时数据摄取,现实世界条件下物理产品的模拟,服务交付优化和数据分析,以及ML/DL异常检测和决策。模拟中的不确定性量化也将是一个计算和数据密集型工作流程,将推动模拟改进周期。未来的高端计算系统设计需要理解这些类型的复杂工作流程,并提供一个灵活的框架来优化动态负载条件下的设计和操作。
{"title":"A Scalable System Architecture to Addressing the Next Generation of Predictive Simulation Workflows with Coupled Compute and Data Intensive Applications","authors":"M. Seager","doi":"10.1109/IPDPS.2017.129","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.129","url":null,"abstract":"Trends in the emerging digital economy are pushing the virtual representation of products and services. Creating these digital twins requires a combination of real time data ingestion, simulation of physical products under real world conditions, service delivery optimization and data analytics as well as ML/DL anomaly detection and decision making. Quantification of Uncertainty in the simulations will also be a compute and data intensive workflow that will drive the simulation improvement cycle. Future high-end computing systems designs need to comprehend these types of complex workflows and provide a flexible framework for optimizing the design and operations under dynamic load conditions for them.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124850288","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Clustering Throughput Optimization on the GPU GPU上的集群吞吐量优化
M. Gowanlock, C. Rude, D. M. Blair, Justin D. Li, V. Pankratius
Large datasets in astronomy and geoscience often require clustering and visualizations of phenomena at different densities and scales in order to generate scientific insight. We examine the problem of maximizing clustering throughput for concurrent dataset clustering in spatial dimensions. We introduce a novel hybrid approach that uses GPUs in conjunction with multicore CPUs for algorithmic throughput optimizations. The key idea is to exploit the fast memory on the GPU for index searches and optimize I/O transfers in such a way that the low-bandwidth host-GPU bottleneck does not have a significant negative performance impact. To achieve this, we derive two distinct GPU kernels that exploit grid-based indexing schemes to improve clustering performance. To obviate limited GPU memory and enable large dataset clustering, our method is complemented by an efficient batching scheme for transfers between the host and GPU accelerator. This scheme is robust with respect to both sparse and dense data distributions and intelligently avoids buffer overflows that would otherwise degrade performance, all while minimizing the number of data transfers between the host and GPU. We evaluate our approaches on ionospheric total electron content datasets as well as intermediate-redshift galaxies from the Sloan Digital Sky Survey. Our hybrid approach yields a speedup of up to 50x over the sequential implementation on one of the experimental scenarios, which is respectable for I/O intensive clustering.
天文学和地球科学中的大型数据集通常需要对不同密度和尺度的现象进行聚类和可视化,以便产生科学见解。我们研究了在空间维度上最大化并发数据集聚类吞吐量的问题。我们介绍了一种新的混合方法,将gpu与多核cpu结合使用以实现算法吞吐量优化。关键思想是利用GPU上的快速内存进行索引搜索,并以这样一种方式优化I/O传输,即低带宽主机-GPU瓶颈不会对性能产生显著的负面影响。为了实现这一点,我们派生了两个不同的GPU内核,它们利用基于网格的索引方案来提高聚类性能。为了避免有限的GPU内存并启用大型数据集聚类,我们的方法辅以主机和GPU加速器之间传输的高效批处理方案。该方案对于稀疏和密集的数据分布都是健壮的,并且智能地避免了缓冲区溢出,否则会降低性能,同时最大限度地减少主机和GPU之间的数据传输数量。我们在电离层总电子含量数据集以及斯隆数字巡天的中间红移星系上评估了我们的方法。在一个实验场景中,我们的混合方法比顺序实现的速度提高了50倍,这对于I/O密集型集群来说是相当不错的。
{"title":"Clustering Throughput Optimization on the GPU","authors":"M. Gowanlock, C. Rude, D. M. Blair, Justin D. Li, V. Pankratius","doi":"10.1109/IPDPS.2017.17","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.17","url":null,"abstract":"Large datasets in astronomy and geoscience often require clustering and visualizations of phenomena at different densities and scales in order to generate scientific insight. We examine the problem of maximizing clustering throughput for concurrent dataset clustering in spatial dimensions. We introduce a novel hybrid approach that uses GPUs in conjunction with multicore CPUs for algorithmic throughput optimizations. The key idea is to exploit the fast memory on the GPU for index searches and optimize I/O transfers in such a way that the low-bandwidth host-GPU bottleneck does not have a significant negative performance impact. To achieve this, we derive two distinct GPU kernels that exploit grid-based indexing schemes to improve clustering performance. To obviate limited GPU memory and enable large dataset clustering, our method is complemented by an efficient batching scheme for transfers between the host and GPU accelerator. This scheme is robust with respect to both sparse and dense data distributions and intelligently avoids buffer overflows that would otherwise degrade performance, all while minimizing the number of data transfers between the host and GPU. We evaluate our approaches on ionospheric total electron content datasets as well as intermediate-redshift galaxies from the Sloan Digital Sky Survey. Our hybrid approach yields a speedup of up to 50x over the sequential implementation on one of the experimental scenarios, which is respectable for I/O intensive clustering.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127835295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers MetaKV:用于分布式突发缓冲区元数据管理的键值存储
Teng Wang, A. Moody, Yue Zhu, K. Mohror, Kento Sato, T. Islam, Weikuan Yu
Distributed burst buffers are a promising storage architecture for handling I/O workloads for exascale computing. Their aggregate storage bandwidth grows linearly with system node count. However, although scientific applications can achieve scalable write bandwidth by having each process write to its node-local burst buffer, metadata challenges remain formidable, especially for files shared across many processes. This is due to the need to track and organize file segments across the distributed burst buffers in a global index. Because this global index can be accessed concurrently by thousands or more processes in a scientific application, the scalability of metadata management is a severe performance-limiting factor. In this paper, we propose MetaKV: a key-value store that provides fast and scalable metadata management for HPC metadata workloads on distributed burst buffers. MetaKV complements the functionality of an existing key-value store with specialized metadata services that efficiently handle bursty and concurrent metadata workloads: compressed storage management, supervised block clustering, and log-ring based collective message reduction. Our experiments demonstrate that MetaKV outperforms the state-of-the-art key-value stores by a significant margin. It improves put and get metadata operations by as much as 2.66× and 6.29×, respectively, and the benefits of MetaKV increase with increasing metadata workload demand.
分布式突发缓冲区是一种很有前途的存储架构,用于处理百亿亿次计算的I/O工作负载。它们的总存储带宽随系统节点数线性增长。然而,尽管科学应用程序可以通过让每个进程写入其节点本地突发缓冲区来实现可扩展的写入带宽,但元数据挑战仍然是艰巨的,特别是对于跨多个进程共享的文件。这是由于需要在全局索引中跟踪和组织跨分布式突发缓冲区的文件段。因为这个全局索引可以被科学应用程序中的数千个或更多进程并发访问,所以元数据管理的可伸缩性是一个严重的性能限制因素。在本文中,我们提出了MetaKV:一个键值存储,为分布式突发缓冲区上的HPC元数据工作负载提供快速和可扩展的元数据管理。MetaKV用专门的元数据服务补充了现有键值存储的功能,这些服务可以有效地处理突发和并发的元数据工作负载:压缩存储管理、监督块集群和基于日志环的集体消息减少。我们的实验表明,MetaKV的性能明显优于最先进的键值存储。它将元数据的put和get操作分别提高了2.66倍和6.29倍,并且MetaKV的好处随着元数据工作负载需求的增加而增加。
{"title":"MetaKV: A Key-Value Store for Metadata Management of Distributed Burst Buffers","authors":"Teng Wang, A. Moody, Yue Zhu, K. Mohror, Kento Sato, T. Islam, Weikuan Yu","doi":"10.1109/IPDPS.2017.39","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.39","url":null,"abstract":"Distributed burst buffers are a promising storage architecture for handling I/O workloads for exascale computing. Their aggregate storage bandwidth grows linearly with system node count. However, although scientific applications can achieve scalable write bandwidth by having each process write to its node-local burst buffer, metadata challenges remain formidable, especially for files shared across many processes. This is due to the need to track and organize file segments across the distributed burst buffers in a global index. Because this global index can be accessed concurrently by thousands or more processes in a scientific application, the scalability of metadata management is a severe performance-limiting factor. In this paper, we propose MetaKV: a key-value store that provides fast and scalable metadata management for HPC metadata workloads on distributed burst buffers. MetaKV complements the functionality of an existing key-value store with specialized metadata services that efficiently handle bursty and concurrent metadata workloads: compressed storage management, supervised block clustering, and log-ring based collective message reduction. Our experiments demonstrate that MetaKV outperforms the state-of-the-art key-value stores by a significant margin. It improves put and get metadata operations by as much as 2.66× and 6.29×, respectively, and the benefits of MetaKV increase with increasing metadata workload demand.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129041877","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
PUNAS: A Parallel Ungapped-Alignment-Featured Seed Verification Algorithm for Next-Generation Sequencing Read Alignment 新一代测序读比对的并行unapped -Alignment种子验证算法
Yuandong Chan, Kai Xu, Haidong Lan, Weiguo Liu, Yongchao Liu, B. Schmidt
The progress of next-generation sequencing has a major impact on medical and genomic research. This technology can now produce billions of short DNA fragments (reads) in a single run. One of the most demanding computational problems used by almost every sequencing pipeline is short-read alignment; i.e. determining where each fragment originated from in the original genome. Most current solutions are based on a seed-and-extend approach, where promising candidate regions (seeds) are first identified and subsequently extended in order to verify whether a full high-scoring alignment actually exists in the vicinity of each seed. Seed verification is the main bottleneck in many state-of-the-art aligners and thus finding fast solutions is of high importance. We present a parallel ungapped-alignment-featured seed verification (PUNAS) algorithm, a fast filter for effectively removing the majority of false positive seeds, thus significantly accelerating the short-read alignment process. PUNAS is based on bit-parallelism and takes advantage of SIMD vector units of modern microprocessors. Our implementation employs a vectorize-and-scale approach supporting multi-core CPUs and many-core Knights Landing (KNL)-based Xeon Phi processors. Performance evaluation reveals that PUNAS is over three orders-of-magnitude faster than seed verification with the Smith-Waterman algorithm and around one order-of-magnitude faster than seed verification with the banded version of Myers bit-vector algorithm. Using a single thread it achieves a speedup of up to 7.3, 27.1, and 11.6 compared to the shifted Hamming distance filter on a SSE, AVX2, and AVX-512 based CPU/KNL, respectively. The speed of our framework further scales almost linearly with the number of cores. PUNAS is open-source software available at https://github.com/Xu-Kai/PUNASfilter.
新一代测序技术的进展对医学和基因组研究产生了重大影响。这项技术现在可以在一次运行中产生数十亿个短DNA片段(读取)。几乎每个测序管道使用的最苛刻的计算问题之一是短读比对;即确定每个片段在原始基因组中的起源。目前的大多数解决方案都是基于种子-扩展方法,首先确定有希望的候选区域(种子),然后扩展,以验证在每个种子附近是否确实存在完整的高分序列。种子验证是许多最先进的对准器的主要瓶颈,因此找到快速解决方案非常重要。我们提出了一种并行的unapped -alignment-feature seed verification (PUNAS)算法,这是一种快速过滤器,可以有效地去除大多数假阳性种子,从而显着加快短读比对过程。PUNAS基于位并行性,并利用现代微处理器的SIMD矢量单元。我们的实现采用矢量化和规模化方法,支持多核cpu和多核骑士登陆(KNL)的Xeon Phi处理器。性能评估表明,PUNAS比Smith-Waterman算法的种子验证快3个数量级,比Myers位向量算法的带状版本的种子验证快1个数量级。与基于SSE、AVX2和AVX-512的CPU/KNL上的移位汉明距离过滤器相比,使用单个线程可以实现最高7.3、27.1和11.6的加速。我们的框架的速度进一步几乎与核心数量成线性关系。PUNAS是开源软件,可在https://github.com/Xu-Kai/PUNASfilter上获得。
{"title":"PUNAS: A Parallel Ungapped-Alignment-Featured Seed Verification Algorithm for Next-Generation Sequencing Read Alignment","authors":"Yuandong Chan, Kai Xu, Haidong Lan, Weiguo Liu, Yongchao Liu, B. Schmidt","doi":"10.1109/IPDPS.2017.35","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.35","url":null,"abstract":"The progress of next-generation sequencing has a major impact on medical and genomic research. This technology can now produce billions of short DNA fragments (reads) in a single run. One of the most demanding computational problems used by almost every sequencing pipeline is short-read alignment; i.e. determining where each fragment originated from in the original genome. Most current solutions are based on a seed-and-extend approach, where promising candidate regions (seeds) are first identified and subsequently extended in order to verify whether a full high-scoring alignment actually exists in the vicinity of each seed. Seed verification is the main bottleneck in many state-of-the-art aligners and thus finding fast solutions is of high importance. We present a parallel ungapped-alignment-featured seed verification (PUNAS) algorithm, a fast filter for effectively removing the majority of false positive seeds, thus significantly accelerating the short-read alignment process. PUNAS is based on bit-parallelism and takes advantage of SIMD vector units of modern microprocessors. Our implementation employs a vectorize-and-scale approach supporting multi-core CPUs and many-core Knights Landing (KNL)-based Xeon Phi processors. Performance evaluation reveals that PUNAS is over three orders-of-magnitude faster than seed verification with the Smith-Waterman algorithm and around one order-of-magnitude faster than seed verification with the banded version of Myers bit-vector algorithm. Using a single thread it achieves a speedup of up to 7.3, 27.1, and 11.6 compared to the shifted Hamming distance filter on a SSE, AVX2, and AVX-512 based CPU/KNL, respectively. The speed of our framework further scales almost linearly with the number of cores. PUNAS is open-source software available at https://github.com/Xu-Kai/PUNASfilter.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130497644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Addressing Performance Heterogeneity in MapReduce Clusters with Elastic Tasks 使用弹性任务解决MapReduce集群的性能异构问题
Wei Chen, J. Rao, Xiaobo Zhou
MapReduce applications, which require access to a large number of computing nodes, are commonly deployed in heterogeneous environments. The performance discrepancy between individual nodes in a heterogeneous cluster present significant challenges to attain good performance in MapReduce jobs. MapReduce implementations designed and optimized for homogeneous environments perform poorly on heterogeneous clusters. We attribute suboptimal performance in heterogeneous clusters to significant load imbalance between map tasks. We identify two MapReduce designs that hinder load balancing: (1) static binding between mappers and their data makes it difficult to exploit data redundancy for load balancing; (2) uniform map sizes is not optimal for nodes with heterogeneous performance. To address these issues, we propose FlexMap, a user-transparent approach that dynamically provisions map tasks to match distinct machine capacity in heterogeneous environments. We implemented FlexMap in Hadoop-2.6.0. Experimental results show that it reduces job completion time by as much as 40% compared to stock Hadoop and 30% to SkewTune.
MapReduce应用通常部署在异构环境中,需要访问大量的计算节点。在异构集群中,单个节点之间的性能差异对MapReduce作业的良好性能提出了重大挑战。为同构环境设计和优化的MapReduce实现在异构集群上表现不佳。我们将异构集群中的次优性能归因于映射任务之间的显著负载不平衡。我们确定了两个阻碍负载平衡的MapReduce设计:(1)映射器和它们的数据之间的静态绑定使得难以利用数据冗余来实现负载平衡;(2)对于异构性能的节点,统一的映射大小不是最优的。为了解决这些问题,我们提出了FlexMap,这是一种用户透明的方法,可以动态地提供映射任务,以匹配异构环境中不同的机器容量。我们在Hadoop-2.6.0中实现了FlexMap。实验结果表明,与现有Hadoop相比,它可以减少多达40%的作业完成时间,比SkewTune减少30%。
{"title":"Addressing Performance Heterogeneity in MapReduce Clusters with Elastic Tasks","authors":"Wei Chen, J. Rao, Xiaobo Zhou","doi":"10.1109/IPDPS.2017.28","DOIUrl":"https://doi.org/10.1109/IPDPS.2017.28","url":null,"abstract":"MapReduce applications, which require access to a large number of computing nodes, are commonly deployed in heterogeneous environments. The performance discrepancy between individual nodes in a heterogeneous cluster present significant challenges to attain good performance in MapReduce jobs. MapReduce implementations designed and optimized for homogeneous environments perform poorly on heterogeneous clusters. We attribute suboptimal performance in heterogeneous clusters to significant load imbalance between map tasks. We identify two MapReduce designs that hinder load balancing: (1) static binding between mappers and their data makes it difficult to exploit data redundancy for load balancing; (2) uniform map sizes is not optimal for nodes with heterogeneous performance. To address these issues, we propose FlexMap, a user-transparent approach that dynamically provisions map tasks to match distinct machine capacity in heterogeneous environments. We implemented FlexMap in Hadoop-2.6.0. Experimental results show that it reduces job completion time by as much as 40% compared to stock Hadoop and 30% to SkewTune.","PeriodicalId":209524,"journal":{"name":"2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2017-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132444552","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
期刊
2017 IEEE International Parallel and Distributed Processing Symposium (IPDPS)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1