首页 > 最新文献

SC15: International Conference for High Performance Computing, Networking, Storage and Analysis最新文献

英文 中文
Scalable sparse tensor decompositions in distributed memory systems 分布式存储系统中的可伸缩稀疏张量分解
O. Kaya, B. Uçar
We investigate an efficient parallelization of the most common iterative sparse tensor decomposition algorithms on distributed memory systems. A key operation in each iteration of these algorithms is the matricized tensor times Khatri-Rao product (MTTKRP). This operation amounts to element-wise vector multiplication and reduction depending on the sparsity of the tensor. We investigate a fine and a coarse-grain task definition for this operation, and propose hypergraph partitioning-based methods for these task definitions to achieve the load balance as well as reduce the communication requirements. We also design a distributed memory sparse tensor library, HyperTensor, which implements a well-known algorithm for the CANDECOMP-/PARAFAC (CP) tensor decomposition using the task definitions and the associated partitioning methods. We use this library to test the proposed implementation of MTTKRP in CP decomposition context, and report scalability results up to 1024 MPI ranks. We observed up to 194 fold speedups using 512 MPI processes on a well-known real world data, and significantly better performance results with respect to a state of the art implementation.
我们研究了分布式存储系统上最常见的迭代稀疏张量分解算法的有效并行化。这些算法每次迭代中的一个关键操作是矩阵化张量乘以Khatri-Rao积(MTTKRP)。这个操作相当于根据张量的稀疏性对元素进行向量乘法和约简。我们研究了细粒度和粗粒度的任务定义,并提出了基于超图分区的任务定义方法,以实现负载平衡和减少通信需求。我们还设计了一个分布式内存稀疏张量库高血压,它使用任务定义和相关的划分方法实现了一个著名的CANDECOMP-/PARAFAC (CP)张量分解算法。我们使用该库在CP分解上下文中测试了MTTKRP的实现,并报告了高达1024 MPI排名的可扩展性结果。我们在已知的真实世界数据上使用512个MPI进程观察到高达194倍的加速,并且相对于最先进的实现状态而言,性能结果明显更好。
{"title":"Scalable sparse tensor decompositions in distributed memory systems","authors":"O. Kaya, B. Uçar","doi":"10.1145/2807591.2807624","DOIUrl":"https://doi.org/10.1145/2807591.2807624","url":null,"abstract":"We investigate an efficient parallelization of the most common iterative sparse tensor decomposition algorithms on distributed memory systems. A key operation in each iteration of these algorithms is the matricized tensor times Khatri-Rao product (MTTKRP). This operation amounts to element-wise vector multiplication and reduction depending on the sparsity of the tensor. We investigate a fine and a coarse-grain task definition for this operation, and propose hypergraph partitioning-based methods for these task definitions to achieve the load balance as well as reduce the communication requirements. We also design a distributed memory sparse tensor library, HyperTensor, which implements a well-known algorithm for the CANDECOMP-/PARAFAC (CP) tensor decomposition using the task definitions and the associated partitioning methods. We use this library to test the proposed implementation of MTTKRP in CP decomposition context, and report scalability results up to 1024 MPI ranks. We observed up to 194 fold speedups using 512 MPI processes on a well-known real world data, and significantly better performance results with respect to a state of the art implementation.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114822188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 97
Energy-aware data transfer algorithms 能量感知数据传输算法
I. Alan, Engin Arslan, T. Kosar
The amount of data moved over the Internet per year has already exceeded the Exabyte scale and soon will hit the Zettabyte range. To support this massive amount of data movement across the globe, the networking infrastructure as well as the source and destination nodes consume immense amount of electric power, with an estimated cost measured in billions of dollars. Although considerable amount of research has been done on power management techniques for the networking infrastructure, there has not been much prior work focusing on energy-aware data transfer algorithms for minimizing the power consumed at the end-systems. We introduce novel data transfer algorithms which aim to achieve high data transfer throughput while keeping the energy consumption during the transfers at the minimal levels. Our experimental results show that our energy-aware data transfer algorithms can achieve up to 50% energy savings with the same or higher level of data transfer throughput.
每年在互联网上移动的数据量已经超过了eb级,很快将达到zb级。为了支持全球范围内的大量数据移动,网络基础设施以及源节点和目标节点消耗了大量的电力,估计成本为数十亿美元。尽管在网络基础设施的电源管理技术方面已经做了大量的研究,但是在最小化终端系统功耗的能量感知数据传输算法方面还没有太多的工作。我们引入了新的数据传输算法,旨在实现高数据传输吞吐量,同时将传输过程中的能量消耗保持在最低水平。实验结果表明,我们的能量感知数据传输算法可以在相同或更高的数据传输吞吐量水平下实现高达50%的节能。
{"title":"Energy-aware data transfer algorithms","authors":"I. Alan, Engin Arslan, T. Kosar","doi":"10.1145/2807591.2807628","DOIUrl":"https://doi.org/10.1145/2807591.2807628","url":null,"abstract":"The amount of data moved over the Internet per year has already exceeded the Exabyte scale and soon will hit the Zettabyte range. To support this massive amount of data movement across the globe, the networking infrastructure as well as the source and destination nodes consume immense amount of electric power, with an estimated cost measured in billions of dollars. Although considerable amount of research has been done on power management techniques for the networking infrastructure, there has not been much prior work focusing on energy-aware data transfer algorithms for minimizing the power consumed at the end-systems. We introduce novel data transfer algorithms which aim to achieve high data transfer throughput while keeping the energy consumption during the transfers at the minimal levels. Our experimental results show that our energy-aware data transfer algorithms can achieve up to 50% energy savings with the same or higher level of data transfer throughput.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115712149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Frugal ECC: efficient and versatile memory error protection through fine-grained compression 节约型ECC:通过细粒度压缩实现高效、通用的内存错误保护
Jungrae Kim, Michael B. Sullivan, Seong-Lyong Gong, M. Erez
Because main memory is vulnerable to errors and failures, large-scale systems and critical servers utilize error checking and correcting (ECC) mechanisms to meet their reliability requirements. We propose a novel mechanism, Frugal ECC (FECC), that combines ECC with fine-grained compression to provide versatile protection that can be both stronger and lower overhead than current schemes, without sacrificing performance. FECC compresses main memory at cache-block granularity, using any left over space to store ECC information. Compressed data and its ECC information are then frequently read with a single access even without redundant memory chips; insufficiently compressed blocks require additional storage and accesses. As examples, we present chipkill-correct ECCs on a non-ECC DIMM with x4 chips and the first true chipkill-correct ECC for x8 devices using an ECC DIMM. FECC relies on a new Coverage-oriented-Compression that we developed specifically for the modest compression needs of ECC and for floating-point data.
由于主存容易出现错误和故障,大型系统和关键服务器采用ECC (error checking and correcting)机制来满足可靠性要求。我们提出了一种新的机制,节俭ECC (FECC),它将ECC与细粒度压缩相结合,提供比当前方案更强、开销更低的通用保护,而不会牺牲性能。FECC以缓存块粒度压缩主内存,使用任何剩余空间来存储ECC信息。压缩数据及其ECC信息在没有冗余存储芯片的情况下,也可以通过单次访问频繁读取;压缩不足的块需要额外的存储和访问。作为例子,我们展示了x4芯片的非ECC DIMM上的芯片kill-correct ECC,以及使用ECC DIMM的x8设备的第一个真正的芯片kill-correct ECC。FECC依赖于一种新的面向覆盖的压缩,我们专门为ECC和浮点数据的适度压缩需求开发了这种压缩。
{"title":"Frugal ECC: efficient and versatile memory error protection through fine-grained compression","authors":"Jungrae Kim, Michael B. Sullivan, Seong-Lyong Gong, M. Erez","doi":"10.1145/2807591.2807659","DOIUrl":"https://doi.org/10.1145/2807591.2807659","url":null,"abstract":"Because main memory is vulnerable to errors and failures, large-scale systems and critical servers utilize error checking and correcting (ECC) mechanisms to meet their reliability requirements. We propose a novel mechanism, Frugal ECC (FECC), that combines ECC with fine-grained compression to provide versatile protection that can be both stronger and lower overhead than current schemes, without sacrificing performance. FECC compresses main memory at cache-block granularity, using any left over space to store ECC information. Compressed data and its ECC information are then frequently read with a single access even without redundant memory chips; insufficiently compressed blocks require additional storage and accesses. As examples, we present chipkill-correct ECCs on a non-ECC DIMM with x4 chips and the first true chipkill-correct ECC for x8 devices using an ECC DIMM. FECC relies on a new Coverage-oriented-Compression that we developed specifically for the modest compression needs of ECC and for floating-point data.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125027133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
Automatic sharing classification and timely push for cache-coherent systems 缓存一致性系统的自动共享分类和及时推送
Malek Musleh, Vijay S. Pai
This paper proposes and evaluates Sharing/Timing Adaptive Push (STAP), a dynamic scheme for preemptively sending data from producers to consumers to minimize critical-path communication latency. STAP uses small hardware buffers to dynamically detect sharing patterns and timing requirements. The scheme applies to both intra-node and inter-socket directory-based shared memory networks. We integrate STAP into a MOESI cache-coherence (prefetching-enabled) protocol using heuristics to detect different data sharing patterns, including broadcasts, producer/consumer, and migratory-data sharing. Using 15 benchmarks from the PARSEC and SPLASH-2 suites we show that our scheme significantly reduces communication latency in NUMA systems and achieves an average of 9% performance improvement, with at most 3% on-chip storage overhead.
本文提出并评估了共享/定时自适应推送(STAP),这是一种从生产者向消费者先发制人地发送数据以最小化关键路径通信延迟的动态方案。STAP使用小的硬件缓冲区来动态检测共享模式和定时需求。该方案适用于节点内和套接字间基于目录的共享内存网络。我们使用启发式方法将STAP集成到MOESI缓存一致性(支持预取)协议中,以检测不同的数据共享模式,包括广播、生产者/消费者和迁移数据共享。使用来自PARSEC和splash2套件的15个基准测试表明,我们的方案显着降低了NUMA系统中的通信延迟,并实现了平均9%的性能改进,而片上存储开销最多为3%。
{"title":"Automatic sharing classification and timely push for cache-coherent systems","authors":"Malek Musleh, Vijay S. Pai","doi":"10.1145/2807591.2807649","DOIUrl":"https://doi.org/10.1145/2807591.2807649","url":null,"abstract":"This paper proposes and evaluates Sharing/Timing Adaptive Push (STAP), a dynamic scheme for preemptively sending data from producers to consumers to minimize critical-path communication latency. STAP uses small hardware buffers to dynamically detect sharing patterns and timing requirements. The scheme applies to both intra-node and inter-socket directory-based shared memory networks. We integrate STAP into a MOESI cache-coherence (prefetching-enabled) protocol using heuristics to detect different data sharing patterns, including broadcasts, producer/consumer, and migratory-data sharing. Using 15 benchmarks from the PARSEC and SPLASH-2 suites we show that our scheme significantly reduces communication latency in NUMA systems and achieves an average of 9% performance improvement, with at most 3% on-chip storage overhead.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123644302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
GraphBIG: understanding graph computing in the context of industrial solutions GraphBIG:理解工业解决方案背景下的图计算
Lifeng Nai, Yinglong Xia, Ilie Gabriel Tanase, Hyesoon Kim, Ching-Yung Lin
With the emergence of data science, graph computing is becoming a crucial tool for processing big connected data. Although efficient implementations of specific graph applications exist, the behavior of full-spectrum graph computing remains unknown. To understand graph computing, we must consider multiple graph computation types, graph frameworks, data representations, and various data sources in a holistic way. In this paper, we present GraphBIG, a benchmark suite inspired by IBM System G project. To cover major graph computation types and data sources, GraphBIG selects representative datastructures, workloads and data sets from 21 real-world use cases of multiple application domains. We characterized GraphBIG on real machines and observed extremely irregular memory patterns and significant diverse behavior across different computations. GraphBIG helps users understand the impact of modern graph computing on the hardware architecture and enables future architecture and system research.
随着数据科学的兴起,图计算正在成为处理大连接数据的重要工具。尽管存在特定图形应用程序的有效实现,但全谱图计算的行为仍然未知。为了理解图计算,我们必须以整体的方式考虑多种图计算类型、图框架、数据表示和各种数据源。在本文中,我们介绍了GraphBIG,一个受IBM System G项目启发的基准测试套件。为了涵盖主要的图计算类型和数据源,GraphBIG从多个应用领域的21个实际用例中选择了具有代表性的数据结构、工作负载和数据集。我们在真实机器上对GraphBIG进行了表征,并观察到极其不规则的内存模式和不同计算过程中显著不同的行为。GraphBIG帮助用户了解现代图计算对硬件架构的影响,并使未来的架构和系统研究成为可能。
{"title":"GraphBIG: understanding graph computing in the context of industrial solutions","authors":"Lifeng Nai, Yinglong Xia, Ilie Gabriel Tanase, Hyesoon Kim, Ching-Yung Lin","doi":"10.1145/2807591.2807626","DOIUrl":"https://doi.org/10.1145/2807591.2807626","url":null,"abstract":"With the emergence of data science, graph computing is becoming a crucial tool for processing big connected data. Although efficient implementations of specific graph applications exist, the behavior of full-spectrum graph computing remains unknown. To understand graph computing, we must consider multiple graph computation types, graph frameworks, data representations, and various data sources in a holistic way. In this paper, we present GraphBIG, a benchmark suite inspired by IBM System G project. To cover major graph computation types and data sources, GraphBIG selects representative datastructures, workloads and data sets from 21 real-world use cases of multiple application domains. We characterized GraphBIG on real machines and observed extremely irregular memory patterns and significant diverse behavior across different computations. GraphBIG helps users understand the impact of modern graph computing on the hardware architecture and enables future architecture and system research.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130295815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 146
Understanding the propagation of transient errors in HPC applications 理解HPC应用程序中瞬态错误的传播
R. Ashraf, R. Gioiosa, Gokcen Kestor, R. Demara, Chen-Yong Cher, P. Bose
Resiliency of exascale systems has quickly become an important concern for the scientific community. Despite its importance, still much remains to be determined regarding how faults disseminate or at what rate do they impact HPC applications. The understanding of where and how fast faults propagate could lead to more efficient implementation of application-driven error detection and recovery. In this work, we propose a fault propagation framework to analyze how faults propagate in MPI applications and to understand their vulnerability to faults. We employ a combination of compiler-level code transformation and instrumentation, along with a runtime checker. Using the information provided by our framework, we employ machine learning technique to derive application fault propagation models that can be used to estimate the number of corrupted memory locations at runtime.
百亿亿次系统的弹性已迅速成为科学界关注的一个重要问题。尽管它很重要,但关于故障是如何传播的,以及它们影响HPC应用的速度如何,仍有很多问题有待确定。了解错误传播的位置和速度可以更有效地实现应用程序驱动的错误检测和恢复。在这项工作中,我们提出了一个故障传播框架来分析故障在MPI应用程序中的传播方式,并了解它们对故障的脆弱性。我们使用编译器级代码转换和插装的组合,以及运行时检查器。利用我们的框架提供的信息,我们采用机器学习技术来推导应用程序故障传播模型,该模型可用于估计运行时损坏的内存位置的数量。
{"title":"Understanding the propagation of transient errors in HPC applications","authors":"R. Ashraf, R. Gioiosa, Gokcen Kestor, R. Demara, Chen-Yong Cher, P. Bose","doi":"10.1145/2807591.2807670","DOIUrl":"https://doi.org/10.1145/2807591.2807670","url":null,"abstract":"Resiliency of exascale systems has quickly become an important concern for the scientific community. Despite its importance, still much remains to be determined regarding how faults disseminate or at what rate do they impact HPC applications. The understanding of where and how fast faults propagate could lead to more efficient implementation of application-driven error detection and recovery. In this work, we propose a fault propagation framework to analyze how faults propagate in MPI applications and to understand their vulnerability to faults. We employ a combination of compiler-level code transformation and instrumentation, along with a runtime checker. Using the information provided by our framework, we employ machine learning technique to derive application fault propagation models that can be used to estimate the number of corrupted memory locations at runtime.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128633939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 81
Improving backfilling by using machine learning to predict running times 通过使用机器学习预测运行时间来改进回填
Éric Gaussier, David Glesser, Valentin Reis, D. Trystram
The job management system is the HPC middleware responsible for distributing computing power to applications. While such systems generate an ever increasing amount of data, they are characterized by uncertainties on some parameters like the job running times. The question raised in this work is: To what extent is it possible/useful to take into account predictions on the job running times for improving the global scheduling? We present a comprehensive study for answering this question assuming the popular EASY backfilling policy. More precisely, we rely on some classical methods in machine learning and propose new cost functions well-adapted to the problem. Then, we assess our proposed solutions through intensive simulations using several production logs. Finally, we propose a new scheduling algorithm that outperforms the popular EASY backfilling algorithm by 28% considering the average bounded slowdown objective.
作业管理系统是HPC中间件,负责将计算能力分配给应用程序。虽然这样的系统会产生越来越多的数据,但它们的特点是某些参数(如作业运行时间)存在不确定性。在这项工作中提出的问题是:在多大程度上考虑对作业运行时间的预测以改进全局调度是可能/有用的?我们提出了一个全面的研究来回答这个问题,假设流行的EASY回填政策。更准确地说,我们依靠机器学习中的一些经典方法,提出新的成本函数,很好地适应了这个问题。然后,我们通过使用几个生产日志进行密集模拟来评估我们提出的解决方案。最后,我们提出了一种新的调度算法,考虑到平均有界减速目标,该算法比流行的EASY回填算法高出28%。
{"title":"Improving backfilling by using machine learning to predict running times","authors":"Éric Gaussier, David Glesser, Valentin Reis, D. Trystram","doi":"10.1145/2807591.2807646","DOIUrl":"https://doi.org/10.1145/2807591.2807646","url":null,"abstract":"The job management system is the HPC middleware responsible for distributing computing power to applications. While such systems generate an ever increasing amount of data, they are characterized by uncertainties on some parameters like the job running times. The question raised in this work is: To what extent is it possible/useful to take into account predictions on the job running times for improving the global scheduling? We present a comprehensive study for answering this question assuming the popular EASY backfilling policy. More precisely, we rely on some classical methods in machine learning and propose new cost functions well-adapted to the problem. Then, we assess our proposed solutions through intensive simulations using several production logs. Finally, we propose a new scheduling algorithm that outperforms the popular EASY backfilling algorithm by 28% considering the average bounded slowdown objective.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121276496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 88
Exploring network optimizations for large-scale graph analytics 探索大规模图形分析的网络优化
Xinyu Que, Fabio Checconi, F. Petrini, Xing Liu, Daniele Buono
Graph analytics are arguably one of the most demanding workloads for high-performance systems and interconnection networks. Graph applications often display all-to-all, fine-grained, high-rate communication patterns that expose the limits of the network protocol stacks. Load and communication imbalance generate hard-to-predict network hot-spots, and may require computational steering due to unpredictable data distributions. In this paper we present a lightweight communication library, implemented "on the metal" of BlueGene/Q and POWER7 IH that we have used to support large-scale graph algorithms up to 96K processing nodes and 6 million threads. With this library we have explored several optimization techniques, including overlapped communication, non-blocking collectives, message aggregation, and computation in the network for special collective communication patterns, such as parallel prefix. The experimental results show significant performance improvements, ranging from 5X to 10X, when compared to equally optimized MPI implementations.
图形分析可以说是高性能系统和互连网络中要求最高的工作负载之一。图形应用程序经常显示所有对所有、细粒度、高速率的通信模式,这暴露了网络协议栈的局限性。负载和通信不平衡产生难以预测的网络热点,并且由于不可预测的数据分布可能需要计算转向。在本文中,我们提出了一个轻量级的通信库,实现在BlueGene/Q和POWER7 IH的“金属”上,我们已经使用它来支持多达96K处理节点和600万线程的大规模图算法。通过这个库,我们探索了几种优化技术,包括重叠通信、非阻塞集体、消息聚合以及针对特殊集体通信模式(如并行前缀)的网络计算。实验结果表明,与同等优化的MPI实现相比,性能有了显著提高,从5倍到10倍不等。
{"title":"Exploring network optimizations for large-scale graph analytics","authors":"Xinyu Que, Fabio Checconi, F. Petrini, Xing Liu, Daniele Buono","doi":"10.1145/2807591.2807661","DOIUrl":"https://doi.org/10.1145/2807591.2807661","url":null,"abstract":"Graph analytics are arguably one of the most demanding workloads for high-performance systems and interconnection networks. Graph applications often display all-to-all, fine-grained, high-rate communication patterns that expose the limits of the network protocol stacks. Load and communication imbalance generate hard-to-predict network hot-spots, and may require computational steering due to unpredictable data distributions. In this paper we present a lightweight communication library, implemented \"on the metal\" of BlueGene/Q and POWER7 IH that we have used to support large-scale graph algorithms up to 96K processing nodes and 6 million threads. With this library we have explored several optimization techniques, including overlapped communication, non-blocking collectives, message aggregation, and computation in the network for special collective communication patterns, such as parallel prefix. The experimental results show significant performance improvements, ranging from 5X to 10X, when compared to equally optimized MPI implementations.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"474 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124396078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A work-efficient algorithm for parallel unordered depth-first search 一种高效的并行无序深度优先搜索算法
Umut A. Acar, A. Charguéraud, Mike Rainey
Advances in processing power and memory technology have made multicore computers an important platform for high-performance graph-search (or graph-traversal) algorithms. Since the introduction of multicore, much progress has been made to improve parallel breadth-first search. However, less attention has been given to algorithms for unordered or loosely ordered traversals. We present a parallel algorithm for unordered depth-first-search on graphs. We prove that the algorithm is work efficient in a realistic algorithmic model that accounts for important scheduling costs. This work-efficiency result applies to all graphs, including those with high diameter and high out-degree vertices. The algorithmic techniques behind this result include a new data structure for representing the frontier of vertices in depth-first search, a new amortization technique for controlling excess parallelism, and an adaptation of the lazy-splitting technique to depth first search. We validate the theoretical results with an implementation and experiments. The experiments show that the algorithm performs well on a range of graphs and that it can lead to significant improvements over comparable algorithms.
处理能力和存储技术的进步使多核计算机成为高性能图搜索(或图遍历)算法的重要平台。自引入多核以来,在改进并行广度优先搜索方面取得了很大进展。然而,对无序或松散有序遍历的算法关注较少。提出了一种图上无序深度优先搜索的并行算法。在考虑重要调度成本的实际算法模型中,证明了该算法的工作效率。这种工作效率结果适用于所有图,包括那些具有高直径和高出界度顶点的图。该结果背后的算法技术包括用于表示深度优先搜索中顶点边界的新数据结构,用于控制过度并行性的新平摊技术,以及用于深度优先搜索的延迟分割技术的适应。通过实例和实验验证了理论结果。实验表明,该算法在一系列图上表现良好,并且与同类算法相比,它可以带来显着的改进。
{"title":"A work-efficient algorithm for parallel unordered depth-first search","authors":"Umut A. Acar, A. Charguéraud, Mike Rainey","doi":"10.1145/2807591.2807651","DOIUrl":"https://doi.org/10.1145/2807591.2807651","url":null,"abstract":"Advances in processing power and memory technology have made multicore computers an important platform for high-performance graph-search (or graph-traversal) algorithms. Since the introduction of multicore, much progress has been made to improve parallel breadth-first search. However, less attention has been given to algorithms for unordered or loosely ordered traversals. We present a parallel algorithm for unordered depth-first-search on graphs. We prove that the algorithm is work efficient in a realistic algorithmic model that accounts for important scheduling costs. This work-efficiency result applies to all graphs, including those with high diameter and high out-degree vertices. The algorithmic techniques behind this result include a new data structure for representing the frontier of vertices in depth-first search, a new amortization technique for controlling excess parallelism, and an adaptation of the lazy-splitting technique to depth first search. We validate the theoretical results with an implementation and experiments. The experiments show that the algorithm performs well on a range of graphs and that it can lead to significant improvements over comparable algorithms.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125911701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Exploiting asynchrony from exact forward recovery for DUE in iterative solvers 利用迭代求解中DUE精确前向恢复的异步性
Luc Jaulmes, Marc Casas, Miquel Moretó, E. Ayguadé, Jesús Labarta, M. Valero
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE) relying on error detection techniques already available in commodity hardware. Detection operates at the memory page level, which enables the use of simple algorithmic redundancies to correct errors. Such redundancies would be inapplicable under coarse grain error detection, but become very powerful when the hardware is able to precisely detect errors. Relations straightforwardly extracted from the solver allow to recover lost data exactly. This method is free of the overheads of backwards recoveries like checkpointing, and does not compromise mathematical convergence properties of the solver as restarting would do. We apply this recovery to three widely used Krylov subspace methods, CG, GMRES and BiCGStab, and their preconditioned versions. We implement our resilience techniques on CG considering scenarios from small (8 cores) to large (1024 cores) scales, and demonstrate very low overheads compared to state-of-the-art solutions. We deploy our recovery techniques either by overlapping them with algorithmic computations or by forcing them to be in the critical path of the application. A trade-off exists between both approaches depending on the error rate the solver is suffering. Under realistic error rates, overlapping decreases overheads from 5.37% down to 3.59% for a non-preconditioned CG on 8 cores.
本文提出了一种方法来保护迭代求解器从检测和未纠正的错误(DUE)依赖于错误检测技术已经在商品硬件。检测在内存页级别操作,这允许使用简单的算法冗余来纠正错误。这种冗余在粗粒度错误检测中是不适用的,但当硬件能够精确检测错误时,这种冗余将变得非常强大。直接从求解器中提取的关系允许精确地恢复丢失的数据。这种方法没有向后恢复(如检查点)的开销,并且不会像重新启动那样损害求解器的数学收敛特性。我们将这种恢复应用于三种广泛使用的Krylov子空间方法,CG、GMRES和BiCGStab,以及它们的预处理版本。我们在CG上实现了弹性技术,考虑了从小(8核)到大(1024核)的场景,并且与最先进的解决方案相比,开销非常低。我们通过将恢复技术与算法计算重叠或强制它们位于应用程序的关键路径来部署恢复技术。根据求解器遭受的错误率,两种方法之间存在权衡。在实际错误率下,对于8核的非预置CG,重叠将开销从5.37%降低到3.59%。
{"title":"Exploiting asynchrony from exact forward recovery for DUE in iterative solvers","authors":"Luc Jaulmes, Marc Casas, Miquel Moretó, E. Ayguadé, Jesús Labarta, M. Valero","doi":"10.1145/2807591.2807599","DOIUrl":"https://doi.org/10.1145/2807591.2807599","url":null,"abstract":"This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE) relying on error detection techniques already available in commodity hardware. Detection operates at the memory page level, which enables the use of simple algorithmic redundancies to correct errors. Such redundancies would be inapplicable under coarse grain error detection, but become very powerful when the hardware is able to precisely detect errors. Relations straightforwardly extracted from the solver allow to recover lost data exactly. This method is free of the overheads of backwards recoveries like checkpointing, and does not compromise mathematical convergence properties of the solver as restarting would do. We apply this recovery to three widely used Krylov subspace methods, CG, GMRES and BiCGStab, and their preconditioned versions. We implement our resilience techniques on CG considering scenarios from small (8 cores) to large (1024 cores) scales, and demonstrate very low overheads compared to state-of-the-art solutions. We deploy our recovery techniques either by overlapping them with algorithmic computations or by forcing them to be in the critical path of the application. A trade-off exists between both approaches depending on the error rate the solver is suffering. Under realistic error rates, overlapping decreases overheads from 5.37% down to 3.59% for a non-preconditioned CG on 8 cores.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134233752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
期刊
SC15: International Conference for High Performance Computing, Networking, Storage and Analysis
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1