We investigate an efficient parallelization of the most common iterative sparse tensor decomposition algorithms on distributed memory systems. A key operation in each iteration of these algorithms is the matricized tensor times Khatri-Rao product (MTTKRP). This operation amounts to element-wise vector multiplication and reduction depending on the sparsity of the tensor. We investigate a fine and a coarse-grain task definition for this operation, and propose hypergraph partitioning-based methods for these task definitions to achieve the load balance as well as reduce the communication requirements. We also design a distributed memory sparse tensor library, HyperTensor, which implements a well-known algorithm for the CANDECOMP-/PARAFAC (CP) tensor decomposition using the task definitions and the associated partitioning methods. We use this library to test the proposed implementation of MTTKRP in CP decomposition context, and report scalability results up to 1024 MPI ranks. We observed up to 194 fold speedups using 512 MPI processes on a well-known real world data, and significantly better performance results with respect to a state of the art implementation.
{"title":"Scalable sparse tensor decompositions in distributed memory systems","authors":"O. Kaya, B. Uçar","doi":"10.1145/2807591.2807624","DOIUrl":"https://doi.org/10.1145/2807591.2807624","url":null,"abstract":"We investigate an efficient parallelization of the most common iterative sparse tensor decomposition algorithms on distributed memory systems. A key operation in each iteration of these algorithms is the matricized tensor times Khatri-Rao product (MTTKRP). This operation amounts to element-wise vector multiplication and reduction depending on the sparsity of the tensor. We investigate a fine and a coarse-grain task definition for this operation, and propose hypergraph partitioning-based methods for these task definitions to achieve the load balance as well as reduce the communication requirements. We also design a distributed memory sparse tensor library, HyperTensor, which implements a well-known algorithm for the CANDECOMP-/PARAFAC (CP) tensor decomposition using the task definitions and the associated partitioning methods. We use this library to test the proposed implementation of MTTKRP in CP decomposition context, and report scalability results up to 1024 MPI ranks. We observed up to 194 fold speedups using 512 MPI processes on a well-known real world data, and significantly better performance results with respect to a state of the art implementation.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114822188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The amount of data moved over the Internet per year has already exceeded the Exabyte scale and soon will hit the Zettabyte range. To support this massive amount of data movement across the globe, the networking infrastructure as well as the source and destination nodes consume immense amount of electric power, with an estimated cost measured in billions of dollars. Although considerable amount of research has been done on power management techniques for the networking infrastructure, there has not been much prior work focusing on energy-aware data transfer algorithms for minimizing the power consumed at the end-systems. We introduce novel data transfer algorithms which aim to achieve high data transfer throughput while keeping the energy consumption during the transfers at the minimal levels. Our experimental results show that our energy-aware data transfer algorithms can achieve up to 50% energy savings with the same or higher level of data transfer throughput.
{"title":"Energy-aware data transfer algorithms","authors":"I. Alan, Engin Arslan, T. Kosar","doi":"10.1145/2807591.2807628","DOIUrl":"https://doi.org/10.1145/2807591.2807628","url":null,"abstract":"The amount of data moved over the Internet per year has already exceeded the Exabyte scale and soon will hit the Zettabyte range. To support this massive amount of data movement across the globe, the networking infrastructure as well as the source and destination nodes consume immense amount of electric power, with an estimated cost measured in billions of dollars. Although considerable amount of research has been done on power management techniques for the networking infrastructure, there has not been much prior work focusing on energy-aware data transfer algorithms for minimizing the power consumed at the end-systems. We introduce novel data transfer algorithms which aim to achieve high data transfer throughput while keeping the energy consumption during the transfers at the minimal levels. Our experimental results show that our energy-aware data transfer algorithms can achieve up to 50% energy savings with the same or higher level of data transfer throughput.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115712149","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jungrae Kim, Michael B. Sullivan, Seong-Lyong Gong, M. Erez
Because main memory is vulnerable to errors and failures, large-scale systems and critical servers utilize error checking and correcting (ECC) mechanisms to meet their reliability requirements. We propose a novel mechanism, Frugal ECC (FECC), that combines ECC with fine-grained compression to provide versatile protection that can be both stronger and lower overhead than current schemes, without sacrificing performance. FECC compresses main memory at cache-block granularity, using any left over space to store ECC information. Compressed data and its ECC information are then frequently read with a single access even without redundant memory chips; insufficiently compressed blocks require additional storage and accesses. As examples, we present chipkill-correct ECCs on a non-ECC DIMM with x4 chips and the first true chipkill-correct ECC for x8 devices using an ECC DIMM. FECC relies on a new Coverage-oriented-Compression that we developed specifically for the modest compression needs of ECC and for floating-point data.
由于主存容易出现错误和故障,大型系统和关键服务器采用ECC (error checking and correcting)机制来满足可靠性要求。我们提出了一种新的机制,节俭ECC (FECC),它将ECC与细粒度压缩相结合,提供比当前方案更强、开销更低的通用保护,而不会牺牲性能。FECC以缓存块粒度压缩主内存,使用任何剩余空间来存储ECC信息。压缩数据及其ECC信息在没有冗余存储芯片的情况下,也可以通过单次访问频繁读取;压缩不足的块需要额外的存储和访问。作为例子,我们展示了x4芯片的非ECC DIMM上的芯片kill-correct ECC,以及使用ECC DIMM的x8设备的第一个真正的芯片kill-correct ECC。FECC依赖于一种新的面向覆盖的压缩,我们专门为ECC和浮点数据的适度压缩需求开发了这种压缩。
{"title":"Frugal ECC: efficient and versatile memory error protection through fine-grained compression","authors":"Jungrae Kim, Michael B. Sullivan, Seong-Lyong Gong, M. Erez","doi":"10.1145/2807591.2807659","DOIUrl":"https://doi.org/10.1145/2807591.2807659","url":null,"abstract":"Because main memory is vulnerable to errors and failures, large-scale systems and critical servers utilize error checking and correcting (ECC) mechanisms to meet their reliability requirements. We propose a novel mechanism, Frugal ECC (FECC), that combines ECC with fine-grained compression to provide versatile protection that can be both stronger and lower overhead than current schemes, without sacrificing performance. FECC compresses main memory at cache-block granularity, using any left over space to store ECC information. Compressed data and its ECC information are then frequently read with a single access even without redundant memory chips; insufficiently compressed blocks require additional storage and accesses. As examples, we present chipkill-correct ECCs on a non-ECC DIMM with x4 chips and the first true chipkill-correct ECC for x8 devices using an ECC DIMM. FECC relies on a new Coverage-oriented-Compression that we developed specifically for the modest compression needs of ECC and for floating-point data.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125027133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper proposes and evaluates Sharing/Timing Adaptive Push (STAP), a dynamic scheme for preemptively sending data from producers to consumers to minimize critical-path communication latency. STAP uses small hardware buffers to dynamically detect sharing patterns and timing requirements. The scheme applies to both intra-node and inter-socket directory-based shared memory networks. We integrate STAP into a MOESI cache-coherence (prefetching-enabled) protocol using heuristics to detect different data sharing patterns, including broadcasts, producer/consumer, and migratory-data sharing. Using 15 benchmarks from the PARSEC and SPLASH-2 suites we show that our scheme significantly reduces communication latency in NUMA systems and achieves an average of 9% performance improvement, with at most 3% on-chip storage overhead.
{"title":"Automatic sharing classification and timely push for cache-coherent systems","authors":"Malek Musleh, Vijay S. Pai","doi":"10.1145/2807591.2807649","DOIUrl":"https://doi.org/10.1145/2807591.2807649","url":null,"abstract":"This paper proposes and evaluates Sharing/Timing Adaptive Push (STAP), a dynamic scheme for preemptively sending data from producers to consumers to minimize critical-path communication latency. STAP uses small hardware buffers to dynamically detect sharing patterns and timing requirements. The scheme applies to both intra-node and inter-socket directory-based shared memory networks. We integrate STAP into a MOESI cache-coherence (prefetching-enabled) protocol using heuristics to detect different data sharing patterns, including broadcasts, producer/consumer, and migratory-data sharing. Using 15 benchmarks from the PARSEC and SPLASH-2 suites we show that our scheme significantly reduces communication latency in NUMA systems and achieves an average of 9% performance improvement, with at most 3% on-chip storage overhead.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123644302","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lifeng Nai, Yinglong Xia, Ilie Gabriel Tanase, Hyesoon Kim, Ching-Yung Lin
With the emergence of data science, graph computing is becoming a crucial tool for processing big connected data. Although efficient implementations of specific graph applications exist, the behavior of full-spectrum graph computing remains unknown. To understand graph computing, we must consider multiple graph computation types, graph frameworks, data representations, and various data sources in a holistic way. In this paper, we present GraphBIG, a benchmark suite inspired by IBM System G project. To cover major graph computation types and data sources, GraphBIG selects representative datastructures, workloads and data sets from 21 real-world use cases of multiple application domains. We characterized GraphBIG on real machines and observed extremely irregular memory patterns and significant diverse behavior across different computations. GraphBIG helps users understand the impact of modern graph computing on the hardware architecture and enables future architecture and system research.
随着数据科学的兴起,图计算正在成为处理大连接数据的重要工具。尽管存在特定图形应用程序的有效实现,但全谱图计算的行为仍然未知。为了理解图计算,我们必须以整体的方式考虑多种图计算类型、图框架、数据表示和各种数据源。在本文中,我们介绍了GraphBIG,一个受IBM System G项目启发的基准测试套件。为了涵盖主要的图计算类型和数据源,GraphBIG从多个应用领域的21个实际用例中选择了具有代表性的数据结构、工作负载和数据集。我们在真实机器上对GraphBIG进行了表征,并观察到极其不规则的内存模式和不同计算过程中显著不同的行为。GraphBIG帮助用户了解现代图计算对硬件架构的影响,并使未来的架构和系统研究成为可能。
{"title":"GraphBIG: understanding graph computing in the context of industrial solutions","authors":"Lifeng Nai, Yinglong Xia, Ilie Gabriel Tanase, Hyesoon Kim, Ching-Yung Lin","doi":"10.1145/2807591.2807626","DOIUrl":"https://doi.org/10.1145/2807591.2807626","url":null,"abstract":"With the emergence of data science, graph computing is becoming a crucial tool for processing big connected data. Although efficient implementations of specific graph applications exist, the behavior of full-spectrum graph computing remains unknown. To understand graph computing, we must consider multiple graph computation types, graph frameworks, data representations, and various data sources in a holistic way. In this paper, we present GraphBIG, a benchmark suite inspired by IBM System G project. To cover major graph computation types and data sources, GraphBIG selects representative datastructures, workloads and data sets from 21 real-world use cases of multiple application domains. We characterized GraphBIG on real machines and observed extremely irregular memory patterns and significant diverse behavior across different computations. GraphBIG helps users understand the impact of modern graph computing on the hardware architecture and enables future architecture and system research.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130295815","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
R. Ashraf, R. Gioiosa, Gokcen Kestor, R. Demara, Chen-Yong Cher, P. Bose
Resiliency of exascale systems has quickly become an important concern for the scientific community. Despite its importance, still much remains to be determined regarding how faults disseminate or at what rate do they impact HPC applications. The understanding of where and how fast faults propagate could lead to more efficient implementation of application-driven error detection and recovery. In this work, we propose a fault propagation framework to analyze how faults propagate in MPI applications and to understand their vulnerability to faults. We employ a combination of compiler-level code transformation and instrumentation, along with a runtime checker. Using the information provided by our framework, we employ machine learning technique to derive application fault propagation models that can be used to estimate the number of corrupted memory locations at runtime.
{"title":"Understanding the propagation of transient errors in HPC applications","authors":"R. Ashraf, R. Gioiosa, Gokcen Kestor, R. Demara, Chen-Yong Cher, P. Bose","doi":"10.1145/2807591.2807670","DOIUrl":"https://doi.org/10.1145/2807591.2807670","url":null,"abstract":"Resiliency of exascale systems has quickly become an important concern for the scientific community. Despite its importance, still much remains to be determined regarding how faults disseminate or at what rate do they impact HPC applications. The understanding of where and how fast faults propagate could lead to more efficient implementation of application-driven error detection and recovery. In this work, we propose a fault propagation framework to analyze how faults propagate in MPI applications and to understand their vulnerability to faults. We employ a combination of compiler-level code transformation and instrumentation, along with a runtime checker. Using the information provided by our framework, we employ machine learning technique to derive application fault propagation models that can be used to estimate the number of corrupted memory locations at runtime.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128633939","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Éric Gaussier, David Glesser, Valentin Reis, D. Trystram
The job management system is the HPC middleware responsible for distributing computing power to applications. While such systems generate an ever increasing amount of data, they are characterized by uncertainties on some parameters like the job running times. The question raised in this work is: To what extent is it possible/useful to take into account predictions on the job running times for improving the global scheduling? We present a comprehensive study for answering this question assuming the popular EASY backfilling policy. More precisely, we rely on some classical methods in machine learning and propose new cost functions well-adapted to the problem. Then, we assess our proposed solutions through intensive simulations using several production logs. Finally, we propose a new scheduling algorithm that outperforms the popular EASY backfilling algorithm by 28% considering the average bounded slowdown objective.
{"title":"Improving backfilling by using machine learning to predict running times","authors":"Éric Gaussier, David Glesser, Valentin Reis, D. Trystram","doi":"10.1145/2807591.2807646","DOIUrl":"https://doi.org/10.1145/2807591.2807646","url":null,"abstract":"The job management system is the HPC middleware responsible for distributing computing power to applications. While such systems generate an ever increasing amount of data, they are characterized by uncertainties on some parameters like the job running times. The question raised in this work is: To what extent is it possible/useful to take into account predictions on the job running times for improving the global scheduling? We present a comprehensive study for answering this question assuming the popular EASY backfilling policy. More precisely, we rely on some classical methods in machine learning and propose new cost functions well-adapted to the problem. Then, we assess our proposed solutions through intensive simulations using several production logs. Finally, we propose a new scheduling algorithm that outperforms the popular EASY backfilling algorithm by 28% considering the average bounded slowdown objective.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121276496","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xinyu Que, Fabio Checconi, F. Petrini, Xing Liu, Daniele Buono
Graph analytics are arguably one of the most demanding workloads for high-performance systems and interconnection networks. Graph applications often display all-to-all, fine-grained, high-rate communication patterns that expose the limits of the network protocol stacks. Load and communication imbalance generate hard-to-predict network hot-spots, and may require computational steering due to unpredictable data distributions. In this paper we present a lightweight communication library, implemented "on the metal" of BlueGene/Q and POWER7 IH that we have used to support large-scale graph algorithms up to 96K processing nodes and 6 million threads. With this library we have explored several optimization techniques, including overlapped communication, non-blocking collectives, message aggregation, and computation in the network for special collective communication patterns, such as parallel prefix. The experimental results show significant performance improvements, ranging from 5X to 10X, when compared to equally optimized MPI implementations.
{"title":"Exploring network optimizations for large-scale graph analytics","authors":"Xinyu Que, Fabio Checconi, F. Petrini, Xing Liu, Daniele Buono","doi":"10.1145/2807591.2807661","DOIUrl":"https://doi.org/10.1145/2807591.2807661","url":null,"abstract":"Graph analytics are arguably one of the most demanding workloads for high-performance systems and interconnection networks. Graph applications often display all-to-all, fine-grained, high-rate communication patterns that expose the limits of the network protocol stacks. Load and communication imbalance generate hard-to-predict network hot-spots, and may require computational steering due to unpredictable data distributions. In this paper we present a lightweight communication library, implemented \"on the metal\" of BlueGene/Q and POWER7 IH that we have used to support large-scale graph algorithms up to 96K processing nodes and 6 million threads. With this library we have explored several optimization techniques, including overlapped communication, non-blocking collectives, message aggregation, and computation in the network for special collective communication patterns, such as parallel prefix. The experimental results show significant performance improvements, ranging from 5X to 10X, when compared to equally optimized MPI implementations.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"474 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124396078","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Advances in processing power and memory technology have made multicore computers an important platform for high-performance graph-search (or graph-traversal) algorithms. Since the introduction of multicore, much progress has been made to improve parallel breadth-first search. However, less attention has been given to algorithms for unordered or loosely ordered traversals. We present a parallel algorithm for unordered depth-first-search on graphs. We prove that the algorithm is work efficient in a realistic algorithmic model that accounts for important scheduling costs. This work-efficiency result applies to all graphs, including those with high diameter and high out-degree vertices. The algorithmic techniques behind this result include a new data structure for representing the frontier of vertices in depth-first search, a new amortization technique for controlling excess parallelism, and an adaptation of the lazy-splitting technique to depth first search. We validate the theoretical results with an implementation and experiments. The experiments show that the algorithm performs well on a range of graphs and that it can lead to significant improvements over comparable algorithms.
{"title":"A work-efficient algorithm for parallel unordered depth-first search","authors":"Umut A. Acar, A. Charguéraud, Mike Rainey","doi":"10.1145/2807591.2807651","DOIUrl":"https://doi.org/10.1145/2807591.2807651","url":null,"abstract":"Advances in processing power and memory technology have made multicore computers an important platform for high-performance graph-search (or graph-traversal) algorithms. Since the introduction of multicore, much progress has been made to improve parallel breadth-first search. However, less attention has been given to algorithms for unordered or loosely ordered traversals. We present a parallel algorithm for unordered depth-first-search on graphs. We prove that the algorithm is work efficient in a realistic algorithmic model that accounts for important scheduling costs. This work-efficiency result applies to all graphs, including those with high diameter and high out-degree vertices. The algorithmic techniques behind this result include a new data structure for representing the frontier of vertices in depth-first search, a new amortization technique for controlling excess parallelism, and an adaptation of the lazy-splitting technique to depth first search. We validate the theoretical results with an implementation and experiments. The experiments show that the algorithm performs well on a range of graphs and that it can lead to significant improvements over comparable algorithms.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125911701","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luc Jaulmes, Marc Casas, Miquel Moretó, E. Ayguadé, Jesús Labarta, M. Valero
This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE) relying on error detection techniques already available in commodity hardware. Detection operates at the memory page level, which enables the use of simple algorithmic redundancies to correct errors. Such redundancies would be inapplicable under coarse grain error detection, but become very powerful when the hardware is able to precisely detect errors. Relations straightforwardly extracted from the solver allow to recover lost data exactly. This method is free of the overheads of backwards recoveries like checkpointing, and does not compromise mathematical convergence properties of the solver as restarting would do. We apply this recovery to three widely used Krylov subspace methods, CG, GMRES and BiCGStab, and their preconditioned versions. We implement our resilience techniques on CG considering scenarios from small (8 cores) to large (1024 cores) scales, and demonstrate very low overheads compared to state-of-the-art solutions. We deploy our recovery techniques either by overlapping them with algorithmic computations or by forcing them to be in the critical path of the application. A trade-off exists between both approaches depending on the error rate the solver is suffering. Under realistic error rates, overlapping decreases overheads from 5.37% down to 3.59% for a non-preconditioned CG on 8 cores.
{"title":"Exploiting asynchrony from exact forward recovery for DUE in iterative solvers","authors":"Luc Jaulmes, Marc Casas, Miquel Moretó, E. Ayguadé, Jesús Labarta, M. Valero","doi":"10.1145/2807591.2807599","DOIUrl":"https://doi.org/10.1145/2807591.2807599","url":null,"abstract":"This paper presents a method to protect iterative solvers from Detected and Uncorrected Errors (DUE) relying on error detection techniques already available in commodity hardware. Detection operates at the memory page level, which enables the use of simple algorithmic redundancies to correct errors. Such redundancies would be inapplicable under coarse grain error detection, but become very powerful when the hardware is able to precisely detect errors. Relations straightforwardly extracted from the solver allow to recover lost data exactly. This method is free of the overheads of backwards recoveries like checkpointing, and does not compromise mathematical convergence properties of the solver as restarting would do. We apply this recovery to three widely used Krylov subspace methods, CG, GMRES and BiCGStab, and their preconditioned versions. We implement our resilience techniques on CG considering scenarios from small (8 cores) to large (1024 cores) scales, and demonstrate very low overheads compared to state-of-the-art solutions. We deploy our recovery techniques either by overlapping them with algorithmic computations or by forcing them to be in the critical path of the application. A trade-off exists between both approaches depending on the error rate the solver is suffering. Under realistic error rates, overlapping decreases overheads from 5.37% down to 3.59% for a non-preconditioned CG on 8 cores.","PeriodicalId":117494,"journal":{"name":"SC15: International Conference for High Performance Computing, Networking, Storage and Analysis","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2015-11-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134233752","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}