Pub Date : 2012-10-24DOI: 10.1109/SBAC-PAD.2012.29
Jason Kane, Qing Yang
This paper examines several promising throughput enhancements to the Lempel-Ziv-Oberhumer (LZO) 1x-1-15 data compression algorithm. Of many algorithm variants present in the current library version, 2.06, LZO 1x-1-15 is considered to be the fastest, geared toward speed rather than compression ratio. We present several algorithm modifications tailored to modern multi-core architectures in this paper that are intended to increase compression speed while minimizing any loss in compression ratio. On average, the experimental results show that on a modern quad core system, a 3.9x speedup in compression time is achieved over the baseline algorithm with no loss to compression ratio. Allowing for a 25% loss in compression ratio, up to a 5.4x speedup in compression time was observed.
{"title":"Compression Speed Enhancements to LZO for Multi-core Systems","authors":"Jason Kane, Qing Yang","doi":"10.1109/SBAC-PAD.2012.29","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.29","url":null,"abstract":"This paper examines several promising throughput enhancements to the Lempel-Ziv-Oberhumer (LZO) 1x-1-15 data compression algorithm. Of many algorithm variants present in the current library version, 2.06, LZO 1x-1-15 is considered to be the fastest, geared toward speed rather than compression ratio. We present several algorithm modifications tailored to modern multi-core architectures in this paper that are intended to increase compression speed while minimizing any loss in compression ratio. On average, the experimental results show that on a modern quad core system, a 3.9x speedup in compression time is achieved over the baseline algorithm with no loss to compression ratio. Allowing for a 25% loss in compression ratio, up to a 5.4x speedup in compression time was observed.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"216 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121998658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-24DOI: 10.1109/SBAC-PAD.2012.10
J. Jann, R. S. Burugula, Ching-Farn E. Wu, Kaoutar El Maghraoui
Recovering from OS crashes has traditionally been done using reboot or checkpoint-restart mechanisms. Such techniques either fail to preserve the state before the crash happens or require modifications to applications. To eliminate these problems, we present a novel OS-hyper visor infrastructure for automated OS crash diagnosis and recovery in virtual servers. Our approach uses a small hidden OS-repair-image that is dynamically created from the healthy running OS instance. Upon an OS crash, the hyper visor automatically loads this repair-image to perform diagnosis and repair. The offending process is then quarantined, and the fixed OS automatically resumes running without a reboot. Our experimental evaluations demonstrated that it takes less than 3 seconds to recover from an OS crash. This approach can significantly reduce the downtime and maintenance costs in data centers. This is the first design and implementation of an OS-hyper visor combo capable of automatically resurrecting a crashed commercial server-OS.
{"title":"An OS-Hypervisor Infrastructure for Automated OS Crash Diagnosis and Recovery in a Virtualized Environment","authors":"J. Jann, R. S. Burugula, Ching-Farn E. Wu, Kaoutar El Maghraoui","doi":"10.1109/SBAC-PAD.2012.10","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.10","url":null,"abstract":"Recovering from OS crashes has traditionally been done using reboot or checkpoint-restart mechanisms. Such techniques either fail to preserve the state before the crash happens or require modifications to applications. To eliminate these problems, we present a novel OS-hyper visor infrastructure for automated OS crash diagnosis and recovery in virtual servers. Our approach uses a small hidden OS-repair-image that is dynamically created from the healthy running OS instance. Upon an OS crash, the hyper visor automatically loads this repair-image to perform diagnosis and repair. The offending process is then quarantined, and the fixed OS automatically resumes running without a reboot. Our experimental evaluations demonstrated that it takes less than 3 seconds to recover from an OS crash. This approach can significantly reduce the downtime and maintenance costs in data centers. This is the first design and implementation of an OS-hyper visor combo capable of automatically resurrecting a crashed commercial server-OS.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126249316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-24DOI: 10.1109/SBAC-PAD.2012.30
M. Alves, Khubaib, Eiman Ebrahimi, V. Narasiman, Carlos Villavieja, P. Navaux, Y. Patt
Cache memories have traditionally been designed to exploit spatial locality by fetching entire cache lines from memory upon a miss. However, recent studies have shown that often the number of sub-blocks within a line that are actually used is low. Furthermore, those sub-blocks that are used are accessed only a few times before becoming dead (i.e., never accessed again). This results in considerable energy waste since (1) data not needed by the processor is brought into the cache, and (2) data is kept alive in the cache longer than necessary. We propose the Dead Sub-Block Predictor (DSBP) to predict which sub-blocks of a cache line will be actually used and how many times it will be used in order to bring into the cache only those sub-blocks that are necessary, and power them off after they are touched the predicted number of times. We also use DSBP to identify dead lines (i.e., all sub-blocks off) and augment the existing replacement policy by prioritizing dead lines for eviction. Our results show a 24% energy reduction for the whole cache hierarchy when averaged over the SPEC2000, SPEC2006 and NAS-NPB benchmarks.
{"title":"Energy Savings via Dead Sub-Block Prediction","authors":"M. Alves, Khubaib, Eiman Ebrahimi, V. Narasiman, Carlos Villavieja, P. Navaux, Y. Patt","doi":"10.1109/SBAC-PAD.2012.30","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.30","url":null,"abstract":"Cache memories have traditionally been designed to exploit spatial locality by fetching entire cache lines from memory upon a miss. However, recent studies have shown that often the number of sub-blocks within a line that are actually used is low. Furthermore, those sub-blocks that are used are accessed only a few times before becoming dead (i.e., never accessed again). This results in considerable energy waste since (1) data not needed by the processor is brought into the cache, and (2) data is kept alive in the cache longer than necessary. We propose the Dead Sub-Block Predictor (DSBP) to predict which sub-blocks of a cache line will be actually used and how many times it will be used in order to bring into the cache only those sub-blocks that are necessary, and power them off after they are touched the predicted number of times. We also use DSBP to identify dead lines (i.e., all sub-blocks off) and augment the existing replacement policy by prioritizing dead lines for eviction. Our results show a 24% energy reduction for the whole cache hierarchy when averaged over the SPEC2000, SPEC2006 and NAS-NPB benchmarks.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132264588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-24DOI: 10.1109/SBAC-PAD.2012.40
Rance Rodrigues, A. Annamalai, I. Koren, S. Kundu
The emergence of asymmetric multicore processors(AMPs) has elevated the problem of thread scheduling in such systems. The computing needs of a thread often vary during its execution (phases) and hence, reassigning threads to cores(thread swapping) upon detection of such a change, can significantly improve the AMP's power efficiency. Even though identifying a change in the resource requirements of a workload is straightforward, determining the thread reassignment is a challenge. Traditional online learning schemes rely on sampling to determine the best thread to core in AMPs. However, as the number of cores in the multicore increases, the sampling overhead may be too large. In this paper, we propose a novel technique to dynamically assess the current thread to core assignment and determine whether swapping the threads between the cores will be beneficial and achieve a higher performance/Watt. This decision is based on estimating the expected performance and power of the current program phase on other cores. This estimation is done using the values of selected performance counters in the host core. By estimating the expected performance and power on each core type, informed thread scheduling decisions can be made while avoiding the overhead associated with sampling. We illustrate our approach using an 8-core high performance/low-power AMP and show the performance/Watt benefits of the proposed dynamic thread scheduling technique. We compare our proposed scheme against previously published schemes based on online learning and two schemes based on the use of an oracle, one static and the other dynamic. Our results show that significant performance/Watt gains can be achieved through informed thread scheduling decisions in AMPs.
{"title":"Scalable Thread Scheduling in Asymmetric Multicores for Power Efficiency","authors":"Rance Rodrigues, A. Annamalai, I. Koren, S. Kundu","doi":"10.1109/SBAC-PAD.2012.40","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.40","url":null,"abstract":"The emergence of asymmetric multicore processors(AMPs) has elevated the problem of thread scheduling in such systems. The computing needs of a thread often vary during its execution (phases) and hence, reassigning threads to cores(thread swapping) upon detection of such a change, can significantly improve the AMP's power efficiency. Even though identifying a change in the resource requirements of a workload is straightforward, determining the thread reassignment is a challenge. Traditional online learning schemes rely on sampling to determine the best thread to core in AMPs. However, as the number of cores in the multicore increases, the sampling overhead may be too large. In this paper, we propose a novel technique to dynamically assess the current thread to core assignment and determine whether swapping the threads between the cores will be beneficial and achieve a higher performance/Watt. This decision is based on estimating the expected performance and power of the current program phase on other cores. This estimation is done using the values of selected performance counters in the host core. By estimating the expected performance and power on each core type, informed thread scheduling decisions can be made while avoiding the overhead associated with sampling. We illustrate our approach using an 8-core high performance/low-power AMP and show the performance/Watt benefits of the proposed dynamic thread scheduling technique. We compare our proposed scheme against previously published schemes based on online learning and two schemes based on the use of an oracle, one static and the other dynamic. Our results show that significant performance/Watt gains can be achieved through informed thread scheduling decisions in AMPs.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129303976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-24DOI: 10.1109/SBAC-PAD.2012.43
N. Ma, Yinglong Xia, V. Prasanna
Inference is a key problem in exploring probabilistic graphical models for machine learning algorithms. Recently, many parallel techniques have been developed to accelerate inference. However, these techniques are not widely used due to their implementation complexity. MapReduce provides an appealing programming model that has been increasingly used to develop parallel solutions. MapReduce though has been mainly used for data parallel applications. In this paper, we investigate the use of MapReduce for exact inference in Bayesian networks. MapReduce based algorithms are proposed for evidence propagation in junction trees. We evaluate our methods on general-purpose multi-core machines using Phoenix as the underlying MapReduce runtime. The experimental results show that our methods achieve 20x speedup on an Intel West mere-EX based system.
推理是探索机器学习算法的概率图模型的关键问题。近年来,人们开发了许多并行技术来加速推理。然而,由于实现的复杂性,这些技术并没有被广泛使用。MapReduce提供了一种吸引人的编程模型,越来越多地用于开发并行解决方案。MapReduce主要用于数据并行应用。在本文中,我们研究了在贝叶斯网络中使用MapReduce进行精确推理。提出了基于MapReduce的连接树证据传播算法。我们使用Phoenix作为底层MapReduce运行时,在通用多核机器上评估我们的方法。实验结果表明,我们的方法在基于Intel West - ex的系统上实现了20倍的加速。
{"title":"Parallel Exact Inference on Multicore Using MapReduce","authors":"N. Ma, Yinglong Xia, V. Prasanna","doi":"10.1109/SBAC-PAD.2012.43","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.43","url":null,"abstract":"Inference is a key problem in exploring probabilistic graphical models for machine learning algorithms. Recently, many parallel techniques have been developed to accelerate inference. However, these techniques are not widely used due to their implementation complexity. MapReduce provides an appealing programming model that has been increasingly used to develop parallel solutions. MapReduce though has been mainly used for data parallel applications. In this paper, we investigate the use of MapReduce for exact inference in Bayesian networks. MapReduce based algorithms are proposed for evidence propagation in junction trees. We evaluate our methods on general-purpose multi-core machines using Phoenix as the underlying MapReduce runtime. The experimental results show that our methods achieve 20x speedup on an Intel West mere-EX based system.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133628879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-24DOI: 10.1109/SBAC-PAD.2012.13
M. Breternitz, Keith Lowery, Anton Charnoff, Patryk Kamiński, Leonardo Piga
This note describes the Synthetic Workload Application Toolkit (SWAT) and presents the results from a set of experiments on some key cloud workloads. SWAT is a software platform that automates the creation, deployment, provisioning, execution, and (most importantly) data gathering of synthetic compute workloads on clusters of arbitrary size. SWAT collects and aggregates data from application execution logs, operating system call interfaces, and micro architecture-specific program counters. The data collected by SWAT are used to characterize the effects of network traffic, file I/O, and computation on program performance. The output is analyzed to provide insight into the design and deployment of cloud workloads and systems. Each workload is characterized according to its scalability with the number of server nodes and Hadoop server jobs, sensitivity to network characteristics (bandwidth, latency, statistics on packet size), and computation vs. I/O intensity as these values adjusted via workload-specific parameters. (In the future, we will use SWAT's benchmark synthesizer capability.) We also characterize micro-architectural characteristics that give insight on the micro architecture of processors better suited for this class of workloads. We contrast our results with prior work on Cloud Suite [5], validating some conclusions and providing further insight into others. This illustrates SWAT's data collection capabilities and usefulness to obtain insight on cloud applications and systems.
{"title":"Cloud Workload Analysis with SWAT","authors":"M. Breternitz, Keith Lowery, Anton Charnoff, Patryk Kamiński, Leonardo Piga","doi":"10.1109/SBAC-PAD.2012.13","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.13","url":null,"abstract":"This note describes the Synthetic Workload Application Toolkit (SWAT) and presents the results from a set of experiments on some key cloud workloads. SWAT is a software platform that automates the creation, deployment, provisioning, execution, and (most importantly) data gathering of synthetic compute workloads on clusters of arbitrary size. SWAT collects and aggregates data from application execution logs, operating system call interfaces, and micro architecture-specific program counters. The data collected by SWAT are used to characterize the effects of network traffic, file I/O, and computation on program performance. The output is analyzed to provide insight into the design and deployment of cloud workloads and systems. Each workload is characterized according to its scalability with the number of server nodes and Hadoop server jobs, sensitivity to network characteristics (bandwidth, latency, statistics on packet size), and computation vs. I/O intensity as these values adjusted via workload-specific parameters. (In the future, we will use SWAT's benchmark synthesizer capability.) We also characterize micro-architectural characteristics that give insight on the micro architecture of processors better suited for this class of workloads. We contrast our results with prior work on Cloud Suite [5], validating some conclusions and providing further insight into others. This illustrates SWAT's data collection capabilities and usefulness to obtain insight on cloud applications and systems.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115431013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-24DOI: 10.1109/SBAC-PAD.2012.48
Akhil Langer, J. Lifflander, P. Miller, K. Pan, L. Kalé, P. Ricker
This paper presents scalable algorithms and data structures for adaptive mesh refinement computations. We describe a novel mesh restructuring algorithm for adaptive mesh refinement computations that uses a constant number of collectives regardless of the refinement depth. To further increase scalability, we describe a localized hierarchical coordinate-based block indexing scheme in contrast to traditional linear numbering schemes, which incur unnecessary synchronization. In contrast to the existing approaches which take O(P) time and storage per process, our approach takes only constant time and has very small memory footprint. With these optimizations as well as an efficient mapping scheme, our algorithm is scalable and suitable for large, highly-refined meshes. We present strong-scaling experiments up to 2k ranks on Cray XK6, and 32k ranks on IBM Blue Gene/Q.
本文提出了用于自适应网格细化计算的可扩展算法和数据结构。我们描述了一种新的网格重构算法,用于自适应网格细化计算,该算法使用恒定数量的集合,而不考虑细化深度。为了进一步提高可扩展性,我们描述了一种基于局部层次坐标的块索引方案,而不是传统的线性编号方案,这会导致不必要的同步。与每个进程占用O(P)时间和存储的现有方法相比,我们的方法只占用常数时间,并且内存占用非常小。通过这些优化以及有效的映射方案,我们的算法具有可扩展性,适合于大型,高度精细的网格。我们提出了在Cray XK6上达到2k排名的强缩放实验,在IBM Blue Gene/Q上达到32k排名。
{"title":"Scalable Algorithms for Distributed-Memory Adaptive Mesh Refinement","authors":"Akhil Langer, J. Lifflander, P. Miller, K. Pan, L. Kalé, P. Ricker","doi":"10.1109/SBAC-PAD.2012.48","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.48","url":null,"abstract":"This paper presents scalable algorithms and data structures for adaptive mesh refinement computations. We describe a novel mesh restructuring algorithm for adaptive mesh refinement computations that uses a constant number of collectives regardless of the refinement depth. To further increase scalability, we describe a localized hierarchical coordinate-based block indexing scheme in contrast to traditional linear numbering schemes, which incur unnecessary synchronization. In contrast to the existing approaches which take O(P) time and storage per process, our approach takes only constant time and has very small memory footprint. With these optimizations as well as an efficient mapping scheme, our algorithm is scalable and suitable for large, highly-refined meshes. We present strong-scaling experiments up to 2k ranks on Cray XK6, and 32k ranks on IBM Blue Gene/Q.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123886781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-24DOI: 10.1109/SBAC-PAD.2012.27
Biswabandan Panda, S. Balachandran
Parallel applications are becoming mainstream and architectural techniques for multicores that target these applications are the need of the hour. Sharing of data by multiple threads and issues due to data coherence are unique to parallel applications. We propose CSHARP, a hardware framework that brings coherence and sharing awareness to any shared last level cache replacement policy. We use the degree of sharing of cache lines and the information present in coherence vectors to make replacement decisions. We apply CSHARP to a state-of-the-art cache replacement policy called TA-DRRIP to show its effectiveness. Our experiments on four core simulated system show that applying CSHARP on TA-DRRIP gives an extra 10% reduction in miss-rate at the LLC. Compared to LRU policy, CSHARP on TA-DRRIP shows a 18% miss-rate reduction and a 7% performance boost. We also show the scalability of our proposal by studying the hardware overhead and performance on a 8-core system.
{"title":"CSHARP: Coherence and SHaring Aware Cache Replacement Policies for Parallel Applications","authors":"Biswabandan Panda, S. Balachandran","doi":"10.1109/SBAC-PAD.2012.27","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.27","url":null,"abstract":"Parallel applications are becoming mainstream and architectural techniques for multicores that target these applications are the need of the hour. Sharing of data by multiple threads and issues due to data coherence are unique to parallel applications. We propose CSHARP, a hardware framework that brings coherence and sharing awareness to any shared last level cache replacement policy. We use the degree of sharing of cache lines and the information present in coherence vectors to make replacement decisions. We apply CSHARP to a state-of-the-art cache replacement policy called TA-DRRIP to show its effectiveness. Our experiments on four core simulated system show that applying CSHARP on TA-DRRIP gives an extra 10% reduction in miss-rate at the LLC. Compared to LRU policy, CSHARP on TA-DRRIP shows a 18% miss-rate reduction and a 7% performance boost. We also show the scalability of our proposal by studying the hardware overhead and performance on a 8-core system.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117184266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-24DOI: 10.1109/SBAC-PAD.2012.24
Vladimir Gajinov, Srdjan Stipic, O. Unsal, T. Harris, E. Ayguadé, A. Cristal
In this paper we present Atomic Dataflow model (ADF), a new task-based parallel programming model for C/C++ which integrates dataflow abstractions into the shared memory programming model. The ADF model provides pragma directives that allow a programmer to organize a program into a set of tasks and to explicitly define input data for each task. The task dependency information is conveyed to the ADF runtime system which constructs the dataflow task graph and builds the necessary infrastructure for dataflow execution. Additionally, the ADF model allows tasks to share data. The key idea is that computation is triggered by dataflow between tasks but that, within a task, execution occurs by making atomic updates to common mutable state. To that end, the ADF model employs transactional memory which guarantees atomicity of shared memory updates. We show examples that illustrate how the programmability of shared memory can be improved using the ADF model. Moreover, our evaluation shows that the ADF model performs well in comparison with programs parallelized using OpenMP and transactional memory.
{"title":"Integrating Dataflow Abstractions into the Shared Memory Model","authors":"Vladimir Gajinov, Srdjan Stipic, O. Unsal, T. Harris, E. Ayguadé, A. Cristal","doi":"10.1109/SBAC-PAD.2012.24","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.24","url":null,"abstract":"In this paper we present Atomic Dataflow model (ADF), a new task-based parallel programming model for C/C++ which integrates dataflow abstractions into the shared memory programming model. The ADF model provides pragma directives that allow a programmer to organize a program into a set of tasks and to explicitly define input data for each task. The task dependency information is conveyed to the ADF runtime system which constructs the dataflow task graph and builds the necessary infrastructure for dataflow execution. Additionally, the ADF model allows tasks to share data. The key idea is that computation is triggered by dataflow between tasks but that, within a task, execution occurs by making atomic updates to common mutable state. To that end, the ADF model employs transactional memory which guarantees atomicity of shared memory updates. We show examples that illustrate how the programmability of shared memory can be improved using the ADF model. Moreover, our evaluation shows that the ADF model performs well in comparison with programs parallelized using OpenMP and transactional memory.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134066858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2012-10-24DOI: 10.1109/SBAC-PAD.2012.12
Esteban Meneses, O. Sarood, L. Kalé
An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This paper presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart.
{"title":"Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems","authors":"Esteban Meneses, O. Sarood, L. Kalé","doi":"10.1109/SBAC-PAD.2012.12","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.12","url":null,"abstract":"An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This paper presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131993281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}