首页 > 最新文献

2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing最新文献

英文 中文
Compression Speed Enhancements to LZO for Multi-core Systems 多核系统LZO压缩速度增强
Jason Kane, Qing Yang
This paper examines several promising throughput enhancements to the Lempel-Ziv-Oberhumer (LZO) 1x-1-15 data compression algorithm. Of many algorithm variants present in the current library version, 2.06, LZO 1x-1-15 is considered to be the fastest, geared toward speed rather than compression ratio. We present several algorithm modifications tailored to modern multi-core architectures in this paper that are intended to increase compression speed while minimizing any loss in compression ratio. On average, the experimental results show that on a modern quad core system, a 3.9x speedup in compression time is achieved over the baseline algorithm with no loss to compression ratio. Allowing for a 25% loss in compression ratio, up to a 5.4x speedup in compression time was observed.
本文研究了Lempel-Ziv-Oberhumer (LZO) 1x-1-15数据压缩算法的几个有希望的吞吐量增强。在当前库版本2.06中存在的许多算法变体中,LZO 1x-1-15被认为是最快的,它关注的是速度而不是压缩比。在本文中,我们提出了针对现代多核架构量身定制的几种算法修改,旨在提高压缩速度,同时最大限度地减少压缩比的损失。平均而言,实验结果表明,在现代四核系统上,在没有压缩比损失的情况下,压缩时间比基线算法提高了3.9倍。在允许压缩比损失25%的情况下,可以观察到压缩时间加速高达5.4倍。
{"title":"Compression Speed Enhancements to LZO for Multi-core Systems","authors":"Jason Kane, Qing Yang","doi":"10.1109/SBAC-PAD.2012.29","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.29","url":null,"abstract":"This paper examines several promising throughput enhancements to the Lempel-Ziv-Oberhumer (LZO) 1x-1-15 data compression algorithm. Of many algorithm variants present in the current library version, 2.06, LZO 1x-1-15 is considered to be the fastest, geared toward speed rather than compression ratio. We present several algorithm modifications tailored to modern multi-core architectures in this paper that are intended to increase compression speed while minimizing any loss in compression ratio. On average, the experimental results show that on a modern quad core system, a 3.9x speedup in compression time is achieved over the baseline algorithm with no loss to compression ratio. Allowing for a 25% loss in compression ratio, up to a 5.4x speedup in compression time was observed.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"216 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121998658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
An OS-Hypervisor Infrastructure for Automated OS Crash Diagnosis and Recovery in a Virtualized Environment 虚拟化环境中用于自动诊断和恢复操作系统崩溃的OS- hypervisor基础架构
J. Jann, R. S. Burugula, Ching-Farn E. Wu, Kaoutar El Maghraoui
Recovering from OS crashes has traditionally been done using reboot or checkpoint-restart mechanisms. Such techniques either fail to preserve the state before the crash happens or require modifications to applications. To eliminate these problems, we present a novel OS-hyper visor infrastructure for automated OS crash diagnosis and recovery in virtual servers. Our approach uses a small hidden OS-repair-image that is dynamically created from the healthy running OS instance. Upon an OS crash, the hyper visor automatically loads this repair-image to perform diagnosis and repair. The offending process is then quarantined, and the fixed OS automatically resumes running without a reboot. Our experimental evaluations demonstrated that it takes less than 3 seconds to recover from an OS crash. This approach can significantly reduce the downtime and maintenance costs in data centers. This is the first design and implementation of an OS-hyper visor combo capable of automatically resurrecting a crashed commercial server-OS.
从操作系统崩溃中恢复通常使用重新启动或检查点重新启动机制。这种技术要么无法在崩溃发生之前保持状态,要么需要修改应用程序。为了消除这些问题,我们提出了一种新的OS-hyper - visor基础架构,用于在虚拟服务器中自动诊断和恢复OS崩溃。我们的方法使用一个隐藏的小操作系统修复映像,该映像是从运行正常的操作系统实例动态创建的。当操作系统崩溃时,hypervisor会自动加载此修复映像以执行诊断和修复。然后,问题进程将被隔离,固定的操作系统将自动恢复运行,无需重新启动。我们的实验评估表明,从操作系统崩溃中恢复需要不到3秒的时间。这种方法可以显著减少数据中心的停机时间和维护成本。这是第一个能够自动恢复崩溃的商业服务器操作系统的OS-hyper visor组合的设计和实现。
{"title":"An OS-Hypervisor Infrastructure for Automated OS Crash Diagnosis and Recovery in a Virtualized Environment","authors":"J. Jann, R. S. Burugula, Ching-Farn E. Wu, Kaoutar El Maghraoui","doi":"10.1109/SBAC-PAD.2012.10","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.10","url":null,"abstract":"Recovering from OS crashes has traditionally been done using reboot or checkpoint-restart mechanisms. Such techniques either fail to preserve the state before the crash happens or require modifications to applications. To eliminate these problems, we present a novel OS-hyper visor infrastructure for automated OS crash diagnosis and recovery in virtual servers. Our approach uses a small hidden OS-repair-image that is dynamically created from the healthy running OS instance. Upon an OS crash, the hyper visor automatically loads this repair-image to perform diagnosis and repair. The offending process is then quarantined, and the fixed OS automatically resumes running without a reboot. Our experimental evaluations demonstrated that it takes less than 3 seconds to recover from an OS crash. This approach can significantly reduce the downtime and maintenance costs in data centers. This is the first design and implementation of an OS-hyper visor combo capable of automatically resurrecting a crashed commercial server-OS.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126249316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Energy Savings via Dead Sub-Block Prediction 通过死亡子块预测节约能源
M. Alves, Khubaib, Eiman Ebrahimi, V. Narasiman, Carlos Villavieja, P. Navaux, Y. Patt
Cache memories have traditionally been designed to exploit spatial locality by fetching entire cache lines from memory upon a miss. However, recent studies have shown that often the number of sub-blocks within a line that are actually used is low. Furthermore, those sub-blocks that are used are accessed only a few times before becoming dead (i.e., never accessed again). This results in considerable energy waste since (1) data not needed by the processor is brought into the cache, and (2) data is kept alive in the cache longer than necessary. We propose the Dead Sub-Block Predictor (DSBP) to predict which sub-blocks of a cache line will be actually used and how many times it will be used in order to bring into the cache only those sub-blocks that are necessary, and power them off after they are touched the predicted number of times. We also use DSBP to identify dead lines (i.e., all sub-blocks off) and augment the existing replacement policy by prioritizing dead lines for eviction. Our results show a 24% energy reduction for the whole cache hierarchy when averaged over the SPEC2000, SPEC2006 and NAS-NPB benchmarks.
高速缓存存储器传统上被设计为利用空间局部性,在丢失时从存储器中提取整个高速缓存行。然而,最近的研究表明,通常在一条线路中实际使用的子块数量很低。此外,那些被使用的子块在死亡之前只被访问几次(即永远不会再次访问)。这导致了相当大的能源浪费,因为(1)处理器不需要的数据被带入缓存,(2)数据在缓存中保持存活的时间比必要的长。我们提出了死亡子块预测器(DSBP)来预测缓存线的哪些子块将被实际使用,以及它将被使用多少次,以便仅将必要的子块带入缓存,并在它们被触及预测次数后关闭它们。我们还使用DSBP来识别死线(即所有子块关闭),并通过优先清除死线来增强现有的替换策略。我们的结果表明,在SPEC2000、SPEC2006和NAS-NPB基准测试中,整个缓存层次结构的能耗降低了24%。
{"title":"Energy Savings via Dead Sub-Block Prediction","authors":"M. Alves, Khubaib, Eiman Ebrahimi, V. Narasiman, Carlos Villavieja, P. Navaux, Y. Patt","doi":"10.1109/SBAC-PAD.2012.30","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.30","url":null,"abstract":"Cache memories have traditionally been designed to exploit spatial locality by fetching entire cache lines from memory upon a miss. However, recent studies have shown that often the number of sub-blocks within a line that are actually used is low. Furthermore, those sub-blocks that are used are accessed only a few times before becoming dead (i.e., never accessed again). This results in considerable energy waste since (1) data not needed by the processor is brought into the cache, and (2) data is kept alive in the cache longer than necessary. We propose the Dead Sub-Block Predictor (DSBP) to predict which sub-blocks of a cache line will be actually used and how many times it will be used in order to bring into the cache only those sub-blocks that are necessary, and power them off after they are touched the predicted number of times. We also use DSBP to identify dead lines (i.e., all sub-blocks off) and augment the existing replacement policy by prioritizing dead lines for eviction. Our results show a 24% energy reduction for the whole cache hierarchy when averaged over the SPEC2000, SPEC2006 and NAS-NPB benchmarks.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"46 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132264588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Scalable Thread Scheduling in Asymmetric Multicores for Power Efficiency 非对称多核中的可扩展线程调度以提高能效
Rance Rodrigues, A. Annamalai, I. Koren, S. Kundu
The emergence of asymmetric multicore processors(AMPs) has elevated the problem of thread scheduling in such systems. The computing needs of a thread often vary during its execution (phases) and hence, reassigning threads to cores(thread swapping) upon detection of such a change, can significantly improve the AMP's power efficiency. Even though identifying a change in the resource requirements of a workload is straightforward, determining the thread reassignment is a challenge. Traditional online learning schemes rely on sampling to determine the best thread to core in AMPs. However, as the number of cores in the multicore increases, the sampling overhead may be too large. In this paper, we propose a novel technique to dynamically assess the current thread to core assignment and determine whether swapping the threads between the cores will be beneficial and achieve a higher performance/Watt. This decision is based on estimating the expected performance and power of the current program phase on other cores. This estimation is done using the values of selected performance counters in the host core. By estimating the expected performance and power on each core type, informed thread scheduling decisions can be made while avoiding the overhead associated with sampling. We illustrate our approach using an 8-core high performance/low-power AMP and show the performance/Watt benefits of the proposed dynamic thread scheduling technique. We compare our proposed scheme against previously published schemes based on online learning and two schemes based on the use of an oracle, one static and the other dynamic. Our results show that significant performance/Watt gains can be achieved through informed thread scheduling decisions in AMPs.
非对称多核处理器(AMP)的出现提升了此类系统的线程调度问题。线程的计算需求在其执行过程(阶段)中经常会发生变化,因此,在检测到这种变化时将线程重新分配到内核(线程交换),可以显著提高 AMP 的能效。尽管识别工作负载的资源需求变化很简单,但确定线程的重新分配却是一项挑战。传统的在线学习方案依靠抽样来确定 AMP 中的最佳线程与内核。然而,随着多核中内核数量的增加,采样开销可能会过大。在本文中,我们提出了一种新技术,可动态评估当前线程到内核的分配,并确定在内核之间交换线程是否有益,从而实现更高的性能/瓦特。这一决定基于对当前程序阶段在其他内核上的预期性能和功耗的估计。这种估算是通过主机内核中选定的性能计数器的值完成的。通过估算每种内核类型的预期性能和功耗,可以做出明智的线程调度决策,同时避免与采样相关的开销。我们使用 8 核高性能/低功耗 AMP 对我们的方法进行了说明,并展示了所建议的动态线程调度技术的性能/功耗优势。我们将我们提出的方案与之前发布的基于在线学习的方案和两种基于甲骨文的方案(一种是静态方案,另一种是动态方案)进行了比较。我们的结果表明,通过在 AMP 中做出明智的线程调度决策,可以显著提高性能/瓦特。
{"title":"Scalable Thread Scheduling in Asymmetric Multicores for Power Efficiency","authors":"Rance Rodrigues, A. Annamalai, I. Koren, S. Kundu","doi":"10.1109/SBAC-PAD.2012.40","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.40","url":null,"abstract":"The emergence of asymmetric multicore processors(AMPs) has elevated the problem of thread scheduling in such systems. The computing needs of a thread often vary during its execution (phases) and hence, reassigning threads to cores(thread swapping) upon detection of such a change, can significantly improve the AMP's power efficiency. Even though identifying a change in the resource requirements of a workload is straightforward, determining the thread reassignment is a challenge. Traditional online learning schemes rely on sampling to determine the best thread to core in AMPs. However, as the number of cores in the multicore increases, the sampling overhead may be too large. In this paper, we propose a novel technique to dynamically assess the current thread to core assignment and determine whether swapping the threads between the cores will be beneficial and achieve a higher performance/Watt. This decision is based on estimating the expected performance and power of the current program phase on other cores. This estimation is done using the values of selected performance counters in the host core. By estimating the expected performance and power on each core type, informed thread scheduling decisions can be made while avoiding the overhead associated with sampling. We illustrate our approach using an 8-core high performance/low-power AMP and show the performance/Watt benefits of the proposed dynamic thread scheduling technique. We compare our proposed scheme against previously published schemes based on online learning and two schemes based on the use of an oracle, one static and the other dynamic. Our results show that significant performance/Watt gains can be achieved through informed thread scheduling decisions in AMPs.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129303976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Parallel Exact Inference on Multicore Using MapReduce 基于MapReduce的多核并行精确推理
N. Ma, Yinglong Xia, V. Prasanna
Inference is a key problem in exploring probabilistic graphical models for machine learning algorithms. Recently, many parallel techniques have been developed to accelerate inference. However, these techniques are not widely used due to their implementation complexity. MapReduce provides an appealing programming model that has been increasingly used to develop parallel solutions. MapReduce though has been mainly used for data parallel applications. In this paper, we investigate the use of MapReduce for exact inference in Bayesian networks. MapReduce based algorithms are proposed for evidence propagation in junction trees. We evaluate our methods on general-purpose multi-core machines using Phoenix as the underlying MapReduce runtime. The experimental results show that our methods achieve 20x speedup on an Intel West mere-EX based system.
推理是探索机器学习算法的概率图模型的关键问题。近年来,人们开发了许多并行技术来加速推理。然而,由于实现的复杂性,这些技术并没有被广泛使用。MapReduce提供了一种吸引人的编程模型,越来越多地用于开发并行解决方案。MapReduce主要用于数据并行应用。在本文中,我们研究了在贝叶斯网络中使用MapReduce进行精确推理。提出了基于MapReduce的连接树证据传播算法。我们使用Phoenix作为底层MapReduce运行时,在通用多核机器上评估我们的方法。实验结果表明,我们的方法在基于Intel West - ex的系统上实现了20倍的加速。
{"title":"Parallel Exact Inference on Multicore Using MapReduce","authors":"N. Ma, Yinglong Xia, V. Prasanna","doi":"10.1109/SBAC-PAD.2012.43","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.43","url":null,"abstract":"Inference is a key problem in exploring probabilistic graphical models for machine learning algorithms. Recently, many parallel techniques have been developed to accelerate inference. However, these techniques are not widely used due to their implementation complexity. MapReduce provides an appealing programming model that has been increasingly used to develop parallel solutions. MapReduce though has been mainly used for data parallel applications. In this paper, we investigate the use of MapReduce for exact inference in Bayesian networks. MapReduce based algorithms are proposed for evidence propagation in junction trees. We evaluate our methods on general-purpose multi-core machines using Phoenix as the underlying MapReduce runtime. The experimental results show that our methods achieve 20x speedup on an Intel West mere-EX based system.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133628879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Cloud Workload Analysis with SWAT 使用SWAT进行云工作负载分析
M. Breternitz, Keith Lowery, Anton Charnoff, Patryk Kamiński, Leonardo Piga
This note describes the Synthetic Workload Application Toolkit (SWAT) and presents the results from a set of experiments on some key cloud workloads. SWAT is a software platform that automates the creation, deployment, provisioning, execution, and (most importantly) data gathering of synthetic compute workloads on clusters of arbitrary size. SWAT collects and aggregates data from application execution logs, operating system call interfaces, and micro architecture-specific program counters. The data collected by SWAT are used to characterize the effects of network traffic, file I/O, and computation on program performance. The output is analyzed to provide insight into the design and deployment of cloud workloads and systems. Each workload is characterized according to its scalability with the number of server nodes and Hadoop server jobs, sensitivity to network characteristics (bandwidth, latency, statistics on packet size), and computation vs. I/O intensity as these values adjusted via workload-specific parameters. (In the future, we will use SWAT's benchmark synthesizer capability.) We also characterize micro-architectural characteristics that give insight on the micro architecture of processors better suited for this class of workloads. We contrast our results with prior work on Cloud Suite [5], validating some conclusions and providing further insight into others. This illustrates SWAT's data collection capabilities and usefulness to obtain insight on cloud applications and systems.
本文描述了合成工作负载应用程序工具包(SWAT),并介绍了在一些关键云工作负载上进行的一组实验的结果。SWAT是一个软件平台,可以在任意大小的集群上自动创建、部署、供应、执行和(最重要的)数据收集合成计算工作负载。SWAT从应用程序执行日志、操作系统调用接口和特定于微体系结构的程序计数器中收集和聚合数据。SWAT收集的数据用于描述网络流量、文件I/O和计算对程序性能的影响。对输出进行分析,以便深入了解云工作负载和系统的设计和部署。每个工作负载的特征是根据其随服务器节点和Hadoop服务器作业数量的可伸缩性、对网络特性(带宽、延迟、数据包大小统计)的敏感性以及计算与I/O强度(这些值通过特定于工作负载的参数进行调整)。(将来,我们将使用SWAT的基准合成器功能。)我们还描述了微体系结构特征,这些特征可以让我们了解更适合这类工作负载的处理器的微体系结构。我们将我们的结果与之前在Cloud Suite[5]上的工作进行了对比,验证了一些结论,并对其他结论提供了进一步的见解。这说明了SWAT的数据收集能力和获取云应用程序和系统洞察力的有用性。
{"title":"Cloud Workload Analysis with SWAT","authors":"M. Breternitz, Keith Lowery, Anton Charnoff, Patryk Kamiński, Leonardo Piga","doi":"10.1109/SBAC-PAD.2012.13","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.13","url":null,"abstract":"This note describes the Synthetic Workload Application Toolkit (SWAT) and presents the results from a set of experiments on some key cloud workloads. SWAT is a software platform that automates the creation, deployment, provisioning, execution, and (most importantly) data gathering of synthetic compute workloads on clusters of arbitrary size. SWAT collects and aggregates data from application execution logs, operating system call interfaces, and micro architecture-specific program counters. The data collected by SWAT are used to characterize the effects of network traffic, file I/O, and computation on program performance. The output is analyzed to provide insight into the design and deployment of cloud workloads and systems. Each workload is characterized according to its scalability with the number of server nodes and Hadoop server jobs, sensitivity to network characteristics (bandwidth, latency, statistics on packet size), and computation vs. I/O intensity as these values adjusted via workload-specific parameters. (In the future, we will use SWAT's benchmark synthesizer capability.) We also characterize micro-architectural characteristics that give insight on the micro architecture of processors better suited for this class of workloads. We contrast our results with prior work on Cloud Suite [5], validating some conclusions and providing further insight into others. This illustrates SWAT's data collection capabilities and usefulness to obtain insight on cloud applications and systems.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115431013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Scalable Algorithms for Distributed-Memory Adaptive Mesh Refinement 分布式内存自适应网格细化的可扩展算法
Akhil Langer, J. Lifflander, P. Miller, K. Pan, L. Kalé, P. Ricker
This paper presents scalable algorithms and data structures for adaptive mesh refinement computations. We describe a novel mesh restructuring algorithm for adaptive mesh refinement computations that uses a constant number of collectives regardless of the refinement depth. To further increase scalability, we describe a localized hierarchical coordinate-based block indexing scheme in contrast to traditional linear numbering schemes, which incur unnecessary synchronization. In contrast to the existing approaches which take O(P) time and storage per process, our approach takes only constant time and has very small memory footprint. With these optimizations as well as an efficient mapping scheme, our algorithm is scalable and suitable for large, highly-refined meshes. We present strong-scaling experiments up to 2k ranks on Cray XK6, and 32k ranks on IBM Blue Gene/Q.
本文提出了用于自适应网格细化计算的可扩展算法和数据结构。我们描述了一种新的网格重构算法,用于自适应网格细化计算,该算法使用恒定数量的集合,而不考虑细化深度。为了进一步提高可扩展性,我们描述了一种基于局部层次坐标的块索引方案,而不是传统的线性编号方案,这会导致不必要的同步。与每个进程占用O(P)时间和存储的现有方法相比,我们的方法只占用常数时间,并且内存占用非常小。通过这些优化以及有效的映射方案,我们的算法具有可扩展性,适合于大型,高度精细的网格。我们提出了在Cray XK6上达到2k排名的强缩放实验,在IBM Blue Gene/Q上达到32k排名。
{"title":"Scalable Algorithms for Distributed-Memory Adaptive Mesh Refinement","authors":"Akhil Langer, J. Lifflander, P. Miller, K. Pan, L. Kalé, P. Ricker","doi":"10.1109/SBAC-PAD.2012.48","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.48","url":null,"abstract":"This paper presents scalable algorithms and data structures for adaptive mesh refinement computations. We describe a novel mesh restructuring algorithm for adaptive mesh refinement computations that uses a constant number of collectives regardless of the refinement depth. To further increase scalability, we describe a localized hierarchical coordinate-based block indexing scheme in contrast to traditional linear numbering schemes, which incur unnecessary synchronization. In contrast to the existing approaches which take O(P) time and storage per process, our approach takes only constant time and has very small memory footprint. With these optimizations as well as an efficient mapping scheme, our algorithm is scalable and suitable for large, highly-refined meshes. We present strong-scaling experiments up to 2k ranks on Cray XK6, and 32k ranks on IBM Blue Gene/Q.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123886781","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 29
CSHARP: Coherence and SHaring Aware Cache Replacement Policies for Parallel Applications 并行应用的一致性和共享感知缓存替换策略
Biswabandan Panda, S. Balachandran
Parallel applications are becoming mainstream and architectural techniques for multicores that target these applications are the need of the hour. Sharing of data by multiple threads and issues due to data coherence are unique to parallel applications. We propose CSHARP, a hardware framework that brings coherence and sharing awareness to any shared last level cache replacement policy. We use the degree of sharing of cache lines and the information present in coherence vectors to make replacement decisions. We apply CSHARP to a state-of-the-art cache replacement policy called TA-DRRIP to show its effectiveness. Our experiments on four core simulated system show that applying CSHARP on TA-DRRIP gives an extra 10% reduction in miss-rate at the LLC. Compared to LRU policy, CSHARP on TA-DRRIP shows a 18% miss-rate reduction and a 7% performance boost. We also show the scalability of our proposal by studying the hardware overhead and performance on a 8-core system.
并行应用程序正在成为主流,针对这些应用程序的多核架构技术是当前的需求。多线程共享数据和数据一致性问题是并行应用程序所特有的。我们提出了CSHARP,这是一个硬件框架,为任何共享的最后一级缓存替换策略带来一致性和共享意识。我们使用缓存线的共享程度和相干向量中存在的信息来做出替换决策。我们将CSHARP应用于称为TA-DRRIP的最先进的缓存替换策略,以显示其有效性。我们在四个核心模拟系统上的实验表明,在TA-DRRIP上应用CSHARP可以使LLC的失误率额外降低10%。与LRU策略相比,CSHARP在TA-DRRIP上的失误率降低18%,性能提高7%。我们还通过研究8核系统上的硬件开销和性能来展示我们建议的可伸缩性。
{"title":"CSHARP: Coherence and SHaring Aware Cache Replacement Policies for Parallel Applications","authors":"Biswabandan Panda, S. Balachandran","doi":"10.1109/SBAC-PAD.2012.27","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.27","url":null,"abstract":"Parallel applications are becoming mainstream and architectural techniques for multicores that target these applications are the need of the hour. Sharing of data by multiple threads and issues due to data coherence are unique to parallel applications. We propose CSHARP, a hardware framework that brings coherence and sharing awareness to any shared last level cache replacement policy. We use the degree of sharing of cache lines and the information present in coherence vectors to make replacement decisions. We apply CSHARP to a state-of-the-art cache replacement policy called TA-DRRIP to show its effectiveness. Our experiments on four core simulated system show that applying CSHARP on TA-DRRIP gives an extra 10% reduction in miss-rate at the LLC. Compared to LRU policy, CSHARP on TA-DRRIP shows a 18% miss-rate reduction and a 7% performance boost. We also show the scalability of our proposal by studying the hardware overhead and performance on a 8-core system.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117184266","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Integrating Dataflow Abstractions into the Shared Memory Model 将数据流抽象集成到共享内存模型中
Vladimir Gajinov, Srdjan Stipic, O. Unsal, T. Harris, E. Ayguadé, A. Cristal
In this paper we present Atomic Dataflow model (ADF), a new task-based parallel programming model for C/C++ which integrates dataflow abstractions into the shared memory programming model. The ADF model provides pragma directives that allow a programmer to organize a program into a set of tasks and to explicitly define input data for each task. The task dependency information is conveyed to the ADF runtime system which constructs the dataflow task graph and builds the necessary infrastructure for dataflow execution. Additionally, the ADF model allows tasks to share data. The key idea is that computation is triggered by dataflow between tasks but that, within a task, execution occurs by making atomic updates to common mutable state. To that end, the ADF model employs transactional memory which guarantees atomicity of shared memory updates. We show examples that illustrate how the programmability of shared memory can be improved using the ADF model. Moreover, our evaluation shows that the ADF model performs well in comparison with programs parallelized using OpenMP and transactional memory.
本文提出了一种新的基于任务的C/ c++并行编程模型——原子数据流模型(Atomic Dataflow model, ADF),它将数据流抽象集成到共享内存编程模型中。ADF模型提供了编程指令,允许程序员将程序组织成一组任务,并显式地定义每个任务的输入数据。任务依赖信息被传递给ADF运行时系统,该系统构建数据流任务图并为数据流执行构建必要的基础结构。此外,ADF模型允许任务共享数据。关键思想是,计算是由任务之间的数据流触发的,但在任务内,执行是通过对公共可变状态进行原子更新来实现的。为此,ADF模型采用事务性内存,以保证共享内存更新的原子性。我们将展示一些示例,说明如何使用ADF模型改进共享内存的可编程性。此外,我们的评估表明,与使用OpenMP和事务性内存并行化的程序相比,ADF模型表现良好。
{"title":"Integrating Dataflow Abstractions into the Shared Memory Model","authors":"Vladimir Gajinov, Srdjan Stipic, O. Unsal, T. Harris, E. Ayguadé, A. Cristal","doi":"10.1109/SBAC-PAD.2012.24","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.24","url":null,"abstract":"In this paper we present Atomic Dataflow model (ADF), a new task-based parallel programming model for C/C++ which integrates dataflow abstractions into the shared memory programming model. The ADF model provides pragma directives that allow a programmer to organize a program into a set of tasks and to explicitly define input data for each task. The task dependency information is conveyed to the ADF runtime system which constructs the dataflow task graph and builds the necessary infrastructure for dataflow execution. Additionally, the ADF model allows tasks to share data. The key idea is that computation is triggered by dataflow between tasks but that, within a task, execution occurs by making atomic updates to common mutable state. To that end, the ADF model employs transactional memory which guarantees atomicity of shared memory updates. We show examples that illustrate how the programmability of shared memory can be improved using the ADF model. Moreover, our evaluation shows that the ADF model performs well in comparison with programs parallelized using OpenMP and transactional memory.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134066858","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems 高性能计算系统容错协议的能效评估
Esteban Meneses, O. Sarood, L. Kalé
An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This paper presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart.
一台百亿亿次计算机预计将在2018-2020年交付。这样的机器将能够解决一些最难的计算问题,并扩展我们对自然和宇宙的理解。然而,为了实现这一目标,HPC社区必须解决几个重要的挑战。弹性将成为一个突出的问题,因为一个百亿亿级的机器将经历频繁的故障,因为它将包含大量的组件。必须在系统中加入某种形式的容错,以保持应用程序的进度率尽可能高。同时,系统在电源管理方面也必须更加小心。权力有两个维度。首先,在功率有限的环境中,系统的所有层都必须遵守该限制(包括容错层)。其次,由于能源消耗,电力将是相关的:一个百亿亿次的安装将不得不支付一大笔能源账单。提高我们对不同容错方案的能量分布的理解是至关重要的。本文介绍了三种不同的容错方法:检查点/重启、消息日志记录和并行恢复。通过使用来自不同编程模型的程序,我们展示了并行恢复是执行失败时最节能的解决方案。同时,并行恢复能够比其他方法更快地完成执行。我们使用分析模型探索这些方法在极端尺度下的行为。在大规模情况下,与检查点/重新启动相比,并行恢复预计将使应用程序的总执行时间减少17%,能耗减少13%。
{"title":"Assessing Energy Efficiency of Fault Tolerance Protocols for HPC Systems","authors":"Esteban Meneses, O. Sarood, L. Kalé","doi":"10.1109/SBAC-PAD.2012.12","DOIUrl":"https://doi.org/10.1109/SBAC-PAD.2012.12","url":null,"abstract":"An exascale machine is expected to be delivered in the time frame 2018-2020. Such a machine will be able to tackle some of the hardest computational problems and to extend our understanding of Nature and the universe. However, to make that a reality, the HPC community has to solve a few important challenges. Resilience will become a prominent problem because an exascale machine will experience frequent failures due to the large amount of components it will encompass. Some form of fault tolerance has to be incorporated in the system to maintain the progress rate of applications as high as possible. In parallel, the system will have to be more careful about power management. There are two dimensions of power. First, in a power-limited environment, all the layers of the system have to adhere to that limitation (including the fault tolerance layer). Second, power will be relevant due to energy consumption: an exascale installation will have to pay a large energy bill. It is fundamental to increase our understanding of the energy profile of different fault tolerance schemes. This paper presents an evaluation of three different fault tolerance approaches: checkpoint/restart, message-logging and parallel recovery. Using programs from different programming models, we show parallel recovery is the most energy-efficient solution for an execution with failures. At the same time, parallel recovery is able to finish the execution faster than the other approaches. We explore the behavior of these approaches at extreme scales using an analytical model. At large scale, parallel recovery is predicted to reduce the total execution time of an application by 17% and reduce the energy consumption by 13% when compared to checkpoint/restart.","PeriodicalId":232444,"journal":{"name":"2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2012-10-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131993281","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
期刊
2012 IEEE 24th International Symposium on Computer Architecture and High Performance Computing
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1