首页 > 最新文献

2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)最新文献

英文 中文
Using simulation to explore distributed key-value stores for extreme-scale system services 使用模拟来探索极端规模系统服务的分布式键值存储
Ke Wang, Abhishek Kulkarni, M. Lang, D. Arnold, I. Raicu
Owing to the significant high rate of component failures at extreme scales, system services will need to be failure-resistant, adaptive and self-healing. A majority of HPC services are still designed around a centralized paradigm and hence are susceptible to scaling issues. Peer-to-peer services have proved themselves at scale for wide-area internet workloads. Distributed key-value stores (KVS) are widely used as a building block for these services, but are not prevalent in HPC services. In this paper, we simulate KVS for various service architectures and examine the design trade-offs as applied to HPC service workloads to support extreme-scale systems. The simulator is validated against existing distributed KVS-based services. Via simulation, we demonstrate how failure, replication, and consistency models affect performance at scale. Finally, we emphasize the general use of KVS to HPC services by feeding real HPC service workloads into the simulator and presenting a KVS-based distributed job launch prototype.
由于在极端规模下组件故障率非常高,系统服务需要具有抗故障、自适应和自修复能力。大多数HPC服务仍然是围绕集中式范式设计的,因此容易受到扩展问题的影响。点对点服务已经证明了自己在广域互联网工作负载上的规模。分布式键值存储(KVS)被广泛用作这些服务的构建块,但在HPC服务中并不流行。在本文中,我们模拟了各种服务架构的KVS,并检查了应用于HPC服务工作负载以支持极端规模系统的设计权衡。该模拟器针对现有的分布式基于kvs的服务进行验证。通过模拟,我们演示了故障、复制和一致性模型如何影响大规模的性能。最后,我们通过将真实的HPC服务工作负载输入模拟器并展示基于KVS的分布式作业启动原型,强调了KVS在HPC服务中的普遍使用。
{"title":"Using simulation to explore distributed key-value stores for extreme-scale system services","authors":"Ke Wang, Abhishek Kulkarni, M. Lang, D. Arnold, I. Raicu","doi":"10.1145/2503210.2503239","DOIUrl":"https://doi.org/10.1145/2503210.2503239","url":null,"abstract":"Owing to the significant high rate of component failures at extreme scales, system services will need to be failure-resistant, adaptive and self-healing. A majority of HPC services are still designed around a centralized paradigm and hence are susceptible to scaling issues. Peer-to-peer services have proved themselves at scale for wide-area internet workloads. Distributed key-value stores (KVS) are widely used as a building block for these services, but are not prevalent in HPC services. In this paper, we simulate KVS for various service architectures and examine the design trade-offs as applied to HPC service workloads to support extreme-scale systems. The simulator is validated against existing distributed KVS-based services. Via simulation, we demonstrate how failure, replication, and consistency models affect performance at scale. Finally, we emphasize the general use of KVS to HPC services by feeding real HPC service workloads into the simulator and presenting a KVS-based distributed job launch prototype.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129551955","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 70
General transformations for GPU execution of tree traversals 用于GPU执行树遍历的一般转换
Michael Goldfarb, Youngjoon Jo, Milind Kulkarni
With the advent of programmer-friendly GPU computing environments, there has been much interest in offloading workloads that can exploit the high degree of parallelism available on modern GPUs. Exploiting this parallelism and optimizing for the GPU memory hierarchy is well-understood for regular applications that operate on dense data structures such as arrays and matrices. However, there has been significantly less work in the area of irregular algorithms and even less so when pointer-based dynamic data structures are involved. Recently, irregular algorithms such as Barnes-Hut and kd-tree traversals have been implemented on GPUs, yielding significant performance gains over CPU implementations. However, the implementations often rely on exploiting application-specific semantics to get acceptable performance. We argue that there are general-purpose techniques for implementing irregular algorithms on GPUs that exploit similarities in algorithmic structure rather than application-specific knowledge. We demonstrate these techniques on several tree traversal algorithms, achieving speedups of up to 38× over 32-thread CPU versions.
随着对程序员友好的GPU计算环境的出现,人们对卸载能够利用现代GPU上可用的高度并行性的工作负载非常感兴趣。对于操作密集数据结构(如数组和矩阵)的常规应用程序来说,利用这种并行性和优化GPU内存层次结构是很容易理解的。然而,在不规则算法领域的工作要少得多,在涉及到基于指针的动态数据结构时更是如此。最近,像Barnes-Hut和kd-tree遍历这样的不规则算法已经在gpu上实现,比CPU实现产生了显著的性能提升。然而,实现通常依赖于利用特定于应用程序的语义来获得可接受的性能。我们认为,在gpu上实现不规则算法的通用技术利用算法结构的相似性,而不是特定于应用程序的知识。我们在几种树遍历算法上演示了这些技术,在32线程CPU版本上实现了高达38倍的速度提升。
{"title":"General transformations for GPU execution of tree traversals","authors":"Michael Goldfarb, Youngjoon Jo, Milind Kulkarni","doi":"10.1145/2503210.2503223","DOIUrl":"https://doi.org/10.1145/2503210.2503223","url":null,"abstract":"With the advent of programmer-friendly GPU computing environments, there has been much interest in offloading workloads that can exploit the high degree of parallelism available on modern GPUs. Exploiting this parallelism and optimizing for the GPU memory hierarchy is well-understood for regular applications that operate on dense data structures such as arrays and matrices. However, there has been significantly less work in the area of irregular algorithms and even less so when pointer-based dynamic data structures are involved. Recently, irregular algorithms such as Barnes-Hut and kd-tree traversals have been implemented on GPUs, yielding significant performance gains over CPU implementations. However, the implementations often rely on exploiting application-specific semantics to get acceptable performance. We argue that there are general-purpose techniques for implementing irregular algorithms on GPUs that exploit similarities in algorithmic structure rather than application-specific knowledge. We demonstrate these techniques on several tree traversal algorithms, achieving speedups of up to 38× over 32-thread CPU versions.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"85 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115944445","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Taking a quantum leap in time to solution for simulations of high-Tc superconductors 在解决高温超导体模拟问题上实现了时间上的飞跃
Peter Staar, T. Maier, M. Summers, G. Fourestey, R. Solcà, T. Schulthess
We present a new quantum cluster algorithm to simulate models of high-Tc superconductors. This algorithm extends current methods with continuous lattice self-energies, thereby removing artificial long-range correlations. This cures the fermionic sign problem in the underlying quantum Monte Carlo solver for large clusters and realistic values of the Coulomb interaction in the entire temperature range of interest. We find that the new algorithm improves time-to-solution by nine orders of magnitude compared to current, state of the art quantum cluster simulations. An efficient implementation is given, which ports to multi-core as well as hybrid CPU-GPU systems. Running on 18,600 nodes on ORNL's Titan supercomputer enables us to compute a converged value of Tc/t = 0.053±0.0014 for a 28 site cluster in the 2D Hubbard model with U/t = 7 at 10% hole doping. Typical simulations on Titan sustain between 9.2 and 15.4 petaflops (double precision measured over full run), depending on configuration and parameters used.
我们提出了一种新的量子簇算法来模拟高温超导体的模型。该算法扩展了现有的连续晶格自能方法,从而消除了人为的远程相关性。这解决了大型簇的基础量子蒙特卡罗求解器中的费米子符号问题和整个感兴趣温度范围内库仑相互作用的实际值。我们发现,与当前最先进的量子集群模拟相比,新算法将求解时间提高了9个数量级。给出了一种有效的实现方法,可移植到多核和CPU-GPU混合系统中。在ORNL的Titan超级计算机上运行18,600个节点,我们可以计算出在U/t = 7的2D Hubbard模型中,在10%的孔掺杂情况下,28个站点集群的收敛值Tc/t = 0.053±0.0014。根据配置和使用的参数,Titan上的典型模拟维持在9.2到15.4 petaflops之间(完整运行时的双精度测量)。
{"title":"Taking a quantum leap in time to solution for simulations of high-Tc superconductors","authors":"Peter Staar, T. Maier, M. Summers, G. Fourestey, R. Solcà, T. Schulthess","doi":"10.1145/2503210.2503282","DOIUrl":"https://doi.org/10.1145/2503210.2503282","url":null,"abstract":"We present a new quantum cluster algorithm to simulate models of high-Tc superconductors. This algorithm extends current methods with continuous lattice self-energies, thereby removing artificial long-range correlations. This cures the fermionic sign problem in the underlying quantum Monte Carlo solver for large clusters and realistic values of the Coulomb interaction in the entire temperature range of interest. We find that the new algorithm improves time-to-solution by nine orders of magnitude compared to current, state of the art quantum cluster simulations. An efficient implementation is given, which ports to multi-core as well as hybrid CPU-GPU systems. Running on 18,600 nodes on ORNL's Titan supercomputer enables us to compute a converged value of Tc/t = 0.053±0.0014 for a 28 site cluster in the 2D Hubbard model with U/t = 7 at 10% hole doping. Typical simulations on Titan sustain between 9.2 and 15.4 petaflops (double precision measured over full run), depending on configuration and parameters used.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127364226","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Cost-effective cloud HPC resource provisioning by building Semi-Elastic virtual clusters 通过构建半弹性虚拟集群提供高性价比的云高性能计算资源
Shuangcheng Niu, Jidong Zhai, Xiaosong Ma, Xiongchao Tang, Wenguang Chen
Recent studies have found cloud environments increasingly appealing for executing HPC applications, including tightly coupled parallel simulations. While public clouds offer elastic, on-demand resource provisioning and pay-as-you-go pricing, individual users setting up their on-demand virtual clusters may not be able to take full advantage of common cost-saving opportunities, such as reserved instances. In this paper, we propose a Semi-Elastic Cluster (SEC) computing model for organizations to reserve and dynamically resize a virtual cloud-based cluster. We present a set of integrated batch scheduling plus resource scaling strategies uniquely enabled by SEC, as well as an online reserved instance provisioning algorithm based on job history. Our trace-driven simulation results show that such a model has a 61.0% cost saving than individual users acquiring and managing cloud resources without causing longer average job wait time. Meanwhile, the overhead of acquiring/maintaining shared cloud instances is shown to take only a few seconds.
最近的研究发现,云环境对执行HPC应用程序越来越有吸引力,包括紧耦合并行模拟。虽然公共云提供弹性的按需资源供应和按需付费定价,但设置按需虚拟集群的个人用户可能无法充分利用常见的节省成本的机会,例如保留实例。本文提出了一种半弹性集群(SEC)计算模型,用于组织保留和动态调整基于云的虚拟集群的大小。我们提出了一套集成的批调度和资源扩展策略,以及一种基于作业历史记录的在线保留实例配置算法。我们的跟踪驱动仿真结果表明,与获取和管理云资源的单个用户相比,这种模型节省了61.0%的成本,而不会导致更长的平均作业等待时间。同时,获取/维护共享云实例的开销只需要几秒钟。
{"title":"Cost-effective cloud HPC resource provisioning by building Semi-Elastic virtual clusters","authors":"Shuangcheng Niu, Jidong Zhai, Xiaosong Ma, Xiongchao Tang, Wenguang Chen","doi":"10.1145/2503210.2503236","DOIUrl":"https://doi.org/10.1145/2503210.2503236","url":null,"abstract":"Recent studies have found cloud environments increasingly appealing for executing HPC applications, including tightly coupled parallel simulations. While public clouds offer elastic, on-demand resource provisioning and pay-as-you-go pricing, individual users setting up their on-demand virtual clusters may not be able to take full advantage of common cost-saving opportunities, such as reserved instances. In this paper, we propose a Semi-Elastic Cluster (SEC) computing model for organizations to reserve and dynamically resize a virtual cloud-based cluster. We present a set of integrated batch scheduling plus resource scaling strategies uniquely enabled by SEC, as well as an online reserved instance provisioning algorithm based on job history. Our trace-driven simulation results show that such a model has a 61.0% cost saving than individual users acquiring and managing cloud resources without causing longer average job wait time. Meanwhile, the overhead of acquiring/maintaining shared cloud instances is shown to take only a few seconds.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116683218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 44
Exploiting application dynamism and cloud elasticity for continuous dataflows 利用连续数据流的应用程序动态和云弹性
A. Kumbhare, Yogesh L. Simmhan, V. Prasanna
Contemporary continuous data flow systems use elastic scaling on distributed cloud resources to handle variable data rates and to meet applications' needs while attempting to maximize resource utilization. However, virtualized clouds present an added challenge due to the variability in resource performance - over time and space - thereby impacting the application's QoS. Elastic use of cloud resources and their allocation to continuous dataflow tasks need to adapt to such infrastructure dynamism. In this paper, we develop the concept of “dynamic dataflows” as an extension to continuous dataflows that utilizes alternate tasks and allows additional control over the dataflow's cost and QoS. We formalize an optimization problem to perform both deployment and runtime cloud resource management for such dataflows, and define an objective function that allows trade-off between the application's value against resource cost. We present two novel heuristics, local and global, based on the variable sized bin packing heuristics to solve this NP-hard problem. We evaluate the heuristics against a static allocation policy for a dataflow with different data rate profiles that is simulated using VM performance traces from a private cloud data center. The results show that the heuristics are effective in intelligently utilizing cloud elasticity to mitigate the effect of both input data rate and cloud resource performance variabilities on QoS.
当代连续数据流系统在分布式云资源上使用弹性扩展来处理可变数据速率,并在尝试最大化资源利用率的同时满足应用程序的需求。然而,由于资源性能随时间和空间的变化而变化,虚拟化云带来了额外的挑战,从而影响了应用程序的QoS。云资源的弹性使用及其对连续数据流任务的分配需要适应这种基础设施的动态性。在本文中,我们开发了“动态数据流”的概念,作为连续数据流的扩展,它利用替代任务并允许对数据流的成本和QoS进行额外的控制。我们形式化了一个优化问题,以便为这些数据流执行部署和运行时云资源管理,并定义了一个目标函数,允许在应用程序的价值和资源成本之间进行权衡。我们提出了两种新的启发式方法,局部启发式和全局启发式,基于变尺寸装箱启发式来解决这一np困难问题。我们针对具有不同数据速率配置文件的数据流的静态分配策略评估了启发式方法,该数据流使用来自私有云数据中心的VM性能跟踪来模拟。结果表明,启发式算法可以有效地利用云弹性来减轻输入数据速率和云资源性能变化对QoS的影响。
{"title":"Exploiting application dynamism and cloud elasticity for continuous dataflows","authors":"A. Kumbhare, Yogesh L. Simmhan, V. Prasanna","doi":"10.1145/2503210.2503240","DOIUrl":"https://doi.org/10.1145/2503210.2503240","url":null,"abstract":"Contemporary continuous data flow systems use elastic scaling on distributed cloud resources to handle variable data rates and to meet applications' needs while attempting to maximize resource utilization. However, virtualized clouds present an added challenge due to the variability in resource performance - over time and space - thereby impacting the application's QoS. Elastic use of cloud resources and their allocation to continuous dataflow tasks need to adapt to such infrastructure dynamism. In this paper, we develop the concept of “dynamic dataflows” as an extension to continuous dataflows that utilizes alternate tasks and allows additional control over the dataflow's cost and QoS. We formalize an optimization problem to perform both deployment and runtime cloud resource management for such dataflows, and define an objective function that allows trade-off between the application's value against resource cost. We present two novel heuristics, local and global, based on the variable sized bin packing heuristics to solve this NP-hard problem. We evaluate the heuristics against a static allocation policy for a dataflow with different data rate profiles that is simulated using VM performance traces from a private cloud data center. The results show that the heuristics are effective in intelligently utilizing cloud elasticity to mitigate the effect of both input data rate and cloud resource performance variabilities on QoS.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117118449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
11 PFLOP/s simulations of cloud cavitation collapse 云空化坍缩的PFLOP/s模拟
D. Rossinelli, B. Hejazialhosseini, P. Hadjidoukas, C. Bekas, A. Curioni, A. Bertsch, S. Futral, S. Schmidt, N. Adams, P. Koumoutsakos
We present unprecedented, high throughput simulations of cloud cavitation collapse on 1.6 million cores of Sequoia reaching 55% of its nominal peak performance, corresponding to 11 PFLOP/s. The destructive power of cavitation reduces the lifetime of energy critical systems such as internal combustion engines and hydraulic turbines, yet it has been harnessed for water purification and kidney lithotripsy. The present two-phase flow simulations enable the quantitative prediction of cavitation using 13 trillion grid points to resolve the collapse of 15'000 bubbles. We advance by one order of magnitude the current state-of-the-art in terms of time to solution, and by two orders the geometrical complexity of the flow. The software successfully addresses the challenges that hinder the effective solution of complex flows on contemporary supercomputers, such as limited memory bandwidth, I/O bandwidth and storage capacity. The present work redefines the frontier of high performance computing for fluid dynamics simulations.
我们在红杉的160万核上进行了前所未有的高通量云空化崩溃模拟,达到其名义峰值性能的55%,相当于11 PFLOP/s。空化的破坏力降低了内燃机和水力涡轮机等能源关键系统的使用寿命,但它已被用于水净化和肾脏碎石。目前的两相流模拟能够使用13万亿个网格点来定量预测空化,以解决15,000个气泡的崩溃。在解决问题的时间方面,我们将目前最先进的技术提高了一个数量级,并将流体的几何复杂性提高了两个数量级。该软件成功地解决了阻碍当代超级计算机有效解决复杂流的挑战,例如有限的内存带宽、I/O带宽和存储容量。目前的工作重新定义了流体动力学模拟高性能计算的前沿。
{"title":"11 PFLOP/s simulations of cloud cavitation collapse","authors":"D. Rossinelli, B. Hejazialhosseini, P. Hadjidoukas, C. Bekas, A. Curioni, A. Bertsch, S. Futral, S. Schmidt, N. Adams, P. Koumoutsakos","doi":"10.1145/2503210.2504565","DOIUrl":"https://doi.org/10.1145/2503210.2504565","url":null,"abstract":"We present unprecedented, high throughput simulations of cloud cavitation collapse on 1.6 million cores of Sequoia reaching 55% of its nominal peak performance, corresponding to 11 PFLOP/s. The destructive power of cavitation reduces the lifetime of energy critical systems such as internal combustion engines and hydraulic turbines, yet it has been harnessed for water purification and kidney lithotripsy. The present two-phase flow simulations enable the quantitative prediction of cavitation using 13 trillion grid points to resolve the collapse of 15'000 bubbles. We advance by one order of magnitude the current state-of-the-art in terms of time to solution, and by two orders the geometrical complexity of the flow. The software successfully addresses the challenges that hinder the effective solution of complex flows on contemporary supercomputers, such as limited memory bandwidth, I/O bandwidth and storage capacity. The present work redefines the frontier of high performance computing for fluid dynamics simulations.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"28 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129068785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 79
Semi-automatic restructuring of offloadable tasks for many-core accelerators 多核加速器可卸载任务的半自动重构
N. Ravi, Yi Yang, Tao Bao, S. Chakradhar
Work division between the processor and accelerator is a common theme in modern heterogenous computing. Recent efforts (such as LEO and OpenAcc) provide directives that allow the developer to mark code regions in the original application from which offloadable tasks can be generated by the compiler. Auto-tuners and runtime schedulers work with the options (i.e., offloadable tasks) generated at compile time, which is limited by the directives specified by the developer. There is no provision for offload restructuring.
处理器和加速器之间的工作划分是现代异构计算的一个共同主题。最近的努力(如LEO和OpenAcc)提供了一些指令,允许开发人员在原始应用程序中标记代码区域,编译器可以从这些代码区域生成可卸载的任务。自动调优器和运行时调度器使用编译时生成的选项(即可卸载的任务),这受到开发人员指定的指令的限制。没有关于卸载重组的规定。
{"title":"Semi-automatic restructuring of offloadable tasks for many-core accelerators","authors":"N. Ravi, Yi Yang, Tao Bao, S. Chakradhar","doi":"10.1145/2503210.2503285","DOIUrl":"https://doi.org/10.1145/2503210.2503285","url":null,"abstract":"Work division between the processor and accelerator is a common theme in modern heterogenous computing. Recent efforts (such as LEO and OpenAcc) provide directives that allow the developer to mark code regions in the original application from which offloadable tasks can be generated by the compiler. Auto-tuners and runtime schedulers work with the options (i.e., offloadable tasks) generated at compile time, which is limited by the directives specified by the developer. There is no provision for offload restructuring.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132698109","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Detection of false sharing using machine learning 使用机器学习检测虚假共享
Sanath Jayasena, Saman P. Amarasinghe, Asanka Abeyweera, Gayashan Amarasinghe, Himeshi De Silva, Sunimal Rathnayake, Xiaoqiao Meng, Yanbin Liu
False sharing is a major class of performance bugs in parallel applications. Detecting false sharing is difficult as it does not change the program semantics. We introduce an efficient and effective approach for detecting false sharing based on machine learning. We develop a set of mini-programs in which false sharing can be turned on and off. We then run the mini-programs both with and without false sharing, collect a set of hardware performance event counts and use the collected data to train a classifier. We can use the trained classifier to analyze data from arbitrary programs for detection of false sharing. Experiments with the PARSEC and Phoenix benchmarks show that our approach is indeed effective. We detect published false sharing regions in the benchmarks with zero false positives. Our performance penalty is less than 2%. Thus, we believe that this is an effective and practical method for detecting false sharing.
虚假共享是并行应用程序中一类主要的性能错误。检测虚假共享是困难的,因为它不会改变程序语义。提出了一种基于机器学习的虚假共享检测方法。我们开发了一套小程序,可以在其中打开和关闭虚假分享。然后,我们运行带有和不带有虚假共享的小程序,收集一组硬件性能事件计数,并使用收集到的数据来训练分类器。我们可以使用训练好的分类器来分析来自任意程序的数据,以检测虚假共享。PARSEC和Phoenix基准测试的实验表明,我们的方法确实有效。我们在基准测试中检测到发布的虚假共享区域,没有误报。我们的性能损失小于2%。因此,我们认为这是一种有效而实用的检测虚假共享的方法。
{"title":"Detection of false sharing using machine learning","authors":"Sanath Jayasena, Saman P. Amarasinghe, Asanka Abeyweera, Gayashan Amarasinghe, Himeshi De Silva, Sunimal Rathnayake, Xiaoqiao Meng, Yanbin Liu","doi":"10.1145/2503210.2503269","DOIUrl":"https://doi.org/10.1145/2503210.2503269","url":null,"abstract":"False sharing is a major class of performance bugs in parallel applications. Detecting false sharing is difficult as it does not change the program semantics. We introduce an efficient and effective approach for detecting false sharing based on machine learning. We develop a set of mini-programs in which false sharing can be turned on and off. We then run the mini-programs both with and without false sharing, collect a set of hardware performance event counts and use the collected data to train a classifier. We can use the trained classifier to analyze data from arbitrary programs for detection of false sharing. Experiments with the PARSEC and Phoenix benchmarks show that our approach is indeed effective. We detect published false sharing regions in the benchmarks with zero false positives. Our performance penalty is less than 2%. Thus, we believe that this is an effective and practical method for detecting false sharing.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133572767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 37
Optimization of cloud task processing with checkpoint-restart mechanism 基于检查点重启机制的云任务处理优化
S. Di, Y. Robert, F. Vivien, Derrick Kondo, Cho-Li Wang, F. Cappello
In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall-clock lengths by 50-100 seconds per job on average.
在本文中,我们的目标是优化基于检查点/重启机制的容错技术,在云计算的背景下。我们的贡献有三方面。(1)我们推导了一个新的公式来计算具有不同故障事件分布的云作业的最优检查点数量。本文的分析不仅具有通用性,不需要对失效概率分布进行假设,而且在实际应用中也非常简单。(2)我们设计了一种自适应算法来优化检查点对检查点/重启开销等各种成本的影响。(3)我们在一个真实的集群环境中使用数百个虚拟机和Berkeley Lab Checkpoint/Restart工具来评估我们的优化解决方案。任务失败事件通过在大型Google数据中心上生成的生产跟踪来模拟。实验证明我们的解决方案非常适合Google系统。我们的优化公式比Young的公式高出3- 10%,平均每个作业减少了50-100秒的挂钟长度。
{"title":"Optimization of cloud task processing with checkpoint-restart mechanism","authors":"S. Di, Y. Robert, F. Vivien, Derrick Kondo, Cho-Li Wang, F. Cappello","doi":"10.1145/2503210.2503217","DOIUrl":"https://doi.org/10.1145/2503210.2503217","url":null,"abstract":"In this paper, we aim at optimizing fault-tolerance techniques based on a checkpointing/restart mechanism, in the context of cloud computing. Our contribution is three-fold. (1) We derive a fresh formula to compute the optimal number of checkpoints for cloud jobs with varied distributions of failure events. Our analysis is not only generic with no assumption on failure probability distribution, but also attractively simple to apply in practice. (2) We design an adaptive algorithm to optimize the impact of checkpointing regarding various costs like checkpointing/restart overhead. (3) We evaluate our optimized solution in a real cluster environment with hundreds of virtual machines and Berkeley Lab Checkpoint/Restart tool. Task failure events are emulated via a production trace produced on a large-scale Google data center. Experiments confirm that our solution is fairly suitable for Google systems. Our optimized formula outperforms Young's formula by 3-10 percent, reducing wall-clock lengths by 50-100 seconds per job on average.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131635932","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 80
A computationally efficient algorithm for the 2D covariance method 二维协方差法的高效计算算法
Oded Green, Y. Birk
The estimated covariance matrix is a building block for many algorithms, including signal and image processing. The Covariance Method is an estimator for the covariance matrix, favored both as an estimator and in view of the convenient properties of the matrix that it produces. However, the considerable computational requirements limit its use. We present a novel computation algorithm for the covariance method, which dramatically reduces the computational complexity (both ALU operations and memory access) relative to previous algorithms. It has a small memory footprint, is highly parallelizable and requires no synchronization among compute threads. On a 40-core X86 system, we achieve 1200X speedup relative to a straightforward single-core implementation; even on a single core, 35X speedup is achieved.
估计的协方差矩阵是许多算法的基石,包括信号和图像处理。协方差法是协方差矩阵的一种估计方法,它不仅是一种估计方法,而且由于它所产生的矩阵具有方便的性质而受到青睐。然而,可观的计算需求限制了它的使用。我们提出了一种新的协方差方法计算算法,相对于以前的算法,它大大降低了计算复杂度(包括ALU操作和内存访问)。它的内存占用很小,具有高度并行性,并且不需要计算线程之间的同步。在40核X86系统上,相对于简单的单核实现,我们实现了1200X的加速;即使在单核上,也可以实现35倍的加速。
{"title":"A computationally efficient algorithm for the 2D covariance method","authors":"Oded Green, Y. Birk","doi":"10.1145/2503210.2503218","DOIUrl":"https://doi.org/10.1145/2503210.2503218","url":null,"abstract":"The estimated covariance matrix is a building block for many algorithms, including signal and image processing. The Covariance Method is an estimator for the covariance matrix, favored both as an estimator and in view of the convenient properties of the matrix that it produces. However, the considerable computational requirements limit its use. We present a novel computation algorithm for the covariance method, which dramatically reduces the computational complexity (both ALU operations and memory access) relative to previous algorithms. It has a small memory footprint, is highly parallelizable and requires no synchronization among compute threads. On a 40-core X86 system, we achieve 1200X speedup relative to a straightforward single-core implementation; even on a single core, 35X speedup is achieved.","PeriodicalId":371074,"journal":{"name":"2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)","volume":"641 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2013-11-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131819260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2013 SC - International Conference for High Performance Computing, Networking, Storage and Analysis (SC)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1