首页 > 最新文献

2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)最新文献

英文 中文
Impacts of Three Soft-Fault Models on Hybrid Parallel Asynchronous Iterative Methods 三种软故障模型对混合并行异步迭代方法的影响
Evan Coleman, Erik J. Jensen, M. Sosonkina
This study seeks to understand the soft error vulnerability of asynchronous iterative methods, with a focus on stationary iterative solvers such as Jacobi. The implementations make use of hybrid parallelism where the computational work is distributed over multiple nodes using MPI and parallelized on each node using openMP. A series of experiments is conducted to measure the impact of an undetected soft fault on an asynchronous iterative method, and to compare and contrast several techniques for simulating the occurrence of a fault and then recovering from the effects of the faults. The data shows that the two numerical soft-fault models tested here more consistently than a “bit-flip” model produce bad enough behavior to test a variety of recovery strategies, such as those based on partial checkpointing.
本研究旨在了解异步迭代方法的软错误脆弱性,重点关注Jacobi等平稳迭代求解器。这些实现使用混合并行,其中计算工作使用MPI分布在多个节点上,并使用openMP在每个节点上并行化。进行了一系列实验,以测量未检测到的软故障对异步迭代方法的影响,并对几种模拟故障发生并从故障影响中恢复的技术进行了比较和对比。数据表明,这里测试的两种数值软断层模型比“位翻转”模型更一致,产生的不良行为足以测试各种恢复策略,例如基于部分检查点的恢复策略。
{"title":"Impacts of Three Soft-Fault Models on Hybrid Parallel Asynchronous Iterative Methods","authors":"Evan Coleman, Erik J. Jensen, M. Sosonkina","doi":"10.1109/CAHPC.2018.8645942","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645942","url":null,"abstract":"This study seeks to understand the soft error vulnerability of asynchronous iterative methods, with a focus on stationary iterative solvers such as Jacobi. The implementations make use of hybrid parallelism where the computational work is distributed over multiple nodes using MPI and parallelized on each node using openMP. A series of experiments is conducted to measure the impact of an undetected soft fault on an asynchronous iterative method, and to compare and contrast several techniques for simulating the occurrence of a fault and then recovering from the effects of the faults. The data shows that the two numerical soft-fault models tested here more consistently than a “bit-flip” model produce bad enough behavior to test a variety of recovery strategies, such as those based on partial checkpointing.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114676753","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Mainstream vs. Emerging HPC: Metrics, Trade-Offs and Lessons Learned 主流vs.新兴HPC:度量、权衡和经验教训
M. Radulovic, Kazi Asifuzzaman, D. Zivanovic, Nikola Rajovic, G. C. D. Verdière, D. Pleiter, M. Marazakis, Nikolaos D. Kallimanis, P. Carpenter, Petar Radojkovic, E. Ayguadé
Various servers with different characteristics and architectures are hitting the market, and their evaluation and comparison in terms of HPC features is complex and multidimensional. In this paper, we share our experience of evaluating a diverse set of HPC systems, consisting of three mainstream and five emerging architectures. We evaluate the performance and power efficiency using prominent HPC benchmarks, High-Performance Linpack (HPL) and High Performance Conjugate Gradients (HPCG), and expand our analysis using publicly available specialized kernel benchmarks, targeting specific system components. In addition to a large body of quantitative results, we emphasize six usually overlooked aspects of the HPC platforms evaluation, and share our conclusions and lessons learned. Overall, we believe that this paper will improve the evaluation and comparison of HPC platforms, making a first step towards a more reliable and uniform methodology.
市场上各种不同特性和架构的服务器层出不穷,它们在高性能计算特性方面的评估和比较是复杂和多维的。在本文中,我们分享了我们评估多种HPC系统的经验,包括三种主流架构和五种新兴架构。我们使用著名的HPC基准,高性能Linpack (HPL)和高性能共轭梯度(HPCG)来评估性能和功率效率,并使用公开可用的专门内核基准来扩展我们的分析,针对特定的系统组件。除了大量的定量结果外,我们还强调了HPC平台评估中通常被忽视的六个方面,并分享了我们的结论和经验教训。总的来说,我们相信本文将改进高性能计算平台的评估和比较,朝着更可靠和统一的方法迈出第一步。
{"title":"Mainstream vs. Emerging HPC: Metrics, Trade-Offs and Lessons Learned","authors":"M. Radulovic, Kazi Asifuzzaman, D. Zivanovic, Nikola Rajovic, G. C. D. Verdière, D. Pleiter, M. Marazakis, Nikolaos D. Kallimanis, P. Carpenter, Petar Radojkovic, E. Ayguadé","doi":"10.1109/CAHPC.2018.8645891","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645891","url":null,"abstract":"Various servers with different characteristics and architectures are hitting the market, and their evaluation and comparison in terms of HPC features is complex and multidimensional. In this paper, we share our experience of evaluating a diverse set of HPC systems, consisting of three mainstream and five emerging architectures. We evaluate the performance and power efficiency using prominent HPC benchmarks, High-Performance Linpack (HPL) and High Performance Conjugate Gradients (HPCG), and expand our analysis using publicly available specialized kernel benchmarks, targeting specific system components. In addition to a large body of quantitative results, we emphasize six usually overlooked aspects of the HPC platforms evaluation, and share our conclusions and lessons learned. Overall, we believe that this paper will improve the evaluation and comparison of HPC platforms, making a first step towards a more reliable and uniform methodology.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123930123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Adaptive Scheduling of Collocated Applications Using a Task-Based Runtime System 使用基于任务的运行时系统自适应调度并置应用程序
J. Dokulil, S. Benkner
Task-based runtime systems are considered as one of the options for dealing with the challenges of upcoming parallel architectures. The greater flexibility of these runtime systems can also be used to dynamically adjust the resources allocated to the applications, adapting to the current load of the system and the progress of the applications. In our work, we have extended our implementation of the Open Community Runtime to support dynamic adjustment of execution threads. The runtimes communicate with an agent process, which collects performance data, computes thread allocation, and instructs the runtimes to make the required adjustments. We have tested our solution under different scenarios, focusing on producer-consumer applications, where the dynamic resource management was used to keep the applications in sync, improving the overall performance in some cases.
基于任务的运行时系统被认为是处理即将到来的并行体系结构挑战的选项之一。这些运行时系统的更大灵活性还可以用于动态调整分配给应用程序的资源,以适应系统的当前负载和应用程序的进度。在我们的工作中,我们扩展了开放社区运行时的实现,以支持动态调整执行线程。运行时与代理进程通信,代理进程收集性能数据,计算线程分配,并指示运行时进行所需的调整。我们在不同的场景下测试了我们的解决方案,重点关注生产者-消费者应用程序,其中使用动态资源管理来保持应用程序同步,在某些情况下提高了整体性能。
{"title":"Adaptive Scheduling of Collocated Applications Using a Task-Based Runtime System","authors":"J. Dokulil, S. Benkner","doi":"10.1109/CAHPC.2018.8645869","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645869","url":null,"abstract":"Task-based runtime systems are considered as one of the options for dealing with the challenges of upcoming parallel architectures. The greater flexibility of these runtime systems can also be used to dynamically adjust the resources allocated to the applications, adapting to the current load of the system and the progress of the applications. In our work, we have extended our implementation of the Open Community Runtime to support dynamic adjustment of execution threads. The runtimes communicate with an agent process, which collects performance data, computes thread allocation, and instructs the runtimes to make the required adjustments. We have tested our solution under different scenarios, focusing on producer-consumer applications, where the dynamic resource management was used to keep the applications in sync, improving the overall performance in some cases.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117003463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
From Java to FPGA: An Experience with the Intel HARP System 从Java到FPGA:使用Intel HARP系统的经验
Pedro Caldeira, J. Penha, L. Bragança, Ricardo Ferreira, J. Nacif, R. Ferreira, Fernando Magno Quintão Pereira
Recent years have seen a surge in the popularity of Field-Programmable Gate Arrays (FPGAs). Programmers can use them to develop high-performance systems that are not only efficient in time, but also in energy. Yet, programming FPGAs remains a difficult task. Even though there exist today OpenCL interfaces to synthesize such hardware, higher-level programming languages, such as Java, C# or Python remain distant from them. In this paper, we describe a compiler, and its supporting runtime environment, that reduces this distance, translating functional code written in Java to the Intel HARP platform. Thus, we bring two contributions. First, the insight that a functional-style library is a good starting point to bridge the gap between high-level programming idioms and FPGAs. Second, the implementation of this system itself, including the compiler, its intermediate representation, and all the runtime support necessary to shield developers from the task of transferring data back and forth between the host CPU and the accelerator. To demonstrate the effectiveness of our system, we have used it to implement different benchmarks, used in image processing and data-mining. For large inputs, we can observe consistent 20x speedups over the Java Virtual Machine across all our benchmarks. Depending on the target function that we compile, this speedup can achieve 280x.
近年来,现场可编程门阵列(fpga)的普及程度激增。程序员可以使用它们来开发不仅在时间上高效,而且在能源上高效的高性能系统。然而,fpga编程仍然是一项艰巨的任务。尽管现在有OpenCL接口来合成这些硬件,但高级编程语言,如Java、c#或Python仍然离它们很远。在本文中,我们描述了一个编译器,以及它支持的运行时环境,它减少了这个距离,将用Java编写的功能代码转换为Intel HARP平台。因此,我们带来了两个贡献。首先,函数式库是弥合高级编程习惯和fpga之间差距的良好起点。其次,这个系统本身的实现,包括编译器,它的中间表示,以及所有必要的运行时支持,以保护开发人员免受在主机CPU和加速器之间来回传输数据的任务。为了证明我们的系统的有效性,我们使用它来实现不同的基准测试,用于图像处理和数据挖掘。对于大的输入,我们可以观察到在所有基准测试中Java虚拟机的速度提高了20倍。根据我们编译的目标函数,这个加速可以达到280x。
{"title":"From Java to FPGA: An Experience with the Intel HARP System","authors":"Pedro Caldeira, J. Penha, L. Bragança, Ricardo Ferreira, J. Nacif, R. Ferreira, Fernando Magno Quintão Pereira","doi":"10.1109/CAHPC.2018.8645951","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645951","url":null,"abstract":"Recent years have seen a surge in the popularity of Field-Programmable Gate Arrays (FPGAs). Programmers can use them to develop high-performance systems that are not only efficient in time, but also in energy. Yet, programming FPGAs remains a difficult task. Even though there exist today OpenCL interfaces to synthesize such hardware, higher-level programming languages, such as Java, C# or Python remain distant from them. In this paper, we describe a compiler, and its supporting runtime environment, that reduces this distance, translating functional code written in Java to the Intel HARP platform. Thus, we bring two contributions. First, the insight that a functional-style library is a good starting point to bridge the gap between high-level programming idioms and FPGAs. Second, the implementation of this system itself, including the compiler, its intermediate representation, and all the runtime support necessary to shield developers from the task of transferring data back and forth between the host CPU and the accelerator. To demonstrate the effectiveness of our system, we have used it to implement different benchmarks, used in image processing and data-mining. For large inputs, we can observe consistent 20x speedups over the Java Virtual Machine across all our benchmarks. Depending on the target function that we compile, this speedup can achieve 280x.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114960257","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Enabling Efficient Job Dispatching in Accelerator-Extended Heterogeneous Systems with Unified Address Space 利用统一地址空间实现加速器扩展异构系统的高效作业调度
Georgios Kornaros, M. Coppola
In addition to GPUs that see increasingly widespread use for general-purpose computing, special-purpose accelerators are widely used for their high efficiency and low power consumption, attached to general-purpose CPUs, thus forming Heterogeneous System Architectures (HSAs). This paper presents a new communication model for heterogeneous computing, that utilizes a unified memory space for CPUs and accelerators and removes the requirement for virtual-to-physical address translation through an I/O Memory Management Unit (IOMMU), thus making stronger the adoption of Heterogeneous System Architectures in SoCs that do not include an IOMMU but still representing a large number in real products. By exploiting user-level queuing, workload dispatching to specialized hardware accelerators allows the removal of drawbacks present when copying objects through using the operating system calls. Additionally, dispatching is structured to enable fixed-size packet management that hardware specialized logic accelerates. To also eliminate IOMMU performance loss and IOMMU management complexity, we propose direct accelerator data placement in contiguous space in system-memory, where, the dispatcher provides trasparent access to the accelerators and at the same time offers an easy abstraction in the programming layer for the application. We demonstrate dispatching rates that exceed ten thousand jobs per second implementing architectural support on a low-cost embedded System-on-Chip, bounded only by the computing capacity of the hardware accelerators.
除了gpu越来越广泛地用于通用计算之外,专用加速器因其高效率和低功耗而被广泛使用,并附加到通用cpu上,从而形成异构系统架构(HSAs)。本文提出了一种新的异构计算通信模型,该模型为cpu和加速器使用统一的存储空间,并消除了通过I/O内存管理单元(IOMMU)进行虚拟到物理地址转换的要求,从而使异构系统架构在不包含IOMMU但在实际产品中仍然大量使用的soc中得到更强的采用。通过利用用户级队列,将工作负载分派到专用硬件加速器,可以消除通过使用操作系统调用复制对象时出现的缺点。此外,调度的结构支持固定大小的数据包管理,硬件专用逻辑可以加速。为了消除IOMMU性能损失和IOMMU管理复杂性,我们建议将加速器数据直接放置在系统内存的连续空间中,其中,调度程序提供对加速器的透明访问,同时在编程层为应用程序提供简单的抽象。我们演示了在低成本嵌入式片上系统上实现每秒超过10,000个作业的调度速率,仅受硬件加速器计算能力的限制。
{"title":"Enabling Efficient Job Dispatching in Accelerator-Extended Heterogeneous Systems with Unified Address Space","authors":"Georgios Kornaros, M. Coppola","doi":"10.1109/CAHPC.2018.8645945","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645945","url":null,"abstract":"In addition to GPUs that see increasingly widespread use for general-purpose computing, special-purpose accelerators are widely used for their high efficiency and low power consumption, attached to general-purpose CPUs, thus forming Heterogeneous System Architectures (HSAs). This paper presents a new communication model for heterogeneous computing, that utilizes a unified memory space for CPUs and accelerators and removes the requirement for virtual-to-physical address translation through an I/O Memory Management Unit (IOMMU), thus making stronger the adoption of Heterogeneous System Architectures in SoCs that do not include an IOMMU but still representing a large number in real products. By exploiting user-level queuing, workload dispatching to specialized hardware accelerators allows the removal of drawbacks present when copying objects through using the operating system calls. Additionally, dispatching is structured to enable fixed-size packet management that hardware specialized logic accelerates. To also eliminate IOMMU performance loss and IOMMU management complexity, we propose direct accelerator data placement in contiguous space in system-memory, where, the dispatcher provides trasparent access to the accelerators and at the same time offers an easy abstraction in the programming layer for the application. We demonstrate dispatching rates that exceed ten thousand jobs per second implementing architectural support on a low-cost embedded System-on-Chip, bounded only by the computing capacity of the hardware accelerators.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"76 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123209374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Deep Learning on Large-Scale Muticore Clusters 大规模多核集群的深度学习
Kazumasa Sakivama, S. Kato, Y. Ishikawa, A. Hori, Abraham Monrroy
Convolutional neural networks (CNNs) have achieved outstanding accuracy among conventional machine learning algorithms. Recent works have shown that large and complicated models, which take significant cost for training are needed to get higher accuracy. To train these models efficiently in high performance computers (HPCs), many parallelization techniques for CNNs have been developed. However, most techniques are mainly targeting GPUs and parallelizations for CPUs are not fully investigated. This paper explores CNN training performance on large-scale multicore clusters by optimizing intra-node processing and applying techniques of inter-node parallelization for multiple GPUs. Detailed experiments conducted on state-of-the-art multi-core processors using the openMP API and MPI framework demonstrated that Caffe-based CNNs can be accelerated by using well-designed multithreaded programs. We achieved at most 1.64 times speedup in convolution operations with devised lowering strategy compared to conventional lowering and acquired 772 times speedup with 864 nodes compared to one node.
卷积神经网络(cnn)在传统的机器学习算法中取得了突出的精度。近年来的研究表明,为了获得更高的精度,需要庞大而复杂的模型,这需要花费大量的训练成本。为了在高性能计算机(hpc)上有效地训练这些模型,许多cnn的并行化技术已经被开发出来。然而,大多数技术主要针对gpu, cpu的并行化没有得到充分的研究。本文通过优化节点内处理和应用多gpu节点间并行化技术,探讨了CNN在大规模多核集群上的训练性能。使用openMP API和MPI框架在最先进的多核处理器上进行的详细实验表明,使用精心设计的多线程程序可以加速基于caffe的cnn。与传统的降低策略相比,我们设计的降低策略在卷积运算中获得了最多1.64倍的加速,与一个节点相比,我们在864个节点上获得了772倍的加速。
{"title":"Deep Learning on Large-Scale Muticore Clusters","authors":"Kazumasa Sakivama, S. Kato, Y. Ishikawa, A. Hori, Abraham Monrroy","doi":"10.1109/CAHPC.2018.8645860","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645860","url":null,"abstract":"Convolutional neural networks (CNNs) have achieved outstanding accuracy among conventional machine learning algorithms. Recent works have shown that large and complicated models, which take significant cost for training are needed to get higher accuracy. To train these models efficiently in high performance computers (HPCs), many parallelization techniques for CNNs have been developed. However, most techniques are mainly targeting GPUs and parallelizations for CPUs are not fully investigated. This paper explores CNN training performance on large-scale multicore clusters by optimizing intra-node processing and applying techniques of inter-node parallelization for multiple GPUs. Detailed experiments conducted on state-of-the-art multi-core processors using the openMP API and MPI framework demonstrated that Caffe-based CNNs can be accelerated by using well-designed multithreaded programs. We achieved at most 1.64 times speedup in convolution operations with devised lowering strategy compared to conventional lowering and acquired 772 times speedup with 864 nodes compared to one node.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121250151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Case Study on Optimizing Accurate Half Precision Average 优化精确半精度平均值的实例研究
K. Peou, A. Kelly, J. Falcou, Cécile Germain
In this work, we study the numerical performance of various common algorithms used to calculate the average of an array of half precision (FP16) floating point values. While the current generation of CPUs does not support native FP16 arithmetic, it is a planned feature in a number of next-generation CPUs. FP16 arithmetic was emulated via the half software library. Due to the limitations of the FP16 data type, some algorithms proved insufficient for arrays as small as 100 elements. We propose an algorithm that allows numerically stable FP16 computation of the average and compare it to the naive floating point (FP32) algorithm in terms of both numerical precision and runtime performance. We find that our algorithm offers comparable robustness, numerical precision, and SIMD performance to the higher precision computation.
在这项工作中,我们研究了用于计算半精度(FP16)浮点值数组平均值的各种常用算法的数值性能。虽然当前一代cpu不支持原生FP16算法,但它是许多下一代cpu计划中的功能。利用半软件对FP16算法进行了仿真。由于FP16数据类型的限制,一些算法被证明不足以处理100个元素的数组。我们提出了一种算法,允许数值稳定的FP16计算平均值,并将其与朴素浮点(FP32)算法在数值精度和运行时性能方面进行比较。我们发现我们的算法提供了相当的鲁棒性,数值精度和SIMD性能,以更高的精度计算。
{"title":"A Case Study on Optimizing Accurate Half Precision Average","authors":"K. Peou, A. Kelly, J. Falcou, Cécile Germain","doi":"10.1109/CAHPC.2018.8645923","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645923","url":null,"abstract":"In this work, we study the numerical performance of various common algorithms used to calculate the average of an array of half precision (FP16) floating point values. While the current generation of CPUs does not support native FP16 arithmetic, it is a planned feature in a number of next-generation CPUs. FP16 arithmetic was emulated via the half software library. Due to the limitations of the FP16 data type, some algorithms proved insufficient for arrays as small as 100 elements. We propose an algorithm that allows numerically stable FP16 computation of the average and compare it to the naive floating point (FP32) algorithm in terms of both numerical precision and runtime performance. We find that our algorithm offers comparable robustness, numerical precision, and SIMD performance to the higher precision computation.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125745095","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Effect of Network Topology on the Performance of ADMM-Based SVMs 网络拓扑结构对基于admm的svm性能的影响
Shirin Tavara, Alexander Schliep
Alternating Direction Method Of Multipliers (ADMM) is one of the promising frameworks for training Support Vector Machines (SVMs) on large-scale data in a distributed manner. In a consensus-based ADMM, nodes may only communicate with one-hop neighbors and this may cause slow convergence. In this paper, we investigate the impact of network topology on the convergence speed of ADMM-based SVMs using expander graphs. In particular, we investigate how much the expansion property of the network influence the convergence and which topology is preferable. Besides, we supply an implementation making these theoretical advances practically available. The results of the experiments show that graphs with large spectral gaps and higher degrees exhibit accelerated convergence.
交替方向乘法器(ADMM)是一种很有前途的支持向量机(svm)分布式大规模数据训练框架。在基于共识的ADMM中,节点可能只与一跳邻居通信,这可能导致收敛缓慢。本文利用扩展图研究了网络拓扑结构对基于admm的支持向量机收敛速度的影响。特别地,我们研究了网络的扩展特性对收敛性的影响程度以及哪种拓扑结构更可取。此外,我们还提供了一个实现,使这些理论进展在实践中可用。实验结果表明,谱隙大、谱度高的图收敛速度加快。
{"title":"Effect of Network Topology on the Performance of ADMM-Based SVMs","authors":"Shirin Tavara, Alexander Schliep","doi":"10.1109/CAHPC.2018.8645857","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645857","url":null,"abstract":"Alternating Direction Method Of Multipliers (ADMM) is one of the promising frameworks for training Support Vector Machines (SVMs) on large-scale data in a distributed manner. In a consensus-based ADMM, nodes may only communicate with one-hop neighbors and this may cause slow convergence. In this paper, we investigate the impact of network topology on the convergence speed of ADMM-based SVMs using expander graphs. In particular, we investigate how much the expansion property of the network influence the convergence and which topology is preferable. Besides, we supply an implementation making these theoretical advances practically available. The results of the experiments show that graphs with large spectral gaps and higher degrees exhibit accelerated convergence.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"43 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133925203","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
ECHOFS: A Scheduler-Guided Temporary Filesystem to Leverage Node-Local NVMS ECHOFS:利用节点本地NVMS的调度程序引导的临时文件系统
Alberto Miranda, Ramon Nou, Toni Cortes
The growth in data-intensive scientific applications poses strong demands on the HPC storage subsystem, as data needs to be copied from compute nodes to I/O nodes and vice versa for jobs to run. The emerging trend of adding denser, NVM-based burst buffers to compute nodes, however, offers the possibility of using these resources to build temporary file systems with specific I/O optimizations for a batch job. In this work, we present echofs, a temporary filesystem that coordinates with the job scheduler to preload a job's input files into node-local burst buffers. We present the results measured with NVM emulation, and different FS backends with DAX/FUSE on a local node, to show the benefits of our proposal and such coordination.
数据密集型科学应用程序的增长对HPC存储子系统提出了强烈的要求,因为数据需要从计算节点复制到I/O节点,反之亦然才能运行作业。然而,向计算节点添加更密集的、基于nvm的突发缓冲区的新趋势提供了使用这些资源为批处理作业构建具有特定I/O优化的临时文件系统的可能性。在这项工作中,我们提出了echofs,一个临时文件系统,它与作业调度器协调,将作业的输入文件预加载到节点本地突发缓冲区中。我们给出了用NVM仿真测量的结果,以及在本地节点上使用DAX/FUSE的不同FS后端,以显示我们的建议和这种协调的好处。
{"title":"ECHOFS: A Scheduler-Guided Temporary Filesystem to Leverage Node-Local NVMS","authors":"Alberto Miranda, Ramon Nou, Toni Cortes","doi":"10.1109/CAHPC.2018.8645894","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645894","url":null,"abstract":"The growth in data-intensive scientific applications poses strong demands on the HPC storage subsystem, as data needs to be copied from compute nodes to I/O nodes and vice versa for jobs to run. The emerging trend of adding denser, NVM-based burst buffers to compute nodes, however, offers the possibility of using these resources to build temporary file systems with specific I/O optimizations for a batch job. In this work, we present echofs, a temporary filesystem that coordinates with the job scheduler to preload a job's input files into node-local burst buffers. We present the results measured with NVM emulation, and different FS backends with DAX/FUSE on a local node, to show the benefits of our proposal and such coordination.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"105 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131176147","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Predicting the Reliability Behavior of HPC Applications 预测高性能计算应用的可靠性行为
Daniel Oliveira, Francis B. Moreira, P. Rech, P. Navaux
The error rate of current High Performance Computing (HPC) systems is already in the order of one per dozens of hours. Understanding the reliability behavior of HPC applications will be required for the next generation of supercomputers. Using the reliability behavior one can select efficient mitigation techniques for the application and fine-tune parameters such as checkpoint frequency. In this paper, we investigate the application of a machine learning model to predict the reliability behavior of HPC applications. We inject faults in more than 30 HPC applications executing in the Intel Xeon Phi Knights Landing (KNL) and use profiling information to build a predictive model with Support Vector Machines (SVM). We show that the model can predict the Program Vulnerability Factor (PVF) with an average relative error of 7 % for certain classes of algorithm, such as linear algebra and sorting. The average relative error for all algorithm classes is 22 %. Such a fast and straightforward prediction model can be effective as a filter to select the most unreliable applications to perform an in-depth analysis.
目前高性能计算(HPC)系统的错误率已经达到每几十小时一次的水平。下一代超级计算机需要了解高性能计算应用程序的可靠性行为。使用可靠性行为,可以为应用程序选择有效的缓解技术,并微调参数,如检查点频率。在本文中,我们研究了应用机器学习模型来预测高性能计算应用程序的可靠性行为。我们在Intel Xeon Phi Knights Landing (KNL)上执行的30多个HPC应用程序中注入故障,并使用分析信息与支持向量机(SVM)建立预测模型。结果表明,对于线性代数和排序等算法,该模型能以7%的平均相对误差预测程序脆弱性因子(PVF)。所有算法类的平均相对误差为22%。这种快速而直接的预测模型可以有效地作为过滤器,选择最不可靠的应用程序来执行深入分析。
{"title":"Predicting the Reliability Behavior of HPC Applications","authors":"Daniel Oliveira, Francis B. Moreira, P. Rech, P. Navaux","doi":"10.1109/CAHPC.2018.8645856","DOIUrl":"https://doi.org/10.1109/CAHPC.2018.8645856","url":null,"abstract":"The error rate of current High Performance Computing (HPC) systems is already in the order of one per dozens of hours. Understanding the reliability behavior of HPC applications will be required for the next generation of supercomputers. Using the reliability behavior one can select efficient mitigation techniques for the application and fine-tune parameters such as checkpoint frequency. In this paper, we investigate the application of a machine learning model to predict the reliability behavior of HPC applications. We inject faults in more than 30 HPC applications executing in the Intel Xeon Phi Knights Landing (KNL) and use profiling information to build a predictive model with Support Vector Machines (SVM). We show that the model can predict the Program Vulnerability Factor (PVF) with an average relative error of 7 % for certain classes of algorithm, such as linear algebra and sorting. The average relative error for all algorithm classes is 22 %. Such a fast and straightforward prediction model can be effective as a filter to select the most unreliable applications to perform an in-depth analysis.","PeriodicalId":307747,"journal":{"name":"2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131271067","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
期刊
2018 30th International Symposium on Computer Architecture and High Performance Computing (SBAC-PAD)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1