Supercomput. Front. Innov.最新文献_第4页

Dawn: a High-level Domain-Specific Language Compiler Toolchain for Weather and Climate Applications Dawn:用于天气和气候应用程序的高级领域特定语言编译器工具链

Supercomput. Front. Innov.

Pub Date : 2020-07-11 DOI: 10.14529/jsfi200205

C. Osuna, Tobias Wicky, Fabian Thuering, T. Hoefler, O. Fuhrer

High-level programming languages that allow to express numerical methods and generate efficient parallel implementations are of key importance for the productivity of domain-scientists. The diversity and complexity of hardware architectures is imposing a huge challenge for large and complex models that must be ported and maintained for multiple architectures combining various parallel programming models. Several domain-specific languages (DSLs) have been developed to address the portability problem, but they usually impose a parallel model for specific numerical methods and support optimizations for limited scope operators. Dawn provides a high-level concise language for expressing numerical finite difference/volume methods using a sequential and descriptive language. The sequential statements are transformed into an efficient target-dependent parallel implementation by the Dawn compiler toolchain. We demonstrate our approach on the dynamical solver of the COSMO model, achieving performance improvements and code size reduction of up to 2x and 5x, respectively.

能够表达数值方法并生成高效并行实现的高级编程语言对领域科学家的生产力至关重要。硬件体系结构的多样性和复杂性给大型和复杂的模型带来了巨大的挑战，这些模型必须为组合各种并行编程模型的多个体系结构移植和维护。已经开发了一些领域特定语言(dsl)来解决可移植性问题，但它们通常为特定的数值方法强加并行模型，并支持对有限范围操作符的优化。Dawn提供了一种高级简洁的语言，用于使用顺序和描述性语言表达数值有限差分/体积方法。顺序语句通过Dawn编译器工具链转换为高效的目标相关并行实现。我们在COSMO模型的动态求解器上展示了我们的方法，分别实现了高达2倍和5倍的性能改进和代码大小减少。

引用次数: 7

Performance Reduction For Automatic Development of Parallel Applications For Reconfigurable Computer Systems 可重构计算机系统并行应用自动开发的性能降低

Supercomput. Front. Innov.

Pub Date : 2020-06-01 DOI: 10.14529/jsfi200201

A. Dordopulo, I. Levin

In the paper, we review a suboptimal methodology of mapping of a task information graph on the architecture of a reconfigurable computer system. Using performance reduction methods, we can solve computational problems which need hardware costs exceeding the available hardware resource. We proved theorems, concerning properties of sequential reductions. In our case, we have the following types of reduction such as the reduction by number of basic subgraphs, by number of computing devices, and by data width. On the base of the proved theorems and corollaries, we developed the methodology of reduction transformations of a task information graph for its automatic adaptation to the architecture of a reconfigurable computer system. We estimated the maximum number of transformations, which, according to the suggested methodology, are needed for balanced reduction of the performance and hardware costs of applications for reconfigurable computer systems.

本文讨论了一种在可重构计算机系统体系结构上映射任务信息图的次优方法。采用性能降维方法，可以解决需要硬件成本超过可用硬件资源的计算问题。我们证明了有关顺序约简性质的定理。在我们的示例中，我们有以下类型的缩减，例如按基本子图的数量、按计算设备的数量和按数据宽度进行缩减。在已证明的定理和推论的基础上，我们发展了任务信息图的约简变换方法，使其能够自动适应可重构计算机系统的体系结构。我们估计了转换的最大数量，根据建议的方法，需要平衡地降低可重构计算机系统的应用程序的性能和硬件成本。

引用次数: 2

Potential of I/O Aware Workflows in Climate and Weather 气候和天气中I/O感知工作流程的潜力

Supercomput. Front. Innov.

Pub Date : 2020-06-01 DOI: 10.14529/jsfi200203

J. Kunkel, L. Pedro

The efficient, convenient, and robust execution of data-driven workflows and enhanced data management are essential for productivity in scientific computing. In HPC, the concerns of storage and computing are traditionally separated and optimised independently from each other and the needs of the end-to-end user. However, in complex workflows, this is becoming problematic. These problems are particularly acute in climate and weather workflows, which as well as becoming increasingly complex and exploiting deep storage hierarchies, can involve multiple data centres. The key contributions of this paper are: 1) A sketch of a vision for an integrated data-driven approach, with a discussion of the associated challenges and implications, and 2) An architecture and roadmap consistent with this vision that would allow a seamless integration into current climate and weather workflows as it utilises versions of existing tools (ESDM, Cylc, XIOS, and DDN’s IME). The vision proposed here is built on the belief that workflows composed of data, computing, and communication-intensive tasks should drive interfaces and hardware configurations to better support the programming models. When delivered, this work will increase the opportunity for smarter scheduling of computing by considering storage in heterogeneous storage systems. We illustrate the performance-impact on an example workload using a model built on measured performance data using ESDM at DKRZ.

高效、方便、健壮地执行数据驱动工作流和增强的数据管理对于科学计算的生产力至关重要。在HPC中，存储和计算的关注点传统上是分开的，并且彼此独立优化，并且端到端用户的需求也是如此。然而，在复杂的工作流中，这就变得有问题了。这些问题在气候和天气工作流程中尤其严重，这些工作流程变得越来越复杂，需要利用深层存储层次结构，可能涉及多个数据中心。本文的主要贡献是:1)概述了集成数据驱动方法的愿景，并讨论了相关的挑战和影响;2)与该愿景一致的架构和路线图，该架构和路线图将允许无缝集成到当前的气候和天气工作流程中，因为它利用了现有工具的版本(ESDM, Cylc, XIOS和DDN的IME)。这里提出的愿景是基于这样一种信念，即由数据、计算和通信密集型任务组成的工作流应该驱动接口和硬件配置，以更好地支持编程模型。当交付时，这项工作将通过考虑异构存储系统中的存储来增加智能调度计算的机会。我们使用在DKRZ使用ESDM测量的性能数据上构建的模型来说明对示例工作负载的性能影响。

{"title":"Potential of I/O Aware Workflows in Climate and Weather","authors":"J. Kunkel, L. Pedro","doi":"10.14529/jsfi200203","DOIUrl":"https://doi.org/10.14529/jsfi200203","url":null,"abstract":"The efficient, convenient, and robust execution of data-driven workflows and enhanced data management are essential for productivity in scientific computing. In HPC, the concerns of storage and computing are traditionally separated and optimised independently from each other and the needs of the end-to-end user. However, in complex workflows, this is becoming problematic. These problems are particularly acute in climate and weather workflows, which as well as becoming increasingly complex and exploiting deep storage hierarchies, can involve multiple data centres. The key contributions of this paper are: 1) A sketch of a vision for an integrated data-driven approach, with a discussion of the associated challenges and implications, and 2) An architecture and roadmap consistent with this vision that would allow a seamless integration into current climate and weather workflows as it utilises versions of existing tools (ESDM, Cylc, XIOS, and DDN’s IME). The vision proposed here is built on the belief that workflows composed of data, computing, and communication-intensive tasks should drive interfaces and hardware configurations to better support the programming models. When delivered, this work will increase the opportunity for smarter scheduling of computing by considering storage in heterogeneous storage systems. We illustrate the performance-impact on an example workload using a model built on measured performance data using ESDM at DKRZ.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125227594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

Long Distance Geographically Distributed InfiniBand Based Computing 远距离地理分布的InfiniBand计算

Supercomput. Front. Innov.

Pub Date : 2020-06-01 DOI: 10.14529/jsfi200202

K. Niedzielewski, Marcin Semeniuk, Jaroslaw Skomial, J. Proficz, Piotr Sumioka, Bartosz Pliszka, M. Michalewicz

Collaboration between multiple computing centres, referred as federated computing is becoming important pillar of High Performance Computing (HPC) and will be one of its key components in the future. To test technical possibilities of future collaboration using 100Gb optic fiber link (Connection was 900 km in length with 9ms RTT time) we prepared two scenarios of operation. In the first one, Interdisciplinary Centre for Mathematical and Computational Modelling (ICM) in Warsaw and Centre of Informatics - Tricity Academic Supercomputer & networK (CI-TASK) in Gdansk prepared a long distance geographically distributed computing cluster. System consisted of 14 nodes (10 nodes at ICM facility and 4 at TASK facility) connected using InfiniBand. Our tests demonstrate that it is possible to perform computationally intensive data analysis on systems of this class without substantial drop in performance for a certain type of workloads. Additionally, we show that it is feasible to use High Performance Parallex [1], high level abstraction libraries for distributed computing, to develop software for such geographically distributed computing resources and maintain desired efficiency. In the second scenario, we prepared distributed simulation-postprocessing-visualization workflow using ADIOS2 [2] and two programming languages (C++ and python). In this test we prove capabilities of performing different parts of analysis in seperate sites.

多个计算中心之间的协作，称为联邦计算，正在成为高性能计算(HPC)的重要支柱，并将成为其未来的关键组成部分之一。为了测试使用100Gb光纤链路(连接长度为900公里，RTT时间为9ms)进行未来合作的技术可能性，我们准备了两种操作方案。在第一个项目中，华沙的跨学科数学和计算建模中心(ICM)和格但斯克的信息学中心-三重学术超级计算机和网络(CI-TASK)准备了一个远距离地理分布式计算集群。系统由14个节点组成(10个节点在ICM设施，4个节点在TASK设施)，使用InfiniBand连接。我们的测试表明，对于特定类型的工作负载，在此类系统上执行计算密集型数据分析而不会导致性能大幅下降是可能的。此外，我们表明，使用高性能并行[1]，分布式计算的高级抽象库，为这种地理上分布式的计算资源开发软件并保持所需的效率是可行的。在第二种场景中，我们使用ADIOS2[2]和两种编程语言(c++和python)编写了分布式仿真-后处理-可视化工作流。在这个测试中，我们证明了在不同的地点执行不同部分分析的能力。

{"title":"Long Distance Geographically Distributed InfiniBand Based Computing","authors":"K. Niedzielewski, Marcin Semeniuk, Jaroslaw Skomial, J. Proficz, Piotr Sumioka, Bartosz Pliszka, M. Michalewicz","doi":"10.14529/jsfi200202","DOIUrl":"https://doi.org/10.14529/jsfi200202","url":null,"abstract":"Collaboration between multiple computing centres, referred as federated computing is becoming important pillar of High Performance Computing (HPC) and will be one of its key components in the future. To test technical possibilities of future collaboration using 100Gb optic fiber link (Connection was 900 km in length with 9ms RTT time) we prepared two scenarios of operation. In the first one, Interdisciplinary Centre for Mathematical and Computational Modelling (ICM) in Warsaw and Centre of Informatics - Tricity Academic Supercomputer & networK (CI-TASK) in Gdansk prepared a long distance geographically distributed computing cluster. System consisted of 14 nodes (10 nodes at ICM facility and 4 at TASK facility) connected using InfiniBand. Our tests demonstrate that it is possible to perform computationally intensive data analysis on systems of this class without substantial drop in performance for a certain type of workloads. Additionally, we show that it is feasible to use High Performance Parallex [1], high level abstraction libraries for distributed computing, to develop software for such geographically distributed computing resources and maintain desired efficiency. In the second scenario, we prepared distributed simulation-postprocessing-visualization workflow using ADIOS2 [2] and two programming languages (C++ and python). In this test we prove capabilities of performing different parts of analysis in seperate sites.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126457562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Building a Vision for Reproducibility in the Cyberinfrastructure Ecosystem: Leveraging Community Efforts 建立网络基础设施生态系统可复制性的愿景:利用社区的努力

Supercomput. Front. Innov.

Pub Date : 2020-04-14 DOI: 10.14529/jsfi200106

Dylan Chapp, V. Stodden, M. Taufer

The scientific computing community has long taken a leadership role in understanding and assessing the relationship of reproducibility to cyberinfrastructure, ensuring that computational results - such as those from simulations - are "reproducible", that is, the same results are obtained when one re-uses the same input data, methods, software and analysis conditions. Starting almost a decade ago, the community has regularly published and advocated for advances in this area. In this article we trace this thinking and relate it to current national efforts, including the 2019 National Academies of Science, Engineering, and Medicine report on "Reproducibility and Replication in Science". To this end, this work considers high performance computing workflows that emphasize workflows combining traditional simulations (e.g. Molecular Dynamics simulations) with in situ analytics. We leverage an analysis of such workflows to (a) contextualize the 2019 National Academies of Science, Engineering, and Medicine report's recommendations in the HPC setting and (b) envision a path forward in the tradition of community driven approaches to reproducibility and the acceleration of science and discovery. The work also articulates avenues for future research at the intersection of transparency, reproducibility, and computational infrastructure that supports scientific discovery.

长期以来，科学计算界在理解和评估再现性与网络基础设施的关系方面一直发挥着领导作用，确保计算结果(如模拟结果)是“可再现的”，也就是说，当一个人重复使用相同的输入数据、方法、软件和分析条件时，获得相同的结果。从近十年前开始，该社区定期发表文章，并倡导在这一领域取得进展。在本文中，我们追溯了这一想法，并将其与当前的国家努力联系起来，包括2019年美国国家科学院、工程院和医学院关于“科学中的可重复性和复制性”的报告。为此，本工作考虑高性能计算工作流，强调将传统模拟(例如分子动力学模拟)与原位分析相结合的工作流。我们利用对这些工作流程的分析来(a)将2019年美国国家科学、工程和医学院报告在HPC环境下的建议置于背景下，(b)设想一条以社区驱动的方法传统为基础的前进道路，以实现可重复性和加速科学和发现。这项工作还阐明了未来在透明度、可重复性和支持科学发现的计算基础设施交叉领域的研究途径。

{"title":"Building a Vision for Reproducibility in the Cyberinfrastructure Ecosystem: Leveraging Community Efforts","authors":"Dylan Chapp, V. Stodden, M. Taufer","doi":"10.14529/jsfi200106","DOIUrl":"https://doi.org/10.14529/jsfi200106","url":null,"abstract":"The scientific computing community has long taken a leadership role in understanding and assessing the relationship of reproducibility to cyberinfrastructure, ensuring that computational results - such as those from simulations - are \"reproducible\", that is, the same results are obtained when one re-uses the same input data, methods, software and analysis conditions. Starting almost a decade ago, the community has regularly published and advocated for advances in this area. In this article we trace this thinking and relate it to current national efforts, including the 2019 National Academies of Science, Engineering, and Medicine report on \"Reproducibility and Replication in Science\". To this end, this work considers high performance computing workflows that emphasize workflows combining traditional simulations (e.g. Molecular Dynamics simulations) with in situ analytics. We leverage an analysis of such workflows to (a) contextualize the 2019 National Academies of Science, Engineering, and Medicine report's recommendations in the HPC setting and (b) envision a path forward in the tradition of community driven approaches to reproducibility and the acceleration of science and discovery. The work also articulates avenues for future research at the intersection of transparency, reproducibility, and computational infrastructure that supports scientific discovery.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122854411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

State of the Art and Future Trends in Data Reduction for High-Performance Computing 面向高性能计算的数据缩减技术现状和未来趋势

Supercomput. Front. Innov.

Pub Date : 2020-04-14 DOI: 10.14529/jsfi200101

Kira Duwe, Jakob Lüttgau, Georgiana Mania, Jannek Squar, A. Fuchs, Michael Kuhn, Eugen Betke, T. Ludwig

Research into data reduction techniques has gained popularity in recent years as storage capacity and performance become a growing concern. This survey paper provides an overview of leveraging points found in high-performance computing (HPC) systems and suitable mechanisms to reduce data volumes. We present the underlying theories and their application throughout the HPC stack and also discuss related hardware acceleration and reduction approaches. After introducing relevant use-cases, an overview of modern lossless and lossy compression algorithms and their respective usage at the application and file system layer is given. In anticipation of their increasing relevance for adaptive and in situ approaches, dimensionality reduction techniques are summarized with a focus on non-linear feature extraction. Adaptive approaches and in situ compression algorithms and frameworks follow. The key stages and new opportunities to deduplication are covered next. An unconventional but promising method is recomputation, which is proposed at last. We conclude the survey with an outlook on future developments.

近年来，随着存储容量和性能日益受到关注，对数据约简技术的研究越来越受欢迎。本调查报告概述了在高性能计算(HPC)系统中发现的利用点和减少数据量的适当机制。我们介绍了基本理论及其在整个HPC堆栈中的应用，并讨论了相关的硬件加速和减少方法。在介绍了相关用例之后，概述了现代无损压缩和有损压缩算法及其在应用程序层和文件系统层的使用情况。鉴于降维技术与自适应和原位方法的相关性日益增强，本文对降维技术进行了总结，重点是非线性特征提取。自适应方法和原位压缩算法和框架紧随其后。接下来将介绍重复数据删除的关键阶段和新机会。最后提出了一种非常规但很有前途的方法——重计算。我们以对未来发展的展望来结束调查。

{"title":"State of the Art and Future Trends in Data Reduction for High-Performance Computing","authors":"Kira Duwe, Jakob Lüttgau, Georgiana Mania, Jannek Squar, A. Fuchs, Michael Kuhn, Eugen Betke, T. Ludwig","doi":"10.14529/jsfi200101","DOIUrl":"https://doi.org/10.14529/jsfi200101","url":null,"abstract":"Research into data reduction techniques has gained popularity in recent years as storage capacity and performance become a growing concern. This survey paper provides an overview of leveraging points found in high-performance computing (HPC) systems and suitable mechanisms to reduce data volumes. We present the underlying theories and their application throughout the HPC stack and also discuss related hardware acceleration and reduction approaches. After introducing relevant use-cases, an overview of modern lossless and lossy compression algorithms and their respective usage at the application and file system layer is given. In anticipation of their increasing relevance for adaptive and in situ approaches, dimensionality reduction techniques are summarized with a focus on non-linear feature extraction. Adaptive approaches and in situ compression algorithms and frameworks follow. The key stages and new opportunities to deduplication are covered next. An unconventional but promising method is recomputation, which is proposed at last. We conclude the survey with an outlook on future developments.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127201028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

Supercomputing Technologies as Drive for Development of Enterprise Information Systems and Digital Economy 超级计算技术推动企业信息系统和数字经济发展

Supercomput. Front. Innov.

Pub Date : 2020-04-14 DOI: 10.14529/jsfi200103

O. V. Loginovsky, A. Shestakov, A. Shinkarev

The article presents an analysis of approaches to the development of enterprise information systems that are in use today. One of the major trends that predetermines the agenda of information technology is the focus on parallel computing of large volumes of data using supercomputing technologies. The article considers the resulting ubiquitous move to distributed patterns of building enterprise information systems and avoiding monolithic architectures. The emphasis is placed on the importance of such fundamental characteristics of enterprise information systems as reliability, scalability, and maintainability. The article justifies the importance of machine learning in the context of effective big data analysis and competitive gain for business, vital for both maintaining a leading position in the market and surviving in conditions of global instability and digitalization of economy. Transition from storing the current state of a enterprise information system to storing a full log and history of all changes in the event stream is proposed as an instrument of achieving linearization of the data stream for subsequent parallel computing. There is a new view that is being shaped of specialists at the intersection of engineering and analytical disciplines, who would be able to effectively develop scalable systems and algorithms for data processing and integration of its results into company business processes.

本文对目前使用的企业信息系统开发方法进行了分析。预先决定信息技术议程的主要趋势之一是关注使用超级计算技术对大量数据进行并行计算。本文考虑了构建企业信息系统和避免单体体系结构的普遍转向分布式模式的结果。重点放在企业信息系统的可靠性、可伸缩性和可维护性等基本特征的重要性上。这篇文章证明了机器学习在有效的大数据分析和商业竞争优势背景下的重要性，这对于保持市场领先地位以及在全球不稳定和经济数字化的条件下生存至关重要。将存储企业信息系统的当前状态转换为存储事件流中所有更改的完整日志和历史记录，作为实现数据流线性化的工具，用于后续并行计算。工程和分析学科交叉领域的专家正在形成一种新的观点，他们将能够有效地开发可扩展的系统和算法，用于数据处理，并将其结果集成到公司业务流程中。

{"title":"Supercomputing Technologies as Drive for Development of Enterprise Information Systems and Digital Economy","authors":"O. V. Loginovsky, A. Shestakov, A. Shinkarev","doi":"10.14529/jsfi200103","DOIUrl":"https://doi.org/10.14529/jsfi200103","url":null,"abstract":"The article presents an analysis of approaches to the development of enterprise information systems that are in use today. One of the major trends that predetermines the agenda of information technology is the focus on parallel computing of large volumes of data using supercomputing technologies. The article considers the resulting ubiquitous move to distributed patterns of building enterprise information systems and avoiding monolithic architectures. The emphasis is placed on the importance of such fundamental characteristics of enterprise information systems as reliability, scalability, and maintainability. The article justifies the importance of machine learning in the context of effective big data analysis and competitive gain for business, vital for both maintaining a leading position in the market and surviving in conditions of global instability and digitalization of economy. Transition from storing the current state of a enterprise information system to storing a full log and history of all changes in the event stream is proposed as an instrument of achieving linearization of the data stream for subsequent parallel computing. There is a new view that is being shaped of specialists at the intersection of engineering and analytical disciplines, who would be able to effectively develop scalable systems and algorithms for data processing and integration of its results into company business processes.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125557407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Online MPI Process Mapping for Coordinating Locality and Memory Congestion on NUMA Systems NUMA系统上协调局部性和内存拥塞的在线MPI进程映射

Supercomput. Front. Innov.

Pub Date : 2020-04-14 DOI: 10.14529/jsfi200104

Mulya Agung, Muhammad Alfian Amrizal, Ryusuke Egawa, H. Takizawa

Mapping MPI processes to processor cores, called process mapping, is crucial to achieving the scalable performance on multi-core processors. By analyzing the communication behavior among MPI processes, process mapping can improve the communication locality, and thus reduce the overall communication cost. However, on modern non-uniform memory access (NUMA) systems, the memory congestion problem could degrade performance more severely than the locality problem because heavy congestion on shared caches and memory controllers could cause long latencies. Most of the existing work focus only on improving the locality or rely on offline profiling to analyze the communication behavior. We propose a process mapping method that dynamically performs the process mapping for adapting to communication behaviors while coordinating the locality and memory congestion. Our method works online during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. Experimental results show that our method can achieve performance and energy efficiency close to the best static mapping method with low overhead to the application execution. In experiments with the NAS parallel benchmarks on a NUMA system, the performance and total energy improvements are up to 34% (18.5% on average) and 28.9% (13.6% on average), respectively. In experiments with two GROMACS applications on a larger NUMA system, the average improvements in performance and total energy consumption are 21.6% and 12.6%, respectively.

将MPI进程映射到处理器内核(称为进程映射)对于在多核处理器上实现可扩展性能至关重要。通过分析MPI进程之间的通信行为，进程映射可以提高通信的局部性，从而降低总体通信成本。然而，在现代非统一内存访问(NUMA)系统上，内存拥塞问题可能比局部性问题更严重地降低性能，因为共享缓存和内存控制器上的严重拥塞可能导致长延迟。现有的工作大多只关注局部性的改进或依赖于离线分析来分析通信行为。我们提出了一种动态执行进程映射的方法，以适应通信行为，同时协调局部性和内存拥塞。我们的方法在MPI应用程序执行期间在线工作。它不需要修改应用程序，不需要事先了解通信行为，也不需要更改硬件和操作系统。实验结果表明，该方法的性能和能源效率接近最佳静态映射方法，且对应用程序执行的开销很小。在NUMA系统上使用NAS并行基准测试的实验中，性能和总能耗分别提高了34%(平均18.5%)和28.9%(平均13.6%)。在一个更大的NUMA系统上使用两个GROMACS应用程序进行的实验中，性能和总能耗的平均提高分别为21.6%和12.6%。

{"title":"Online MPI Process Mapping for Coordinating Locality and Memory Congestion on NUMA Systems","authors":"Mulya Agung, Muhammad Alfian Amrizal, Ryusuke Egawa, H. Takizawa","doi":"10.14529/jsfi200104","DOIUrl":"https://doi.org/10.14529/jsfi200104","url":null,"abstract":"Mapping MPI processes to processor cores, called process mapping, is crucial to achieving the scalable performance on multi-core processors. By analyzing the communication behavior among MPI processes, process mapping can improve the communication locality, and thus reduce the overall communication cost. However, on modern non-uniform memory access (NUMA) systems, the memory congestion problem could degrade performance more severely than the locality problem because heavy congestion on shared caches and memory controllers could cause long latencies. Most of the existing work focus only on improving the locality or rely on offline profiling to analyze the communication behavior. We propose a process mapping method that dynamically performs the process mapping for adapting to communication behaviors while coordinating the locality and memory congestion. Our method works online during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. Experimental results show that our method can achieve performance and energy efficiency close to the best static mapping method with low overhead to the application execution. In experiments with the NAS parallel benchmarks on a NUMA system, the performance and total energy improvements are up to 34% (18.5% on average) and 28.9% (13.6% on average), respectively. In experiments with two GROMACS applications on a larger NUMA system, the average improvements in performance and total energy consumption are 21.6% and 12.6%, respectively.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133371988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Tools for GPU Computing - Debugging and Performance Analysis of Heterogenous HPC Applications GPU计算工具-异构HPC应用程序的调试和性能分析

Supercomput. Front. Innov.

Pub Date : 2020-03-01 DOI: 10.14529/jsfi200105

Michael Knobloch, B. Mohr

General purpose GPUs are now ubiquitous in high-end supercomputing. All but one (the Japanese Fugaku system, which is based on ARM processors) of the announced (pre-)exascale systems contain vast amounts of GPUs that deliver the majority of the performance of these systems. Thus, GPU programming will be a necessity for application developers using high-end HPC systems.However, programming GPUs efficiently is an even more daunting task than traditional HPC application development. This becomes even more apparent for large-scale systems containing thousands of GPUs. Orchestrating all the resources of such a system imposes a tremendous challenge to developers. Luckily a rich ecosystem of tools exist to assist developers in every development step of a GPU application at all scales. In this paper we present an overview of these tools and discuss their capabilities. We start with an overview of different GPU programming models, from low-level with CUDA over pragma-based models like OpenACC to high-level approaches like Kokkos. We discuss their respective tool interfaces as the main method for tools to obtain information on the execution of a kernel on the GPU. The main focus of this paper is on two classes of tools, debuggers and performance analysis tools. Debuggers help the developer to identify problems both on the CPU and GPU side as well as in the interplay of both. Once the application runs correctly, performance analysis tools can be used to pinpoint bottlenecks in the execution of the code and help to increase the overall performance.

通用gpu现在在高端超级计算中无处不在。除了一个(基于ARM处理器的日本Fugaku系统)之外，所有宣布的(预)百亿亿级系统都包含大量gpu，这些gpu提供了这些系统的大部分性能。因此，对于使用高端HPC系统的应用程序开发人员来说，GPU编程将是必要的。然而，与传统的HPC应用程序开发相比，高效地编程gpu是一项更加艰巨的任务。对于包含数千个gpu的大型系统来说，这一点更加明显。编排这样一个系统的所有资源给开发人员带来了巨大的挑战。幸运的是，有一个丰富的工具生态系统可以在各种规模的GPU应用程序的每个开发步骤中帮助开发人员。在本文中，我们概述了这些工具并讨论了它们的功能。我们从不同GPU编程模型的概述开始，从低级的CUDA基于pragma的模型(如OpenACC)到高级的方法(如Kokkos)。我们讨论了它们各自的工具接口，作为工具获取GPU上内核执行信息的主要方法。本文主要关注两类工具:调试器和性能分析工具。调试器帮助开发人员识别CPU和GPU端以及两者相互作用中的问题。一旦应用程序正确运行，就可以使用性能分析工具来查明代码执行中的瓶颈，并帮助提高整体性能。

{"title":"Tools for GPU Computing - Debugging and Performance Analysis of Heterogenous HPC Applications","authors":"Michael Knobloch, B. Mohr","doi":"10.14529/jsfi200105","DOIUrl":"https://doi.org/10.14529/jsfi200105","url":null,"abstract":"General purpose GPUs are now ubiquitous in high-end supercomputing. All but one (the Japanese Fugaku system, which is based on ARM processors) of the announced (pre-)exascale systems contain vast amounts of GPUs that deliver the majority of the performance of these systems. Thus, GPU programming will be a necessity for application developers using high-end HPC systems.However, programming GPUs efficiently is an even more daunting task than traditional HPC application development. This becomes even more apparent for large-scale systems containing thousands of GPUs. Orchestrating all the resources of such a system imposes a tremendous challenge to developers. Luckily a rich ecosystem of tools exist to assist developers in every development step of a GPU application at all scales. In this paper we present an overview of these tools and discuss their capabilities. We start with an overview of different GPU programming models, from low-level with CUDA over pragma-based models like OpenACC to high-level approaches like Kokkos. We discuss their respective tool interfaces as the main method for tools to obtain information on the execution of a kernel on the GPU. The main focus of this paper is on two classes of tools, debuggers and performance analysis tools. Debuggers help the developer to identify problems both on the CPU and GPU side as well as in the interplay of both. Once the application runs correctly, performance analysis tools can be used to pinpoint bottlenecks in the execution of the code and help to increase the overall performance.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"94 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124162810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Development of Computational Pipeline Software for Genome/Exome Analysis on the K Computer K计算机基因组/外显子组分析计算流水线软件的开发

Supercomput. Front. Innov.

Pub Date : 2020-03-01 DOI: 10.14529/jsfi200102

Kento Aoyama, Masanori Kakuta, Yuri Matsuzaki, T. Ishida, M. Ohue, Y. Akiyama

Pipeline software that comprise tool and application chains for specific data processing have found extensive utilization in the analysis of several data types, such as genome, in bioinformatics research. Recent trends in genome analysis require use of pipeline software for optimum utilization of computational resources, thereby facilitating efficient handling of large-scale biological data accumulated on a daily basis. However, use of pipeline software in bioinformatics tends to be problematic owing to their large memory and storage capacity requirements, increasing number of job submissions, and a wide range of software dependencies. This paper presents a massive parallel genome/exome analysis pipeline software that addresses these difficulties. Additionally, it can be executed on a large number of K computer nodes. The proposed pipeline incorporates workflow management functionality that performs effectively when considering the task-dependency graph of internal executions via extension of the dynamic task distribution framework. Performance results pertaining to the core pipeline functionality, obtained via evaluation experiments performed using an actual exome dataset, demonstrate good scalability when using over a thousand nodes. Additionally, this study proposes several approaches to resolve performance bottlenecks of a pipeline by considering the domain knowledge pertaining to internal pipeline executions as a major challenge facing pipeline parallelization.

流水线软件包括用于特定数据处理的工具和应用链，在生物信息学研究中广泛应用于几种数据类型的分析，例如基因组。基因组分析的最新趋势要求使用流水线软件来优化利用计算资源，从而促进对每天积累的大规模生物数据的有效处理。然而，在生物信息学中使用流水线软件往往是有问题的，因为它们需要大量的内存和存储容量，越来越多的作业提交，以及广泛的软件依赖。本文提出了一个大规模的并行基因组/外显子组分析流水线软件，解决了这些困难。此外，它可以在大量的K个计算机节点上执行。建议的管道包含工作流管理功能，当考虑通过扩展动态任务分布框架的内部执行的任务依赖图时，该功能可以有效地执行。通过使用实际的外显子组数据集进行评估实验，获得了与核心管道功能相关的性能结果，在使用超过1000个节点时显示出良好的可扩展性。此外，本研究提出了几种解决管道性能瓶颈的方法，通过考虑与内部管道执行相关的领域知识作为管道并行化面临的主要挑战。

{"title":"Development of Computational Pipeline Software for Genome/Exome Analysis on the K Computer","authors":"Kento Aoyama, Masanori Kakuta, Yuri Matsuzaki, T. Ishida, M. Ohue, Y. Akiyama","doi":"10.14529/jsfi200102","DOIUrl":"https://doi.org/10.14529/jsfi200102","url":null,"abstract":"Pipeline software that comprise tool and application chains for specific data processing have found extensive utilization in the analysis of several data types, such as genome, in bioinformatics research. Recent trends in genome analysis require use of pipeline software for optimum utilization of computational resources, thereby facilitating efficient handling of large-scale biological data accumulated on a daily basis. However, use of pipeline software in bioinformatics tends to be problematic owing to their large memory and storage capacity requirements, increasing number of job submissions, and a wide range of software dependencies. This paper presents a massive parallel genome/exome analysis pipeline software that addresses these difficulties. Additionally, it can be executed on a large number of K computer nodes. The proposed pipeline incorporates workflow management functionality that performs effectively when considering the task-dependency graph of internal executions via extension of the dynamic task distribution framework. Performance results pertaining to the core pipeline functionality, obtained via evaluation experiments performed using an actual exome dataset, demonstrate good scalability when using over a thousand nodes. Additionally, this study proposes several approaches to resolve performance bottlenecks of a pipeline by considering the domain knowledge pertaining to internal pipeline executions as a major challenge facing pipeline parallelization.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123652541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0