C. Osuna, Tobias Wicky, Fabian Thuering, T. Hoefler, O. Fuhrer
High-level programming languages that allow to express numerical methods and generate efficient parallel implementations are of key importance for the productivity of domain-scientists. The diversity and complexity of hardware architectures is imposing a huge challenge for large and complex models that must be ported and maintained for multiple architectures combining various parallel programming models. Several domain-specific languages (DSLs) have been developed to address the portability problem, but they usually impose a parallel model for specific numerical methods and support optimizations for limited scope operators. Dawn provides a high-level concise language for expressing numerical finite difference/volume methods using a sequential and descriptive language. The sequential statements are transformed into an efficient target-dependent parallel implementation by the Dawn compiler toolchain. We demonstrate our approach on the dynamical solver of the COSMO model, achieving performance improvements and code size reduction of up to 2x and 5x, respectively.
{"title":"Dawn: a High-level Domain-Specific Language Compiler Toolchain for Weather and Climate Applications","authors":"C. Osuna, Tobias Wicky, Fabian Thuering, T. Hoefler, O. Fuhrer","doi":"10.14529/jsfi200205","DOIUrl":"https://doi.org/10.14529/jsfi200205","url":null,"abstract":"High-level programming languages that allow to express numerical methods and generate efficient parallel implementations are of key importance for the productivity of domain-scientists. The diversity and complexity of hardware architectures is imposing a huge challenge for large and complex models that must be ported and maintained for multiple architectures combining various parallel programming models. Several domain-specific languages (DSLs) have been developed to address the portability problem, but they usually impose a parallel model for specific numerical methods and support optimizations for limited scope operators. Dawn provides a high-level concise language for expressing numerical finite difference/volume methods using a sequential and descriptive language. The sequential statements are transformed into an efficient target-dependent parallel implementation by the Dawn compiler toolchain. We demonstrate our approach on the dynamical solver of the COSMO model, achieving performance improvements and code size reduction of up to 2x and 5x, respectively.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"117 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-07-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132662979","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In the paper, we review a suboptimal methodology of mapping of a task information graph on the architecture of a reconfigurable computer system. Using performance reduction methods, we can solve computational problems which need hardware costs exceeding the available hardware resource. We proved theorems, concerning properties of sequential reductions. In our case, we have the following types of reduction such as the reduction by number of basic subgraphs, by number of computing devices, and by data width. On the base of the proved theorems and corollaries, we developed the methodology of reduction transformations of a task information graph for its automatic adaptation to the architecture of a reconfigurable computer system. We estimated the maximum number of transformations, which, according to the suggested methodology, are needed for balanced reduction of the performance and hardware costs of applications for reconfigurable computer systems.
{"title":"Performance Reduction For Automatic Development of Parallel Applications For Reconfigurable Computer Systems","authors":"A. Dordopulo, I. Levin","doi":"10.14529/jsfi200201","DOIUrl":"https://doi.org/10.14529/jsfi200201","url":null,"abstract":"In the paper, we review a suboptimal methodology of mapping of a task information graph on the architecture of a reconfigurable computer system. Using performance reduction methods, we can solve computational problems which need hardware costs exceeding the available hardware resource. We proved theorems, concerning properties of sequential reductions. In our case, we have the following types of reduction such as the reduction by number of basic subgraphs, by number of computing devices, and by data width. On the base of the proved theorems and corollaries, we developed the methodology of reduction transformations of a task information graph for its automatic adaptation to the architecture of a reconfigurable computer system. We estimated the maximum number of transformations, which, according to the suggested methodology, are needed for balanced reduction of the performance and hardware costs of applications for reconfigurable computer systems.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122516805","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The efficient, convenient, and robust execution of data-driven workflows and enhanced data management are essential for productivity in scientific computing. In HPC, the concerns of storage and computing are traditionally separated and optimised independently from each other and the needs of the end-to-end user. However, in complex workflows, this is becoming problematic. These problems are particularly acute in climate and weather workflows, which as well as becoming increasingly complex and exploiting deep storage hierarchies, can involve multiple data centres. The key contributions of this paper are: 1) A sketch of a vision for an integrated data-driven approach, with a discussion of the associated challenges and implications, and 2) An architecture and roadmap consistent with this vision that would allow a seamless integration into current climate and weather workflows as it utilises versions of existing tools (ESDM, Cylc, XIOS, and DDN’s IME). The vision proposed here is built on the belief that workflows composed of data, computing, and communication-intensive tasks should drive interfaces and hardware configurations to better support the programming models. When delivered, this work will increase the opportunity for smarter scheduling of computing by considering storage in heterogeneous storage systems. We illustrate the performance-impact on an example workload using a model built on measured performance data using ESDM at DKRZ.
{"title":"Potential of I/O Aware Workflows in Climate and Weather","authors":"J. Kunkel, L. Pedro","doi":"10.14529/jsfi200203","DOIUrl":"https://doi.org/10.14529/jsfi200203","url":null,"abstract":"The efficient, convenient, and robust execution of data-driven workflows and enhanced data management are essential for productivity in scientific computing. In HPC, the concerns of storage and computing are traditionally separated and optimised independently from each other and the needs of the end-to-end user. However, in complex workflows, this is becoming problematic. These problems are particularly acute in climate and weather workflows, which as well as becoming increasingly complex and exploiting deep storage hierarchies, can involve multiple data centres. The key contributions of this paper are: 1) A sketch of a vision for an integrated data-driven approach, with a discussion of the associated challenges and implications, and 2) An architecture and roadmap consistent with this vision that would allow a seamless integration into current climate and weather workflows as it utilises versions of existing tools (ESDM, Cylc, XIOS, and DDN’s IME). The vision proposed here is built on the belief that workflows composed of data, computing, and communication-intensive tasks should drive interfaces and hardware configurations to better support the programming models. When delivered, this work will increase the opportunity for smarter scheduling of computing by considering storage in heterogeneous storage systems. We illustrate the performance-impact on an example workload using a model built on measured performance data using ESDM at DKRZ.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125227594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
K. Niedzielewski, Marcin Semeniuk, Jaroslaw Skomial, J. Proficz, Piotr Sumioka, Bartosz Pliszka, M. Michalewicz
Collaboration between multiple computing centres, referred as federated computing is becoming important pillar of High Performance Computing (HPC) and will be one of its key components in the future. To test technical possibilities of future collaboration using 100Gb optic fiber link (Connection was 900 km in length with 9ms RTT time) we prepared two scenarios of operation. In the first one, Interdisciplinary Centre for Mathematical and Computational Modelling (ICM) in Warsaw and Centre of Informatics - Tricity Academic Supercomputer & networK (CI-TASK) in Gdansk prepared a long distance geographically distributed computing cluster. System consisted of 14 nodes (10 nodes at ICM facility and 4 at TASK facility) connected using InfiniBand. Our tests demonstrate that it is possible to perform computationally intensive data analysis on systems of this class without substantial drop in performance for a certain type of workloads. Additionally, we show that it is feasible to use High Performance Parallex [1], high level abstraction libraries for distributed computing, to develop software for such geographically distributed computing resources and maintain desired efficiency. In the second scenario, we prepared distributed simulation-postprocessing-visualization workflow using ADIOS2 [2] and two programming languages (C++ and python). In this test we prove capabilities of performing different parts of analysis in seperate sites.
{"title":"Long Distance Geographically Distributed InfiniBand Based Computing","authors":"K. Niedzielewski, Marcin Semeniuk, Jaroslaw Skomial, J. Proficz, Piotr Sumioka, Bartosz Pliszka, M. Michalewicz","doi":"10.14529/jsfi200202","DOIUrl":"https://doi.org/10.14529/jsfi200202","url":null,"abstract":"Collaboration between multiple computing centres, referred as federated computing is becoming important pillar of High Performance Computing (HPC) and will be one of its key components in the future. To test technical possibilities of future collaboration using 100Gb optic fiber link (Connection was 900 km in length with 9ms RTT time) we prepared two scenarios of operation. In the first one, Interdisciplinary Centre for Mathematical and Computational Modelling (ICM) in Warsaw and Centre of Informatics - Tricity Academic Supercomputer & networK (CI-TASK) in Gdansk prepared a long distance geographically distributed computing cluster. System consisted of 14 nodes (10 nodes at ICM facility and 4 at TASK facility) connected using InfiniBand. Our tests demonstrate that it is possible to perform computationally intensive data analysis on systems of this class without substantial drop in performance for a certain type of workloads. Additionally, we show that it is feasible to use High Performance Parallex [1], high level abstraction libraries for distributed computing, to develop software for such geographically distributed computing resources and maintain desired efficiency. In the second scenario, we prepared distributed simulation-postprocessing-visualization workflow using ADIOS2 [2] and two programming languages (C++ and python). In this test we prove capabilities of performing different parts of analysis in seperate sites.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126457562","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The scientific computing community has long taken a leadership role in understanding and assessing the relationship of reproducibility to cyberinfrastructure, ensuring that computational results - such as those from simulations - are "reproducible", that is, the same results are obtained when one re-uses the same input data, methods, software and analysis conditions. Starting almost a decade ago, the community has regularly published and advocated for advances in this area. In this article we trace this thinking and relate it to current national efforts, including the 2019 National Academies of Science, Engineering, and Medicine report on "Reproducibility and Replication in Science". To this end, this work considers high performance computing workflows that emphasize workflows combining traditional simulations (e.g. Molecular Dynamics simulations) with in situ analytics. We leverage an analysis of such workflows to (a) contextualize the 2019 National Academies of Science, Engineering, and Medicine report's recommendations in the HPC setting and (b) envision a path forward in the tradition of community driven approaches to reproducibility and the acceleration of science and discovery. The work also articulates avenues for future research at the intersection of transparency, reproducibility, and computational infrastructure that supports scientific discovery.
{"title":"Building a Vision for Reproducibility in the Cyberinfrastructure Ecosystem: Leveraging Community Efforts","authors":"Dylan Chapp, V. Stodden, M. Taufer","doi":"10.14529/jsfi200106","DOIUrl":"https://doi.org/10.14529/jsfi200106","url":null,"abstract":"The scientific computing community has long taken a leadership role in understanding and assessing the relationship of reproducibility to cyberinfrastructure, ensuring that computational results - such as those from simulations - are \"reproducible\", that is, the same results are obtained when one re-uses the same input data, methods, software and analysis conditions. Starting almost a decade ago, the community has regularly published and advocated for advances in this area. In this article we trace this thinking and relate it to current national efforts, including the 2019 National Academies of Science, Engineering, and Medicine report on \"Reproducibility and Replication in Science\". To this end, this work considers high performance computing workflows that emphasize workflows combining traditional simulations (e.g. Molecular Dynamics simulations) with in situ analytics. We leverage an analysis of such workflows to (a) contextualize the 2019 National Academies of Science, Engineering, and Medicine report's recommendations in the HPC setting and (b) envision a path forward in the tradition of community driven approaches to reproducibility and the acceleration of science and discovery. The work also articulates avenues for future research at the intersection of transparency, reproducibility, and computational infrastructure that supports scientific discovery.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"63 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122854411","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kira Duwe, Jakob Lüttgau, Georgiana Mania, Jannek Squar, A. Fuchs, Michael Kuhn, Eugen Betke, T. Ludwig
Research into data reduction techniques has gained popularity in recent years as storage capacity and performance become a growing concern. This survey paper provides an overview of leveraging points found in high-performance computing (HPC) systems and suitable mechanisms to reduce data volumes. We present the underlying theories and their application throughout the HPC stack and also discuss related hardware acceleration and reduction approaches. After introducing relevant use-cases, an overview of modern lossless and lossy compression algorithms and their respective usage at the application and file system layer is given. In anticipation of their increasing relevance for adaptive and in situ approaches, dimensionality reduction techniques are summarized with a focus on non-linear feature extraction. Adaptive approaches and in situ compression algorithms and frameworks follow. The key stages and new opportunities to deduplication are covered next. An unconventional but promising method is recomputation, which is proposed at last. We conclude the survey with an outlook on future developments.
{"title":"State of the Art and Future Trends in Data Reduction for High-Performance Computing","authors":"Kira Duwe, Jakob Lüttgau, Georgiana Mania, Jannek Squar, A. Fuchs, Michael Kuhn, Eugen Betke, T. Ludwig","doi":"10.14529/jsfi200101","DOIUrl":"https://doi.org/10.14529/jsfi200101","url":null,"abstract":"Research into data reduction techniques has gained popularity in recent years as storage capacity and performance become a growing concern. This survey paper provides an overview of leveraging points found in high-performance computing (HPC) systems and suitable mechanisms to reduce data volumes. We present the underlying theories and their application throughout the HPC stack and also discuss related hardware acceleration and reduction approaches. After introducing relevant use-cases, an overview of modern lossless and lossy compression algorithms and their respective usage at the application and file system layer is given. In anticipation of their increasing relevance for adaptive and in situ approaches, dimensionality reduction techniques are summarized with a focus on non-linear feature extraction. Adaptive approaches and in situ compression algorithms and frameworks follow. The key stages and new opportunities to deduplication are covered next. An unconventional but promising method is recomputation, which is proposed at last. We conclude the survey with an outlook on future developments.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127201028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The article presents an analysis of approaches to the development of enterprise information systems that are in use today. One of the major trends that predetermines the agenda of information technology is the focus on parallel computing of large volumes of data using supercomputing technologies. The article considers the resulting ubiquitous move to distributed patterns of building enterprise information systems and avoiding monolithic architectures. The emphasis is placed on the importance of such fundamental characteristics of enterprise information systems as reliability, scalability, and maintainability. The article justifies the importance of machine learning in the context of effective big data analysis and competitive gain for business, vital for both maintaining a leading position in the market and surviving in conditions of global instability and digitalization of economy. Transition from storing the current state of a enterprise information system to storing a full log and history of all changes in the event stream is proposed as an instrument of achieving linearization of the data stream for subsequent parallel computing. There is a new view that is being shaped of specialists at the intersection of engineering and analytical disciplines, who would be able to effectively develop scalable systems and algorithms for data processing and integration of its results into company business processes.
{"title":"Supercomputing Technologies as Drive for Development of Enterprise Information Systems and Digital Economy","authors":"O. V. Loginovsky, A. Shestakov, A. Shinkarev","doi":"10.14529/jsfi200103","DOIUrl":"https://doi.org/10.14529/jsfi200103","url":null,"abstract":"The article presents an analysis of approaches to the development of enterprise information systems that are in use today. One of the major trends that predetermines the agenda of information technology is the focus on parallel computing of large volumes of data using supercomputing technologies. The article considers the resulting ubiquitous move to distributed patterns of building enterprise information systems and avoiding monolithic architectures. The emphasis is placed on the importance of such fundamental characteristics of enterprise information systems as reliability, scalability, and maintainability. The article justifies the importance of machine learning in the context of effective big data analysis and competitive gain for business, vital for both maintaining a leading position in the market and surviving in conditions of global instability and digitalization of economy. Transition from storing the current state of a enterprise information system to storing a full log and history of all changes in the event stream is proposed as an instrument of achieving linearization of the data stream for subsequent parallel computing. There is a new view that is being shaped of specialists at the intersection of engineering and analytical disciplines, who would be able to effectively develop scalable systems and algorithms for data processing and integration of its results into company business processes.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125557407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mulya Agung, Muhammad Alfian Amrizal, Ryusuke Egawa, H. Takizawa
Mapping MPI processes to processor cores, called process mapping, is crucial to achieving the scalable performance on multi-core processors. By analyzing the communication behavior among MPI processes, process mapping can improve the communication locality, and thus reduce the overall communication cost. However, on modern non-uniform memory access (NUMA) systems, the memory congestion problem could degrade performance more severely than the locality problem because heavy congestion on shared caches and memory controllers could cause long latencies. Most of the existing work focus only on improving the locality or rely on offline profiling to analyze the communication behavior. We propose a process mapping method that dynamically performs the process mapping for adapting to communication behaviors while coordinating the locality and memory congestion. Our method works online during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. Experimental results show that our method can achieve performance and energy efficiency close to the best static mapping method with low overhead to the application execution. In experiments with the NAS parallel benchmarks on a NUMA system, the performance and total energy improvements are up to 34% (18.5% on average) and 28.9% (13.6% on average), respectively. In experiments with two GROMACS applications on a larger NUMA system, the average improvements in performance and total energy consumption are 21.6% and 12.6%, respectively.
{"title":"Online MPI Process Mapping for Coordinating Locality and Memory Congestion on NUMA Systems","authors":"Mulya Agung, Muhammad Alfian Amrizal, Ryusuke Egawa, H. Takizawa","doi":"10.14529/jsfi200104","DOIUrl":"https://doi.org/10.14529/jsfi200104","url":null,"abstract":"Mapping MPI processes to processor cores, called process mapping, is crucial to achieving the scalable performance on multi-core processors. By analyzing the communication behavior among MPI processes, process mapping can improve the communication locality, and thus reduce the overall communication cost. However, on modern non-uniform memory access (NUMA) systems, the memory congestion problem could degrade performance more severely than the locality problem because heavy congestion on shared caches and memory controllers could cause long latencies. Most of the existing work focus only on improving the locality or rely on offline profiling to analyze the communication behavior. We propose a process mapping method that dynamically performs the process mapping for adapting to communication behaviors while coordinating the locality and memory congestion. Our method works online during the execution of an MPI application. It does not require modifications to the application, previous knowledge of the communication behavior, or changes to the hardware and operating system. Experimental results show that our method can achieve performance and energy efficiency close to the best static mapping method with low overhead to the application execution. In experiments with the NAS parallel benchmarks on a NUMA system, the performance and total energy improvements are up to 34% (18.5% on average) and 28.9% (13.6% on average), respectively. In experiments with two GROMACS applications on a larger NUMA system, the average improvements in performance and total energy consumption are 21.6% and 12.6%, respectively.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-04-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133371988","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
General purpose GPUs are now ubiquitous in high-end supercomputing. All but one (the Japanese Fugaku system, which is based on ARM processors) of the announced (pre-)exascale systems contain vast amounts of GPUs that deliver the majority of the performance of these systems. Thus, GPU programming will be a necessity for application developers using high-end HPC systems.However, programming GPUs efficiently is an even more daunting task than traditional HPC application development. This becomes even more apparent for large-scale systems containing thousands of GPUs. Orchestrating all the resources of such a system imposes a tremendous challenge to developers. Luckily a rich ecosystem of tools exist to assist developers in every development step of a GPU application at all scales. In this paper we present an overview of these tools and discuss their capabilities. We start with an overview of different GPU programming models, from low-level with CUDA over pragma-based models like OpenACC to high-level approaches like Kokkos. We discuss their respective tool interfaces as the main method for tools to obtain information on the execution of a kernel on the GPU. The main focus of this paper is on two classes of tools, debuggers and performance analysis tools. Debuggers help the developer to identify problems both on the CPU and GPU side as well as in the interplay of both. Once the application runs correctly, performance analysis tools can be used to pinpoint bottlenecks in the execution of the code and help to increase the overall performance.
{"title":"Tools for GPU Computing - Debugging and Performance Analysis of Heterogenous HPC Applications","authors":"Michael Knobloch, B. Mohr","doi":"10.14529/jsfi200105","DOIUrl":"https://doi.org/10.14529/jsfi200105","url":null,"abstract":"General purpose GPUs are now ubiquitous in high-end supercomputing. All but one (the Japanese Fugaku system, which is based on ARM processors) of the announced (pre-)exascale systems contain vast amounts of GPUs that deliver the majority of the performance of these systems. Thus, GPU programming will be a necessity for application developers using high-end HPC systems.However, programming GPUs efficiently is an even more daunting task than traditional HPC application development. This becomes even more apparent for large-scale systems containing thousands of GPUs. Orchestrating all the resources of such a system imposes a tremendous challenge to developers. Luckily a rich ecosystem of tools exist to assist developers in every development step of a GPU application at all scales. In this paper we present an overview of these tools and discuss their capabilities. We start with an overview of different GPU programming models, from low-level with CUDA over pragma-based models like OpenACC to high-level approaches like Kokkos. We discuss their respective tool interfaces as the main method for tools to obtain information on the execution of a kernel on the GPU. The main focus of this paper is on two classes of tools, debuggers and performance analysis tools. Debuggers help the developer to identify problems both on the CPU and GPU side as well as in the interplay of both. Once the application runs correctly, performance analysis tools can be used to pinpoint bottlenecks in the execution of the code and help to increase the overall performance.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"94 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124162810","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kento Aoyama, Masanori Kakuta, Yuri Matsuzaki, T. Ishida, M. Ohue, Y. Akiyama
Pipeline software that comprise tool and application chains for specific data processing have found extensive utilization in the analysis of several data types, such as genome, in bioinformatics research. Recent trends in genome analysis require use of pipeline software for optimum utilization of computational resources, thereby facilitating efficient handling of large-scale biological data accumulated on a daily basis. However, use of pipeline software in bioinformatics tends to be problematic owing to their large memory and storage capacity requirements, increasing number of job submissions, and a wide range of software dependencies. This paper presents a massive parallel genome/exome analysis pipeline software that addresses these difficulties. Additionally, it can be executed on a large number of K computer nodes. The proposed pipeline incorporates workflow management functionality that performs effectively when considering the task-dependency graph of internal executions via extension of the dynamic task distribution framework. Performance results pertaining to the core pipeline functionality, obtained via evaluation experiments performed using an actual exome dataset, demonstrate good scalability when using over a thousand nodes. Additionally, this study proposes several approaches to resolve performance bottlenecks of a pipeline by considering the domain knowledge pertaining to internal pipeline executions as a major challenge facing pipeline parallelization.
{"title":"Development of Computational Pipeline Software for Genome/Exome Analysis on the K Computer","authors":"Kento Aoyama, Masanori Kakuta, Yuri Matsuzaki, T. Ishida, M. Ohue, Y. Akiyama","doi":"10.14529/jsfi200102","DOIUrl":"https://doi.org/10.14529/jsfi200102","url":null,"abstract":"Pipeline software that comprise tool and application chains for specific data processing have found extensive utilization in the analysis of several data types, such as genome, in bioinformatics research. Recent trends in genome analysis require use of pipeline software for optimum utilization of computational resources, thereby facilitating efficient handling of large-scale biological data accumulated on a daily basis. However, use of pipeline software in bioinformatics tends to be problematic owing to their large memory and storage capacity requirements, increasing number of job submissions, and a wide range of software dependencies. This paper presents a massive parallel genome/exome analysis pipeline software that addresses these difficulties. Additionally, it can be executed on a large number of K computer nodes. The proposed pipeline incorporates workflow management functionality that performs effectively when considering the task-dependency graph of internal executions via extension of the dynamic task distribution framework. Performance results pertaining to the core pipeline functionality, obtained via evaluation experiments performed using an actual exome dataset, demonstrate good scalability when using over a thousand nodes. Additionally, this study proposes several approaches to resolve performance bottlenecks of a pipeline by considering the domain knowledge pertaining to internal pipeline executions as a major challenge facing pipeline parallelization.","PeriodicalId":338883,"journal":{"name":"Supercomput. Front. Innov.","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2020-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123652541","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}