Pub Date : 2019-12-25DOI: 10.1147/JRD.2019.2962452
R. Islam;G. Shah
A high-performance, scalable, and resilient storage subsystem is essential for delivering and maintaining consistent performance and high utilization expected from a modern supercomputer. IBM delivered two systems under the CORAL program, both of which used IBM Spectrum Scale and IBM Elastic Storage Server (ESS) as the storage solution. The larger of the two CORAL clusters is composed of 77 building blocks of ESS, each of which consists of a pair of high-performance I/O Server nodes connected to four high-density storage enclosures. These ESS building blocks are interconnected via a redundant InfiniBand EDR network to form a storage cluster that provides a global namespace aggregating performance over 32,000 commodity disks. The IBM Spectrum Scale for ESS runs high-performance erasure coding on each building block and provides a single global name space across all the building blocks. The IBM Spectrum Scale features deliver a highly resilient, high-performance storage subsystem using ESS. These features include recent improvements for efficient buffer management and fast efficient low-latency communication. CORAL I/O performance results include large-block streaming throughput of over 2.4 TB/s, ability to create over 1 M 32-KB files per second, and enabling an aggregate rate of 30 K zero-length file creates per second in a shared directory from multiple nodes. This article describes the design and implementation of the ESS storage cluster; the innovations required to meet the performance, scale, manageability, and reliability goals; and challenges we had to overcome as we deployed a system of such unprecedented I/O capabilities.
{"title":"Building a high-performance resilient scalable storage cluster for CORAL using IBM ESS","authors":"R. Islam;G. Shah","doi":"10.1147/JRD.2019.2962452","DOIUrl":"https://doi.org/10.1147/JRD.2019.2962452","url":null,"abstract":"A high-performance, scalable, and resilient storage subsystem is essential for delivering and maintaining consistent performance and high utilization expected from a modern supercomputer. IBM delivered two systems under the CORAL program, both of which used IBM Spectrum Scale and IBM Elastic Storage Server (ESS) as the storage solution. The larger of the two CORAL clusters is composed of 77 building blocks of ESS, each of which consists of a pair of high-performance I/O Server nodes connected to four high-density storage enclosures. These ESS building blocks are interconnected via a redundant InfiniBand EDR network to form a storage cluster that provides a global namespace aggregating performance over 32,000 commodity disks. The IBM Spectrum Scale for ESS runs high-performance erasure coding on each building block and provides a single global name space across all the building blocks. The IBM Spectrum Scale features deliver a highly resilient, high-performance storage subsystem using ESS. These features include recent improvements for efficient buffer management and fast efficient low-latency communication. CORAL I/O performance results include large-block streaming throughput of over 2.4 TB/s, ability to create over 1 M 32-KB files per second, and enabling an aggregate rate of 30 K zero-length file creates per second in a shared directory from multiple nodes. This article describes the design and implementation of the ESS storage cluster; the innovations required to meet the performance, scale, manageability, and reliability goals; and challenges we had to overcome as we deployed a system of such unprecedented I/O capabilities.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"4:1-4:9"},"PeriodicalIF":1.3,"publicationDate":"2019-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-12-20DOI: 10.1147/JRD.2019.2961069
J. P. Dahm;D. F. Richards;A. Black;A. D. Bertsch;L. Grinberg;I. Karlin;S. Kokkila-Schumacher;E. A. León;J. R. Neely;R. Pankajakshan;O. Pearce
The introduction of heterogeneous computing via GPUs from the Sierra architecture represented a significant shift in direction for computational science at Lawrence Livermore National Laboratory (LLNL), and therefore required significant preparation. Over the last five years, the Sierra Center of Excellence (CoE) has brought employees with specific expertise from IBM and NVIDIA together with LLNL in a concentrated effort to prepare applications, system software, and tools for the Sierra supercomputer. This article shares the process we applied for the CoE and documents lessons learned during the collaboration, with the hope that others will be able to learn from both our success and intermediate setbacks. We describe what we have found to work for the management of such a collaboration and best practices for algorithms and source code, system configuration and software stack, tools, and application performance.
{"title":"Sierra Center of Excellence: Lessons learned","authors":"J. P. Dahm;D. F. Richards;A. Black;A. D. Bertsch;L. Grinberg;I. Karlin;S. Kokkila-Schumacher;E. A. León;J. R. Neely;R. Pankajakshan;O. Pearce","doi":"10.1147/JRD.2019.2961069","DOIUrl":"https://doi.org/10.1147/JRD.2019.2961069","url":null,"abstract":"The introduction of heterogeneous computing via GPUs from the Sierra architecture represented a significant shift in direction for computational science at Lawrence Livermore National Laboratory (LLNL), and therefore required significant preparation. Over the last five years, the Sierra Center of Excellence (CoE) has brought employees with specific expertise from IBM and NVIDIA together with LLNL in a concentrated effort to prepare applications, system software, and tools for the Sierra supercomputer. This article shares the process we applied for the CoE and documents lessons learned during the collaboration, with the hope that others will be able to learn from both our success and intermediate setbacks. We describe what we have found to work for the management of such a collaboration and best practices for algorithms and source code, system configuration and software stack, tools, and application performance.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"2:1-2:14"},"PeriodicalIF":1.3,"publicationDate":"2019-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2961069","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948809","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-12-17DOI: 10.1147/JRD.2019.2960246
S. Maerean;E. K. Lee;H.-F. Wen;I-H. Chung
The CORAL project exhibits an important shift in the computational paradigm from homogeneous to heterogeneous computing, where applications run on both the CPU and the accelerator (e.g., GPU). Existing applications optimized to run only on the CPU have to be rewritten to adopt accelerators and retuned to achieve optimal performance. The shift in the computational paradigm requires application development tools (e.g., compilers, performance profilers and tracers, and debuggers) change to better assist users. The CORAL project places a strong emphasis on open-source tools to create a collaborative environment in the tools community. In this article, we discuss the collaboration efforts and corresponding challenges to meet the CORAL requirements on tools and detail three of the challenges that required the most involvement. A usage scenario is provided to show how the tools may help users adopt the new computation environment and understand their application execution and the data flow at scale.
{"title":"Transformation of application enablement tools on CORAL systems","authors":"S. Maerean;E. K. Lee;H.-F. Wen;I-H. Chung","doi":"10.1147/JRD.2019.2960246","DOIUrl":"https://doi.org/10.1147/JRD.2019.2960246","url":null,"abstract":"The CORAL project exhibits an important shift in the computational paradigm from homogeneous to heterogeneous computing, where applications run on both the CPU and the accelerator (e.g., GPU). Existing applications optimized to run only on the CPU have to be rewritten to adopt accelerators and retuned to achieve optimal performance. The shift in the computational paradigm requires application development tools (e.g., compilers, performance profilers and tracers, and debuggers) change to better assist users. The CORAL project places a strong emphasis on open-source tools to create a collaborative environment in the tools community. In this article, we discuss the collaboration efforts and corresponding challenges to meet the CORAL requirements on tools and detail three of the challenges that required the most involvement. A usage scenario is provided to show how the tools may help users adopt the new computation environment and understand their application execution and the data flow at scale.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"16:1-16:12"},"PeriodicalIF":1.3,"publicationDate":"2019-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2960246","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-12-17DOI: 10.1147/JRD.2019.2960241
D. Krook;S. Malaika
Natural disasters are increasing as highlighted in many reports including the Borgen Project. In 2018, David Clark Cause as creator and IBM as founding partner, in partnership with the United Nations Human Rights Office, the American Red Cross International Team, and The Linux Foundation, issued a “Call for Code” to developers to create robust projects that prepare communities for natural disasters and help them respond more quickly in their aftermath. This article covers the steps and tools used to engage with developers, the results from the first of five competitions to be run by the Call for Code Global Initiative over five years, and how the winners were selected. Insights from the mobilization of 100,000 developers toward this cause are described, as well as the lessons learned from running large-scale hackathons.
自然灾害正在增加,包括博根项目在内的许多报告都强调了这一点。2018年,David Clark Cause作为创始人,IBM作为创始合作伙伴,与联合国人权办公室、美国红十字国际团队和Linux基金会合作,向开发人员发出了“代码呼吁”,以创建强大的项目,为社区应对自然灾害做好准备,并帮助他们在灾难发生后更快地做出反应。这篇文章介绍了与开发人员接触的步骤和工具,五年来由代码全球倡议组织举办的五场比赛中的第一场比赛的结果,以及如何选出获胜者。描述了动员10万名开发人员参与这项事业的见解,以及从举办大型黑客马拉松中吸取的教训。
{"title":"Call for Code: Developers tackle natural disasters with software","authors":"D. Krook;S. Malaika","doi":"10.1147/JRD.2019.2960241","DOIUrl":"https://doi.org/10.1147/JRD.2019.2960241","url":null,"abstract":"Natural disasters are increasing as highlighted in many reports including the Borgen Project. In 2018, David Clark Cause as creator and IBM as founding partner, in partnership with the United Nations Human Rights Office, the American Red Cross International Team, and The Linux Foundation, issued a “Call for Code” to developers to create robust projects that prepare communities for natural disasters and help them respond more quickly in their aftermath. This article covers the steps and tools used to engage with developers, the results from the first of five competitions to be run by the Call for Code Global Initiative over five years, and how the winners were selected. Insights from the mobilization of 100,000 developers toward this cause are described, as well as the lessons learned from running large-scale hackathons.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 1/2","pages":"4:1-4:8"},"PeriodicalIF":1.3,"publicationDate":"2019-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2960241","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49980047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-12-17DOI: 10.1147/JRD.2019.2960244
R. E. Curzon;P. Curotto;M. Evason;A. Failla;P. Kusterer;A. Ogawa;J. Paraszczak;S. Raghavan
The role of corporations and their corporate social responsibility (CSR)-related response to disasters in support of their communities has not been extensively documented; thus, this article attempts to explain the role that one corporation, IBM, has played in disaster response and how it has used IBM and open-source technologies to deal with a broad range of disasters. These technologies range from advanced seismic monitoring and flood management to predicting and improving refugee flows. The article outlines various principles that have guided IBM in shaping its disaster response and provides some insights into various sources of useful data and applications that can be used in these critical situations. It also details one example of an emerging technology that is being used in these efforts.
{"title":"A unique approach to corporate disaster philanthropy focused on delivering technology and expertise","authors":"R. E. Curzon;P. Curotto;M. Evason;A. Failla;P. Kusterer;A. Ogawa;J. Paraszczak;S. Raghavan","doi":"10.1147/JRD.2019.2960244","DOIUrl":"https://doi.org/10.1147/JRD.2019.2960244","url":null,"abstract":"The role of corporations and their corporate social responsibility (CSR)-related response to disasters in support of their communities has not been extensively documented; thus, this article attempts to explain the role that one corporation, IBM, has played in disaster response and how it has used IBM and open-source technologies to deal with a broad range of disasters. These technologies range from advanced seismic monitoring and flood management to predicting and improving refugee flows. The article outlines various principles that have guided IBM in shaping its disaster response and provides some insights into various sources of useful data and applications that can be used in these critical situations. It also details one example of an emerging technology that is being used in these efforts.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 1/2","pages":"2:1-2:14"},"PeriodicalIF":1.3,"publicationDate":"2019-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2960244","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49986743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-12-17DOI: 10.1147/JRD.2019.2960220
W. A. Hanson
In 2014, the U.S. Department of Energy (DoE) initiated a multiyear collaboration between Oak Ridge National Laboratory (ORNL), Argonne National Laboratory, and Lawrence Livermore National Laboratory (LLNL), known as “CORAL,” the next major phase in the DoE's scientific computing roadmap. The IBM CORAL systems are based on a fundamentally new data-centric architecture, where compute power is embedded everywhere data resides, combining powerful central processing units (CPUs) with graphics processing units (GPUs) optimized for scientific computing and artificial intelligence workloads. The IBM CORAL systems were built on the combination of mature technologies: 9th-generation POWER CPU, 6th-generation NVIDIA GPU, and 5th-generation Mellanox InfiniBand. These systems are providing scientists with computing power to solve challenges in many research areas beyond previously possible. This article provides an overview of the system solutions deployed at ORNL and LLNL.
{"title":"The CORAL supercomputer systems","authors":"W. A. Hanson","doi":"10.1147/JRD.2019.2960220","DOIUrl":"https://doi.org/10.1147/JRD.2019.2960220","url":null,"abstract":"In 2014, the U.S. Department of Energy (DoE) initiated a multiyear collaboration between Oak Ridge National Laboratory (ORNL), Argonne National Laboratory, and Lawrence Livermore National Laboratory (LLNL), known as “CORAL,” the next major phase in the DoE's scientific computing roadmap. The IBM CORAL systems are based on a fundamentally new data-centric architecture, where compute power is embedded everywhere data resides, combining powerful central processing units (CPUs) with graphics processing units (GPUs) optimized for scientific computing and artificial intelligence workloads. The IBM CORAL systems were built on the combination of mature technologies: 9th-generation POWER CPU, 6th-generation NVIDIA GPU, and 5th-generation Mellanox InfiniBand. These systems are providing scientists with computing power to solve challenges in many research areas beyond previously possible. This article provides an overview of the system solutions deployed at ORNL and LLNL.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"1:1-1:10"},"PeriodicalIF":1.3,"publicationDate":"2019-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2960220","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49978542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-12-17DOI: 10.1147/JRD.2019.2960218
R. Pankajakshan;P.-H. Lin;B. Sjögreen
Seismic waves fourth order (SW4) solves the seismic wave equations on Cartesian and curvilinear grids using large compute clusters with O (100,000) cores. This article discusses the porting of SW4 to run on the CORAL architecture using the RAJA performance portability abstraction layer. The performances of key kernels using RAJA and CUDA are compared to estimate the performance penalty of using the portability abstraction layer. Code changes required for efficiency on GPUs and minimizing time spent in Message Passing Interface (MPI) are discussed. This article describes a path for efficiently porting large code bases to GPU-based machines while avoiding the pitfalls of a new architecture in the early stages of its deployment. Current bottlenecks in the code are discussed along with possible architectural or software mitigations. SW4 runs 28× faster on one 4-GPU CORAL node than on a CTS-1 node (Dual Intel Xeon E5-2695 v4). SW4 is now in routine use on problems of unprecedented resolution (203 billion grid points) and scale on 1,200 nodes of Summit.
{"title":"Porting a 3D seismic modeling code (SW4) to CORAL machines","authors":"R. Pankajakshan;P.-H. Lin;B. Sjögreen","doi":"10.1147/JRD.2019.2960218","DOIUrl":"https://doi.org/10.1147/JRD.2019.2960218","url":null,"abstract":"Seismic waves fourth order (SW4) solves the seismic wave equations on Cartesian and curvilinear grids using large compute clusters with O (100,000) cores. This article discusses the porting of SW4 to run on the CORAL architecture using the RAJA performance portability abstraction layer. The performances of key kernels using RAJA and CUDA are compared to estimate the performance penalty of using the portability abstraction layer. Code changes required for efficiency on GPUs and minimizing time spent in Message Passing Interface (MPI) are discussed. This article describes a path for efficiently porting large code bases to GPU-based machines while avoiding the pitfalls of a new architecture in the early stages of its deployment. Current bottlenecks in the code are discussed along with possible architectural or software mitigations. SW4 runs 28× faster on one 4-GPU CORAL node than on a CTS-1 node (Dual Intel Xeon E5-2695 v4). SW4 is now in routine use on problems of unprecedented resolution (203 billion grid points) and scale on 1,200 nodes of Summit.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"17:1-17:11"},"PeriodicalIF":1.3,"publicationDate":"2019-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2960218","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948704","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-12-17DOI: 10.1147/JRD.2019.2960245
A. E. Eichenberger;G.-T. Bercea;A. Bataev;L. Grinberg;J. K. O'Brien
Sierra and Summit supercomputers exhibit a significant amount of intranode parallelism between the host POWER9 CPUs and their attached GPU devices. In this article, we show that exploiting device-level parallelism is key to achieving high performance by reducing overheads typically associated with CPU and GPU task execution. Moreover, manually exploiting this type of parallelism in large-scale applications is nontrivial and error-prone. We hide the complexity of exploiting this hybrid intranode parallelism using the OpenMP programming model abstraction. The implementation leverages the semantics of OpenMP tasks to express asynchronous task computations and their associated dependences. Launching tasks on the CPU threads requires a careful design of work-stealing algorithms to provide efficient load balancing among CPU threads. We propose a novel algorithm that removes locks from all task queueing operations that are on the critical path. Tasks assigned to GPU devices require additional steps such as copying input data to GPU devices, launching the computation kernels, and copying data back to the host CPU memory. We perform key optimizations to reduce the cost of these additional steps by tightly integrating data transfers and GPU computations into streams of asynchronous GPU operations. We further map high-level dependences between GPU tasks to the same asynchronous GPU streams to further avoid unnecessary synchronization. Results validate our approach.
{"title":"Hybrid CPU/GPU tasks optimized for concurrency in OpenMP","authors":"A. E. Eichenberger;G.-T. Bercea;A. Bataev;L. Grinberg;J. K. O'Brien","doi":"10.1147/JRD.2019.2960245","DOIUrl":"https://doi.org/10.1147/JRD.2019.2960245","url":null,"abstract":"Sierra and Summit supercomputers exhibit a significant amount of intranode parallelism between the host POWER9 CPUs and their attached GPU devices. In this article, we show that exploiting device-level parallelism is key to achieving high performance by reducing overheads typically associated with CPU and GPU task execution. Moreover, manually exploiting this type of parallelism in large-scale applications is nontrivial and error-prone. We hide the complexity of exploiting this hybrid intranode parallelism using the OpenMP programming model abstraction. The implementation leverages the semantics of OpenMP tasks to express asynchronous task computations and their associated dependences. Launching tasks on the CPU threads requires a careful design of work-stealing algorithms to provide efficient load balancing among CPU threads. We propose a novel algorithm that removes locks from all task queueing operations that are on the critical path. Tasks assigned to GPU devices require additional steps such as copying input data to GPU devices, launching the computation kernels, and copying data back to the host CPU memory. We perform key optimizations to reduce the cost of these additional steps by tightly integrating data transfers and GPU computations into streams of asynchronous GPU operations. We further map high-level dependences between GPU tasks to the same asynchronous GPU streams to further avoid unnecessary synchronization. Results validate our approach.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"13:1-13:14"},"PeriodicalIF":1.3,"publicationDate":"2019-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2960245","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-12-17DOI: 10.1147/JRD.2019.2960356
A. E. Baxter;H. E. Wilborn Lagerman;P. Keskinocak
The number, magnitude, complexity, and impact of natural disasters have been steadily increasing in various parts of the world. When preparing for, responding to, and recovering from a disaster, multiple organizations make decisions and take actions considering the needs, available resources, and priorities of the affected communities, emergency supply chains, and infrastructures. Most of the prior research focuses on decision-making for independent systems (e.g., single critical infrastructure networks or distinct relief resources). An emerging research area extends the focus to interdependent systems (i.e., multiple dependent networks or resources). In this article, we survey the literature on modeling approaches for disaster management problems on independent systems, discuss some recent work on problems involving demand, resource, and/or network interdependencies, and offer future research directions to add to this growing research area.
{"title":"Quantitative modeling in disaster management: A literature review","authors":"A. E. Baxter;H. E. Wilborn Lagerman;P. Keskinocak","doi":"10.1147/JRD.2019.2960356","DOIUrl":"https://doi.org/10.1147/JRD.2019.2960356","url":null,"abstract":"The number, magnitude, complexity, and impact of natural disasters have been steadily increasing in various parts of the world. When preparing for, responding to, and recovering from a disaster, multiple organizations make decisions and take actions considering the needs, available resources, and priorities of the affected communities, emergency supply chains, and infrastructures. Most of the prior research focuses on decision-making for independent systems (e.g., single critical infrastructure networks or distinct relief resources). An emerging research area extends the focus to interdependent systems (i.e., multiple dependent networks or resources). In this article, we survey the literature on modeling approaches for disaster management problems on independent systems, discuss some recent work on problems involving demand, resource, and/or network interdependencies, and offer future research directions to add to this growing research area.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 1/2","pages":"3:1-3:13"},"PeriodicalIF":1.3,"publicationDate":"2019-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2960356","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49980046","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2019-12-17DOI: 10.1147/JRD.2019.2960225
M. Coletti;A. Fafard;D. Page
Architectural and hyperparameter design choices can influence deep-learner (DL) model fidelity but can also be affected by malformed training and validation data. However, practitioners may spend significant time refining layers and hyperparameters before discovering that distorted training data were impeding the training progress. We found that an evolutionary algorithm (EA) can be used to troubleshoot this kind of DL problem. An EA evaluated thousands of DL configurations on Summit that yielded no overall improvement in DL performance, which suggested problems with the training and validation data. We suspected that contrast limited adaptive histogram equalization enhancement that was applied to previously generated digital surface models, for which we were training DLs to find errors, had damaged the training data. Subsequent runs with an alternative global normalization yielded significantly improved DL performance. However, the DL intersection over unions still exhibited consistent subpar performance, which suggested further problems with the training data and DL approach. Nonetheless, we were able to diagnose this problem within a 12-hour span via Summit runs, which prevented several weeks of unproductive trial-and-error DL configuration refinement and allowed for a more timely convergence on an ultimately viable solution.
{"title":"Troubleshooting deep-learner training data problems using an evolutionary algorithm on Summit","authors":"M. Coletti;A. Fafard;D. Page","doi":"10.1147/JRD.2019.2960225","DOIUrl":"https://doi.org/10.1147/JRD.2019.2960225","url":null,"abstract":"Architectural and hyperparameter design choices can influence deep-learner (DL) model fidelity but can also be affected by malformed training and validation data. However, practitioners may spend significant time refining layers and hyperparameters before discovering that distorted training data were impeding the training progress. We found that an evolutionary algorithm (EA) can be used to troubleshoot this kind of DL problem. An EA evaluated thousands of DL configurations on Summit that yielded no overall improvement in DL performance, which suggested problems with the training and validation data. We suspected that contrast limited adaptive histogram equalization enhancement that was applied to previously generated digital surface models, for which we were training DLs to find errors, had damaged the training data. Subsequent runs with an alternative global normalization yielded significantly improved DL performance. However, the DL intersection over unions still exhibited consistent subpar performance, which suggested further problems with the training data and DL approach. Nonetheless, we were able to diagnose this problem within a 12-hour span via Summit runs, which prevented several weeks of unproductive trial-and-error DL configuration refinement and allowed for a more timely convergence on an ultimately viable solution.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"1-12"},"PeriodicalIF":1.3,"publicationDate":"2019-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2960225","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}