IBM Journal of Research and Development最新文献

英文中文

Building a high-performance resilient scalable storage cluster for CORAL using IBM ESS 使用IBM ESS为CORAL构建高性能弹性可扩展存储集群

IF 1.3 4区计算机科学 Q1 Computer Science

IBM Journal of Research and Development

Pub Date : 2019-12-25 DOI: 10.1147/JRD.2019.2962452

R. Islam;G. Shah

A high-performance, scalable, and resilient storage subsystem is essential for delivering and maintaining consistent performance and high utilization expected from a modern supercomputer. IBM delivered two systems under the CORAL program, both of which used IBM Spectrum Scale and IBM Elastic Storage Server (ESS) as the storage solution. The larger of the two CORAL clusters is composed of 77 building blocks of ESS, each of which consists of a pair of high-performance I/O Server nodes connected to four high-density storage enclosures. These ESS building blocks are interconnected via a redundant InfiniBand EDR network to form a storage cluster that provides a global namespace aggregating performance over 32,000 commodity disks. The IBM Spectrum Scale for ESS runs high-performance erasure coding on each building block and provides a single global name space across all the building blocks. The IBM Spectrum Scale features deliver a highly resilient, high-performance storage subsystem using ESS. These features include recent improvements for efficient buffer management and fast efficient low-latency communication. CORAL I/O performance results include large-block streaming throughput of over 2.4 TB/s, ability to create over 1 M 32-KB files per second, and enabling an aggregate rate of 30 K zero-length file creates per second in a shared directory from multiple nodes. This article describes the design and implementation of the ESS storage cluster; the innovations required to meet the performance, scale, manageability, and reliability goals; and challenges we had to overcome as we deployed a system of such unprecedented I/O capabilities.

高性能、可扩展和弹性存储子系统对于交付和维护现代超级计算机所期望的一致性能和高利用率至关重要。IBM在CORAL项目下交付了两个系统，它们都使用IBM Spectrum Scale和IBM Elastic Storage Server (ESS)作为存储解决方案。较大的CORAL集群由77个ESS模块组成，每个模块由一对高性能I/O Server节点连接到4个高密度存储框。这些ESS构建块通过冗余的ib EDR网络相互连接，形成一个存储集群，提供超过32,000个商品磁盘的全局命名空间聚合性能。用于ESS的IBM Spectrum Scale在每个构建块上运行高性能擦除编码，并提供跨所有构建块的单一全局名称空间。IBM Spectrum Scale特性使用ESS提供了一个高弹性、高性能的存储子系统。这些特性包括对高效缓冲区管理和快速高效低延迟通信的最新改进。CORAL I/O性能结果包括超过2.4 TB/s的大块流吞吐量，每秒创建超过1m个32 kb文件的能力，以及在多个节点的共享目录中实现每秒30 K零长度文件创建的聚合速率。本文介绍了ESS存储集群的设计与实现;满足性能、规模、可管理性和可靠性目标所需的创新;以及我们在部署具有如此前所未有的I/O能力的系统时必须克服的挑战。

{"title":"Building a high-performance resilient scalable storage cluster for CORAL using IBM ESS","authors":"R. Islam;G. Shah","doi":"10.1147/JRD.2019.2962452","DOIUrl":"https://doi.org/10.1147/JRD.2019.2962452","url":null,"abstract":"A high-performance, scalable, and resilient storage subsystem is essential for delivering and maintaining consistent performance and high utilization expected from a modern supercomputer. IBM delivered two systems under the CORAL program, both of which used IBM Spectrum Scale and IBM Elastic Storage Server (ESS) as the storage solution. The larger of the two CORAL clusters is composed of 77 building blocks of ESS, each of which consists of a pair of high-performance I/O Server nodes connected to four high-density storage enclosures. These ESS building blocks are interconnected via a redundant InfiniBand EDR network to form a storage cluster that provides a global namespace aggregating performance over 32,000 commodity disks. The IBM Spectrum Scale for ESS runs high-performance erasure coding on each building block and provides a single global name space across all the building blocks. The IBM Spectrum Scale features deliver a highly resilient, high-performance storage subsystem using ESS. These features include recent improvements for efficient buffer management and fast efficient low-latency communication. CORAL I/O performance results include large-block streaming throughput of over 2.4 TB/s, ability to create over 1 M 32-KB files per second, and enabling an aggregate rate of 30 K zero-length file creates per second in a shared directory from multiple nodes. This article describes the design and implementation of the ESS storage cluster; the innovations required to meet the performance, scale, manageability, and reliability goals; and challenges we had to overcome as we deployed a system of such unprecedented I/O capabilities.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"4:1-4:9"},"PeriodicalIF":1.3,"publicationDate":"2019-12-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Sierra Center of Excellence: Lessons learned Sierra卓越中心:经验教训

IF 1.3 4区计算机科学 Q1 Computer Science

IBM Journal of Research and Development

Pub Date : 2019-12-20 DOI: 10.1147/JRD.2019.2961069

J. P. Dahm;D. F. Richards;A. Black;A. D. Bertsch;L. Grinberg;I. Karlin;S. Kokkila-Schumacher;E. A. León;J. R. Neely;R. Pankajakshan;O. Pearce

The introduction of heterogeneous computing via GPUs from the Sierra architecture represented a significant shift in direction for computational science at Lawrence Livermore National Laboratory (LLNL), and therefore required significant preparation. Over the last five years, the Sierra Center of Excellence (CoE) has brought employees with specific expertise from IBM and NVIDIA together with LLNL in a concentrated effort to prepare applications, system software, and tools for the Sierra supercomputer. This article shares the process we applied for the CoE and documents lessons learned during the collaboration, with the hope that others will be able to learn from both our success and intermediate setbacks. We describe what we have found to work for the management of such a collaboration and best practices for algorithms and source code, system configuration and software stack, tools, and application performance.

Sierra架构通过gpu引入异构计算代表了劳伦斯利弗莫尔国家实验室(LLNL)计算科学方向的重大转变，因此需要大量的准备工作。在过去的五年中，Sierra卓越中心(CoE)将IBM和NVIDIA的专业人员与LLNL聚集在一起，集中精力为Sierra超级计算机准备应用程序、系统软件和工具。本文分享了我们申请CoE的过程，并记录了在协作过程中获得的经验教训，希望其他人能够从我们的成功和中间挫折中学习。我们描述了我们发现的对这种协作的管理以及算法和源代码、系统配置和软件堆栈、工具和应用程序性能的最佳实践的工作。

引用次数: 2

Transformation of application enablement tools on CORAL systems CORAL系统上应用程序启用工具的转换

IF 1.3 4区计算机科学 Q1 Computer Science

IBM Journal of Research and Development

Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960246

S. Maerean;E. K. Lee;H.-F. Wen;I-H. Chung

The CORAL project exhibits an important shift in the computational paradigm from homogeneous to heterogeneous computing, where applications run on both the CPU and the accelerator (e.g., GPU). Existing applications optimized to run only on the CPU have to be rewritten to adopt accelerators and retuned to achieve optimal performance. The shift in the computational paradigm requires application development tools (e.g., compilers, performance profilers and tracers, and debuggers) change to better assist users. The CORAL project places a strong emphasis on open-source tools to create a collaborative environment in the tools community. In this article, we discuss the collaboration efforts and corresponding challenges to meet the CORAL requirements on tools and detail three of the challenges that required the most involvement. A usage scenario is provided to show how the tools may help users adopt the new computation environment and understand their application execution and the data flow at scale.

CORAL项目展示了计算范式从同质计算到异构计算的重要转变，在异构计算中，应用程序在CPU和加速器（例如GPU）上运行。现有的优化为仅在CPU上运行的应用程序必须重写以采用加速器，并重新调整以实现最佳性能。计算范式的转变需要改变应用程序开发工具（例如编译器、性能评测器和跟踪器以及调试器），以更好地帮助用户。CORAL项目非常重视开源工具，以在工具社区中创建协作环境。在本文中，我们讨论了为满足CORAL对工具的要求而进行的协作工作和相应的挑战，并详细介绍了最需要参与的三个挑战。提供了一个使用场景来展示这些工具如何帮助用户采用新的计算环境，并了解他们的应用程序执行和大规模的数据流。

引用次数: 0

Call for Code: Developers tackle natural disasters with software 呼吁代码:开发人员用软件解决自然灾害

IF 1.3 4区计算机科学 Q1 Computer Science

IBM Journal of Research and Development

Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960241

D. Krook;S. Malaika

Natural disasters are increasing as highlighted in many reports including the Borgen Project. In 2018, David Clark Cause as creator and IBM as founding partner, in partnership with the United Nations Human Rights Office, the American Red Cross International Team, and The Linux Foundation, issued a “Call for Code” to developers to create robust projects that prepare communities for natural disasters and help them respond more quickly in their aftermath. This article covers the steps and tools used to engage with developers, the results from the first of five competitions to be run by the Call for Code Global Initiative over five years, and how the winners were selected. Insights from the mobilization of 100,000 developers toward this cause are described, as well as the lessons learned from running large-scale hackathons.

自然灾害正在增加，包括博根项目在内的许多报告都强调了这一点。2018年，David Clark Cause作为创始人，IBM作为创始合作伙伴，与联合国人权办公室、美国红十字国际团队和Linux基金会合作，向开发人员发出了“代码呼吁”，以创建强大的项目，为社区应对自然灾害做好准备，并帮助他们在灾难发生后更快地做出反应。这篇文章介绍了与开发人员接触的步骤和工具，五年来由代码全球倡议组织举办的五场比赛中的第一场比赛的结果，以及如何选出获胜者。描述了动员10万名开发人员参与这项事业的见解，以及从举办大型黑客马拉松中吸取的教训。

引用次数: 1

A unique approach to corporate disaster philanthropy focused on delivering technology and expertise 一种独特的企业灾难慈善方式，专注于提供技术和专业知识

IF 1.3 4区计算机科学 Q1 Computer Science

IBM Journal of Research and Development

Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960244

R. E. Curzon;P. Curotto;M. Evason;A. Failla;P. Kusterer;A. Ogawa;J. Paraszczak;S. Raghavan

The role of corporations and their corporate social responsibility (CSR)-related response to disasters in support of their communities has not been extensively documented; thus, this article attempts to explain the role that one corporation, IBM, has played in disaster response and how it has used IBM and open-source technologies to deal with a broad range of disasters. These technologies range from advanced seismic monitoring and flood management to predicting and improving refugee flows. The article outlines various principles that have guided IBM in shaping its disaster response and provides some insights into various sources of useful data and applications that can be used in these critical situations. It also details one example of an emerging technology that is being used in these efforts.

企业的作用及其与企业社会责任(CSR)有关的救灾反应，以支持其社区，尚未得到广泛的记录;因此，本文试图解释IBM公司在灾难响应中所扮演的角色，以及它如何使用IBM和开源技术来处理各种各样的灾难。这些技术的范围从先进的地震监测和洪水管理到预测和改善难民潮。本文概述了指导IBM形成灾难响应的各种原则，并对可用于这些关键情况的各种有用数据和应用程序来源提供了一些见解。它还详细介绍了正在这些努力中使用的一种新兴技术的一个例子。

引用次数: 1

The CORAL supercomputer systems 珊瑚超级计算机系统

IF 1.3 4区计算机科学 Q1 Computer Science

IBM Journal of Research and Development

Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960220

W. A. Hanson

In 2014, the U.S. Department of Energy (DoE) initiated a multiyear collaboration between Oak Ridge National Laboratory (ORNL), Argonne National Laboratory, and Lawrence Livermore National Laboratory (LLNL), known as “CORAL,” the next major phase in the DoE's scientific computing roadmap. The IBM CORAL systems are based on a fundamentally new data-centric architecture, where compute power is embedded everywhere data resides, combining powerful central processing units (CPUs) with graphics processing units (GPUs) optimized for scientific computing and artificial intelligence workloads. The IBM CORAL systems were built on the combination of mature technologies: 9th-generation POWER CPU, 6th-generation NVIDIA GPU, and 5th-generation Mellanox InfiniBand. These systems are providing scientists with computing power to solve challenges in many research areas beyond previously possible. This article provides an overview of the system solutions deployed at ORNL and LLNL.

2014年，美国能源部(DoE)启动了橡树岭国家实验室(ORNL)、阿贡国家实验室和劳伦斯利弗莫尔国家实验室(LLNL)之间的多年合作，称为“CORAL”，这是美国能源部科学计算路线图的下一个主要阶段。IBM CORAL系统基于一种全新的以数据为中心的架构，计算能力嵌入到数据驻留的任何地方，将强大的中央处理单元(cpu)与针对科学计算和人工智能工作负载优化的图形处理单元(gpu)相结合。IBM CORAL系统是在第9代POWER CPU、第6代NVIDIA GPU和第5代Mellanox InfiniBand等成熟技术的基础上构建的。这些系统为科学家提供了计算能力，以解决许多研究领域的挑战，超出了以前的可能性。本文概述了在ORNL和LLNL部署的系统解决方案。

引用次数: 11

Porting a 3D seismic modeling code (SW4) to CORAL machines 将3D地震建模代码(SW4)移植到CORAL机器上

IF 1.3 4区计算机科学 Q1 Computer Science

IBM Journal of Research and Development

Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960218

R. Pankajakshan;P.-H. Lin;B. Sjögreen

Seismic waves fourth order (SW4) solves the seismic wave equations on Cartesian and curvilinear grids using large compute clusters with O (100,000) cores. This article discusses the porting of SW4 to run on the CORAL architecture using the RAJA performance portability abstraction layer. The performances of key kernels using RAJA and CUDA are compared to estimate the performance penalty of using the portability abstraction layer. Code changes required for efficiency on GPUs and minimizing time spent in Message Passing Interface (MPI) are discussed. This article describes a path for efficiently porting large code bases to GPU-based machines while avoiding the pitfalls of a new architecture in the early stages of its deployment. Current bottlenecks in the code are discussed along with possible architectural or software mitigations. SW4 runs 28× faster on one 4-GPU CORAL node than on a CTS-1 node (Dual Intel Xeon E5-2695 v4). SW4 is now in routine use on problems of unprecedented resolution (203 billion grid points) and scale on 1,200 nodes of Summit.

四阶地震波（SW4）使用具有O（100000）核的大型计算集群在笛卡尔和曲线网格上求解地震波方程。本文讨论了使用RAJA性能可移植性抽象层将SW4移植到CORAL架构上运行。比较了使用RAJA和CUDA的关键内核的性能，以估计使用可移植性抽象层的性能损失。讨论了提高GPU效率和最小化在消息传递接口（MPI）中花费的时间所需的代码更改。本文描述了一种有效地将大型代码库移植到基于GPU的机器的路径，同时避免了新架构在部署早期阶段的陷阱。讨论了代码中的当前瓶颈以及可能的体系结构或软件缓解措施。SW4在一个4-GPU CORAL节点上的运行速度是CTS-1节点（双Intel Xeon E5-2695 v4）的28倍。SW4目前正在Summit的1200个节点上以前所未有的分辨率（2030亿个网格点）和规模进行常规使用。

引用次数: 2

Hybrid CPU/GPU tasks optimized for concurrency in OpenMP 在OpenMP中为并发性优化的混合CPU/GPU任务

IF 1.3 4区计算机科学 Q1 Computer Science

IBM Journal of Research and Development

Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960245

A. E. Eichenberger;G.-T. Bercea;A. Bataev;L. Grinberg;J. K. O'Brien

Sierra and Summit supercomputers exhibit a significant amount of intranode parallelism between the host POWER9 CPUs and their attached GPU devices. In this article, we show that exploiting device-level parallelism is key to achieving high performance by reducing overheads typically associated with CPU and GPU task execution. Moreover, manually exploiting this type of parallelism in large-scale applications is nontrivial and error-prone. We hide the complexity of exploiting this hybrid intranode parallelism using the OpenMP programming model abstraction. The implementation leverages the semantics of OpenMP tasks to express asynchronous task computations and their associated dependences. Launching tasks on the CPU threads requires a careful design of work-stealing algorithms to provide efficient load balancing among CPU threads. We propose a novel algorithm that removes locks from all task queueing operations that are on the critical path. Tasks assigned to GPU devices require additional steps such as copying input data to GPU devices, launching the computation kernels, and copying data back to the host CPU memory. We perform key optimizations to reduce the cost of these additional steps by tightly integrating data transfers and GPU computations into streams of asynchronous GPU operations. We further map high-level dependences between GPU tasks to the same asynchronous GPU streams to further avoid unnecessary synchronization. Results validate our approach.

Sierra和Summit超级计算机在主机POWER9 CPU及其连接的GPU设备之间表现出大量的内部节点并行性。在本文中，我们展示了利用设备级并行性是通过减少通常与CPU和GPU任务执行相关的开销来实现高性能的关键。此外，在大规模应用程序中手动利用这种类型的并行性是不平凡的，而且容易出错。我们使用OpenMP编程模型抽象来隐藏利用这种混合内部节点并行性的复杂性。该实现利用OpenMP任务的语义来表达异步任务计算及其相关的依赖关系。在CPU线程上启动任务需要仔细设计工作窃取算法，以在CPU线程之间提供有效的负载平衡。我们提出了一种新的算法，可以从关键路径上的所有任务排队操作中移除锁。分配给GPU设备的任务需要额外的步骤，例如将输入数据复制到GPU设备、启动计算内核以及将数据复制回主机CPU存储器。我们通过将数据传输和GPU计算紧密集成到异步GPU操作流中来执行关键优化，以降低这些额外步骤的成本。我们进一步将GPU任务之间的高级依赖关系映射到相同的异步GPU流，以进一步避免不必要的同步。结果验证了我们的方法。

{"title":"Hybrid CPU/GPU tasks optimized for concurrency in OpenMP","authors":"A. E. Eichenberger;G.-T. Bercea;A. Bataev;L. Grinberg;J. K. O'Brien","doi":"10.1147/JRD.2019.2960245","DOIUrl":"https://doi.org/10.1147/JRD.2019.2960245","url":null,"abstract":"Sierra and Summit supercomputers exhibit a significant amount of intranode parallelism between the host POWER9 CPUs and their attached GPU devices. In this article, we show that exploiting device-level parallelism is key to achieving high performance by reducing overheads typically associated with CPU and GPU task execution. Moreover, manually exploiting this type of parallelism in large-scale applications is nontrivial and error-prone. We hide the complexity of exploiting this hybrid intranode parallelism using the OpenMP programming model abstraction. The implementation leverages the semantics of OpenMP tasks to express asynchronous task computations and their associated dependences. Launching tasks on the CPU threads requires a careful design of work-stealing algorithms to provide efficient load balancing among CPU threads. We propose a novel algorithm that removes locks from all task queueing operations that are on the critical path. Tasks assigned to GPU devices require additional steps such as copying input data to GPU devices, launching the computation kernels, and copying data back to the host CPU memory. We perform key optimizations to reduce the cost of these additional steps by tightly integrating data transfers and GPU computations into streams of asynchronous GPU operations. We further map high-level dependences between GPU tasks to the same asynchronous GPU streams to further avoid unnecessary synchronization. Results validate our approach.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"13:1-13:14"},"PeriodicalIF":1.3,"publicationDate":"2019-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2960245","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948700","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Quantitative modeling in disaster management: A literature review 灾害管理中的定量建模:文献综述

IF 1.3 4区计算机科学 Q1 Computer Science

IBM Journal of Research and Development

Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960356

A. E. Baxter;H. E. Wilborn Lagerman;P. Keskinocak

The number, magnitude, complexity, and impact of natural disasters have been steadily increasing in various parts of the world. When preparing for, responding to, and recovering from a disaster, multiple organizations make decisions and take actions considering the needs, available resources, and priorities of the affected communities, emergency supply chains, and infrastructures. Most of the prior research focuses on decision-making for independent systems (e.g., single critical infrastructure networks or distinct relief resources). An emerging research area extends the focus to interdependent systems (i.e., multiple dependent networks or resources). In this article, we survey the literature on modeling approaches for disaster management problems on independent systems, discuss some recent work on problems involving demand, resource, and/or network interdependencies, and offer future research directions to add to this growing research area.

自然灾害的数量、规模、复杂性和影响在世界各地稳步增加。在准备、应对和从灾难中恢复时，多个组织会根据受影响社区、应急供应链和基础设施的需求、可用资源和优先事项做出决定并采取行动。先前的大多数研究都集中在独立系统的决策上（例如，单个关键基础设施网络或不同的救济资源）。一个新兴的研究领域将重点扩展到相互依赖的系统（即多个相互依赖的网络或资源）。在这篇文章中，我们调查了独立系统上灾害管理问题建模方法的文献，讨论了最近关于需求、资源和/或网络相互依赖性问题的一些工作，并为这一不断增长的研究领域提供了未来的研究方向。

引用次数: 12

Troubleshooting deep-learner training data problems using an evolutionary algorithm on Summit 在Summit上使用进化算法解决深度学习者训练数据问题

IF 1.3 4区计算机科学 Q1 Computer Science

IBM Journal of Research and Development

Pub Date : 2019-12-17 DOI: 10.1147/JRD.2019.2960225

M. Coletti;A. Fafard;D. Page

Architectural and hyperparameter design choices can influence deep-learner (DL) model fidelity but can also be affected by malformed training and validation data. However, practitioners may spend significant time refining layers and hyperparameters before discovering that distorted training data were impeding the training progress. We found that an evolutionary algorithm (EA) can be used to troubleshoot this kind of DL problem. An EA evaluated thousands of DL configurations on Summit that yielded no overall improvement in DL performance, which suggested problems with the training and validation data. We suspected that contrast limited adaptive histogram equalization enhancement that was applied to previously generated digital surface models, for which we were training DLs to find errors, had damaged the training data. Subsequent runs with an alternative global normalization yielded significantly improved DL performance. However, the DL intersection over unions still exhibited consistent subpar performance, which suggested further problems with the training data and DL approach. Nonetheless, we were able to diagnose this problem within a 12-hour span via Summit runs, which prevented several weeks of unproductive trial-and-error DL configuration refinement and allowed for a more timely convergence on an ultimately viable solution.

架构和超参数设计选择可能会影响深度学习器（DL）模型的保真度，但也可能受到格式错误的训练和验证数据的影响。然而，在发现扭曲的训练数据阻碍训练进度之前，从业者可能会花费大量时间来细化层和超参数。我们发现，进化算法（EA）可以用来解决这类DL问题。EA在Summit上评估了数千个DL配置，但这些配置并没有使DL性能得到全面改善，这表明训练和验证数据存在问题。我们怀疑，应用于先前生成的数字表面模型的对比度有限的自适应直方图均衡增强已经损坏了训练数据，我们正在训练DLs以查找错误。随后使用替代全局规范化的运行显著提高了DL性能。然而，并集上的DL交集仍然表现出一致的亚性能，这表明训练数据和DL方法存在进一步的问题。尽管如此，我们还是能够通过Summit运行在12小时内诊断出这个问题，这避免了数周的无效试错DL配置优化，并使我们能够更及时地达成最终可行的解决方案。

{"title":"Troubleshooting deep-learner training data problems using an evolutionary algorithm on Summit","authors":"M. Coletti;A. Fafard;D. Page","doi":"10.1147/JRD.2019.2960225","DOIUrl":"https://doi.org/10.1147/JRD.2019.2960225","url":null,"abstract":"Architectural and hyperparameter design choices can influence deep-learner (DL) model fidelity but can also be affected by malformed training and validation data. However, practitioners may spend significant time refining layers and hyperparameters before discovering that distorted training data were impeding the training progress. We found that an evolutionary algorithm (EA) can be used to troubleshoot this kind of DL problem. An EA evaluated thousands of DL configurations on Summit that yielded no overall improvement in DL performance, which suggested problems with the training and validation data. We suspected that contrast limited adaptive histogram equalization enhancement that was applied to previously generated digital surface models, for which we were training DLs to find errors, had damaged the training data. Subsequent runs with an alternative global normalization yielded significantly improved DL performance. However, the DL intersection over unions still exhibited consistent subpar performance, which suggested further problems with the training data and DL approach. Nonetheless, we were able to diagnose this problem within a 12-hour span via Summit runs, which prevented several weeks of unproductive trial-and-error DL configuration refinement and allowed for a more timely convergence on an ultimately viable solution.","PeriodicalId":55034,"journal":{"name":"IBM Journal of Research and Development","volume":"64 3/4","pages":"1-12"},"PeriodicalIF":1.3,"publicationDate":"2019-12-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1147/JRD.2019.2960225","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"49948705","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

IBM Journal of Research and Development

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀