用于高性能计算的百万核问题解决环境的实用形式正确性检查

2013 5th International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE) Pub Date : 2013-05-18 DOI:10.1109/SECSE.2013.6615102

D. C. B. D. Oliveira, Zvonimir Rakamaric, G. Gopalakrishnan, A. Humphrey, Qingyu Meng, M. Berzins

{"title":"用于高性能计算的百万核问题解决环境的实用形式正确性检查","authors":"D. C. B. D. Oliveira, Zvonimir Rakamaric, G. Gopalakrishnan, A. Humphrey, Qingyu Meng, M. Berzins","doi":"10.1109/SECSE.2013.6615102","DOIUrl":null,"url":null,"abstract":"While formal correctness checking methods have been deployed at scale in a number of important practical domains, we believe that such an experiment has yet to occur in the domain of high performance computing at the scale of a million CPU cores. This paper presents preliminary results from the Uintah Runtime Verification (URV) project that has been launched with this objective. Uintah is an asynchronous task-graph based problem-solving environment that has shown promising results on problems as diverse as fluid-structure interaction and turbulent combustion at well over 200K cores to date. Uintah has been tested on leading platforms such as Kraken, Keenland, and Titan consisting of multicore CPUs and GPUs, incorporates several innovative design features, and is following a roadmap for development well into the million core regime. The main results from the URV project to date are crystallized in two observations: (1) A diverse array of well-known ideas from lightweight formal methods and testing/observing HPC systems at scale have an excellent chance of succeeding. The real challenges are in finding out exactly which combinations of ideas to deploy, and where. (2) Large-scale problem solving environments for HPC must be designed such that they can be “crashed early” (at smaller scales of deployment) and “crashed often” (have effective ways of input generation and schedule perturbation that cause vulnerabilities to be attacked with higher probability). Furthermore, following each crash, one must “explain well” (given the extremely obscure ways in which an error finally manifests itself, we must develop ways to record information leading up to the crash in informative ways, to minimize offsite debugging burden). Our plans to achieve these goals and to measure our success are described. We also highlight some of the broadly applicable concepts and approaches.","PeriodicalId":133144,"journal":{"name":"2013 5th International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Practical formal correctness checking of million-core problem solving environments for HPC\",\"authors\":\"D. C. B. D. Oliveira, Zvonimir Rakamaric, G. Gopalakrishnan, A. Humphrey, Qingyu Meng, M. Berzins\",\"doi\":\"10.1109/SECSE.2013.6615102\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While formal correctness checking methods have been deployed at scale in a number of important practical domains, we believe that such an experiment has yet to occur in the domain of high performance computing at the scale of a million CPU cores. This paper presents preliminary results from the Uintah Runtime Verification (URV) project that has been launched with this objective. Uintah is an asynchronous task-graph based problem-solving environment that has shown promising results on problems as diverse as fluid-structure interaction and turbulent combustion at well over 200K cores to date. Uintah has been tested on leading platforms such as Kraken, Keenland, and Titan consisting of multicore CPUs and GPUs, incorporates several innovative design features, and is following a roadmap for development well into the million core regime. The main results from the URV project to date are crystallized in two observations: (1) A diverse array of well-known ideas from lightweight formal methods and testing/observing HPC systems at scale have an excellent chance of succeeding. The real challenges are in finding out exactly which combinations of ideas to deploy, and where. (2) Large-scale problem solving environments for HPC must be designed such that they can be “crashed early” (at smaller scales of deployment) and “crashed often” (have effective ways of input generation and schedule perturbation that cause vulnerabilities to be attacked with higher probability). Furthermore, following each crash, one must “explain well” (given the extremely obscure ways in which an error finally manifests itself, we must develop ways to record information leading up to the crash in informative ways, to minimize offsite debugging burden). Our plans to achieve these goals and to measure our success are described. We also highlight some of the broadly applicable concepts and approaches.\",\"PeriodicalId\":133144,\"journal\":{\"name\":\"2013 5th International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE)\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-05-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 5th International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SECSE.2013.6615102\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 5th International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SECSE.2013.6615102","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 4

摘要

虽然正式的正确性检查方法已经在许多重要的实际领域中大规模部署，但我们认为这样的实验还没有出现在100万CPU内核规模的高性能计算领域。本文介绍了基于此目标启动的Runtime Verification (URV)项目的初步结果。untah是一个基于异步任务图的问题解决环境，迄今为止已经在超过20万个核上显示出了令人鼓舞的结果，包括流体-结构相互作用和湍流燃烧等各种问题。untah已经在Kraken, Keenland和Titan等领先平台上进行了测试，包括多核cpu和gpu，结合了一些创新的设计功能，并且正在遵循开发路线图进入百万核制度。迄今为止，URV项目的主要成果主要体现在两个方面:(1)来自轻量级形式化方法和大规模测试/观察HPC系统的各种众所周知的想法有很好的成功机会。真正的挑战在于找出应该部署哪些想法组合，以及在哪里部署。(2) HPC的大规模问题解决环境必须被设计成能够“早崩溃”(在较小的部署规模下)和“经常崩溃”(具有有效的输入生成方式和时间表扰动，导致漏洞被攻击的概率更高)。此外，在每次崩溃之后，必须“很好地解释”(考虑到错误最终表现出来的极其模糊的方式，我们必须开发出以有信息的方式记录导致崩溃的信息的方法，以尽量减少场外调试负担)。描述了我们实现这些目标和衡量我们成功的计划。我们还重点介绍了一些广泛适用的概念和方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Practical formal correctness checking of million-core problem solving environments for HPC

While formal correctness checking methods have been deployed at scale in a number of important practical domains, we believe that such an experiment has yet to occur in the domain of high performance computing at the scale of a million CPU cores. This paper presents preliminary results from the Uintah Runtime Verification (URV) project that has been launched with this objective. Uintah is an asynchronous task-graph based problem-solving environment that has shown promising results on problems as diverse as fluid-structure interaction and turbulent combustion at well over 200K cores to date. Uintah has been tested on leading platforms such as Kraken, Keenland, and Titan consisting of multicore CPUs and GPUs, incorporates several innovative design features, and is following a roadmap for development well into the million core regime. The main results from the URV project to date are crystallized in two observations: (1) A diverse array of well-known ideas from lightweight formal methods and testing/observing HPC systems at scale have an excellent chance of succeeding. The real challenges are in finding out exactly which combinations of ideas to deploy, and where. (2) Large-scale problem solving environments for HPC must be designed such that they can be “crashed early” (at smaller scales of deployment) and “crashed often” (have effective ways of input generation and schedule perturbation that cause vulnerabilities to be attacked with higher probability). Furthermore, following each crash, one must “explain well” (given the extremely obscure ways in which an error finally manifests itself, we must develop ways to record information leading up to the crash in informative ways, to minimize offsite debugging burden). Our plans to achieve these goals and to measure our success are described. We also highlight some of the broadly applicable concepts and approaches.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2013 5th International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE)

自引率

0.00%

发文量

期刊最新文献

Implicit provenance gathering through configuration management Exploring issues in software systems used and developed by domain experts DSLs, DLA, DxT, and MDE in CSE Implementing continuous integration software in an established computational chemistry software package Design and rationale of a quality assurance process for a scientific framework