用于高性能计算的百万核问题解决环境的实用形式正确性检查

D. C. B. D. Oliveira, Zvonimir Rakamaric, G. Gopalakrishnan, A. Humphrey, Qingyu Meng, M. Berzins
{"title":"用于高性能计算的百万核问题解决环境的实用形式正确性检查","authors":"D. C. B. D. Oliveira, Zvonimir Rakamaric, G. Gopalakrishnan, A. Humphrey, Qingyu Meng, M. Berzins","doi":"10.1109/SECSE.2013.6615102","DOIUrl":null,"url":null,"abstract":"While formal correctness checking methods have been deployed at scale in a number of important practical domains, we believe that such an experiment has yet to occur in the domain of high performance computing at the scale of a million CPU cores. This paper presents preliminary results from the Uintah Runtime Verification (URV) project that has been launched with this objective. Uintah is an asynchronous task-graph based problem-solving environment that has shown promising results on problems as diverse as fluid-structure interaction and turbulent combustion at well over 200K cores to date. Uintah has been tested on leading platforms such as Kraken, Keenland, and Titan consisting of multicore CPUs and GPUs, incorporates several innovative design features, and is following a roadmap for development well into the million core regime. The main results from the URV project to date are crystallized in two observations: (1) A diverse array of well-known ideas from lightweight formal methods and testing/observing HPC systems at scale have an excellent chance of succeeding. The real challenges are in finding out exactly which combinations of ideas to deploy, and where. (2) Large-scale problem solving environments for HPC must be designed such that they can be “crashed early” (at smaller scales of deployment) and “crashed often” (have effective ways of input generation and schedule perturbation that cause vulnerabilities to be attacked with higher probability). Furthermore, following each crash, one must “explain well” (given the extremely obscure ways in which an error finally manifests itself, we must develop ways to record information leading up to the crash in informative ways, to minimize offsite debugging burden). Our plans to achieve these goals and to measure our success are described. We also highlight some of the broadly applicable concepts and approaches.","PeriodicalId":133144,"journal":{"name":"2013 5th International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":"{\"title\":\"Practical formal correctness checking of million-core problem solving environments for HPC\",\"authors\":\"D. C. B. D. Oliveira, Zvonimir Rakamaric, G. Gopalakrishnan, A. Humphrey, Qingyu Meng, M. Berzins\",\"doi\":\"10.1109/SECSE.2013.6615102\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"While formal correctness checking methods have been deployed at scale in a number of important practical domains, we believe that such an experiment has yet to occur in the domain of high performance computing at the scale of a million CPU cores. This paper presents preliminary results from the Uintah Runtime Verification (URV) project that has been launched with this objective. Uintah is an asynchronous task-graph based problem-solving environment that has shown promising results on problems as diverse as fluid-structure interaction and turbulent combustion at well over 200K cores to date. Uintah has been tested on leading platforms such as Kraken, Keenland, and Titan consisting of multicore CPUs and GPUs, incorporates several innovative design features, and is following a roadmap for development well into the million core regime. The main results from the URV project to date are crystallized in two observations: (1) A diverse array of well-known ideas from lightweight formal methods and testing/observing HPC systems at scale have an excellent chance of succeeding. The real challenges are in finding out exactly which combinations of ideas to deploy, and where. (2) Large-scale problem solving environments for HPC must be designed such that they can be “crashed early” (at smaller scales of deployment) and “crashed often” (have effective ways of input generation and schedule perturbation that cause vulnerabilities to be attacked with higher probability). Furthermore, following each crash, one must “explain well” (given the extremely obscure ways in which an error finally manifests itself, we must develop ways to record information leading up to the crash in informative ways, to minimize offsite debugging burden). Our plans to achieve these goals and to measure our success are described. We also highlight some of the broadly applicable concepts and approaches.\",\"PeriodicalId\":133144,\"journal\":{\"name\":\"2013 5th International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE)\",\"volume\":\"31 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-05-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"4\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 5th International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/SECSE.2013.6615102\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 5th International Workshop on Software Engineering for Computational Science and Engineering (SE-CSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SECSE.2013.6615102","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

摘要

虽然正式的正确性检查方法已经在许多重要的实际领域中大规模部署,但我们认为这样的实验还没有出现在100万CPU内核规模的高性能计算领域。本文介绍了基于此目标启动的Runtime Verification (URV)项目的初步结果。untah是一个基于异步任务图的问题解决环境,迄今为止已经在超过20万个核上显示出了令人鼓舞的结果,包括流体-结构相互作用和湍流燃烧等各种问题。untah已经在Kraken, Keenland和Titan等领先平台上进行了测试,包括多核cpu和gpu,结合了一些创新的设计功能,并且正在遵循开发路线图进入百万核制度。迄今为止,URV项目的主要成果主要体现在两个方面:(1)来自轻量级形式化方法和大规模测试/观察HPC系统的各种众所周知的想法有很好的成功机会。真正的挑战在于找出应该部署哪些想法组合,以及在哪里部署。(2) HPC的大规模问题解决环境必须被设计成能够“早崩溃”(在较小的部署规模下)和“经常崩溃”(具有有效的输入生成方式和时间表扰动,导致漏洞被攻击的概率更高)。此外,在每次崩溃之后,必须“很好地解释”(考虑到错误最终表现出来的极其模糊的方式,我们必须开发出以有信息的方式记录导致崩溃的信息的方法,以尽量减少场外调试负担)。描述了我们实现这些目标和衡量我们成功的计划。我们还重点介绍了一些广泛适用的概念和方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Practical formal correctness checking of million-core problem solving environments for HPC
While formal correctness checking methods have been deployed at scale in a number of important practical domains, we believe that such an experiment has yet to occur in the domain of high performance computing at the scale of a million CPU cores. This paper presents preliminary results from the Uintah Runtime Verification (URV) project that has been launched with this objective. Uintah is an asynchronous task-graph based problem-solving environment that has shown promising results on problems as diverse as fluid-structure interaction and turbulent combustion at well over 200K cores to date. Uintah has been tested on leading platforms such as Kraken, Keenland, and Titan consisting of multicore CPUs and GPUs, incorporates several innovative design features, and is following a roadmap for development well into the million core regime. The main results from the URV project to date are crystallized in two observations: (1) A diverse array of well-known ideas from lightweight formal methods and testing/observing HPC systems at scale have an excellent chance of succeeding. The real challenges are in finding out exactly which combinations of ideas to deploy, and where. (2) Large-scale problem solving environments for HPC must be designed such that they can be “crashed early” (at smaller scales of deployment) and “crashed often” (have effective ways of input generation and schedule perturbation that cause vulnerabilities to be attacked with higher probability). Furthermore, following each crash, one must “explain well” (given the extremely obscure ways in which an error finally manifests itself, we must develop ways to record information leading up to the crash in informative ways, to minimize offsite debugging burden). Our plans to achieve these goals and to measure our success are described. We also highlight some of the broadly applicable concepts and approaches.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Implicit provenance gathering through configuration management Exploring issues in software systems used and developed by domain experts DSLs, DLA, DxT, and MDE in CSE Implementing continuous integration software in an established computational chemistry software package Design and rationale of a quality assurance process for a scientific framework
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1