1st workshop on fault-tolerance for HPC at extreme scale FTXS 2010

International Conference on Dependable Systems and Networks workshops : [proceedings]. International Conference on Dependable Systems and Networks Pub Date : 2010-06-28 DOI:10.1109/DSN.2010.5544426

J. Daly, Nathan Debardeleben

引用次数: 0

Abstract

With the emergence of many-core processors, accelerators, and alternative/heterogeneous architectures, the HPC community faces a new challenge: a scaling in number of processing elements that supersedes the historical trend of scaling in processor frequencies. The attendant increase in system complexity has first-order implications for fault tolerance. Mounting evidence invalidates traditional assumptions of HPC fault tolerance: faults are increasingly multiple-point instead of single-point and interdependent instead of independent; silent failures and silent data corruption are no longer rare enough to discount; stabilization time consumes a larger fraction of useful system lifetime, with failure rates projected to exceed one per hour on the largest systems; and application interrupt rates are apparently diverging from system failure rates.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

ftx2010极端规模HPC容错研讨会第一场

随着多核处理器、加速器和可选/异构架构的出现，HPC社区面临着一个新的挑战:处理器频率的历史趋势将被处理元素数量的扩展所取代。随之而来的系统复杂性的增加对容错性具有一级含义。越来越多的证据证明传统的高性能计算容错假设是无效的:故障越来越多地是多点而不是单点，相互依赖而不是独立的;无声的故障和无声的数据损坏不再罕见到可以忽视的程度;稳定时间消耗了系统使用寿命的很大一部分，在最大的系统中，故障率预计超过每小时一次;应用程序中断率明显偏离了系统故障率。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Conference on Dependable Systems and Networks workshops : [proceedings]. International Conference on Dependable Systems and Networks

自引率

0.00%

发文量

期刊最新文献

Message from the DSN 2023 Program Chairs Message from the general chair CSAI-4-CPS: A Cyber Security characterization model based on Artificial Intelligence For Cyber Physical Systems Keynote I: Advances in memory state-preserving fault tolerance A Framework for Risk Assessment in Augmented Reality-Equipped Socio-Technical Systems