LO-FA-MO: Fault Detection and Systemic Awareness for the QUonG Computing System

R. Ammendola, A. Biagioni, O. Frezza, F. L. Cicero, A. Lonardo, P. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P. Vicini
{"title":"LO-FA-MO: Fault Detection and Systemic Awareness for the QUonG Computing System","authors":"R. Ammendola, A. Biagioni, O. Frezza, F. L. Cicero, A. Lonardo, P. Paolucci, D. Rossetti, F. Simula, L. Tosoratto, P. Vicini","doi":"10.1109/SRDS.2014.33","DOIUrl":null,"url":null,"abstract":"QUonG is a parallel computing platform developed at INFN and equipped with commodity multi-core CPUs coupled with last generation NVIDIA GPUs. Computing nodes communicate through a point-to-point, high performance, low latency 3D torus network implemented by the APEnet+ FPGA-based interconnect. Scaling of this cluster towards peta-and possibly exascale is a prominent investigation point and in this context fault tolerance issues are structural. Typical fault tolerance solutions for HPC systems (e.g. checkpoint/restart) need to be triggered to be applied in an automated and transparent way, or at least knowledge about occurring faults needs propagating in order to prompt a readjustment: an effective tool to detect faults and make the system aware of them is required. Thus, as a first step towards a fault tolerant QUonG we designed the Local Fault Monitor (LO|FA|MO), an HW/SW solution aimed at providing systemic fault awareness. LO|FA|MO allows the detection of node faults thanks to a mutual watchdog mechanism between the host and the APEnet+ NIC, moreover, diagnostic messages can be delivered to neighbour nodes through both the 3D network and a secondary connection for service communication. The double path ensures that no fault remains unknown at the global level, guaranteeing systemic fault awareness with no single point of failure. In this paper we describe our LO|FA|MO implementation, reporting preliminary measures that show scalability and its next to nil impact on system performance.","PeriodicalId":440331,"journal":{"name":"2014 IEEE 33rd International Symposium on Reliable Distributed Systems","volume":"19 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-10-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 IEEE 33rd International Symposium on Reliable Distributed Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SRDS.2014.33","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

Abstract

QUonG is a parallel computing platform developed at INFN and equipped with commodity multi-core CPUs coupled with last generation NVIDIA GPUs. Computing nodes communicate through a point-to-point, high performance, low latency 3D torus network implemented by the APEnet+ FPGA-based interconnect. Scaling of this cluster towards peta-and possibly exascale is a prominent investigation point and in this context fault tolerance issues are structural. Typical fault tolerance solutions for HPC systems (e.g. checkpoint/restart) need to be triggered to be applied in an automated and transparent way, or at least knowledge about occurring faults needs propagating in order to prompt a readjustment: an effective tool to detect faults and make the system aware of them is required. Thus, as a first step towards a fault tolerant QUonG we designed the Local Fault Monitor (LO|FA|MO), an HW/SW solution aimed at providing systemic fault awareness. LO|FA|MO allows the detection of node faults thanks to a mutual watchdog mechanism between the host and the APEnet+ NIC, moreover, diagnostic messages can be delivered to neighbour nodes through both the 3D network and a secondary connection for service communication. The double path ensures that no fault remains unknown at the global level, guaranteeing systemic fault awareness with no single point of failure. In this paper we describe our LO|FA|MO implementation, reporting preliminary measures that show scalability and its next to nil impact on system performance.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
LO-FA-MO: QUonG计算系统的故障检测和系统感知
QUonG是一个由INFN开发的并行计算平台,配备了商用多核cpu和上一代NVIDIA gpu。计算节点通过点对点、高性能、低延迟的3D环面网络进行通信,该网络由基于APEnet+ fpga的互连实现。将集群扩展到pea级(甚至可能是exascale)是一个重要的研究点,在这种情况下,容错问题是结构性的。HPC系统的典型容错解决方案(例如检查点/重启)需要以自动化和透明的方式触发应用,或者至少需要传播关于发生故障的知识,以便提示重新调整:需要一种有效的工具来检测故障并使系统意识到它们。因此,作为容错QUonG的第一步,我们设计了本地故障监视器(LO|FA|MO),这是一种旨在提供系统故障感知的硬件/软件解决方案。LO|FA|MO通过主机和APEnet+ NIC之间的相互看门狗机制,可以检测节点故障,此外,诊断消息可以通过3D网络和服务通信的辅助连接传递给邻居节点。双路径确保在全局层面上没有未知故障,保证系统故障感知,没有单点故障。在本文中,我们描述了我们的LO|FA|MO实现,报告了显示可伸缩性及其对系统性能几乎为零的影响的初步测量。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Modeling Reliability Requirements in Coordinated Node and Link Mapping Fast Repair for Single Failure in Erasure Coding-Based Distributed Storage Systems A Distributed NameNode Cluster for a Highly-Available Hadoop Distributed File System A Convex Hull Query Processing Method in MANETs LO-FA-MO: Fault Detection and Systemic Awareness for the QUonG Computing System
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1