DeLIA: A Dependability Library for Iterative Applications applied to parallel geophysical problems

IF 4.2 2区 地球科学 Q1 COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS Computers & Geosciences Pub Date : 2024-06-25 DOI:10.1016/j.cageo.2024.105662
Carla Santana , Ramon C.F. Araújo , Idalmis Milian Sardina , Ítalo A.S. Assis , Tiago Barros , Calebe P. Bianchini , Antonio D. de S. Oliveira , João M. de Araújo , Hervé Chauris , Claude Tadonki , Samuel Xavier-de-Souza
{"title":"DeLIA: A Dependability Library for Iterative Applications applied to parallel geophysical problems","authors":"Carla Santana ,&nbsp;Ramon C.F. Araújo ,&nbsp;Idalmis Milian Sardina ,&nbsp;Ítalo A.S. Assis ,&nbsp;Tiago Barros ,&nbsp;Calebe P. Bianchini ,&nbsp;Antonio D. de S. Oliveira ,&nbsp;João M. de Araújo ,&nbsp;Hervé Chauris ,&nbsp;Claude Tadonki ,&nbsp;Samuel Xavier-de-Souza","doi":"10.1016/j.cageo.2024.105662","DOIUrl":null,"url":null,"abstract":"<div><p>Many geophysical imaging applications, such as full-waveform inversion, often rely on high-performance computing to meet their demanding computational requirements. The failure of a subset of computer nodes during the execution of such applications can have a significant impact, as it may take several days or even weeks to recover the lost computation. To mitigate the consequences of these failures, it is crucial to employ effective fault tolerance techniques that do not introduce substantial overhead or hinder code optimization efforts. This paper addresses the primary research challenge of developing fault tolerance techniques with minimal impact on execution and optimization. To achieve this, we propose DeLIA, a Dependability Library for Iterative Applications designed for parallel programs that require data synchronization among all processes to maintain a globally consistent state after each iteration. DeLIA efficiently performs checkpointing and rollback of both the application’s global state and each process’s local state. Furthermore, DeLIA incorporates interruption detection mechanisms. One of the key advantages of DeLIA is its flexibility, allowing users to configure various parameters such as checkpointing frequency, selection of data to be saved, and the specific fault tolerance techniques to be applied. To validate the effectiveness of DeLIA, we applied it to a 3D full-waveform inversion code and conducted experiments to measure its overhead under different configurations using two workload schedulers. We also analyzed its behavior in preemptive circumstances. Our experiments revealed a maximum overhead of 8.8%, and DeLIA demonstrated its capability to detect termination signals and save the state of nodes in preemptive scenarios. Overall, the results of our study demonstrate the suitability of DeLIA to provide fault tolerance for iterative parallel applications.</p></div>","PeriodicalId":55221,"journal":{"name":"Computers & Geosciences","volume":"191 ","pages":"Article 105662"},"PeriodicalIF":4.2000,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0098300424001456/pdfft?md5=d3a34eb9baf8c143c8aae12bcda4ed57&pid=1-s2.0-S0098300424001456-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computers & Geosciences","FirstCategoryId":"89","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0098300424001456","RegionNum":2,"RegionCategory":"地球科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, INTERDISCIPLINARY APPLICATIONS","Score":null,"Total":0}
引用次数: 0

Abstract

Many geophysical imaging applications, such as full-waveform inversion, often rely on high-performance computing to meet their demanding computational requirements. The failure of a subset of computer nodes during the execution of such applications can have a significant impact, as it may take several days or even weeks to recover the lost computation. To mitigate the consequences of these failures, it is crucial to employ effective fault tolerance techniques that do not introduce substantial overhead or hinder code optimization efforts. This paper addresses the primary research challenge of developing fault tolerance techniques with minimal impact on execution and optimization. To achieve this, we propose DeLIA, a Dependability Library for Iterative Applications designed for parallel programs that require data synchronization among all processes to maintain a globally consistent state after each iteration. DeLIA efficiently performs checkpointing and rollback of both the application’s global state and each process’s local state. Furthermore, DeLIA incorporates interruption detection mechanisms. One of the key advantages of DeLIA is its flexibility, allowing users to configure various parameters such as checkpointing frequency, selection of data to be saved, and the specific fault tolerance techniques to be applied. To validate the effectiveness of DeLIA, we applied it to a 3D full-waveform inversion code and conducted experiments to measure its overhead under different configurations using two workload schedulers. We also analyzed its behavior in preemptive circumstances. Our experiments revealed a maximum overhead of 8.8%, and DeLIA demonstrated its capability to detect termination signals and save the state of nodes in preemptive scenarios. Overall, the results of our study demonstrate the suitability of DeLIA to provide fault tolerance for iterative parallel applications.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DeLIA:应用于并行地球物理问题的迭代应用可靠性库
许多地球物理成像应用(如全波形反演)通常依赖高性能计算来满足其苛刻的计算要求。在执行此类应用时,一个计算机节点子集的故障可能会产生重大影响,因为可能需要几天甚至几周的时间才能恢复丢失的计算。为了减轻这些故障的后果,采用有效的容错技术至关重要,这种技术既不会带来大量开销,也不会妨碍代码优化工作。本文要解决的首要研究挑战是开发对执行和优化影响最小的容错技术。为了实现这一目标,我们提出了 DeLIA,这是一个用于迭代应用的可依赖性库,专为并行程序而设计,这些程序需要在所有进程之间同步数据,以便在每次迭代后保持全局一致的状态。DeLIA 可高效地对应用程序的全局状态和每个进程的本地状态执行检查点和回滚。此外,DeLIA 还集成了中断检测机制。DeLIA 的主要优势之一是其灵活性,允许用户配置各种参数,如检查点频率、要保存的数据选择以及要应用的特定容错技术。为了验证 DeLIA 的有效性,我们将其应用于三维全波形反演代码,并使用两种工作负载调度器进行了实验,以测量其在不同配置下的开销。我们还分析了它在抢占式环境下的行为。实验结果表明,DeLIA 的最大开销为 8.8%,并证明了其在抢占式情况下检测终止信号和保存节点状态的能力。总之,我们的研究结果表明,DeLIA 适用于为迭代并行应用提供容错。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computers & Geosciences
Computers & Geosciences 地学-地球科学综合
CiteScore
9.30
自引率
6.80%
发文量
164
审稿时长
3.4 months
期刊介绍: Computers & Geosciences publishes high impact, original research at the interface between Computer Sciences and Geosciences. Publications should apply modern computer science paradigms, whether computational or informatics-based, to address problems in the geosciences.
期刊最新文献
Multimodal feature integration network for lithology identification from point cloud data A two-dimensional magnetotelluric deep learning inversion approach based on improved Dense Convolutional Network Removing atmospheric noise from InSAR interferograms in mountainous regions with a convolutional neural network Novel empirical curvelet denoising strategy for suppressing mixed noise of microseismic data Curvilinear lineament extraction: Bayesian optimization of Principal Component Wavelet Analysis and Hysteresis Thresholding
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1