基于dag的并行计算故障下的性能研究

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid Pub Date : 2009-05-18 DOI:10.1109/CCGRID.2009.55

Hui Jin, Xian-He Sun, Ziming Zheng, Z. Lan, Bing Xie

{"title":"基于dag的并行计算故障下的性能研究","authors":"Hui Jin, Xian-He Sun, Ziming Zheng, Z. Lan, Bing Xie","doi":"10.1109/CCGRID.2009.55","DOIUrl":null,"url":null,"abstract":"As the scale and complexity of parallel systems continue to grow, failures become more and more an inevitable fact for solving large-scale applications. In this research, we present an analytical study to estimate execution time in the presence of failures of directed acyclic graph (DAG) based Scientific Applications and provide a guideline for performance optimization. The study is four fold. We first introduce a performance model to predict individual subtask computation time under failures. Next, a layered, iterative approach is adopted to transform a DAG into a layered DAG, which reflects full dependencies among all the subtasks. Then, the expected execution time under failures of the DAG is derived based on stochastic analysis. Unlike existing models, this newly proposed performance model provides both the variance and distribution. It is practical and can be put to real use. Finally, based on the model, performance optimization, weak point identification and enhancement are proposed. Intensive simulations with real system traces are conducted to verify the analytical findings. They show that the newly proposed model and weak point enhancement mechanism work well.","PeriodicalId":118263,"journal":{"name":"2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"16","resultStr":"{\"title\":\"Performance under Failures of DAG-based Parallel Computing\",\"authors\":\"Hui Jin, Xian-He Sun, Ziming Zheng, Z. Lan, Bing Xie\",\"doi\":\"10.1109/CCGRID.2009.55\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As the scale and complexity of parallel systems continue to grow, failures become more and more an inevitable fact for solving large-scale applications. In this research, we present an analytical study to estimate execution time in the presence of failures of directed acyclic graph (DAG) based Scientific Applications and provide a guideline for performance optimization. The study is four fold. We first introduce a performance model to predict individual subtask computation time under failures. Next, a layered, iterative approach is adopted to transform a DAG into a layered DAG, which reflects full dependencies among all the subtasks. Then, the expected execution time under failures of the DAG is derived based on stochastic analysis. Unlike existing models, this newly proposed performance model provides both the variance and distribution. It is practical and can be put to real use. Finally, based on the model, performance optimization, weak point identification and enhancement are proposed. Intensive simulations with real system traces are conducted to verify the analytical findings. They show that the newly proposed model and weak point enhancement mechanism work well.\",\"PeriodicalId\":118263,\"journal\":{\"name\":\"2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-05-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"16\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGRID.2009.55\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2009.55","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 16

摘要

随着并行系统的规模和复杂性不断增长，故障越来越成为解决大规模应用的一个不可避免的事实。在本研究中，我们提出了一项分析研究，以估计基于有向无环图(DAG)的科学应用程序在存在故障时的执行时间，并为性能优化提供指导。这项研究分为四部分。我们首先引入了一个性能模型来预测故障情况下单个子任务的计算时间。接下来，采用分层迭代方法将DAG转换为反映所有子任务之间完全依赖关系的分层DAG。然后，基于随机分析，导出了DAG在故障情况下的期望执行时间。与现有模型不同，新提出的性能模型同时提供方差和分布。它是实用的，可以投入实际使用。最后，在此基础上进行了性能优化、弱点识别和增强。利用真实系统轨迹进行了密集模拟，以验证分析结果。结果表明，新提出的模型和弱点增强机制运行良好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Performance under Failures of DAG-based Parallel Computing

As the scale and complexity of parallel systems continue to grow, failures become more and more an inevitable fact for solving large-scale applications. In this research, we present an analytical study to estimate execution time in the presence of failures of directed acyclic graph (DAG) based Scientific Applications and provide a guideline for performance optimization. The study is four fold. We first introduce a performance model to predict individual subtask computation time under failures. Next, a layered, iterative approach is adopted to transform a DAG into a layered DAG, which reflects full dependencies among all the subtasks. Then, the expected execution time under failures of the DAG is derived based on stochastic analysis. Unlike existing models, this newly proposed performance model provides both the variance and distribution. It is practical and can be put to real use. Finally, based on the model, performance optimization, weak point identification and enhancement are proposed. Intensive simulations with real system traces are conducted to verify the analytical findings. They show that the newly proposed model and weak point enhancement mechanism work well.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2009 9th IEEE/ACM International Symposium on Cluster Computing and the Grid

自引率

0.00%

发文量