并行和分布式计算的高效、故障弹性事务

2014 International Workshop on Data Intensive Scalable Computing Systems Pub Date : 2014-11-16 DOI:10.1109/DISCS.2014.13

J. Lofstead, Jai Dayal, I. Jimenez, C. Maltzahn

{"title":"并行和分布式计算的高效、故障弹性事务","authors":"J. Lofstead, Jai Dayal, I. Jimenez, C. Maltzahn","doi":"10.1109/DISCS.2014.13","DOIUrl":null,"url":null,"abstract":"Scientific simulations are moving away from using centralized persistent storage for intermediate data between workflow steps towards an all online model. This shift is motivated by the relatively slow IO bandwidth growth compared with compute speed increases. The challenges presented by this shift to Integrated Application Workflows are motivated by the loss of persistent storage semantics for node-to-node communication. One step towards addressing this semantics gap is using transactions to logically delineate a data set from 100,000s of processes to 1000s of servers as an atomic unit. Our previously demonstrated Doubly Distributed Transactions (D2T) protocol showed a high-performance solution, but had not explored how to detect and recover from faults. Instead, the focus was on demonstrating high-performance typical case performance. The research presented here addresses fault detection and recovery based on the enhanced protocol design. The total overhead for a full transaction with multiple operations at 65,536 processes is on average 0.055 seconds. Fault detection and recovery mechanisms demonstrate similar performance to the success case with only the addition of appropriate timeouts for the system. This paper explores the challenges in designing a recoverable protocol for doubly distributed transactions, particularly for parallel computing environments.","PeriodicalId":278119,"journal":{"name":"2014 International Workshop on Data Intensive Scalable Computing Systems","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-11-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Efficient, Failure Resilient Transactions for Parallel and Distributed Computing\",\"authors\":\"J. Lofstead, Jai Dayal, I. Jimenez, C. Maltzahn\",\"doi\":\"10.1109/DISCS.2014.13\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Scientific simulations are moving away from using centralized persistent storage for intermediate data between workflow steps towards an all online model. This shift is motivated by the relatively slow IO bandwidth growth compared with compute speed increases. The challenges presented by this shift to Integrated Application Workflows are motivated by the loss of persistent storage semantics for node-to-node communication. One step towards addressing this semantics gap is using transactions to logically delineate a data set from 100,000s of processes to 1000s of servers as an atomic unit. Our previously demonstrated Doubly Distributed Transactions (D2T) protocol showed a high-performance solution, but had not explored how to detect and recover from faults. Instead, the focus was on demonstrating high-performance typical case performance. The research presented here addresses fault detection and recovery based on the enhanced protocol design. The total overhead for a full transaction with multiple operations at 65,536 processes is on average 0.055 seconds. Fault detection and recovery mechanisms demonstrate similar performance to the success case with only the addition of appropriate timeouts for the system. This paper explores the challenges in designing a recoverable protocol for doubly distributed transactions, particularly for parallel computing environments.\",\"PeriodicalId\":278119,\"journal\":{\"name\":\"2014 International Workshop on Data Intensive Scalable Computing Systems\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-11-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2014 International Workshop on Data Intensive Scalable Computing Systems\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/DISCS.2014.13\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2014 International Workshop on Data Intensive Scalable Computing Systems","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/DISCS.2014.13","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 5

摘要

科学模拟正在从使用集中式持久存储工作流步骤之间的中间数据转向全在线模型。这种转变是由于IO带宽的增长相对于计算速度的增长相对缓慢。这种向集成应用程序工作流的转变所带来的挑战是由于节点到节点通信的持久存储语义的丢失。解决这种语义差距的一个步骤是使用事务在逻辑上将从100,000个进程到数千个服务器的数据集描述为一个原子单元。我们之前演示的双分布式事务(D2T)协议展示了一种高性能解决方案，但没有探讨如何检测故障并从故障中恢复。相反，重点是演示高性能的典型案例性能。本文研究的是基于增强协议设计的故障检测和恢复。在65,536个进程中，包含多个操作的完整事务的总开销平均为0.055秒。故障检测和恢复机制显示了与成功案例相似的性能，只是为系统添加了适当的超时。本文探讨了为双重分布式事务设计可恢复协议的挑战，特别是在并行计算环境中。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Efficient, Failure Resilient Transactions for Parallel and Distributed Computing

Scientific simulations are moving away from using centralized persistent storage for intermediate data between workflow steps towards an all online model. This shift is motivated by the relatively slow IO bandwidth growth compared with compute speed increases. The challenges presented by this shift to Integrated Application Workflows are motivated by the loss of persistent storage semantics for node-to-node communication. One step towards addressing this semantics gap is using transactions to logically delineate a data set from 100,000s of processes to 1000s of servers as an atomic unit. Our previously demonstrated Doubly Distributed Transactions (D2T) protocol showed a high-performance solution, but had not explored how to detect and recover from faults. Instead, the focus was on demonstrating high-performance typical case performance. The research presented here addresses fault detection and recovery based on the enhanced protocol design. The total overhead for a full transaction with multiple operations at 65,536 processes is on average 0.055 seconds. Fault detection and recovery mechanisms demonstrate similar performance to the success case with only the addition of appropriate timeouts for the system. This paper explores the challenges in designing a recoverable protocol for doubly distributed transactions, particularly for parallel computing environments.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2014 International Workshop on Data Intensive Scalable Computing Systems

自引率

0.00%

发文量