Approaches for Implementing Persistent Queues within Data-Intensive Scientific Workflows

Michael Agun, S. Bowers
{"title":"Approaches for Implementing Persistent Queues within Data-Intensive Scientific Workflows","authors":"Michael Agun, S. Bowers","doi":"10.1109/SERVICES.2011.57","DOIUrl":null,"url":null,"abstract":"Many scientific workflow systems are built on dataflow-based models of computation in which data drives the execution of workflow components. An advantage of using dataflow models is their straightforward semantics (which includes support for branching, merging, and looping) and their ability to concurrently execute workflow steps. However, for many data-intensive workflows the dataflow model often requires data buffering. Current systems largely perform buffering through in-memory queues which can lead to buffer overflow and performance degradation as queues reach capacity (e.g., because of paging). We describe an alternative framework that leverages external storage to implement buffers (which we refer to as persistent queues) within data-intensive scientific workflows. Our framework can easily be used with different underlying storage technologies, and we consider and evaluate three distinct approaches: a traditional relational database implementation, a non-relational implementation designed for fast reads and writes, and a specialized approach that can further reduce external buffering overhead. In addition, the use of persistent queues can provide detailed provenance information ``for free'' by capturing the input and output information of each workflow component during workflow execution. Although many systems provide such provenance information, we show how this information can be captured both efficiently and can be used to improve overall workflow performance through persistent queues.","PeriodicalId":429726,"journal":{"name":"2011 IEEE World Congress on Services","volume":"62 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2011-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2011 IEEE World Congress on Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/SERVICES.2011.57","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4

Abstract

Many scientific workflow systems are built on dataflow-based models of computation in which data drives the execution of workflow components. An advantage of using dataflow models is their straightforward semantics (which includes support for branching, merging, and looping) and their ability to concurrently execute workflow steps. However, for many data-intensive workflows the dataflow model often requires data buffering. Current systems largely perform buffering through in-memory queues which can lead to buffer overflow and performance degradation as queues reach capacity (e.g., because of paging). We describe an alternative framework that leverages external storage to implement buffers (which we refer to as persistent queues) within data-intensive scientific workflows. Our framework can easily be used with different underlying storage technologies, and we consider and evaluate three distinct approaches: a traditional relational database implementation, a non-relational implementation designed for fast reads and writes, and a specialized approach that can further reduce external buffering overhead. In addition, the use of persistent queues can provide detailed provenance information ``for free'' by capturing the input and output information of each workflow component during workflow execution. Although many systems provide such provenance information, we show how this information can be captured both efficiently and can be used to improve overall workflow performance through persistent queues.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在数据密集型科学工作流中实现持久队列的方法
许多科学工作流系统建立在基于数据流的计算模型之上,其中数据驱动工作流组件的执行。使用数据流模型的一个优点是其简单的语义(包括对分支、合并和循环的支持)以及并发执行工作流步骤的能力。然而,对于许多数据密集型工作流,数据流模型通常需要数据缓冲。当前系统主要通过内存队列执行缓冲,当队列达到容量时(例如,由于分页),这可能导致缓冲区溢出和性能下降。我们描述了一个替代框架,它利用外部存储来实现数据密集型科学工作流中的缓冲区(我们称之为持久队列)。我们的框架可以很容易地与不同的底层存储技术一起使用,我们考虑并评估了三种不同的方法:传统的关系数据库实现、专为快速读写而设计的非关系数据库实现和可以进一步减少外部缓冲开销的专用方法。此外,通过在工作流执行期间捕获每个工作流组件的输入和输出信息,使用持久队列可以“免费”提供详细的来源信息。尽管许多系统都提供了这样的来源信息,但我们将展示如何有效地捕获这些信息,并通过持久队列提高总体工作流性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Reputation-Based Web Service Selection for Composition SSC4Cloud Tooling: An Integrated Environment for the Development of Business Processes with Security Requirements in the Cloud Simplifying Web Service Discovery & Validating Service Composition A Survey of Cloud Storage Facilities Externalizing the Autopoietic Part of Software to Achieve Self-Adaptability
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1