Stocator: Providing High Performance and Fault Tolerance for Apache Spark Over Object Storage

G. Vernik, M. Factor, E. K. Kolodner, P. Michiardi, Effi Ofer, Francesco Pace
{"title":"Stocator: Providing High Performance and Fault Tolerance for Apache Spark Over Object Storage","authors":"G. Vernik, M. Factor, E. K. Kolodner, P. Michiardi, Effi Ofer, Francesco Pace","doi":"10.1109/CCGRID.2018.00073","DOIUrl":null,"url":null,"abstract":"Until now object storage has not been a first-class citizen of the Apache Hadoop ecosystem including Apache Spark. Hadoop connectors to object storage have been based on file semantics, an impedance mismatch, which leads to low performance and the need for an additional consistent storage system to achieve fault tolerance. In particular, Hadoop depends on its underlying storage system and its associated connector for fault tolerance and allowing speculative execution. However, these characteristics are obtained through file operations that are not native for object storage, and are both costly and not atomic. As a result these connectors are not efficient and more importantly they cannot help with fault tolerance for object storage. We introduce Stocator, whose novel algorithm achieves both high performance and fault tolerance by taking advantage of object storage semantics. This greatly decreases the number of operations on object storage as well as enabling a much simpler approach to dealing with the eventually consistent semantics typical of object storage. We have implemented Stocator and shared it in open source. Performance testing with Apache Spark shows that it can be 18 times faster for write intensive workloads and can perform 30 times fewer operations on object storage than the legacy Hadoop connectors, reducing costs both for the client and the object storage service provider.","PeriodicalId":321027,"journal":{"name":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","volume":"197 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 18th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGRID)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGRID.2018.00073","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 7

Abstract

Until now object storage has not been a first-class citizen of the Apache Hadoop ecosystem including Apache Spark. Hadoop connectors to object storage have been based on file semantics, an impedance mismatch, which leads to low performance and the need for an additional consistent storage system to achieve fault tolerance. In particular, Hadoop depends on its underlying storage system and its associated connector for fault tolerance and allowing speculative execution. However, these characteristics are obtained through file operations that are not native for object storage, and are both costly and not atomic. As a result these connectors are not efficient and more importantly they cannot help with fault tolerance for object storage. We introduce Stocator, whose novel algorithm achieves both high performance and fault tolerance by taking advantage of object storage semantics. This greatly decreases the number of operations on object storage as well as enabling a much simpler approach to dealing with the eventually consistent semantics typical of object storage. We have implemented Stocator and shared it in open source. Performance testing with Apache Spark shows that it can be 18 times faster for write intensive workloads and can perform 30 times fewer operations on object storage than the legacy Hadoop connectors, reducing costs both for the client and the object storage service provider.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Stocator:为Apache Spark Over Object Storage提供高性能和容错性
到目前为止,对象存储还不是Apache Hadoop生态系统(包括Apache Spark)的一等公民。Hadoop到对象存储的连接器是基于文件语义的,这是一种阻抗不匹配,导致性能低下,并且需要额外的一致存储系统来实现容错。特别是,Hadoop依赖于其底层存储系统及其相关连接器来实现容错和允许推测执行。然而,这些特征是通过文件操作获得的,这些操作不是对象存储的本机操作,而且成本高,而且不是原子性的。因此,这些连接器效率不高,更重要的是,它们不能帮助实现对象存储的容错。本文介绍了Stocator算法,该算法利用对象存储语义实现了高性能和容错性。这大大减少了对象存储上的操作数量,并支持一种更简单的方法来处理对象存储典型的最终一致的语义。我们已经实现了Stocator,并在开源中分享了它。使用Apache Spark进行的性能测试表明,对于写密集型工作负载,它可以比传统Hadoop连接器快18倍,在对象存储上执行的操作可以减少30倍,从而降低了客户端和对象存储服务提供商的成本。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Extreme-Scale Realistic Stencil Computations on Sunway TaihuLight with Ten Million Cores RideMatcher: Peer-to-Peer Matching of Passengers for Efficient Ridesharing Nitro: Network-Aware Virtual Machine Image Management in Geo-Distributed Clouds Improving Energy Efficiency of Database Clusters Through Prefetching and Caching Main-Memory Requirements of Big Data Applications on Commodity Server Platform
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1