A Zone Reference Model for Enterprise-Grade Data Lake Management

Corinna Giebler, Christoph Gröger, Eva Hoos, H. Schwarz, B. Mitschang
{"title":"A Zone Reference Model for Enterprise-Grade Data Lake Management","authors":"Corinna Giebler, Christoph Gröger, Eva Hoos, H. Schwarz, B. Mitschang","doi":"10.1109/EDOC49727.2020.00017","DOIUrl":null,"url":null,"abstract":"Data lakes are on the rise as data platforms for any kind of analytics, from data exploration to machine learning. They achieve the required flexibility by storing heterogeneous data in their raw format, and by avoiding the need for pre-defined use cases. However, storing only raw data is inefficient, as for many applications, the same data processing has to be applied repeatedly. To foster the reuse of processing steps, literature proposes to store data in different degrees of processing in addition to their raw format. To this end, data lakes are typically structured in zones. There exists various zone models, but they are varied, vague, and no assessments are given. It is unclear which of these zone models is applicable in a practical data lake implementation in enterprises. In this work, we assess existing zone models using requirements derived from multiple representative data analytics use cases of a real-world industry case. We identify the shortcomings of existing work and develop a zone reference model for enterprise-grade data lake management in a detailed manner. We assess the reference model’s applicability through a prototypical implementation for a real-world enterprise data lake use case. This assessment shows that the zone reference model meets the requirements relevant in practice and is ready for industry use.","PeriodicalId":409420,"journal":{"name":"2020 IEEE 24th International Enterprise Distributed Object Computing Conference (EDOC)","volume":"16 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"10","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 24th International Enterprise Distributed Object Computing Conference (EDOC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/EDOC49727.2020.00017","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 10

Abstract

Data lakes are on the rise as data platforms for any kind of analytics, from data exploration to machine learning. They achieve the required flexibility by storing heterogeneous data in their raw format, and by avoiding the need for pre-defined use cases. However, storing only raw data is inefficient, as for many applications, the same data processing has to be applied repeatedly. To foster the reuse of processing steps, literature proposes to store data in different degrees of processing in addition to their raw format. To this end, data lakes are typically structured in zones. There exists various zone models, but they are varied, vague, and no assessments are given. It is unclear which of these zone models is applicable in a practical data lake implementation in enterprises. In this work, we assess existing zone models using requirements derived from multiple representative data analytics use cases of a real-world industry case. We identify the shortcomings of existing work and develop a zone reference model for enterprise-grade data lake management in a detailed manner. We assess the reference model’s applicability through a prototypical implementation for a real-world enterprise data lake use case. This assessment shows that the zone reference model meets the requirements relevant in practice and is ready for industry use.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
企业级数据湖管理的区域参考模型
作为从数据探索到机器学习的各种分析的数据平台,数据湖正在崛起。它们通过以原始格式存储异构数据和避免需要预定义的用例来实现所需的灵活性。但是,只存储原始数据是低效的,因为对于许多应用程序,必须重复应用相同的数据处理。为了促进处理步骤的重用,文献建议在原始格式之外,以不同程度的处理方式存储数据。为此,数据湖通常按区域进行结构化。虽然存在着各种区域模型,但它们都是多种多样的、模糊的,而且没有给出评价。目前还不清楚哪些区域模型适用于企业的实际数据湖实施。在这项工作中,我们使用来自真实行业案例的多个代表性数据分析用例的需求来评估现有的区域模型。我们发现了现有工作的不足,并详细地开发了企业级数据湖管理的区域参考模型。我们通过一个真实企业数据湖用例的原型实现来评估参考模型的适用性。评估结果显示,该区域参考模型符合相关实践要求,可供工业使用。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
How Business Process Benchmarks Enable Organizations To Improve Performance Current Practices in the Usage of Inter-Enterprise Architecture Models for the Management of Business Ecosystems Verifying Compliance of Process Compositions Through Certification of its Components Transforming e3value models into ArchiMate diagrams An open architecture for complex event processing with machine learning
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1