Constance: An Intelligent Data Lake System

Rihan Hai, Sandra Geisler, C. Quix
{"title":"Constance: An Intelligent Data Lake System","authors":"Rihan Hai, Sandra Geisler, C. Quix","doi":"10.1145/2882903.2899389","DOIUrl":null,"url":null,"abstract":"As the challenge of our time, Big Data still has many research hassles, especially the variety of data. The high diversity of data sources often results in information silos, a collection of non-integrated data management systems with heterogeneous schemas, query languages, and APIs. Data Lake systems have been proposed as a solution to this problem, by providing a schema-less repository for raw data with a common access interface. However, just dumping all data into a data lake without any metadata management, would only lead to a 'data swamp'. To avoid this, we propose Constance, a Data Lake system with sophisticated metadata management over raw data extracted from heterogeneous data sources. Constance discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities. With embedded query rewriting engines supporting structured data and semi-structured data, Constance provides users a unified interface for query processing and data exploration. During the demo, we will walk through each functional component of Constance. Constance will be applied to two real-life use cases in order to show attendees the importance and usefulness of our generic and extensible data lake system.","PeriodicalId":20483,"journal":{"name":"Proceedings of the 2016 International Conference on Management of Data","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2016-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"199","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 2016 International Conference on Management of Data","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2882903.2899389","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 199

Abstract

As the challenge of our time, Big Data still has many research hassles, especially the variety of data. The high diversity of data sources often results in information silos, a collection of non-integrated data management systems with heterogeneous schemas, query languages, and APIs. Data Lake systems have been proposed as a solution to this problem, by providing a schema-less repository for raw data with a common access interface. However, just dumping all data into a data lake without any metadata management, would only lead to a 'data swamp'. To avoid this, we propose Constance, a Data Lake system with sophisticated metadata management over raw data extracted from heterogeneous data sources. Constance discovers, extracts, and summarizes the structural metadata from the data sources, and annotates data and metadata with semantic information to avoid ambiguities. With embedded query rewriting engines supporting structured data and semi-structured data, Constance provides users a unified interface for query processing and data exploration. During the demo, we will walk through each functional component of Constance. Constance will be applied to two real-life use cases in order to show attendees the importance and usefulness of our generic and extensible data lake system.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
康斯坦斯:智能数据湖系统
作为我们时代的挑战,大数据的研究仍然存在许多问题,尤其是数据的多样性。数据源的高度多样性通常会导致信息孤岛,这是一组具有异构模式、查询语言和api的非集成数据管理系统。数据湖系统已经被提出作为解决这个问题的方案,它为原始数据提供一个无模式的存储库,并提供一个通用的访问接口。然而,仅仅将所有数据倾倒到数据湖中而不进行任何元数据管理,只会导致“数据沼泽”。为了避免这种情况,我们提出Constance,这是一个数据湖系统,对从异构数据源提取的原始数据进行了复杂的元数据管理。Constance从数据源中发现、提取和总结结构化元数据,并用语义信息对数据和元数据进行标注,避免歧义。通过支持结构化数据和半结构化数据的嵌入式查询重写引擎,Constance为用户提供了查询处理和数据探索的统一界面。在演示过程中,我们将介绍Constance的每个功能组件。Constance将应用于两个现实生活中的用例,以向与会者展示我们的通用和可扩展数据湖系统的重要性和有用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
An Experimental Comparison of Thirteen Relational Equi-Joins in Main Memory Rheem: Enabling Multi-Platform Task Execution Wander Join: Online Aggregation for Joins Graph Summarization for Geo-correlated Trends Detection in Social Networks Emma in Action: Declarative Dataflows for Scalable Data Analysis
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1