构建异构分布式数据库的质量驱动方法:以数据仓库为例

Sabrina Abdellaoui, Ladjel Bellatreche, Fahima Nader
{"title":"构建异构分布式数据库的质量驱动方法:以数据仓库为例","authors":"Sabrina Abdellaoui, Ladjel Bellatreche, Fahima Nader","doi":"10.1109/CCGrid.2016.79","DOIUrl":null,"url":null,"abstract":"Data Warehouse (DW) is a collection of data, consolidated from several heterogeneous sources, used to perform data analysis and support decision making in an organization. Extract-Transform-Load (ETL) phase plays a crucial role in designing DW. To overcome the complexity of the ETL phase, different studies have recently proposed the use of ontologies. Ontology-based ETL approaches have been used to reduce heterogeneity between data sources and ensure automation of the ETL process. Existing studies in semantic ETL have largely focused on fulfilling functional requirements. However, the ETL process quality dimension has not been sufficiently considered by these studies. As the amount of data has exploded with the advent of big data era, dealing with quality challenges in the early stages of designing the process become more important than ever. To address this issue, we propose to keep data quality requirements at the center of the ETL phase design. We present in this paper an approach, defining the ETL process at the ontological level. We define a set of quality indicators and quantitative measures that can anticipate data quality problems and identify causes of deficiencies. Our approach checks the quality of data before loading them into the target data warehouse to avoid the propagation of corrupted data. Finally, our proposal is validated through a case study, using Oracle Semantic DataBase sources (SDBs), where each source references the Lehigh University BenchMark ontology (LUBM).","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A Quality-Driven Approach for Building Heterogeneous Distributed Databases: The Case of Data Warehouses\",\"authors\":\"Sabrina Abdellaoui, Ladjel Bellatreche, Fahima Nader\",\"doi\":\"10.1109/CCGrid.2016.79\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data Warehouse (DW) is a collection of data, consolidated from several heterogeneous sources, used to perform data analysis and support decision making in an organization. Extract-Transform-Load (ETL) phase plays a crucial role in designing DW. To overcome the complexity of the ETL phase, different studies have recently proposed the use of ontologies. Ontology-based ETL approaches have been used to reduce heterogeneity between data sources and ensure automation of the ETL process. Existing studies in semantic ETL have largely focused on fulfilling functional requirements. However, the ETL process quality dimension has not been sufficiently considered by these studies. As the amount of data has exploded with the advent of big data era, dealing with quality challenges in the early stages of designing the process become more important than ever. To address this issue, we propose to keep data quality requirements at the center of the ETL phase design. We present in this paper an approach, defining the ETL process at the ontological level. We define a set of quality indicators and quantitative measures that can anticipate data quality problems and identify causes of deficiencies. Our approach checks the quality of data before loading them into the target data warehouse to avoid the propagation of corrupted data. Finally, our proposal is validated through a case study, using Oracle Semantic DataBase sources (SDBs), where each source references the Lehigh University BenchMark ontology (LUBM).\",\"PeriodicalId\":103641,\"journal\":{\"name\":\"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGrid.2016.79\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2016.79","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3

摘要

数据仓库(Data Warehouse, DW)是来自多个异构源的数据集合,用于执行数据分析并支持组织中的决策制定。提取-转换-加载(ETL)阶段在DW设计中起着至关重要的作用。为了克服ETL阶段的复杂性,最近有不同的研究提出使用本体。基于本体的ETL方法已被用于减少数据源之间的异构性,并确保ETL过程的自动化。现有的语义ETL研究主要集中在功能需求的实现上。然而,这些研究并未充分考虑到ETL过程质量维度。随着大数据时代的到来,数据量呈爆炸式增长,在设计流程的早期阶段处理质量挑战变得比以往任何时候都更加重要。为了解决这个问题,我们建议将数据质量需求放在ETL阶段设计的中心。我们在本文中提出了一种在本体论层面上定义ETL过程的方法。我们定义了一套质量指标和量化措施,可以预测数据质量问题并确定缺陷的原因。我们的方法在将数据加载到目标数据仓库之前检查数据的质量,以避免损坏数据的传播。最后,通过使用Oracle语义数据库源(sdb)的案例研究验证了我们的建议,其中每个源都引用Lehigh University BenchMark本体(LUBM)。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A Quality-Driven Approach for Building Heterogeneous Distributed Databases: The Case of Data Warehouses
Data Warehouse (DW) is a collection of data, consolidated from several heterogeneous sources, used to perform data analysis and support decision making in an organization. Extract-Transform-Load (ETL) phase plays a crucial role in designing DW. To overcome the complexity of the ETL phase, different studies have recently proposed the use of ontologies. Ontology-based ETL approaches have been used to reduce heterogeneity between data sources and ensure automation of the ETL process. Existing studies in semantic ETL have largely focused on fulfilling functional requirements. However, the ETL process quality dimension has not been sufficiently considered by these studies. As the amount of data has exploded with the advent of big data era, dealing with quality challenges in the early stages of designing the process become more important than ever. To address this issue, we propose to keep data quality requirements at the center of the ETL phase design. We present in this paper an approach, defining the ETL process at the ontological level. We define a set of quality indicators and quantitative measures that can anticipate data quality problems and identify causes of deficiencies. Our approach checks the quality of data before loading them into the target data warehouse to avoid the propagation of corrupted data. Finally, our proposal is validated through a case study, using Oracle Semantic DataBase sources (SDBs), where each source references the Lehigh University BenchMark ontology (LUBM).
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Increasing the Performance of Data Centers by Combining Remote GPU Virtualization with Slurm DiBA: Distributed Power Budget Allocation for Large-Scale Computing Clusters Spatial Support Vector Regression to Detect Silent Errors in the Exascale Era DTStorage: Dynamic Tape-Based Storage for Cost-Effective and Highly-Available Streaming Service Facilitating the Execution of HPC Workloads in Colombia through the Integration of a Private IaaS and a Scientific PaaS/SaaS Marketplace
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1