{"title":"构建异构分布式数据库的质量驱动方法:以数据仓库为例","authors":"Sabrina Abdellaoui, Ladjel Bellatreche, Fahima Nader","doi":"10.1109/CCGrid.2016.79","DOIUrl":null,"url":null,"abstract":"Data Warehouse (DW) is a collection of data, consolidated from several heterogeneous sources, used to perform data analysis and support decision making in an organization. Extract-Transform-Load (ETL) phase plays a crucial role in designing DW. To overcome the complexity of the ETL phase, different studies have recently proposed the use of ontologies. Ontology-based ETL approaches have been used to reduce heterogeneity between data sources and ensure automation of the ETL process. Existing studies in semantic ETL have largely focused on fulfilling functional requirements. However, the ETL process quality dimension has not been sufficiently considered by these studies. As the amount of data has exploded with the advent of big data era, dealing with quality challenges in the early stages of designing the process become more important than ever. To address this issue, we propose to keep data quality requirements at the center of the ETL phase design. We present in this paper an approach, defining the ETL process at the ontological level. We define a set of quality indicators and quantitative measures that can anticipate data quality problems and identify causes of deficiencies. Our approach checks the quality of data before loading them into the target data warehouse to avoid the propagation of corrupted data. Finally, our proposal is validated through a case study, using Oracle Semantic DataBase sources (SDBs), where each source references the Lehigh University BenchMark ontology (LUBM).","PeriodicalId":103641,"journal":{"name":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","volume":"50 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-05-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"A Quality-Driven Approach for Building Heterogeneous Distributed Databases: The Case of Data Warehouses\",\"authors\":\"Sabrina Abdellaoui, Ladjel Bellatreche, Fahima Nader\",\"doi\":\"10.1109/CCGrid.2016.79\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data Warehouse (DW) is a collection of data, consolidated from several heterogeneous sources, used to perform data analysis and support decision making in an organization. Extract-Transform-Load (ETL) phase plays a crucial role in designing DW. To overcome the complexity of the ETL phase, different studies have recently proposed the use of ontologies. Ontology-based ETL approaches have been used to reduce heterogeneity between data sources and ensure automation of the ETL process. Existing studies in semantic ETL have largely focused on fulfilling functional requirements. However, the ETL process quality dimension has not been sufficiently considered by these studies. As the amount of data has exploded with the advent of big data era, dealing with quality challenges in the early stages of designing the process become more important than ever. To address this issue, we propose to keep data quality requirements at the center of the ETL phase design. We present in this paper an approach, defining the ETL process at the ontological level. We define a set of quality indicators and quantitative measures that can anticipate data quality problems and identify causes of deficiencies. Our approach checks the quality of data before loading them into the target data warehouse to avoid the propagation of corrupted data. Finally, our proposal is validated through a case study, using Oracle Semantic DataBase sources (SDBs), where each source references the Lehigh University BenchMark ontology (LUBM).\",\"PeriodicalId\":103641,\"journal\":{\"name\":\"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)\",\"volume\":\"50 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-05-16\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/CCGrid.2016.79\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 16th IEEE/ACM International Symposium on Cluster, Cloud and Grid Computing (CCGrid)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/CCGrid.2016.79","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
摘要
数据仓库(Data Warehouse, DW)是来自多个异构源的数据集合,用于执行数据分析并支持组织中的决策制定。提取-转换-加载(ETL)阶段在DW设计中起着至关重要的作用。为了克服ETL阶段的复杂性,最近有不同的研究提出使用本体。基于本体的ETL方法已被用于减少数据源之间的异构性,并确保ETL过程的自动化。现有的语义ETL研究主要集中在功能需求的实现上。然而,这些研究并未充分考虑到ETL过程质量维度。随着大数据时代的到来,数据量呈爆炸式增长,在设计流程的早期阶段处理质量挑战变得比以往任何时候都更加重要。为了解决这个问题,我们建议将数据质量需求放在ETL阶段设计的中心。我们在本文中提出了一种在本体论层面上定义ETL过程的方法。我们定义了一套质量指标和量化措施,可以预测数据质量问题并确定缺陷的原因。我们的方法在将数据加载到目标数据仓库之前检查数据的质量,以避免损坏数据的传播。最后,通过使用Oracle语义数据库源(sdb)的案例研究验证了我们的建议,其中每个源都引用Lehigh University BenchMark本体(LUBM)。
A Quality-Driven Approach for Building Heterogeneous Distributed Databases: The Case of Data Warehouses
Data Warehouse (DW) is a collection of data, consolidated from several heterogeneous sources, used to perform data analysis and support decision making in an organization. Extract-Transform-Load (ETL) phase plays a crucial role in designing DW. To overcome the complexity of the ETL phase, different studies have recently proposed the use of ontologies. Ontology-based ETL approaches have been used to reduce heterogeneity between data sources and ensure automation of the ETL process. Existing studies in semantic ETL have largely focused on fulfilling functional requirements. However, the ETL process quality dimension has not been sufficiently considered by these studies. As the amount of data has exploded with the advent of big data era, dealing with quality challenges in the early stages of designing the process become more important than ever. To address this issue, we propose to keep data quality requirements at the center of the ETL phase design. We present in this paper an approach, defining the ETL process at the ontological level. We define a set of quality indicators and quantitative measures that can anticipate data quality problems and identify causes of deficiencies. Our approach checks the quality of data before loading them into the target data warehouse to avoid the propagation of corrupted data. Finally, our proposal is validated through a case study, using Oracle Semantic DataBase sources (SDBs), where each source references the Lehigh University BenchMark ontology (LUBM).