为高等教育数据量身定制的数据质量方法

Journal of data and information science (Warsaw, Poland) Pub Date : 2020-07-09 DOI:10.2478/jdis-2020-0029

C. Daraio, R. Bruni, G. Catalano, Alessandro Daraio, G. Matteucci, M. Scannapieco, Daniel Wagner-Schuster, B. Lepori

{"title":"为高等教育数据量身定制的数据质量方法","authors":"C. Daraio, R. Bruni, G. Catalano, Alessandro Daraio, G. Matteucci, M. Scannapieco, Daniel Wagner-Schuster, B. Lepori","doi":"10.2478/jdis-2020-0029","DOIUrl":null,"url":null,"abstract":"Abstract Purpose This paper relates the definition of data quality procedures for knowledge organizations such as Higher Education Institutions. The main purpose is to present the flexible approach developed for monitoring the data quality of the European Tertiary Education Register (ETER) database, illustrating its functioning and highlighting the main challenges that still have to be faced in this domain. Design/methodology/approach The proposed data quality methodology is based on two kinds of checks, one to assess the consistency of cross-sectional data and the other to evaluate the stability of multiannual data. This methodology has an operational and empirical orientation. This means that the proposed checks do not assume any theoretical distribution for the determination of the threshold parameters that identify potential outliers, inconsistencies, and errors in the data. Findings We show that the proposed cross-sectional checks and multiannual checks are helpful to identify outliers, extreme observations and to detect ontological inconsistencies not described in the available meta-data. For this reason, they may be a useful complement to integrate the processing of the available information. Research limitations The coverage of the study is limited to European Higher Education Institutions. The cross-sectional and multiannual checks are not yet completely integrated. Practical implications The consideration of the quality of the available data and information is important to enhance data quality-aware empirical investigations, highlighting problems, and areas where to invest for improving the coverage and interoperability of data in future data collection initiatives. Originality/value The data-driven quality checks proposed in this paper may be useful as a reference for building and monitoring the data quality of new databases or of existing databases available for other countries or systems characterized by high heterogeneity and complexity of the units of analysis without relying on pre-specified theoretical distributions.","PeriodicalId":92237,"journal":{"name":"Journal of data and information science (Warsaw, Poland)","volume":"5 1","pages":"129 - 160"},"PeriodicalIF":0.0000,"publicationDate":"2020-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"A Tailor-made Data Quality Approach for Higher Educational Data\",\"authors\":\"C. Daraio, R. Bruni, G. Catalano, Alessandro Daraio, G. Matteucci, M. Scannapieco, Daniel Wagner-Schuster, B. Lepori\",\"doi\":\"10.2478/jdis-2020-0029\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Abstract Purpose This paper relates the definition of data quality procedures for knowledge organizations such as Higher Education Institutions. The main purpose is to present the flexible approach developed for monitoring the data quality of the European Tertiary Education Register (ETER) database, illustrating its functioning and highlighting the main challenges that still have to be faced in this domain. Design/methodology/approach The proposed data quality methodology is based on two kinds of checks, one to assess the consistency of cross-sectional data and the other to evaluate the stability of multiannual data. This methodology has an operational and empirical orientation. This means that the proposed checks do not assume any theoretical distribution for the determination of the threshold parameters that identify potential outliers, inconsistencies, and errors in the data. Findings We show that the proposed cross-sectional checks and multiannual checks are helpful to identify outliers, extreme observations and to detect ontological inconsistencies not described in the available meta-data. For this reason, they may be a useful complement to integrate the processing of the available information. Research limitations The coverage of the study is limited to European Higher Education Institutions. The cross-sectional and multiannual checks are not yet completely integrated. Practical implications The consideration of the quality of the available data and information is important to enhance data quality-aware empirical investigations, highlighting problems, and areas where to invest for improving the coverage and interoperability of data in future data collection initiatives. Originality/value The data-driven quality checks proposed in this paper may be useful as a reference for building and monitoring the data quality of new databases or of existing databases available for other countries or systems characterized by high heterogeneity and complexity of the units of analysis without relying on pre-specified theoretical distributions.\",\"PeriodicalId\":92237,\"journal\":{\"name\":\"Journal of data and information science (Warsaw, Poland)\",\"volume\":\"5 1\",\"pages\":\"129 - 160\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-07-09\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of data and information science (Warsaw, Poland)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2478/jdis-2020-0029\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of data and information science (Warsaw, Poland)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2478/jdis-2020-0029","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 9

摘要

摘要目的探讨知识型组织(如高等教育机构)数据质量程序的定义。主要目的是介绍为监测欧洲高等教育登记册(ETER)数据库的数据质量而制定的灵活方法，说明其功能并强调在这一领域仍然必须面对的主要挑战。建议的数据质量方法基于两种检查，一种用于评估横截面数据的一致性，另一种用于评估多年数据的稳定性。这种方法具有操作性和经验性。这意味着建议的检查不假设任何理论分布来确定识别数据中潜在异常值、不一致和错误的阈值参数。研究结果表明，建议的横断面检查和多年检查有助于识别异常值、极端观察值和检测可用元数据中未描述的本体不一致性。出于这个原因，它们可能是集成可用信息处理的有用补充。研究局限研究范围仅限于欧洲高等教育机构。横断面检查和多年度检查尚未完全结合起来。考虑可用数据和信息的质量对于加强数据质量意识的实证调查、突出问题以及在未来数据收集计划中为改善数据的覆盖范围和互操作性而投资的领域非常重要。原创性/价值本文提出的数据驱动的质量检查可以作为建立和监测新数据库或现有数据库的数据质量的参考，这些数据库可用于其他国家或系统，其特征是分析单元的高度异质性和复杂性，而不依赖于预先指定的理论分布。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Tailor-made Data Quality Approach for Higher Educational Data

Abstract Purpose This paper relates the definition of data quality procedures for knowledge organizations such as Higher Education Institutions. The main purpose is to present the flexible approach developed for monitoring the data quality of the European Tertiary Education Register (ETER) database, illustrating its functioning and highlighting the main challenges that still have to be faced in this domain. Design/methodology/approach The proposed data quality methodology is based on two kinds of checks, one to assess the consistency of cross-sectional data and the other to evaluate the stability of multiannual data. This methodology has an operational and empirical orientation. This means that the proposed checks do not assume any theoretical distribution for the determination of the threshold parameters that identify potential outliers, inconsistencies, and errors in the data. Findings We show that the proposed cross-sectional checks and multiannual checks are helpful to identify outliers, extreme observations and to detect ontological inconsistencies not described in the available meta-data. For this reason, they may be a useful complement to integrate the processing of the available information. Research limitations The coverage of the study is limited to European Higher Education Institutions. The cross-sectional and multiannual checks are not yet completely integrated. Practical implications The consideration of the quality of the available data and information is important to enhance data quality-aware empirical investigations, highlighting problems, and areas where to invest for improving the coverage and interoperability of data in future data collection initiatives. Originality/value The data-driven quality checks proposed in this paper may be useful as a reference for building and monitoring the data quality of new databases or of existing databases available for other countries or systems characterized by high heterogeneity and complexity of the units of analysis without relying on pre-specified theoretical distributions.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of data and information science (Warsaw, Poland)

自引率

0.00%

发文量