A clustering approach for data quality results of research information systems

IF 2.1 Q2 INFORMATION SCIENCE & LIBRARY SCIENCE Information Discovery and Delivery Pub Date : 2022-11-03 DOI:10.1108/idd-07-2022-0063
Reza Edris Abadi, M. Ershadi, S. T. A. Niaki
{"title":"A clustering approach for data quality results of research information systems","authors":"Reza Edris Abadi, M. Ershadi, S. T. A. Niaki","doi":"10.1108/idd-07-2022-0063","DOIUrl":null,"url":null,"abstract":"\nPurpose\nThe overall goal of the data mining process is to extract information from an extensive data set and make it understandable for further use. When working with large volumes of unstructured data in research information systems, it is necessary to divide the information into logical groupings after examining their quality before attempting to analyze it. On the other hand, data quality results are valuable resources for defining quality excellence programs of any information system. Hence, the purpose of this study is to discover and extract knowledge to evaluate and improve data quality in research information systems.\n\n\nDesign/methodology/approach\nClustering in data analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found. In this study, data extracted from an information system are used in the first stage. Then, the data quality results are classified into an organized structure based on data quality dimension standards. Next, clustering algorithms (K-Means), density-based clustering (density-based spatial clustering of applications with noise [DBSCAN]) and hierarchical clustering (balanced iterative reducing and clustering using hierarchies [BIRCH]) are applied to compare and find the most appropriate clustering algorithms in the research information system.\n\n\nFindings\nThis paper showed that quality control results of an information system could be categorized through well-known data quality dimensions, including precision, accuracy, completeness, consistency, reputation and timeliness. Furthermore, among different well-known clustering approaches, the BIRCH algorithm of hierarchical clustering methods performs better in data clustering and gives the highest silhouette coefficient value. Next in line is the DBSCAN method, which performs better than the K-Means method.\n\n\nResearch limitations/implications\nIn the data quality assessment process, the discrepancies identified and the lack of proper classification for inconsistent data have led to unstructured reports, making the statistical analysis of qualitative metadata problems difficult and thus impossible to root out the observed errors. Therefore, in this study, the evaluation results of data quality have been categorized into various data quality dimensions, based on which multiple analyses have been performed in the form of data mining methods.\n\n\nOriginality/value\nAlthough several pieces of research have been conducted to assess data quality results of research information systems, knowledge extraction from obtained data quality scores is a crucial work that has rarely been studied in the literature. Besides, clustering in data quality analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found.\n","PeriodicalId":43488,"journal":{"name":"Information Discovery and Delivery","volume":" ","pages":""},"PeriodicalIF":2.1000,"publicationDate":"2022-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Information Discovery and Delivery","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1108/idd-07-2022-0063","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 1

Abstract

Purpose The overall goal of the data mining process is to extract information from an extensive data set and make it understandable for further use. When working with large volumes of unstructured data in research information systems, it is necessary to divide the information into logical groupings after examining their quality before attempting to analyze it. On the other hand, data quality results are valuable resources for defining quality excellence programs of any information system. Hence, the purpose of this study is to discover and extract knowledge to evaluate and improve data quality in research information systems. Design/methodology/approach Clustering in data analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found. In this study, data extracted from an information system are used in the first stage. Then, the data quality results are classified into an organized structure based on data quality dimension standards. Next, clustering algorithms (K-Means), density-based clustering (density-based spatial clustering of applications with noise [DBSCAN]) and hierarchical clustering (balanced iterative reducing and clustering using hierarchies [BIRCH]) are applied to compare and find the most appropriate clustering algorithms in the research information system. Findings This paper showed that quality control results of an information system could be categorized through well-known data quality dimensions, including precision, accuracy, completeness, consistency, reputation and timeliness. Furthermore, among different well-known clustering approaches, the BIRCH algorithm of hierarchical clustering methods performs better in data clustering and gives the highest silhouette coefficient value. Next in line is the DBSCAN method, which performs better than the K-Means method. Research limitations/implications In the data quality assessment process, the discrepancies identified and the lack of proper classification for inconsistent data have led to unstructured reports, making the statistical analysis of qualitative metadata problems difficult and thus impossible to root out the observed errors. Therefore, in this study, the evaluation results of data quality have been categorized into various data quality dimensions, based on which multiple analyses have been performed in the form of data mining methods. Originality/value Although several pieces of research have been conducted to assess data quality results of research information systems, knowledge extraction from obtained data quality scores is a crucial work that has rarely been studied in the literature. Besides, clustering in data quality analysis and exploiting the outputs allows practitioners to gain an in-depth and extensive look at their information to form some logical structures based on what they have found.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
研究信息系统数据质量结果的聚类方法
目的数据挖掘过程的总体目标是从广泛的数据集中提取信息,并使其易于理解以供进一步使用。在研究信息系统中处理大量非结构化数据时,有必要在检查信息质量后将其划分为逻辑分组,然后再进行分析。另一方面,数据质量结果是定义任何信息系统的卓越质量计划的宝贵资源。因此,本研究的目的是发现和提取知识,以评估和提高研究信息系统中的数据质量。设计/方法论/方法数据分析中的聚类和利用输出使从业者能够深入而广泛地查看他们的信息,并根据他们的发现形成一些逻辑结构。在这项研究中,从信息系统中提取的数据被用于第一阶段。然后,基于数据质量维度标准将数据质量结果分类到有组织的结构中。接下来,应用聚类算法(K-Means)、基于密度的聚类(具有噪声的应用程序的基于密度的空间聚类[DBSCAN])和层次聚类(使用层次结构的平衡迭代约简和聚类[BICH])来比较和找到研究信息系统中最合适的聚类算法。研究结果表明,信息系统的质量控制结果可以通过众所周知的数据质量维度进行分类,包括准确性、准确性、完整性、一致性、信誉和及时性。此外,在不同的已知聚类方法中,层次聚类方法的BIRCH算法在数据聚类中表现更好,并且给出了最高的剪影系数值。接下来是DBSCAN方法,它的性能比K-Means方法好。研究局限性/含义在数据质量评估过程中,发现的差异和对不一致数据缺乏适当分类导致了非结构化报告,使定性元数据问题的统计分析变得困难,因此无法根除观察到的错误。因此,在本研究中,数据质量的评估结果被归类为不同的数据质量维度,在此基础上,以数据挖掘方法的形式进行了多重分析。原创性/价值尽管已经进行了几项研究来评估研究信息系统的数据质量结果,但从所获得的数据质量分数中提取知识是一项文献中很少研究的关键工作。此外,数据质量分析中的聚类和利用输出使从业者能够深入而广泛地查看他们的信息,并根据他们的发现形成一些逻辑结构。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Information Discovery and Delivery
Information Discovery and Delivery INFORMATION SCIENCE & LIBRARY SCIENCE-
CiteScore
5.40
自引率
4.80%
发文量
21
期刊介绍: Information Discovery and Delivery covers information discovery and access for digital information researchers. This includes educators, knowledge professionals in education and cultural organisations, knowledge managers in media, health care and government, as well as librarians. The journal publishes research and practice which explores the digital information supply chain ie transport, flows, tracking, exchange and sharing, including within and between libraries. It is also interested in digital information capture, packaging and storage by ‘collectors’ of all kinds. Information is widely defined, including but not limited to: Records, Documents, Learning objects, Visual and sound files, Data and metadata and , User-generated content.
期刊最新文献
Visualizing the evolution of touchscreen research by scientometric analysis Analyzing user sentiments toward selected content management software: a sentiment analysis of viewer’s comments on YouTube Usability testing of a website through different devices: a task-based approach in a public university setting in Bangladesh Exploring Information Systems (IS) curricula: a semantic analysis approach Examines the value of cloud computing adoption as a proxy for IT flexibility and effectiveness
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1