Cleenex:在迭代数据清理过程中支持用户参与

João L. M. Pereira, Manuel J. Fonseca, Antónia Lopes, H. Galhardas
{"title":"Cleenex:在迭代数据清理过程中支持用户参与","authors":"João L. M. Pereira, Manuel J. Fonseca, Antónia Lopes, H. Galhardas","doi":"10.1145/3648476","DOIUrl":null,"url":null,"abstract":"The existence of large amounts of data increases the probability of occurring data quality problems. A data cleaning process that corrects these problems is usually an iterative process because it may need to be re-executed and refined to produce high quality data. Moreover, due to the specificity of some data quality problems and the limitation of data cleaning programs to cover all problems, often a user has to be involved during the program executions by manually repairing data. However, there is no data cleaning framework that appropriately supports this involvement in such an iterative process, a form of human-in-the-loop, to clean structured data. Moreover, data preparation tools that somehow involve the user in data cleaning processes have not been evaluated with real users to assess their effort.\n Therefore, we propose Cleenex, a data cleaning framework with support for user involvement during an iterative data cleaning process and conducted two data cleaning experimental evaluations: an assessment of the Cleenex components that support the user when manually repairing data with a simulated user, and a comparison, in terms of user involvement, of data preparation tools with real users.\n Results show that Cleenex components reduce the user effort when manually cleaning data during a data cleaning process, for example the number of tuples visualized is reduced in 99%. Moreover, when performing data cleaning tasks with Cleenex, real users need less time/effort (e.g., half the clicks) and, based on questionnaires, prefer it to the other tools used for comparison, OpenRefine and Pentaho Data Integration.","PeriodicalId":517209,"journal":{"name":"Journal of Data and Information Quality","volume":"867 ","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Cleenex: Support for User Involvement During an Iterative Data Cleaning Process\",\"authors\":\"João L. M. Pereira, Manuel J. Fonseca, Antónia Lopes, H. Galhardas\",\"doi\":\"10.1145/3648476\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The existence of large amounts of data increases the probability of occurring data quality problems. A data cleaning process that corrects these problems is usually an iterative process because it may need to be re-executed and refined to produce high quality data. Moreover, due to the specificity of some data quality problems and the limitation of data cleaning programs to cover all problems, often a user has to be involved during the program executions by manually repairing data. However, there is no data cleaning framework that appropriately supports this involvement in such an iterative process, a form of human-in-the-loop, to clean structured data. Moreover, data preparation tools that somehow involve the user in data cleaning processes have not been evaluated with real users to assess their effort.\\n Therefore, we propose Cleenex, a data cleaning framework with support for user involvement during an iterative data cleaning process and conducted two data cleaning experimental evaluations: an assessment of the Cleenex components that support the user when manually repairing data with a simulated user, and a comparison, in terms of user involvement, of data preparation tools with real users.\\n Results show that Cleenex components reduce the user effort when manually cleaning data during a data cleaning process, for example the number of tuples visualized is reduced in 99%. Moreover, when performing data cleaning tasks with Cleenex, real users need less time/effort (e.g., half the clicks) and, based on questionnaires, prefer it to the other tools used for comparison, OpenRefine and Pentaho Data Integration.\",\"PeriodicalId\":517209,\"journal\":{\"name\":\"Journal of Data and Information Quality\",\"volume\":\"867 \",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-02-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Data and Information Quality\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3648476\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Data and Information Quality","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3648476","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

大量数据的存在增加了出现数据质量问题的可能性。纠正这些问题的数据清理过程通常是一个迭代过程,因为它可能需要重新执行和改进,才能产生高质量的数据。此外,由于某些数据质量问题的特殊性,以及数据清理程序无法涵盖所有问题的局限性,用户往往需要在程序执行过程中手动修复数据。然而,目前还没有数据清理框架能适当地支持用户参与这种迭代过程,即 "人在回路中"(human-in-the-loop)的形式来清理结构化数据。此外,以某种方式让用户参与数据清理过程的数据准备工具还没有经过真实用户的评估。因此,我们提出了一个支持用户在迭代数据清理过程中参与的数据清理框架 Cleenex,并进行了两项数据清理实验评估:在模拟用户手动修复数据时评估支持用户的 Cleenex 组件,以及在用户参与方面与真实用户的数据准备工具进行比较。结果表明,在数据清理过程中,Cleenex 组件可减少用户手动清理数据的工作量,例如,可视化图元的数量减少了 99%。此外,在使用 Cleenex 执行数据清理任务时,真实用户所需的时间/精力更少(例如点击次数减少一半),而且根据问卷调查,他们更喜欢 Cleenex,而不是用于比较的其他工具 OpenRefine 和 Pentaho Data Integration。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Cleenex: Support for User Involvement During an Iterative Data Cleaning Process
The existence of large amounts of data increases the probability of occurring data quality problems. A data cleaning process that corrects these problems is usually an iterative process because it may need to be re-executed and refined to produce high quality data. Moreover, due to the specificity of some data quality problems and the limitation of data cleaning programs to cover all problems, often a user has to be involved during the program executions by manually repairing data. However, there is no data cleaning framework that appropriately supports this involvement in such an iterative process, a form of human-in-the-loop, to clean structured data. Moreover, data preparation tools that somehow involve the user in data cleaning processes have not been evaluated with real users to assess their effort. Therefore, we propose Cleenex, a data cleaning framework with support for user involvement during an iterative data cleaning process and conducted two data cleaning experimental evaluations: an assessment of the Cleenex components that support the user when manually repairing data with a simulated user, and a comparison, in terms of user involvement, of data preparation tools with real users. Results show that Cleenex components reduce the user effort when manually cleaning data during a data cleaning process, for example the number of tuples visualized is reduced in 99%. Moreover, when performing data cleaning tasks with Cleenex, real users need less time/effort (e.g., half the clicks) and, based on questionnaires, prefer it to the other tools used for comparison, OpenRefine and Pentaho Data Integration.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Student Academic Success Prediction Using Learning Management Multimedia Data With Convoluted Features and Ensemble Model Active Learning for Data Quality Control: A Survey Data Validation Utilizing Expert Knowledge and Shape Constraints Editorial: Special Issue on Human in the Loop Data Curation Editor-in-Chief (June 2017–November 2023) Farewell Report
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1