Quality aspects of annotated data

Jacob Beck
{"title":"Quality aspects of annotated data","authors":"Jacob Beck","doi":"10.1007/s11943-023-00332-y","DOIUrl":null,"url":null,"abstract":"<div><p>The quality of Machine Learning (ML) applications is commonly assessed by quantifying how well an algorithm fits its respective training data. Yet, a perfect model that learns from and reproduces erroneous data will always be flawed in its real-world application. Hence, a comprehensive assessment of ML quality must include an additional data perspective, especially for models trained on human-annotated data. For the collection of human-annotated training data, best practices often do not exist and leave researchers to make arbitrary decisions when collecting annotations. Decisions about the selection of annotators or label options may affect training data quality and model performance.</p><p>In this paper, I will outline and summarize previous research and approaches to the collection of annotated training data. I look at data annotation and its quality confounders from two perspectives: the set of <i>annotators</i> and the <i>strategy</i> of data collection. The paper will highlight the various implementations of text and image annotation collection and stress the importance of careful task construction. I conclude by illustrating the consequences for future research and applications of data annotation. The paper is intended give readers a starting point on annotated data quality research and stress the necessity of thoughtful consideration of the annotation collection process to researchers and practitioners.</p></div>","PeriodicalId":100134,"journal":{"name":"AStA Wirtschafts- und Sozialstatistisches Archiv","volume":"17 3-4","pages":"331 - 353"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11943-023-00332-y.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"AStA Wirtschafts- und Sozialstatistisches Archiv","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.1007/s11943-023-00332-y","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The quality of Machine Learning (ML) applications is commonly assessed by quantifying how well an algorithm fits its respective training data. Yet, a perfect model that learns from and reproduces erroneous data will always be flawed in its real-world application. Hence, a comprehensive assessment of ML quality must include an additional data perspective, especially for models trained on human-annotated data. For the collection of human-annotated training data, best practices often do not exist and leave researchers to make arbitrary decisions when collecting annotations. Decisions about the selection of annotators or label options may affect training data quality and model performance.

In this paper, I will outline and summarize previous research and approaches to the collection of annotated training data. I look at data annotation and its quality confounders from two perspectives: the set of annotators and the strategy of data collection. The paper will highlight the various implementations of text and image annotation collection and stress the importance of careful task construction. I conclude by illustrating the consequences for future research and applications of data annotation. The paper is intended give readers a starting point on annotated data quality research and stress the necessity of thoughtful consideration of the annotation collection process to researchers and practitioners.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
注释数据的质量问题
机器学习(ML)应用的质量通常是通过量化算法与各自训练数据的匹配程度来评估的。然而,从错误数据中学习并再现错误数据的完美模型在实际应用中总是存在缺陷。因此,对人工智能质量的全面评估必须包括额外的数据视角,特别是对于根据人类标注数据训练的模型。在收集人工标注的训练数据方面,通常不存在最佳实践,研究人员在收集标注时只能随意做出决定。在本文中,我将概述和总结以往收集注释训练数据的研究和方法。我将从两个角度来探讨数据注释及其质量问题:注释者的集合和数据收集策略。本文将重点介绍文本和图像注释收集的各种实现方法,并强调仔细构建任务的重要性。最后,我将说明数据标注对未来研究和应用的影响。本文旨在为读者提供一个注释数据质量研究的起点,并强调研究人员和从业人员在注释收集过程中深思熟虑的必要性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Vorwort der Herausgeber Connecting algorithmic fairness to quality dimensions in machine learning in official statistics and survey production Automated Bayesian variable selection methods for binary regression models with missing covariate data Fairness als Qualitätskriterium im Maschinellen Lernen – Rekonstruktion des philosophischen Konzepts und Implikationen für die Nutzung außergesetzlicher Merkmale bei qualifizierten Mietspiegeln Interview mit Walter Krämer
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1