注释数据的质量问题

AStA Wirtschafts- und Sozialstatistisches Archiv Pub Date : 2023-11-27 DOI:10.1007/s11943-023-00332-y

Jacob Beck

{"title":"注释数据的质量问题","authors":"Jacob Beck","doi":"10.1007/s11943-023-00332-y","DOIUrl":null,"url":null,"abstract":"<div><p>The quality of Machine Learning (ML) applications is commonly assessed by quantifying how well an algorithm fits its respective training data. Yet, a perfect model that learns from and reproduces erroneous data will always be flawed in its real-world application. Hence, a comprehensive assessment of ML quality must include an additional data perspective, especially for models trained on human-annotated data. For the collection of human-annotated training data, best practices often do not exist and leave researchers to make arbitrary decisions when collecting annotations. Decisions about the selection of annotators or label options may affect training data quality and model performance.</p><p>In this paper, I will outline and summarize previous research and approaches to the collection of annotated training data. I look at data annotation and its quality confounders from two perspectives: the set of <i>annotators</i> and the <i>strategy</i> of data collection. The paper will highlight the various implementations of text and image annotation collection and stress the importance of careful task construction. I conclude by illustrating the consequences for future research and applications of data annotation. The paper is intended give readers a starting point on annotated data quality research and stress the necessity of thoughtful consideration of the annotation collection process to researchers and practitioners.</p></div>","PeriodicalId":100134,"journal":{"name":"AStA Wirtschafts- und Sozialstatistisches Archiv","volume":"17 3-4","pages":"331 - 353"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://link.springer.com/content/pdf/10.1007/s11943-023-00332-y.pdf","citationCount":"0","resultStr":"{\"title\":\"Quality aspects of annotated data\",\"authors\":\"Jacob Beck\",\"doi\":\"10.1007/s11943-023-00332-y\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>The quality of Machine Learning (ML) applications is commonly assessed by quantifying how well an algorithm fits its respective training data. Yet, a perfect model that learns from and reproduces erroneous data will always be flawed in its real-world application. Hence, a comprehensive assessment of ML quality must include an additional data perspective, especially for models trained on human-annotated data. For the collection of human-annotated training data, best practices often do not exist and leave researchers to make arbitrary decisions when collecting annotations. Decisions about the selection of annotators or label options may affect training data quality and model performance.</p><p>In this paper, I will outline and summarize previous research and approaches to the collection of annotated training data. I look at data annotation and its quality confounders from two perspectives: the set of <i>annotators</i> and the <i>strategy</i> of data collection. The paper will highlight the various implementations of text and image annotation collection and stress the importance of careful task construction. I conclude by illustrating the consequences for future research and applications of data annotation. The paper is intended give readers a starting point on annotated data quality research and stress the necessity of thoughtful consideration of the annotation collection process to researchers and practitioners.</p></div>\",\"PeriodicalId\":100134,\"journal\":{\"name\":\"AStA Wirtschafts- und Sozialstatistisches Archiv\",\"volume\":\"17 3-4\",\"pages\":\"331 - 353\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-27\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://link.springer.com/content/pdf/10.1007/s11943-023-00332-y.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"AStA Wirtschafts- und Sozialstatistisches Archiv\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://link.springer.com/article/10.1007/s11943-023-00332-y\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"AStA Wirtschafts- und Sozialstatistisches Archiv","FirstCategoryId":"1085","ListUrlMain":"https://link.springer.com/article/10.1007/s11943-023-00332-y","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

机器学习（ML）应用的质量通常是通过量化算法与各自训练数据的匹配程度来评估的。然而，从错误数据中学习并再现错误数据的完美模型在实际应用中总是存在缺陷。因此，对人工智能质量的全面评估必须包括额外的数据视角，特别是对于根据人类标注数据训练的模型。在收集人工标注的训练数据方面，通常不存在最佳实践，研究人员在收集标注时只能随意做出决定。在本文中，我将概述和总结以往收集注释训练数据的研究和方法。我将从两个角度来探讨数据注释及其质量问题：注释者的集合和数据收集策略。本文将重点介绍文本和图像注释收集的各种实现方法，并强调仔细构建任务的重要性。最后，我将说明数据标注对未来研究和应用的影响。本文旨在为读者提供一个注释数据质量研究的起点，并强调研究人员和从业人员在注释收集过程中深思熟虑的必要性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Quality aspects of annotated data

The quality of Machine Learning (ML) applications is commonly assessed by quantifying how well an algorithm fits its respective training data. Yet, a perfect model that learns from and reproduces erroneous data will always be flawed in its real-world application. Hence, a comprehensive assessment of ML quality must include an additional data perspective, especially for models trained on human-annotated data. For the collection of human-annotated training data, best practices often do not exist and leave researchers to make arbitrary decisions when collecting annotations. Decisions about the selection of annotators or label options may affect training data quality and model performance.

In this paper, I will outline and summarize previous research and approaches to the collection of annotated training data. I look at data annotation and its quality confounders from two perspectives: the set of annotators and the strategy of data collection. The paper will highlight the various implementations of text and image annotation collection and stress the importance of careful task construction. I conclude by illustrating the consequences for future research and applications of data annotation. The paper is intended give readers a starting point on annotated data quality research and stress the necessity of thoughtful consideration of the annotation collection process to researchers and practitioners.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助