Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing最新文献

Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection 收集、测量、重复:负责任的人工智能数据收集的可靠性因素

Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing

Pub Date : 2023-11-03 DOI: 10.1609/hcomp.v11i1.27547

Oana Inel, Tim Draws, Lora Aroyo

The rapid entry of machine learning approaches in our daily activities and high-stakes domains demands transparency and scrutiny of their fairness and reliability. To help gauge machine learning models' robustness, research typically focuses on the massive datasets used for their deployment, e.g., creating and maintaining documentation for understanding their origin, process of development, and ethical considerations. However, data collection for AI is still typically a one-off practice, and oftentimes datasets collected for a certain purpose or application are reused for a different problem. Additionally, dataset annotations may not be representative over time, contain ambiguous or erroneous annotations, or be unable to generalize across issues or domains. Recent research has shown these practices might lead to unfair, biased, or inaccurate outcomes. We argue that data collection for AI should be performed in a responsible manner where the quality of the data is thoroughly scrutinized and measured through a systematic set of appropriate metrics. In this paper, we propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics for an iterative in-depth analysis of the factors influencing the quality and reliability of the generated data. We propose a granular set of measurements to inform on the internal reliability of a dataset and its external stability over time. We validate our approach across nine existing datasets and annotation tasks and four content modalities. This approach impacts the assessment of data robustness used for AI applied in the real world, where diversity of users and content is eminent. Furthermore, it deals with fairness and accountability aspects in data collection by providing systematic and transparent quality analysis for data collections.

机器学习方法快速进入我们的日常活动和高风险领域，需要透明度和对其公平性和可靠性的审查。为了帮助衡量机器学习模型的稳健性，研究通常集中在用于部署的大量数据集上，例如，创建和维护文档以了解它们的起源、开发过程和伦理考虑。然而，人工智能的数据收集仍然是典型的一次性实践，并且通常为特定目的或应用程序收集的数据集被重用用于不同的问题。此外，随着时间的推移，数据集注释可能不具有代表性，包含模糊或错误的注释，或者无法跨问题或领域进行泛化。最近的研究表明，这些做法可能会导致不公平、有偏见或不准确的结果。我们认为，人工智能的数据收集应以负责任的方式进行，其中数据的质量应通过一套系统的适当指标进行彻底审查和衡量。在本文中，我们提出了一种负责任的人工智能(RAI)方法，旨在通过一组指标来指导数据收集，从而对影响生成数据质量和可靠性的因素进行迭代深入分析。我们提出了一组粒度测量来告知数据集的内部可靠性及其随时间的外部稳定性。我们在9个现有数据集和注释任务以及4种内容模式上验证了我们的方法。这种方法影响了在现实世界中应用的人工智能的数据鲁棒性评估，在现实世界中，用户和内容的多样性是显著的。此外，它通过为数据收集提供系统和透明的质量分析，处理数据收集中的公平性和问责制问题。

{"title":"Collect, Measure, Repeat: Reliability Factors for Responsible AI Data Collection","authors":"Oana Inel, Tim Draws, Lora Aroyo","doi":"10.1609/hcomp.v11i1.27547","DOIUrl":"https://doi.org/10.1609/hcomp.v11i1.27547","url":null,"abstract":"The rapid entry of machine learning approaches in our daily activities and high-stakes domains demands transparency and scrutiny of their fairness and reliability. To help gauge machine learning models' robustness, research typically focuses on the massive datasets used for their deployment, e.g., creating and maintaining documentation for understanding their origin, process of development, and ethical considerations. However, data collection for AI is still typically a one-off practice, and oftentimes datasets collected for a certain purpose or application are reused for a different problem. Additionally, dataset annotations may not be representative over time, contain ambiguous or erroneous annotations, or be unable to generalize across issues or domains. Recent research has shown these practices might lead to unfair, biased, or inaccurate outcomes. We argue that data collection for AI should be performed in a responsible manner where the quality of the data is thoroughly scrutinized and measured through a systematic set of appropriate metrics. In this paper, we propose a Responsible AI (RAI) methodology designed to guide the data collection with a set of metrics for an iterative in-depth analysis of the factors influencing the quality and reliability of the generated data. We propose a granular set of measurements to inform on the internal reliability of a dataset and its external stability over time. We validate our approach across nine existing datasets and annotation tasks and four content modalities. This approach impacts the assessment of data robustness used for AI applied in the real world, where diversity of users and content is eminent. Furthermore, it deals with fairness and accountability aspects in data collection by providing systematic and transparent quality analysis for data collections.","PeriodicalId":87339,"journal":{"name":"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135873494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

A Cluster-Aware Transfer Learning for Bayesian Optimization of Personalized Preference Models 基于集群感知的个性化偏好模型贝叶斯优化迁移学习

Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing

Pub Date : 2023-11-03 DOI: 10.1609/hcomp.v11i1.27558

Haruto Yamasaki, Masaki Matsubara, Hiroyoshi Ito, Yuta Nambu, Masahiro Kohjima, Yuki Kurauchi, Ryuji Yamamoto, Atsuyuki Morishima

Obtaining personalized models of the crowd is an important issue in various applications, such as preference acquisition and user interaction customization. However, the crowd setting, in which we assume we have little knowledge about the person, brings the cold start problem, which may cause avoidable unpreferable interactions with the people. This paper proposes a cluster-aware transfer learning method for the Bayesian optimization of personalized models. The proposed method, called Cluster-aware Bayesian Optimization, is designed based on a known feature: user preferences are not completely independent but can be divided into clusters. It exploits the clustering information to efficiently find the preference of the crowds while avoiding unpreferable interactions. The results of our extensive experiments with different data sets show that the method is efficient for finding the most preferable items and effective in reducing the number of unpreferable interactions.

在偏好获取和用户交互定制等各种应用中，获取个性化的人群模型是一个重要问题。然而，在人群中，我们假设自己对这个人知之甚少，这就带来了冷启动问题，这可能会导致与人之间本可以避免的不愉快的互动。针对个性化模型的贝叶斯优化问题，提出了一种簇感知迁移学习方法。所提出的方法被称为簇感知贝叶斯优化，它是基于一个已知的特征设计的:用户偏好不是完全独立的，而是可以被分成簇。它利用聚类信息有效地发现群体的偏好，同时避免不良交互。我们在不同数据集上进行的大量实验结果表明，该方法可以有效地找到最可取的项目，并有效地减少不可取的交互次数。

引用次数: 0

Does Human Collaboration Enhance the Accuracy of Identifying LLM-Generated Deepfake Texts? 人类协作是否提高了识别llm生成的深度假文本的准确性?

Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing

Pub Date : 2023-11-03 DOI: 10.1609/hcomp.v11i1.27557

Adaku Uchendu, Jooyoung Lee, Hua Shen, Thai Le, Ting-Hao 'Kenneth' Huang, Dongwon Lee

Advances in Large Language Models (e.g., GPT-4, LLaMA) have improved the generation of coherent sentences resembling human writing on a large scale, resulting in the creation of so-called deepfake texts. However, this progress poses security and privacy concerns, necessitating effective solutions for distinguishing deepfake texts from human-written ones. Although prior works studied humans’ ability to detect deepfake texts, none has examined whether “collaboration” among humans improves the detection of deepfake texts. In this study, to address this gap of understanding on deepfake texts, we conducted experiments with two groups: (1) nonexpert individuals from the AMT platform and (2) writing experts from the Upwork platform. The results demonstrate that collaboration among humans can potentially improve the detection of deepfake texts for both groups, increasing detection accuracies by 6.36% for non-experts and 12.76% for experts, respectively, compared to individuals’ detection accuracies. We further analyze the explanations that humans used for detecting a piece of text as deepfake text, and find that the strongest indicator of deepfake texts is their lack of coherence and consistency. Our study provides useful insights for future tools and framework designs to facilitate the collaborative human detection of deepfake texts. The experiment datasets and AMT implementations are available at: https://github.com/huashen218/llm-deepfake-human-study.git

大型语言模型(例如，GPT-4, LLaMA)的进步已经大规模地改善了类似人类写作的连贯句子的生成，从而产生了所谓的深度假文本。然而，这一进展带来了安全和隐私问题，需要有效的解决方案来区分深度假文本和人类书写的文本。尽管之前的研究研究了人类检测深度伪造文本的能力，但没有人研究过人类之间的“合作”是否能提高对深度伪造文本的检测。在本研究中，为了解决对深度假文本的理解差距，我们对两组进行了实验:(1)来自AMT平台的非专家个体和(2)来自Upwork平台的写作专家。结果表明，人类之间的合作可以潜在地提高两组对深度虚假文本的检测，与个人的检测准确率相比，非专家的检测准确率分别提高了6.36%和12.76%。我们进一步分析了人类用于检测一段文本作为深度假文本的解释，并发现深度假文本的最强指标是它们缺乏连贯性和一致性。我们的研究为未来的工具和框架设计提供了有用的见解，以促进人类对深度虚假文本的协作检测。实验数据集和AMT实现可在:https://github.com/huashen218/llm-deepfake-human-study.git

{"title":"Does Human Collaboration Enhance the Accuracy of Identifying LLM-Generated Deepfake Texts?","authors":"Adaku Uchendu, Jooyoung Lee, Hua Shen, Thai Le, Ting-Hao 'Kenneth' Huang, Dongwon Lee","doi":"10.1609/hcomp.v11i1.27557","DOIUrl":"https://doi.org/10.1609/hcomp.v11i1.27557","url":null,"abstract":"Advances in Large Language Models (e.g., GPT-4, LLaMA) have improved the generation of coherent sentences resembling human writing on a large scale, resulting in the creation of so-called deepfake texts. However, this progress poses security and privacy concerns, necessitating effective solutions for distinguishing deepfake texts from human-written ones. Although prior works studied humans’ ability to detect deepfake texts, none has examined whether “collaboration” among humans improves the detection of deepfake texts. In this study, to address this gap of understanding on deepfake texts, we conducted experiments with two groups: (1) nonexpert individuals from the AMT platform and (2) writing experts from the Upwork platform. The results demonstrate that collaboration among humans can potentially improve the detection of deepfake texts for both groups, increasing detection accuracies by 6.36% for non-experts and 12.76% for experts, respectively, compared to individuals’ detection accuracies. We further analyze the explanations that humans used for detecting a piece of text as deepfake text, and find that the strongest indicator of deepfake texts is their lack of coherence and consistency. Our study provides useful insights for future tools and framework designs to facilitate the collaborative human detection of deepfake texts. The experiment datasets and AMT implementations are available at: https://github.com/huashen218/llm-deepfake-human-study.git","PeriodicalId":87339,"journal":{"name":"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135873502","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms 我的模型在哪里表现不佳?切片发现算法的人类评价

Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing

Pub Date : 2023-11-03 DOI: 10.1609/hcomp.v11i1.27548

Nari Johnson, Ángel Alexander Cabrera, Gregory Plumb, Ameet Talwalkar

Machine learning (ML) models that achieve high average accuracy can still underperform on semantically coherent subsets ("slices") of data. This behavior can have significant societal consequences for the safety or bias of the model in deployment, but identifying these underperforming slices can be difficult in practice, especially in domains where practitioners lack access to group annotations to define coherent subsets of their data. Motivated by these challenges, ML researchers have developed new slice discovery algorithms that aim to group together coherent and high-error subsets of data. However, there has been little evaluation focused on whether these tools help humans form correct hypotheses about where (for which groups) their model underperforms. We conduct a controlled user study (N = 15) where we show 40 slices output by two state-of-the-art slice discovery algorithms to users, and ask them to form hypotheses about an object detection model. Our results provide positive evidence that these tools provide some benefit over a naive baseline, and also shed light on challenges faced by users during the hypothesis formation step. We conclude by discussing design opportunities for ML and HCI researchers. Our findings point to the importance of centering users when creating and evaluating new tools for slice discovery.

达到高平均精度的机器学习(ML)模型在数据的语义连贯子集(“切片”)上仍然表现不佳。这种行为可能会对部署中的模型的安全性或偏差产生重大的社会后果，但是在实践中识别这些表现不佳的部分可能很困难，特别是在从业者无法访问组注释以定义其数据的一致子集的领域中。在这些挑战的激励下，机器学习研究人员开发了新的切片发现算法，旨在将连贯且高错误的数据子集分组在一起。然而，很少有评估集中在这些工具是否帮助人类形成正确的假设，关于他们的模型在哪里(对哪些群体)表现不佳。我们进行了一项受控用户研究(N = 15)，其中我们向用户展示了两种最先进的切片发现算法输出的40个切片，并要求他们形成关于对象检测模型的假设。我们的研究结果提供了积极的证据，证明这些工具比简单的基线提供了一些好处，也揭示了用户在假设形成步骤中面临的挑战。最后，我们讨论了ML和HCI研究人员的设计机会。我们的研究结果表明，在创建和评估切片发现的新工具时，以用户为中心的重要性。

{"title":"Where Does My Model Underperform? A Human Evaluation of Slice Discovery Algorithms","authors":"Nari Johnson, Ángel Alexander Cabrera, Gregory Plumb, Ameet Talwalkar","doi":"10.1609/hcomp.v11i1.27548","DOIUrl":"https://doi.org/10.1609/hcomp.v11i1.27548","url":null,"abstract":"Machine learning (ML) models that achieve high average accuracy can still underperform on semantically coherent subsets (\"slices\") of data. This behavior can have significant societal consequences for the safety or bias of the model in deployment, but identifying these underperforming slices can be difficult in practice, especially in domains where practitioners lack access to group annotations to define coherent subsets of their data. Motivated by these challenges, ML researchers have developed new slice discovery algorithms that aim to group together coherent and high-error subsets of data. However, there has been little evaluation focused on whether these tools help humans form correct hypotheses about where (for which groups) their model underperforms. We conduct a controlled user study (N = 15) where we show 40 slices output by two state-of-the-art slice discovery algorithms to users, and ask them to form hypotheses about an object detection model. Our results provide positive evidence that these tools provide some benefit over a naive baseline, and also shed light on challenges faced by users during the hypothesis formation step. We conclude by discussing design opportunities for ML and HCI researchers. Our findings point to the importance of centering users when creating and evaluating new tools for slice discovery.","PeriodicalId":87339,"journal":{"name":"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135873640","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Rethinking Quality Assurance for Crowdsourced Multi-ROI Image Segmentation 对众包多roi图像分割质量保证的再思考

Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing

Pub Date : 2023-11-03 DOI: 10.1609/hcomp.v11i1.27552

Xiaolu Lu, David Ratcliffe, Tsu-Ting Kao, Aristarkh Tikhonov, Lester Litchfield, Craig Rodger, Kaier Wang

Collecting high quality annotations to construct an evaluation dataset is essential for assessing the true performance of machine learning models. One popular way of performing data annotation is via crowdsourcing, where quality can be of concern. Despite much prior work addressing the annotation quality problem in crowdsourcing generally, little has been discussed in detail for image segmentation tasks. These tasks often require pixel-level annotation accuracy, and is relatively complex when compared to image classification or object detection with bounding-boxes. In this paper, we focus on image segmentation annotation via crowdsourcing, where images may not have been collected in a controlled way. In this setting, the task of annotating may be non-trivial, where annotators may experience difficultly in differentiating between regions-of-interest (ROIs) and background pixels. We implement an annotation process and examine the effectiveness of several in-situ and manual quality assurance and quality control mechanisms. We implement an annotation process on a medical image annotation task and examine the effectiveness of several in-situ and manual quality assurance and quality control mechanisms. Our observations on this task are three-fold. Firstly, including an onboarding and a pilot phase improves quality assurance as annotators can familiarize themselves with the task, especially when the definition of ROIs is ambiguous. Secondly, we observe high variability of annotation times, leading us to believe it cannot be relied upon as a source of information for quality control. When performing agreement analysis, we also show that global-level inter-rater agreement is insufficient to provide useful information, especially when annotator skill levels vary. Thirdly, we recognize that reviewing all annotations can be time-consuming and often infeasible, and there currently exist no mechanisms to reduce the workload for reviewers. Therefore, we propose a method to create a priority list of images for review based on inter-rater agreement. Our experiments suggest that this method can be used to improve reviewer efficiency when compared to a baseline approach, especially if a fixed work budget is required.

收集高质量的注释来构建评估数据集对于评估机器学习模型的真实性能至关重要。执行数据注释的一种流行方式是通过众包，在这种方式下，质量可能会受到关注。尽管之前有很多工作解决了众包中的注释质量问题，但很少有关于图像分割任务的详细讨论。这些任务通常需要像素级的标注精度，与图像分类或使用边界框的对象检测相比，这些任务相对复杂。在本文中，我们关注通过众包的方式进行图像分割标注，其中图像可能没有以受控的方式收集。在这种情况下，注释任务可能很重要，注释者在区分感兴趣区域(roi)和背景像素时可能会遇到困难。我们实施了一个注释过程，并检查了几个现场和人工质量保证和质量控制机制的有效性。我们在医学图像标注任务中实现了一个标注过程，并检验了几种原位和人工质量保证和质量控制机制的有效性。我们对这项任务的看法有三点。首先，包括入职和试验阶段可以提高质量保证，因为注释者可以熟悉任务，特别是当roi的定义不明确时。其次，我们观察到注释时间的高度可变性，导致我们认为它不能作为质量控制的信息来源。在执行协议分析时，我们还表明，全球级别的评价者之间的协议不足以提供有用的信息，特别是当评价者的技能水平不同时。第三，我们认识到审查所有注释是非常耗时的，而且通常是不可行的，而且目前还没有机制来减少审稿人的工作量。因此，我们提出了一种方法来创建一个优先级列表的图像审查基于内部协议。我们的实验表明，与基线方法相比，这种方法可以用来提高审稿人的效率，特别是在需要固定工作预算的情况下。

{"title":"Rethinking Quality Assurance for Crowdsourced Multi-ROI Image Segmentation","authors":"Xiaolu Lu, David Ratcliffe, Tsu-Ting Kao, Aristarkh Tikhonov, Lester Litchfield, Craig Rodger, Kaier Wang","doi":"10.1609/hcomp.v11i1.27552","DOIUrl":"https://doi.org/10.1609/hcomp.v11i1.27552","url":null,"abstract":"Collecting high quality annotations to construct an evaluation dataset is essential for assessing the true performance of machine learning models. One popular way of performing data annotation is via crowdsourcing, where quality can be of concern. Despite much prior work addressing the annotation quality problem in crowdsourcing generally, little has been discussed in detail for image segmentation tasks. These tasks often require pixel-level annotation accuracy, and is relatively complex when compared to image classification or object detection with bounding-boxes. In this paper, we focus on image segmentation annotation via crowdsourcing, where images may not have been collected in a controlled way. In this setting, the task of annotating may be non-trivial, where annotators may experience difficultly in differentiating between regions-of-interest (ROIs) and background pixels. We implement an annotation process and examine the effectiveness of several in-situ and manual quality assurance and quality control mechanisms. We implement an annotation process on a medical image annotation task and examine the effectiveness of several in-situ and manual quality assurance and quality control mechanisms. Our observations on this task are three-fold. Firstly, including an onboarding and a pilot phase improves quality assurance as annotators can familiarize themselves with the task, especially when the definition of ROIs is ambiguous. Secondly, we observe high variability of annotation times, leading us to believe it cannot be relied upon as a source of information for quality control. When performing agreement analysis, we also show that global-level inter-rater agreement is insufficient to provide useful information, especially when annotator skill levels vary. Thirdly, we recognize that reviewing all annotations can be time-consuming and often infeasible, and there currently exist no mechanisms to reduce the workload for reviewers. Therefore, we propose a method to create a priority list of images for review based on inter-rater agreement. Our experiments suggest that this method can be used to improve reviewer efficiency when compared to a baseline approach, especially if a fixed work budget is required.","PeriodicalId":87339,"journal":{"name":"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135873646","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Informing Users about Data Imputation: Exploring the Design Space for Dealing With Non-Responses 告知用户数据输入:探索处理非响应的设计空间

Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing

Pub Date : 2023-11-03 DOI: 10.1609/hcomp.v11i1.27544

Ananya Bhattacharjee, Haochen Song, Xuening Wu, Justice Tomlinson, Mohi Reza, Akmar Ehsan Chowdhury, Nina Deliu, Thomas W. Price, Joseph Jay Williams

Machine learning algorithms often require quantitative ratings from users to effectively predict helpful content. When these ratings are unavailable, systems make implicit assumptions or imputations to fill in the missing information; however, users are generally kept unaware of these processes. In our work, we explore ways of informing the users about system imputations, and experiment with imputed ratings and various explanations required by users to correct imputations. We investigate these approaches through the deployment of a text messaging probe to 26 participants to help them manage psychological wellbeing. We provide quantitative results to report users' reactions to correct vs incorrect imputations and potential risks of biasing their ratings. Using semi-structured interviews with participants, we characterize the potential trade-offs regarding user autonomy, and draw insights about alternative ways of involving users in the imputation process. Our findings provide useful directions for future research on communicating system imputation and interpreting user non-responses.

机器学习算法通常需要用户的定量评分来有效地预测有用的内容。当这些评级不可用时，系统做出隐含的假设或估算来填补缺失的信息;然而，用户通常不知道这些过程。在我们的工作中，我们探索了告知用户关于系统imputation的方法，并尝试了imputed评级和用户纠正imputation所需的各种解释。我们通过对26名参与者部署短信探针来研究这些方法，以帮助他们管理心理健康。我们提供定量结果来报告用户对正确和不正确的指责的反应，以及影响他们评级的潜在风险。通过对参与者的半结构化访谈，我们描述了关于用户自主权的潜在权衡，并得出了有关用户参与imputation过程的替代方法的见解。我们的研究结果为未来通信系统imputation和用户非响应解释的研究提供了有益的方向。

{"title":"Informing Users about Data Imputation: Exploring the Design Space for Dealing With Non-Responses","authors":"Ananya Bhattacharjee, Haochen Song, Xuening Wu, Justice Tomlinson, Mohi Reza, Akmar Ehsan Chowdhury, Nina Deliu, Thomas W. Price, Joseph Jay Williams","doi":"10.1609/hcomp.v11i1.27544","DOIUrl":"https://doi.org/10.1609/hcomp.v11i1.27544","url":null,"abstract":"Machine learning algorithms often require quantitative ratings from users to effectively predict helpful content. When these ratings are unavailable, systems make implicit assumptions or imputations to fill in the missing information; however, users are generally kept unaware of these processes. In our work, we explore ways of informing the users about system imputations, and experiment with imputed ratings and various explanations required by users to correct imputations. We investigate these approaches through the deployment of a text messaging probe to 26 participants to help them manage psychological wellbeing. We provide quantitative results to report users' reactions to correct vs incorrect imputations and potential risks of biasing their ratings. Using semi-structured interviews with participants, we characterize the potential trade-offs regarding user autonomy, and draw insights about alternative ways of involving users in the imputation process. Our findings provide useful directions for future research on communicating system imputation and interpreting user non-responses.","PeriodicalId":87339,"journal":{"name":"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135873641","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

How Crowd Worker Factors Influence Subjective Annotations: A Study of Tagging Misogynistic Hate Speech in Tweets 群体工作者因素如何影响主观注释:推文中厌女仇恨言论的标签研究

Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing

Pub Date : 2023-11-03 DOI: 10.1609/hcomp.v11i1.27546

Danula Hettiachchi, Indigo Holcombe-James, Stephanie Livingstone, Anjalee De Silva, Matthew Lease, Flora D. Salim, Mark Sanderson

Crowdsourced annotation is vital to both collecting labelled data to train and test automated content moderation systems and to support human-in-the-loop review of system decisions. However, annotation tasks such as judging hate speech are subjective and thus highly sensitive to biases stemming from annotator beliefs, characteristics and demographics. We conduct two crowdsourcing studies on Mechanical Turk to examine annotator bias in labelling sexist and misogynistic hate speech. Results from 109 annotators show that annotator political inclination, moral integrity, personality traits, and sexist attitudes significantly impact annotation accuracy and the tendency to tag content as hate speech. In addition, semi-structured interviews with nine crowd workers provide further insights regarding the influence of subjectivity on annotations. In exploring how workers interpret a task — shaped by complex negotiations between platform structures, task instructions, subjective motivations, and external contextual factors — we see annotations not only impacted by worker factors but also simultaneously shaped by the structures under which they labour.

众包注释对于收集标记数据以训练和测试自动化内容审核系统以及支持系统决策的人在环审查至关重要。然而，像判断仇恨言论这样的注释任务是主观的，因此对来自注释者信仰、特征和人口统计数据的偏见非常敏感。我们对Mechanical Turk进行了两项众包研究，以检查注释者在标记性别歧视和厌恶女性的仇恨言论时的偏见。109名注释者的研究结果表明，注释者的政治倾向、道德操持、人格特质和性别歧视态度显著影响注释的准确性和将内容标记为仇恨言论的倾向。此外，与九位群体工作者的半结构化访谈提供了关于主观性对注释的影响的进一步见解。在探索工人如何解释任务的过程中——由平台结构、任务指示、主观动机和外部上下文因素之间的复杂协商形成——我们看到注释不仅受到工人因素的影响，同时也受到他们工作的结构的影响。

{"title":"How Crowd Worker Factors Influence Subjective Annotations: A Study of Tagging Misogynistic Hate Speech in Tweets","authors":"Danula Hettiachchi, Indigo Holcombe-James, Stephanie Livingstone, Anjalee De Silva, Matthew Lease, Flora D. Salim, Mark Sanderson","doi":"10.1609/hcomp.v11i1.27546","DOIUrl":"https://doi.org/10.1609/hcomp.v11i1.27546","url":null,"abstract":"Crowdsourced annotation is vital to both collecting labelled data to train and test automated content moderation systems and to support human-in-the-loop review of system decisions. However, annotation tasks such as judging hate speech are subjective and thus highly sensitive to biases stemming from annotator beliefs, characteristics and demographics. We conduct two crowdsourcing studies on Mechanical Turk to examine annotator bias in labelling sexist and misogynistic hate speech. Results from 109 annotators show that annotator political inclination, moral integrity, personality traits, and sexist attitudes significantly impact annotation accuracy and the tendency to tag content as hate speech. In addition, semi-structured interviews with nine crowd workers provide further insights regarding the influence of subjectivity on annotations. In exploring how workers interpret a task — shaped by complex negotiations between platform structures, task instructions, subjective motivations, and external contextual factors — we see annotations not only impacted by worker factors but also simultaneously shaped by the structures under which they labour.","PeriodicalId":87339,"journal":{"name":"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135873499","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Task as Context: A Sensemaking Perspective on Annotating Inter-Dependent Event Attributes with Non-Experts 任务作为上下文:用非专家对相互依赖的事件属性进行标注的意义构建视角

Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing

Pub Date : 2023-11-03 DOI: 10.1609/hcomp.v11i1.27550

Tianyi Li, Ping Wang, Tian Shi, Yali Bian, Andy Esakia

This paper explores the application of sensemaking theory to support non-expert crowds in intricate data annotation tasks. We investigate the influence of procedural context and data context on the annotation quality of novice crowds, defining procedural context as completing multiple related annotation tasks on the same data point, and data context as annotating multiple data points with semantic relevance. We conducted a controlled experiment involving 140 non-expert crowd workers, who generated 1400 event annotations across various procedural and data context levels. Assessments of annotations demonstrate that high procedural context positively impacts annotation quality, although this effect diminishes with lower data context. Notably, assigning multiple related tasks to novice annotators yields comparable quality to expert annotations, without costing additional time or effort. We discuss the trade-offs associated with procedural and data contexts and draw design implications for engaging non-experts in crowdsourcing complex annotation tasks.

本文探讨了语义构建理论在复杂数据标注任务中支持非专家群体的应用。我们研究了程序上下文和数据上下文对新手群体标注质量的影响，将程序上下文定义为在同一数据点上完成多个相关的标注任务，将数据上下文定义为对多个数据点进行语义相关的标注。我们进行了一项涉及140名非专业人群工作者的对照实验，他们在各种过程和数据上下文级别上生成了1400个事件注释。对注释的评估表明，高过程上下文对注释质量有积极影响，尽管这种影响在低数据上下文下会减弱。值得注意的是，将多个相关任务分配给新手注释者可以产生与专家注释相当的质量，而无需花费额外的时间或精力。我们讨论了与过程和数据上下文相关的权衡，并绘制了参与众包复杂注释任务的非专家的设计含义。

引用次数: 0

A Taxonomy of Human and ML Strengths in Decision-Making to Investigate Human-ML Complementarity 人类和机器学习在决策中的优势分类研究人类和机器学习的互补性

Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing

Pub Date : 2023-11-03 DOI: 10.1609/hcomp.v11i1.27554

Charvi Rastogi, Liu Leqi, Kenneth Holstein, Hoda Heidari

Hybrid human-ML systems increasingly make consequential decisions in a wide range of domains. These systems are often introduced with the expectation that the combined human-ML system will achieve complementary performance, that is, the combined decision-making system will be an improvement compared with either decision-making agent in isolation. However, empirical results have been mixed, and existing research rarely articulates the sources and mechanisms by which complementary performance is expected to arise. Our goal in this work is to provide conceptual tools to advance the way researchers reason and communicate about human-ML complementarity. Drawing upon prior literature in human psychology, machine learning, and human-computer interaction, we propose a taxonomy characterizing distinct ways in which human and ML-based decision-making can differ. In doing so, we conceptually map potential mechanisms by which combining human and ML decision-making may yield complementary performance, developing a language for the research community to reason about design of hybrid systems in any decision-making domain. To illustrate how our taxonomy can be used to investigate complementarity, we provide a mathematical aggregation framework to examine enabling conditions for complementarity. Through synthetic simulations, we demonstrate how this framework can be used to explore specific aspects of our taxonomy and shed light on the optimal mechanisms for combining human-ML judgments.

混合人-机器学习系统越来越多地在广泛的领域做出相应的决策。这些系统通常被引入，期望组合的人-机器学习系统将实现互补的性能，也就是说，组合的决策系统将比单独的决策代理中的任何一个都有改进。然而，实证结果喜忧参半，现有研究很少阐明预期互补绩效产生的来源和机制。我们在这项工作中的目标是提供概念工具，以推进研究人员对人类-机器学习互补性的推理和交流方式。借鉴人类心理学、机器学习和人机交互方面的先前文献，我们提出了一种分类方法，描述了人类和基于机器学习的决策不同的不同方式。在此过程中，我们从概念上映射了人类和机器决策相结合可能产生互补性能的潜在机制，为研究界开发了一种语言，可以在任何决策领域中推理混合系统的设计。为了说明如何使用我们的分类法来调查互补性，我们提供了一个数学聚合框架来检查互补性的启用条件。通过合成模拟，我们展示了如何使用这个框架来探索我们分类学的特定方面，并阐明了结合人类-机器学习判断的最佳机制。

{"title":"A Taxonomy of Human and ML Strengths in Decision-Making to Investigate Human-ML Complementarity","authors":"Charvi Rastogi, Liu Leqi, Kenneth Holstein, Hoda Heidari","doi":"10.1609/hcomp.v11i1.27554","DOIUrl":"https://doi.org/10.1609/hcomp.v11i1.27554","url":null,"abstract":"Hybrid human-ML systems increasingly make consequential decisions in a wide range of domains. These systems are often introduced with the expectation that the combined human-ML system will achieve complementary performance, that is, the combined decision-making system will be an improvement compared with either decision-making agent in isolation. However, empirical results have been mixed, and existing research rarely articulates the sources and mechanisms by which complementary performance is expected to arise. Our goal in this work is to provide conceptual tools to advance the way researchers reason and communicate about human-ML complementarity. Drawing upon prior literature in human psychology, machine learning, and human-computer interaction, we propose a taxonomy characterizing distinct ways in which human and ML-based decision-making can differ. In doing so, we conceptually map potential mechanisms by which combining human and ML decision-making may yield complementary performance, developing a language for the research community to reason about design of hybrid systems in any decision-making domain. To illustrate how our taxonomy can be used to investigate complementarity, we provide a mathematical aggregation framework to examine enabling conditions for complementarity. Through synthetic simulations, we demonstrate how this framework can be used to explore specific aspects of our taxonomy and shed light on the optimal mechanisms for combining human-ML judgments.","PeriodicalId":87339,"journal":{"name":"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135873644","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Crowdsourced Clustering via Active Querying: Practical Algorithm with Theoretical Guarantees 基于主动查询的众包聚类:具有理论保证的实用算法

Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing

Pub Date : 2023-11-03 DOI: 10.1609/hcomp.v11i1.27545

Yi Chen, Ramya Korlakai Vinayak, Babak Hassibi

We consider the problem of clustering n items into K disjoint clusters using noisy answers from crowdsourced workers to pairwise queries of the type: “Are items i and j from the same cluster?” We propose a novel, practical, simple, and computationally efficient active querying algorithm for crowdsourced clustering. Furthermore, our algorithm does not require knowledge of unknown problem parameters. We show that our algorithm succeeds in recovering the clusters when the crowdworkers provide answers with an error probability less than 1/2 and provide sample complexity bounds on the number of queries made by our algorithm to guarantee successful clustering. While the bounds depend on the error probabilities, the algorithm itself does not require this knowledge. In addition to the theoretical guarantee, we implement and deploy the proposed algorithm on a real crowdsourcing platform to characterize its performance in real-world settings. Based on both the theoretical and the empirical results, we observe that while the total number of queries made by the active clustering algorithm is order-wise better than random querying, the advantage applies most conspicuously when the datasets have small clusters. For datasets with large enough clusters, passive querying can often be more efficient in practice. Our observations and practically implementable active clustering algorithm can inform and aid the design of real-world crowdsourced clustering systems. We make the dataset collected through this work publicly available (and the code to run such experiments).

我们考虑将n个项目聚到K个不相交的簇中的问题，使用来自众包工作者的嘈杂答案来成对查询:“项目i和j是否来自同一簇?”我们提出了一种新颖、实用、简单、计算效率高的众包聚类主动查询算法。此外，我们的算法不需要知道未知的问题参数。我们证明，当众工提供的答案的错误概率小于1/2时，我们的算法成功地恢复了集群，并为我们的算法所做的查询数量提供了样本复杂性界限，以保证集群的成功。虽然边界取决于错误概率，但算法本身并不需要这些知识。除了理论保证外，我们还在真实的众包平台上实现和部署了所提出的算法，以表征其在现实环境中的性能。基于理论和经验结果，我们观察到，虽然主动聚类算法进行的查询总数在顺序上优于随机查询，但当数据集具有较小的聚类时，优势最为明显。对于具有足够大的集群的数据集，被动查询在实践中通常更有效。我们的观察和实际可实现的主动聚类算法可以为现实世界的众包聚类系统的设计提供信息和帮助。我们公开了通过这项工作收集的数据集(以及运行此类实验的代码)。

{"title":"Crowdsourced Clustering via Active Querying: Practical Algorithm with Theoretical Guarantees","authors":"Yi Chen, Ramya Korlakai Vinayak, Babak Hassibi","doi":"10.1609/hcomp.v11i1.27545","DOIUrl":"https://doi.org/10.1609/hcomp.v11i1.27545","url":null,"abstract":"We consider the problem of clustering n items into K disjoint clusters using noisy answers from crowdsourced workers to pairwise queries of the type: “Are items i and j from the same cluster?” We propose a novel, practical, simple, and computationally efficient active querying algorithm for crowdsourced clustering. Furthermore, our algorithm does not require knowledge of unknown problem parameters. We show that our algorithm succeeds in recovering the clusters when the crowdworkers provide answers with an error probability less than 1/2 and provide sample complexity bounds on the number of queries made by our algorithm to guarantee successful clustering. While the bounds depend on the error probabilities, the algorithm itself does not require this knowledge. In addition to the theoretical guarantee, we implement and deploy the proposed algorithm on a real crowdsourcing platform to characterize its performance in real-world settings. Based on both the theoretical and the empirical results, we observe that while the total number of queries made by the active clustering algorithm is order-wise better than random querying, the advantage applies most conspicuously when the datasets have small clusters. For datasets with large enough clusters, passive querying can often be more efficient in practice. Our observations and practically implementable active clustering algorithm can inform and aid the design of real-world crowdsourced clustering systems. We make the dataset collected through this work publicly available (and the code to run such experiments).","PeriodicalId":87339,"journal":{"name":"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"135873495","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0