基于主动查询的众包聚类:具有理论保证的实用算法

Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing Pub Date : 2023-11-03 DOI:10.1609/hcomp.v11i1.27545

Yi Chen, Ramya Korlakai Vinayak, Babak Hassibi

{"title":"基于主动查询的众包聚类:具有理论保证的实用算法","authors":"Yi Chen, Ramya Korlakai Vinayak, Babak Hassibi","doi":"10.1609/hcomp.v11i1.27545","DOIUrl":null,"url":null,"abstract":"We consider the problem of clustering n items into K disjoint clusters using noisy answers from crowdsourced workers to pairwise queries of the type: “Are items i and j from the same cluster?” We propose a novel, practical, simple, and computationally efficient active querying algorithm for crowdsourced clustering. Furthermore, our algorithm does not require knowledge of unknown problem parameters. We show that our algorithm succeeds in recovering the clusters when the crowdworkers provide answers with an error probability less than 1/2 and provide sample complexity bounds on the number of queries made by our algorithm to guarantee successful clustering. While the bounds depend on the error probabilities, the algorithm itself does not require this knowledge. In addition to the theoretical guarantee, we implement and deploy the proposed algorithm on a real crowdsourcing platform to characterize its performance in real-world settings. Based on both the theoretical and the empirical results, we observe that while the total number of queries made by the active clustering algorithm is order-wise better than random querying, the advantage applies most conspicuously when the datasets have small clusters. For datasets with large enough clusters, passive querying can often be more efficient in practice. Our observations and practically implementable active clustering algorithm can inform and aid the design of real-world crowdsourced clustering systems. We make the dataset collected through this work publicly available (and the code to run such experiments).","PeriodicalId":87339,"journal":{"name":"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing","volume":"9 10","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Crowdsourced Clustering via Active Querying: Practical Algorithm with Theoretical Guarantees\",\"authors\":\"Yi Chen, Ramya Korlakai Vinayak, Babak Hassibi\",\"doi\":\"10.1609/hcomp.v11i1.27545\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We consider the problem of clustering n items into K disjoint clusters using noisy answers from crowdsourced workers to pairwise queries of the type: “Are items i and j from the same cluster?” We propose a novel, practical, simple, and computationally efficient active querying algorithm for crowdsourced clustering. Furthermore, our algorithm does not require knowledge of unknown problem parameters. We show that our algorithm succeeds in recovering the clusters when the crowdworkers provide answers with an error probability less than 1/2 and provide sample complexity bounds on the number of queries made by our algorithm to guarantee successful clustering. While the bounds depend on the error probabilities, the algorithm itself does not require this knowledge. In addition to the theoretical guarantee, we implement and deploy the proposed algorithm on a real crowdsourcing platform to characterize its performance in real-world settings. Based on both the theoretical and the empirical results, we observe that while the total number of queries made by the active clustering algorithm is order-wise better than random querying, the advantage applies most conspicuously when the datasets have small clusters. For datasets with large enough clusters, passive querying can often be more efficient in practice. Our observations and practically implementable active clustering algorithm can inform and aid the design of real-world crowdsourced clustering systems. We make the dataset collected through this work publicly available (and the code to run such experiments).\",\"PeriodicalId\":87339,\"journal\":{\"name\":\"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing\",\"volume\":\"9 10\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-11-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1609/hcomp.v11i1.27545\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1609/hcomp.v11i1.27545","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

我们考虑将n个项目聚到K个不相交的簇中的问题，使用来自众包工作者的嘈杂答案来成对查询:“项目i和j是否来自同一簇?”我们提出了一种新颖、实用、简单、计算效率高的众包聚类主动查询算法。此外，我们的算法不需要知道未知的问题参数。我们证明，当众工提供的答案的错误概率小于1/2时，我们的算法成功地恢复了集群，并为我们的算法所做的查询数量提供了样本复杂性界限，以保证集群的成功。虽然边界取决于错误概率，但算法本身并不需要这些知识。除了理论保证外，我们还在真实的众包平台上实现和部署了所提出的算法，以表征其在现实环境中的性能。基于理论和经验结果，我们观察到，虽然主动聚类算法进行的查询总数在顺序上优于随机查询，但当数据集具有较小的聚类时，优势最为明显。对于具有足够大的集群的数据集，被动查询在实践中通常更有效。我们的观察和实际可实现的主动聚类算法可以为现实世界的众包聚类系统的设计提供信息和帮助。我们公开了通过这项工作收集的数据集(以及运行此类实验的代码)。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Crowdsourced Clustering via Active Querying: Practical Algorithm with Theoretical Guarantees

We consider the problem of clustering n items into K disjoint clusters using noisy answers from crowdsourced workers to pairwise queries of the type: “Are items i and j from the same cluster?” We propose a novel, practical, simple, and computationally efficient active querying algorithm for crowdsourced clustering. Furthermore, our algorithm does not require knowledge of unknown problem parameters. We show that our algorithm succeeds in recovering the clusters when the crowdworkers provide answers with an error probability less than 1/2 and provide sample complexity bounds on the number of queries made by our algorithm to guarantee successful clustering. While the bounds depend on the error probabilities, the algorithm itself does not require this knowledge. In addition to the theoretical guarantee, we implement and deploy the proposed algorithm on a real crowdsourcing platform to characterize its performance in real-world settings. Based on both the theoretical and the empirical results, we observe that while the total number of queries made by the active clustering algorithm is order-wise better than random querying, the advantage applies most conspicuously when the datasets have small clusters. For datasets with large enough clusters, passive querying can often be more efficient in practice. Our observations and practically implementable active clustering algorithm can inform and aid the design of real-world crowdsourced clustering systems. We make the dataset collected through this work publicly available (and the code to run such experiments).

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the ... AAAI Conference on Human Computation and Crowdsourcing

自引率

0.00%

发文量