再论数据分布定制：具有成本效益的代表性数据整合

The VLDB Journal Pub Date : 2024-04-12 DOI:10.1007/s00778-024-00849-w

Jiwon Chang, Bohan Cui, Fatemeh Nargesian, Abolfazl Asudeh, H. V. Jagadish

{"title":"再论数据分布定制：具有成本效益的代表性数据整合","authors":"Jiwon Chang, Bohan Cui, Fatemeh Nargesian, Abolfazl Asudeh, H. V. Jagadish","doi":"10.1007/s00778-024-00849-w","DOIUrl":null,"url":null,"abstract":"Data scientists often develop data sets for analysis by drawing upon available data sources. A major challenge is ensuring that the data set used for analysis adequately represents relevant demographic groups or other variables. Whether data is obtained from an experiment or a data provider, a single data source may not meet the desired distribution requirements. Therefore, combining data from multiple sources is often necessary. The data distribution tailoring (DT) problem aims to cost-efficiently collect a unified data set from multiple sources. In this paper, we present major optimizations and generalizations to previous algorithms for this problem. In situations when group distributions are known in sources, we present a novel algorithm RatioColl that outperforms the existing algorithm, based on the coupon collector’s problem. If distributions are unknown, we propose decaying exploration rate multi-armed-bandit algorithms that, unlike the existing algorithm used for unknown DT, does not require prior information. Through theoretical analysis and extensive experiments, we demonstrate the effectiveness of our proposed algorithms.","PeriodicalId":501532,"journal":{"name":"The VLDB Journal","volume":"53 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Data distribution tailoring revisited: cost-efficient integration of representative data\",\"authors\":\"Jiwon Chang, Bohan Cui, Fatemeh Nargesian, Abolfazl Asudeh, H. V. Jagadish\",\"doi\":\"10.1007/s00778-024-00849-w\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Data scientists often develop data sets for analysis by drawing upon available data sources. A major challenge is ensuring that the data set used for analysis adequately represents relevant demographic groups or other variables. Whether data is obtained from an experiment or a data provider, a single data source may not meet the desired distribution requirements. Therefore, combining data from multiple sources is often necessary. The data distribution tailoring (DT) problem aims to cost-efficiently collect a unified data set from multiple sources. In this paper, we present major optimizations and generalizations to previous algorithms for this problem. In situations when group distributions are known in sources, we present a novel algorithm RatioColl that outperforms the existing algorithm, based on the coupon collector’s problem. If distributions are unknown, we propose decaying exploration rate multi-armed-bandit algorithms that, unlike the existing algorithm used for unknown DT, does not require prior information. Through theoretical analysis and extensive experiments, we demonstrate the effectiveness of our proposed algorithms.\",\"PeriodicalId\":501532,\"journal\":{\"name\":\"The VLDB Journal\",\"volume\":\"53 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-04-12\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The VLDB Journal\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1007/s00778-024-00849-w\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The VLDB Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s00778-024-00849-w","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

数据科学家通常通过利用现有数据源开发数据集进行分析。一个主要挑战是确保用于分析的数据集充分代表相关人口群体或其他变量。无论是从实验还是从数据提供者那里获取数据，单一的数据源可能无法满足所需的分布要求。因此，通常需要将多个来源的数据结合起来。数据分布裁剪（DT）问题旨在从多个来源经济高效地收集统一的数据集。在本文中，我们针对这一问题提出了对以往算法的主要优化和概括。在已知数据源分组分布的情况下，我们提出了一种基于优惠券收集者问题的新型算法 RatioColl，其性能优于现有算法。如果分布未知，我们提出了衰减探索率多臂比特算法，与用于未知 DT 的现有算法不同，该算法不需要先验信息。通过理论分析和大量实验，我们证明了所提算法的有效性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

摘要图片

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Data distribution tailoring revisited: cost-efficient integration of representative data

Data scientists often develop data sets for analysis by drawing upon available data sources. A major challenge is ensuring that the data set used for analysis adequately represents relevant demographic groups or other variables. Whether data is obtained from an experiment or a data provider, a single data source may not meet the desired distribution requirements. Therefore, combining data from multiple sources is often necessary. The data distribution tailoring (DT) problem aims to cost-efficiently collect a unified data set from multiple sources. In this paper, we present major optimizations and generalizations to previous algorithms for this problem. In situations when group distributions are known in sources, we present a novel algorithm RatioColl that outperforms the existing algorithm, based on the coupon collector’s problem. If distributions are unknown, we propose decaying exploration rate multi-armed-bandit algorithms that, unlike the existing algorithm used for unknown DT, does not require prior information. Through theoretical analysis and extensive experiments, we demonstrate the effectiveness of our proposed algorithms.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

The VLDB Journal

自引率

0.00%

发文量