扩展大数据缺失价值估算:皮提亚vs哥斯拉

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2014-08-24 DOI:10.1145/2623330.2623615

C. Anagnostopoulos, P. Triantafillou

{"title":"扩展大数据缺失价值估算:皮提亚vs哥斯拉","authors":"C. Anagnostopoulos, P. Triantafillou","doi":"10.1145/2623330.2623615","DOIUrl":null,"url":null,"abstract":"Solving the missing-value (MV) problem with small estimation errors in big data environments is a notoriously resource-demanding task. As datasets and their user community continuously grow, the problem can only be exacerbated. Assume that it is possible to have a single machine (`Godzilla'), which can store the massive dataset and support an ever-growing community submitting MV imputation requests. Is it possible to replace Godzilla by employing a large number of cohort machines so that imputations can be performed much faster, engaging cohorts in parallel, each of which accesses much smaller partitions of the original dataset? If so, it would be preferable for obvious performance reasons to access only a subset of all cohorts per imputation. In this case, can we decide swiftly which is the desired subset of cohorts to engage per imputation? But efficiency and scalability is just one key concern! Is it possible to do the above while ensuring comparable or even better than Godzilla's imputation estimation errors? In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used. Our contributions involve Pythia, a framework and algorithms for providing the answers to the above questions and for engaging the appropriate subset of cohorts per MV imputation request. Pythia functionality rests on two pillars: (i) dataset (partition) signatures, one per cohort, and (ii) similarity notions and algorithms, which can identify the appropriate subset of cohorts to engage. Comprehensive experimentation with real and synthetic datasets showcase our efficiency, scalability, and accuracy claims.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"49 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"21","resultStr":"{\"title\":\"Scaling out big data missing value imputations: pythia vs. godzilla\",\"authors\":\"C. Anagnostopoulos, P. Triantafillou\",\"doi\":\"10.1145/2623330.2623615\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Solving the missing-value (MV) problem with small estimation errors in big data environments is a notoriously resource-demanding task. As datasets and their user community continuously grow, the problem can only be exacerbated. Assume that it is possible to have a single machine (`Godzilla'), which can store the massive dataset and support an ever-growing community submitting MV imputation requests. Is it possible to replace Godzilla by employing a large number of cohort machines so that imputations can be performed much faster, engaging cohorts in parallel, each of which accesses much smaller partitions of the original dataset? If so, it would be preferable for obvious performance reasons to access only a subset of all cohorts per imputation. In this case, can we decide swiftly which is the desired subset of cohorts to engage per imputation? But efficiency and scalability is just one key concern! Is it possible to do the above while ensuring comparable or even better than Godzilla's imputation estimation errors? In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used. Our contributions involve Pythia, a framework and algorithms for providing the answers to the above questions and for engaging the appropriate subset of cohorts per MV imputation request. Pythia functionality rests on two pillars: (i) dataset (partition) signatures, one per cohort, and (ii) similarity notions and algorithms, which can identify the appropriate subset of cohorts to engage. Comprehensive experimentation with real and synthetic datasets showcase our efficiency, scalability, and accuracy claims.\",\"PeriodicalId\":20536,\"journal\":{\"name\":\"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining\",\"volume\":\"49 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-08-24\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"21\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/2623330.2623615\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2623330.2623615","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 21

摘要

众所周知，在大数据环境中解决具有小估计误差的缺失值(MV)问题是一项耗费资源的任务。随着数据集及其用户社区的不断增长，这个问题只会加剧。假设有可能有一台机器(“哥斯拉”)，它可以存储大量数据集并支持不断增长的社区提交MV插入请求。是否有可能通过使用大量队列机器来取代Godzilla，以便更快地执行推算，并行地参与队列，每个队列访问原始数据集的更小分区?如果是这样，出于明显的性能原因，最好是每个imputation只访问所有队列的一个子集。在这种情况下，我们能否迅速决定哪一个是每个imputation所需的队列子集?但效率和可扩展性只是一个关键问题!在保证与哥斯拉的估算误差相当甚至更好的情况下，是否有可能做到以上这些?在本文中，我们得出了这些基本问题的答案，并开发了原则性的方法和框架，这些方法和框架提供了较大的性能加速和更好的或与哥斯拉相当的误差，而与使用的缺失值插入算法无关。我们的贡献涉及Pythia，这是一个框架和算法，用于提供上述问题的答案，并用于每个MV imputation请求参与适当的队列子集。Pythia功能基于两个支柱:(i)数据集(分区)签名，每个队列一个;(ii)相似性概念和算法，可以识别要参与的队列的适当子集。对真实和合成数据集的全面实验展示了我们的效率、可扩展性和准确性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Scaling out big data missing value imputations: pythia vs. godzilla

Solving the missing-value (MV) problem with small estimation errors in big data environments is a notoriously resource-demanding task. As datasets and their user community continuously grow, the problem can only be exacerbated. Assume that it is possible to have a single machine (`Godzilla'), which can store the massive dataset and support an ever-growing community submitting MV imputation requests. Is it possible to replace Godzilla by employing a large number of cohort machines so that imputations can be performed much faster, engaging cohorts in parallel, each of which accesses much smaller partitions of the original dataset? If so, it would be preferable for obvious performance reasons to access only a subset of all cohorts per imputation. In this case, can we decide swiftly which is the desired subset of cohorts to engage per imputation? But efficiency and scalability is just one key concern! Is it possible to do the above while ensuring comparable or even better than Godzilla's imputation estimation errors? In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used. Our contributions involve Pythia, a framework and algorithms for providing the answers to the above questions and for engaging the appropriate subset of cohorts per MV imputation request. Pythia functionality rests on two pillars: (i) dataset (partition) signatures, one per cohort, and (ii) similarity notions and algorithms, which can identify the appropriate subset of cohorts to engage. Comprehensive experimentation with real and synthetic datasets showcase our efficiency, scalability, and accuracy claims.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

自引率

0.00%

发文量