Streaming submodular maximization: massive data summarization on the fly

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining Pub Date : 2014-08-24 DOI:10.1145/2623330.2623637

Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, Andreas Krause

{"title":"Streaming submodular maximization: massive data summarization on the fly","authors":"Ashwinkumar Badanidiyuru, Baharan Mirzasoleiman, Amin Karbasi, Andreas Krause","doi":"10.1145/2623330.2623637","DOIUrl":null,"url":null,"abstract":"How can one summarize a massive data set \"on the fly\", i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of \"representativeness\" satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2-ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training large-scale kernel methods and exemplar-based clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"269 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"327","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/2623330.2623637","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 327

Abstract

How can one summarize a massive data set "on the fly", i.e., without even having seen it in its entirety? In this paper, we address the problem of extracting representative elements from a large stream of data. I.e., we would like to select a subset of say k data points from the stream that are most representative according to some objective function. Many natural notions of "representativeness" satisfy submodularity, an intuitive notion of diminishing returns. Thus, such problems can be reduced to maximizing a submodular set function subject to a cardinality constraint. Classical approaches to submodular maximization require full access to the data set. We develop the first efficient streaming algorithm with constant factor 1/2-ε approximation guarantee to the optimum solution, requiring only a single pass through the data, and memory independent of data size. In our experiments, we extensively evaluate the effectiveness of our approach on several applications, including training large-scale kernel methods and exemplar-based clustering, on millions of data points. We observe that our streaming method, while achieving practically the same utility value, runs about 100 times faster than previous work.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

流式子模块最大化:动态的海量数据汇总

一个人怎么能“在飞行中”总结一个庞大的数据集，也就是说，甚至没有看到它的整体?在本文中，我们解决了从大量数据流中提取代表性元素的问题。也就是说，我们希望根据某个目标函数从流中选择最具代表性的k个数据点的子集。“代表性”的许多自然概念满足子模块性，这是一种收益递减的直观概念。因此，这样的问题可以简化为在基数约束下最大化一个次模集合函数。实现子模块最大化的经典方法需要对数据集进行完全访问。我们开发了第一个高效的流算法，该算法具有常数因子1/2-ε近似保证最优解，只需要一次遍历数据，并且内存与数据大小无关。在我们的实验中，我们广泛地评估了我们的方法在几个应用程序上的有效性，包括在数百万个数据点上训练大规模核方法和基于示例的聚类。我们观察到，我们的流方法在实现几乎相同的实用价值的同时，运行速度比以前的工作快100倍。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining

自引率

0.00%

发文量