首页 > 最新文献

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献

英文 中文
Towards scalable critical alert mining 迈向可扩展的关键警报挖掘
Bo Zong, Yinghui Wu, Jie Song, Ambuj K. Singh, H. Çam, Jiawei Han, Xifeng Yan
Performance monitor software for data centers typically generates a great number of alert sequences. These alert sequences indicate abnormal network events. Given a set of observed alert sequences, it is important to identify the most critical alerts that are potentially the causes of others. While the need for mining critical alerts over large scale alert sequences is evident, most alert analysis techniques stop at modeling and mining the causal relations among the alerts. This paper studies the critical alert mining problem: Given a set of alert sequences, we aim to find a set of k critical alerts such that the number of alerts potentially triggered by them is maximized. We show that the problem is intractable; therefore, we resort to approximation and heuristic algorithms. First, we develop an approximation algorithm that obtains a near-optimal alert set in quadratic time, and propose pruning techniques to improve its runtime performance. Moreover, we show a faster approximation exists, when the alerts follow certain causal structure. Second, we propose two fast heuristic algorithms based on tree sampling techniques. On real-life data, these algorithms identify a critical alert from up to 270,000 mined causal relations in 5 seconds; meanwhile, they preserve more than 80% of solution quality, and are up to 5,000 times faster than their approximation counterparts.
用于数据中心的性能监控软件通常会生成大量警报序列。这些警报序列表示异常的网络事件。给定一组观察到的警报序列,确定可能导致其他警报的最关键警报是很重要的。虽然在大规模警报序列上挖掘关键警报的需求是显而易见的,但大多数警报分析技术都停留在建模和挖掘警报之间的因果关系上。本文研究了关键警报挖掘问题:给定一组警报序列,我们的目标是找到一组k个关键警报,使它们可能触发的警报数量最大化。我们表明这个问题是难以解决的;因此,我们采用近似和启发式算法。首先,我们开发了一种近似算法,在二次时间内获得接近最优的警报集,并提出了修剪技术来提高其运行时性能。此外,我们表明,当警报遵循一定的因果结构时,存在更快的近似。其次,我们提出了两种基于树采样技术的快速启发式算法。在现实数据上,这些算法在5秒钟内从多达27万个挖掘的因果关系中识别出一个关键警报;同时,它们保持了80%以上的溶液质量,并且比近似的同类产品快5000倍。
{"title":"Towards scalable critical alert mining","authors":"Bo Zong, Yinghui Wu, Jie Song, Ambuj K. Singh, H. Çam, Jiawei Han, Xifeng Yan","doi":"10.1145/2623330.2623729","DOIUrl":"https://doi.org/10.1145/2623330.2623729","url":null,"abstract":"Performance monitor software for data centers typically generates a great number of alert sequences. These alert sequences indicate abnormal network events. Given a set of observed alert sequences, it is important to identify the most critical alerts that are potentially the causes of others. While the need for mining critical alerts over large scale alert sequences is evident, most alert analysis techniques stop at modeling and mining the causal relations among the alerts. This paper studies the critical alert mining problem: Given a set of alert sequences, we aim to find a set of k critical alerts such that the number of alerts potentially triggered by them is maximized. We show that the problem is intractable; therefore, we resort to approximation and heuristic algorithms. First, we develop an approximation algorithm that obtains a near-optimal alert set in quadratic time, and propose pruning techniques to improve its runtime performance. Moreover, we show a faster approximation exists, when the alerts follow certain causal structure. Second, we propose two fast heuristic algorithms based on tree sampling techniques. On real-life data, these algorithms identify a critical alert from up to 270,000 mined causal relations in 5 seconds; meanwhile, they preserve more than 80% of solution quality, and are up to 5,000 times faster than their approximation counterparts.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79931152","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Leveraging user libraries to bootstrap collaborative filtering 利用用户库引导协同过滤
Laurent Charlin, R. Zemel, H. Larochelle
We introduce a novel graphical model, the collaborative score topic model (CSTM), for personal recommendations of textual documents. CSTM's chief novelty lies in its learned model of individual libraries, or sets of documents, associated with each user. Overall, CSTM is a joint directed probabilistic model of user-item scores (ratings), and the textual side information in the user libraries and the items. Creating a generative description of scores and the text allows CSTM to perform well in a wide variety of data regimes, smoothly combining the side information with observed ratings as the number of ratings available for a given user ranges from none to many. Experiments on real-world datasets demonstrate CSTM's performance. We further demonstrate its utility in an application for personal recommendations of posters which we deployed at the NIPS 2013 conference.
我们引入了一种新的图形化模型——协作评分主题模型(CSTM),用于文本文档的个人推荐。CSTM的主要新颖之处在于其与每个用户相关联的单个库或文档集的学习模型。总的来说,CSTM是用户-项目分数(评级)以及用户库和项目中的文本侧信息的联合有向概率模型。创建分数和文本的生成描述允许CSTM在各种数据体系中表现良好,平滑地将侧面信息与观察到的评分相结合,因为给定用户可用的评分数量从零到多不等。在实际数据集上的实验证明了CSTM的性能。我们在NIPS 2013会议上部署了一个用于个人推荐海报的应用程序,进一步展示了它的实用性。
{"title":"Leveraging user libraries to bootstrap collaborative filtering","authors":"Laurent Charlin, R. Zemel, H. Larochelle","doi":"10.1145/2623330.2623663","DOIUrl":"https://doi.org/10.1145/2623330.2623663","url":null,"abstract":"We introduce a novel graphical model, the collaborative score topic model (CSTM), for personal recommendations of textual documents. CSTM's chief novelty lies in its learned model of individual libraries, or sets of documents, associated with each user. Overall, CSTM is a joint directed probabilistic model of user-item scores (ratings), and the textual side information in the user libraries and the items. Creating a generative description of scores and the text allows CSTM to perform well in a wide variety of data regimes, smoothly combining the side information with observed ratings as the number of ratings available for a given user ranges from none to many. Experiments on real-world datasets demonstrate CSTM's performance. We further demonstrate its utility in an application for personal recommendations of posters which we deployed at the NIPS 2013 conference.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"80278453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
A hazard based approach to user return time prediction 基于危险的用户返回时间预测方法
Komal Kapoor, Mingxuan Sun, J. Srivastava, Tao Ye
In the competitive environment of the internet, retaining and growing one's user base is of major concern to most web services. Furthermore, the economic model of many web services is allowing free access to most content, and generating revenue through advertising. This unique model requires securing user time on a site rather than the purchase of good which makes it crucially important to create new kinds of metrics and solutions for growth and retention efforts for web services. In this work, we address this problem by proposing a new retention metric for web services by concentrating on the rate of user return. We further apply predictive analysis to the proposed retention metric on a service, as a means for characterizing lost customers. Finally, we set up a simple yet effective framework to evaluate a multitude of factors that contribute to user return. Specifically, we define the problem of return time prediction for free web services. Our solution is based on the Cox's proportional hazard model from survival analysis. The hazard based approach offers several benefits including the ability to work with censored data, to model the dynamics in user return rates, and to easily incorporate different types of covariates in the model. We compare the performance of our hazard based model in predicting the user return time and in categorizing users into buckets based on their predicted return time, against several baseline regression and classification methods and find the hazard based approach to be superior.
在互联网的竞争环境中,保持和发展用户基础是大多数网络服务的主要关注点。此外,许多网络服务的经济模式是允许免费访问大多数内容,并通过广告产生收入。这种独特的模式需要确保用户在网站上停留的时间,而不是购买商品,这使得为网络服务的增长和留存努力创造新的指标和解决方案变得至关重要。在这项工作中,我们通过提出一种新的web服务留存指标来解决这个问题,该指标主要关注用户回报率。我们进一步将预测分析应用于建议的服务保留度量,作为表征流失客户的一种手段。最后,我们建立了一个简单而有效的框架来评估影响用户回报的众多因素。具体来说,我们定义了免费web服务的返回时间预测问题。我们的解决方案是基于生存分析中的Cox比例风险模型。基于风险的方法提供了几个好处,包括处理审查数据的能力,对用户回访率的动态建模,以及在模型中轻松合并不同类型的协变量。我们比较了基于风险的模型在预测用户返回时间和根据预测返回时间将用户分类到桶中的性能,并与几种基线回归和分类方法进行了比较,发现基于风险的方法更优越。
{"title":"A hazard based approach to user return time prediction","authors":"Komal Kapoor, Mingxuan Sun, J. Srivastava, Tao Ye","doi":"10.1145/2623330.2623348","DOIUrl":"https://doi.org/10.1145/2623330.2623348","url":null,"abstract":"In the competitive environment of the internet, retaining and growing one's user base is of major concern to most web services. Furthermore, the economic model of many web services is allowing free access to most content, and generating revenue through advertising. This unique model requires securing user time on a site rather than the purchase of good which makes it crucially important to create new kinds of metrics and solutions for growth and retention efforts for web services. In this work, we address this problem by proposing a new retention metric for web services by concentrating on the rate of user return. We further apply predictive analysis to the proposed retention metric on a service, as a means for characterizing lost customers. Finally, we set up a simple yet effective framework to evaluate a multitude of factors that contribute to user return. Specifically, we define the problem of return time prediction for free web services. Our solution is based on the Cox's proportional hazard model from survival analysis. The hazard based approach offers several benefits including the ability to work with censored data, to model the dynamics in user return rates, and to easily incorporate different types of covariates in the model. We compare the performance of our hazard based model in predicting the user return time and in categorizing users into buckets based on their predicted return time, against several baseline regression and classification methods and find the hazard based approach to be superior.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82703924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
A dirichlet multinomial mixture model-based approach for short text clustering 基于dirichlet多项混合模型的短文本聚类方法
Jianhua Yin, Jianyong Wang
Short text clustering has become an increasingly important task with the popularity of social media like Twitter, Google+, and Facebook. It is a challenging problem due to its sparse, high-dimensional, and large-volume characteristics. In this paper, we proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering (abbr. to GSDMM). We found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. GSDMM can also cope with the sparse and high-dimensional problem of short texts, and can obtain the representative words of each cluster. Our extensive experimental study shows that GSDMM can achieve significantly better performance than three other clustering models.
随着Twitter、Google+和Facebook等社交媒体的普及,短文本聚类已经成为一项越来越重要的任务。由于其稀疏、高维、大体积的特点,这是一个具有挑战性的问题。本文针对短文本聚类的Dirichlet多项式混合模型(简称GSDMM)提出了一种坍缩的Gibbs抽样算法。我们发现,GSDMM可以自动推断聚类的数量,在聚类结果的完备性和均匀性之间取得很好的平衡,收敛速度快。GSDMM还可以处理短文本的稀疏和高维问题,并可以获得每个聚类的代表词。我们广泛的实验研究表明,GSDMM可以获得明显优于其他三种聚类模型的性能。
{"title":"A dirichlet multinomial mixture model-based approach for short text clustering","authors":"Jianhua Yin, Jianyong Wang","doi":"10.1145/2623330.2623715","DOIUrl":"https://doi.org/10.1145/2623330.2623715","url":null,"abstract":"Short text clustering has become an increasingly important task with the popularity of social media like Twitter, Google+, and Facebook. It is a challenging problem due to its sparse, high-dimensional, and large-volume characteristics. In this paper, we proposed a collapsed Gibbs Sampling algorithm for the Dirichlet Multinomial Mixture model for short text clustering (abbr. to GSDMM). We found that GSDMM can infer the number of clusters automatically with a good balance between the completeness and homogeneity of the clustering results, and is fast to converge. GSDMM can also cope with the sparse and high-dimensional problem of short texts, and can obtain the representative words of each cluster. Our extensive experimental study shows that GSDMM can achieve significantly better performance than three other clustering models.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87623284","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 443
Frontiers in E-commerce personalization 电子商务个性化的前沿
S. Subramaniam
E-commerce has largely been a 'Pull' model to date. Offline retailers have nailed discovery, delight, serendipity, and impulse purchases in person with greater success than online commerce sites. However, in an always-on, mobile-first world, companies like Groupon have the opportunity to push the frontier even further than offline retailers or comprehensive sites due to the fact that our smartphones are always with us. The challenge is to provide the right deals to the right user at the right time. That involves learning about the users and their locations, their personal preferences, and predicting which deals are likely to delight them, presenting diversity, discovery and engaging UX to gather user preferences and semantic graph approaches for user-deal matching. This presentation will give insight into how Groupon manages to grapple with these challenges via a data-driven system in order to delight and surprise customers.
到目前为止,电子商务在很大程度上是一种“拉动”模式。线下零售商比在线商务网站更成功地抓住了发现、愉悦、意外发现和冲动购买的机会。然而,在一个永远在线、移动优先的世界里,像Groupon这样的公司有机会比线下零售商或综合网站走得更远,因为我们的智能手机总是与我们在一起。挑战在于如何在正确的时间为正确的用户提供正确的交易。这包括了解用户和他们的位置,他们的个人偏好,并预测哪些交易可能会让他们高兴,呈现多样性,发现和吸引用户体验,以收集用户偏好和用户交易匹配的语义图方法。本演讲将深入探讨Groupon如何通过数据驱动系统应对这些挑战,以取悦和惊喜客户。
{"title":"Frontiers in E-commerce personalization","authors":"S. Subramaniam","doi":"10.1145/2623330.2630820","DOIUrl":"https://doi.org/10.1145/2623330.2630820","url":null,"abstract":"E-commerce has largely been a 'Pull' model to date. Offline retailers have nailed discovery, delight, serendipity, and impulse purchases in person with greater success than online commerce sites. However, in an always-on, mobile-first world, companies like Groupon have the opportunity to push the frontier even further than offline retailers or comprehensive sites due to the fact that our smartphones are always with us. The challenge is to provide the right deals to the right user at the right time. That involves learning about the users and their locations, their personal preferences, and predicting which deals are likely to delight them, presenting diversity, discovery and engaging UX to gather user preferences and semantic graph approaches for user-deal matching. This presentation will give insight into how Groupon manages to grapple with these challenges via a data-driven system in order to delight and surprise customers.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87944007","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Improved testing of low rank matrices 改进了低秩矩阵的测试
Yi Li, Zhengyu Wang, David P. Woodruff
We study the problem of determining if an input matrix A εRm x n can be well-approximated by a low rank matrix. Specifically, we study the problem of quickly estimating the rank or stable rank of A, the latter often providing a more robust measure of the rank. Since we seek significantly sublinear time algorithms, we cast these problems in the property testing framework. In this framework, A either has low rank or stable rank, or is far from having this property. The algorithm should read only a small number of entries or rows of A and decide which case A is in with high probability. If neither case occurs, the output is allowed to be arbitrary. We consider two notions of being far: (1) A requires changing at least an ε-fraction of its entries, or (2) A requires changing at least an ε-fraction of its rows. We call the former the "entry model" and the latter the "row model". We show: For testing if a matrix has rank at most d in the entry model, we improve the previous number of entries of A that need to be read from O(d2/ε2) (Krauthgamer and Sasson, SODA 2003) to O(d2/ε). Our algorithm is the first to adaptively query the entries of A, which for constant d we show is necessary to achieve O(1/ε) queries. For the important case of d = 1 we also give a new non-adaptive algorithm, improving the previous O(1/ε2) queries to O(log2(1/ε) / ε). For testing if a matrix has rank at most d in the row model, we prove an Ω(d/ε) lower bound on the number of rows that need to be read, even for adaptive algorithms. Our lower bound matches a non-adaptive upper bound of Krauthgamer and Sasson. For testing if a matrix has stable rank at most d in the row model or requires changing an ε/d-fraction of its rows in order to have stable rank at most d, we prove that reading θ(d/ε2) rows is necessary and sufficient. We also give an empirical evaluation of our rank and stable rank algorithms on real and synthetic datasets.
研究了输入矩阵A εRm x n能否被低秩矩阵很好地逼近的问题。具体来说,我们研究了快速估计A的秩或稳定秩的问题,后者通常提供一个更稳健的秩度量。由于我们寻求重要的次线性时间算法,我们将这些问题置于性质测试框架中。在这个框架中,A要么具有低秩,要么具有稳定秩,要么远不具有这种性质。算法应该只读取a的少量条目或行,并确定a处于高概率的哪种情况。如果这两种情况都不发生,则允许输出是任意的。我们考虑两个远的概念:(1)A需要改变它的至少一个ε-分数的条目,或者(2)A需要改变它的至少一个ε-分数的行。我们称前者为“入口模型”,后者为“行模型”。为了测试一个矩阵在条目模型中是否排名最多为d,我们将a的先前需要读取的条目数从O(d2/ε2) (Krauthgamer and Sasson, SODA 2003)提高到O(d2/ε)。我们的算法是第一个自适应查询A的条目的算法,对于常数d,我们证明了实现O(1/ε)查询是必要的。对于d = 1的重要情况,我们还给出了一种新的非自适应算法,将之前的O(1/ε2)查询改进为O(log2(1/ε) /ε)。为了测试一个矩阵在行模型中是否秩最多为d,我们证明了需要读取的行数的Ω(d/ε)下界,即使对于自适应算法也是如此。我们的下界与Krauthgamer和Sasson的非自适应上界相匹配。为了检验一个矩阵在行模型中是否具有最多d的稳定秩,或者是否需要改变其行的ε/d分数才能具有最多d的稳定秩,我们证明了读取θ(d/ε2)行是必要和充分的。我们还在真实数据集和合成数据集上对我们的秩和稳定秩算法进行了实证评估。
{"title":"Improved testing of low rank matrices","authors":"Yi Li, Zhengyu Wang, David P. Woodruff","doi":"10.1145/2623330.2623736","DOIUrl":"https://doi.org/10.1145/2623330.2623736","url":null,"abstract":"We study the problem of determining if an input matrix A εRm x n can be well-approximated by a low rank matrix. Specifically, we study the problem of quickly estimating the rank or stable rank of A, the latter often providing a more robust measure of the rank. Since we seek significantly sublinear time algorithms, we cast these problems in the property testing framework. In this framework, A either has low rank or stable rank, or is far from having this property. The algorithm should read only a small number of entries or rows of A and decide which case A is in with high probability. If neither case occurs, the output is allowed to be arbitrary. We consider two notions of being far: (1) A requires changing at least an ε-fraction of its entries, or (2) A requires changing at least an ε-fraction of its rows. We call the former the \"entry model\" and the latter the \"row model\". We show: For testing if a matrix has rank at most d in the entry model, we improve the previous number of entries of A that need to be read from O(d2/ε2) (Krauthgamer and Sasson, SODA 2003) to O(d2/ε). Our algorithm is the first to adaptively query the entries of A, which for constant d we show is necessary to achieve O(1/ε) queries. For the important case of d = 1 we also give a new non-adaptive algorithm, improving the previous O(1/ε2) queries to O(log2(1/ε) / ε). For testing if a matrix has rank at most d in the row model, we prove an Ω(d/ε) lower bound on the number of rows that need to be read, even for adaptive algorithms. Our lower bound matches a non-adaptive upper bound of Krauthgamer and Sasson. For testing if a matrix has stable rank at most d in the row model or requires changing an ε/d-fraction of its rows in order to have stable rank at most d, we prove that reading θ(d/ε2) rows is necessary and sufficient. We also give an empirical evaluation of our rank and stable rank algorithms on real and synthetic datasets.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"78 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83926533","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Scaling out big data missing value imputations: pythia vs. godzilla 扩展大数据缺失价值估算:皮提亚vs哥斯拉
C. Anagnostopoulos, P. Triantafillou
Solving the missing-value (MV) problem with small estimation errors in big data environments is a notoriously resource-demanding task. As datasets and their user community continuously grow, the problem can only be exacerbated. Assume that it is possible to have a single machine (`Godzilla'), which can store the massive dataset and support an ever-growing community submitting MV imputation requests. Is it possible to replace Godzilla by employing a large number of cohort machines so that imputations can be performed much faster, engaging cohorts in parallel, each of which accesses much smaller partitions of the original dataset? If so, it would be preferable for obvious performance reasons to access only a subset of all cohorts per imputation. In this case, can we decide swiftly which is the desired subset of cohorts to engage per imputation? But efficiency and scalability is just one key concern! Is it possible to do the above while ensuring comparable or even better than Godzilla's imputation estimation errors? In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used. Our contributions involve Pythia, a framework and algorithms for providing the answers to the above questions and for engaging the appropriate subset of cohorts per MV imputation request. Pythia functionality rests on two pillars: (i) dataset (partition) signatures, one per cohort, and (ii) similarity notions and algorithms, which can identify the appropriate subset of cohorts to engage. Comprehensive experimentation with real and synthetic datasets showcase our efficiency, scalability, and accuracy claims.
众所周知,在大数据环境中解决具有小估计误差的缺失值(MV)问题是一项耗费资源的任务。随着数据集及其用户社区的不断增长,这个问题只会加剧。假设有可能有一台机器(“哥斯拉”),它可以存储大量数据集并支持不断增长的社区提交MV插入请求。是否有可能通过使用大量队列机器来取代Godzilla,以便更快地执行推算,并行地参与队列,每个队列访问原始数据集的更小分区?如果是这样,出于明显的性能原因,最好是每个imputation只访问所有队列的一个子集。在这种情况下,我们能否迅速决定哪一个是每个imputation所需的队列子集?但效率和可扩展性只是一个关键问题!在保证与哥斯拉的估算误差相当甚至更好的情况下,是否有可能做到以上这些?在本文中,我们得出了这些基本问题的答案,并开发了原则性的方法和框架,这些方法和框架提供了较大的性能加速和更好的或与哥斯拉相当的误差,而与使用的缺失值插入算法无关。我们的贡献涉及Pythia,这是一个框架和算法,用于提供上述问题的答案,并用于每个MV imputation请求参与适当的队列子集。Pythia功能基于两个支柱:(i)数据集(分区)签名,每个队列一个;(ii)相似性概念和算法,可以识别要参与的队列的适当子集。对真实和合成数据集的全面实验展示了我们的效率、可扩展性和准确性。
{"title":"Scaling out big data missing value imputations: pythia vs. godzilla","authors":"C. Anagnostopoulos, P. Triantafillou","doi":"10.1145/2623330.2623615","DOIUrl":"https://doi.org/10.1145/2623330.2623615","url":null,"abstract":"Solving the missing-value (MV) problem with small estimation errors in big data environments is a notoriously resource-demanding task. As datasets and their user community continuously grow, the problem can only be exacerbated. Assume that it is possible to have a single machine (`Godzilla'), which can store the massive dataset and support an ever-growing community submitting MV imputation requests. Is it possible to replace Godzilla by employing a large number of cohort machines so that imputations can be performed much faster, engaging cohorts in parallel, each of which accesses much smaller partitions of the original dataset? If so, it would be preferable for obvious performance reasons to access only a subset of all cohorts per imputation. In this case, can we decide swiftly which is the desired subset of cohorts to engage per imputation? But efficiency and scalability is just one key concern! Is it possible to do the above while ensuring comparable or even better than Godzilla's imputation estimation errors? In this paper we derive answers to these fundamentals questions and develop principled methods and a framework which offer large performance speed-ups and better, or comparable, errors to that of Godzilla, independently of which missing-value imputation algorithm is used. Our contributions involve Pythia, a framework and algorithms for providing the answers to the above questions and for engaging the appropriate subset of cohorts per MV imputation request. Pythia functionality rests on two pillars: (i) dataset (partition) signatures, one per cohort, and (ii) similarity notions and algorithms, which can identify the appropriate subset of cohorts to engage. Comprehensive experimentation with real and synthetic datasets showcase our efficiency, scalability, and accuracy claims.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"49 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85916114","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
Identifying tourists from public transport commuters 从公共交通通勤者中识别游客
Mingqiang Xue, Huayu Wu, Wei Chen, W. Ng, Gin Howe Goh
Tourism industry has become a key economic driver for Singapore. Understanding the behaviors of tourists is very important for the government and private sectors, e.g., restaurants, hotels and advertising companies, to improve their existing services or create new business opportunities. In this joint work with Singapore's Land Transport Authority (LTA), we innovatively apply machine learning techniques to identity the tourists among public commuters using the public transportation data provided by LTA. On successful identification, the travelling patterns of tourists are then revealed and thus allow further analyses to be carried out such as on their favorite destinations, region of stay, etc. Technically, we model the tourists identification as a classification problem, and design an iterative learning algorithm to perform inference with limited prior knowledge and labeled data. We show the superiority of our algorithm with performance evaluation and comparison with other state-of-the-art learning algorithms. Further, we build an interactive web-based system for answering queries regarding the moving patterns of the tourists, which can be used by stakeholders to gain insight into tourists' travelling behaviors in Singapore.
旅游业已成为新加坡重要的经济驱动力。了解游客的行为对政府和私营部门,如餐馆、酒店和广告公司,改善他们现有的服务或创造新的商业机会非常重要。在与新加坡陆路交通管理局(LTA)的合作中,我们创新地应用机器学习技术,利用LTA提供的公共交通数据,在公共通勤者中识别游客。在成功识别后,游客的旅行模式就会被揭示出来,从而允许进行进一步的分析,比如他们最喜欢的目的地、停留地区等。在技术上,我们将游客识别建模为一个分类问题,并设计了一个迭代学习算法,在有限的先验知识和标记数据下进行推理。我们通过性能评估和与其他最先进的学习算法的比较来证明我们算法的优越性。此外,我们建立了一个交互式的基于网络的系统来回答有关游客移动模式的查询,这可以被利益相关者用来洞察游客在新加坡的旅行行为。
{"title":"Identifying tourists from public transport commuters","authors":"Mingqiang Xue, Huayu Wu, Wei Chen, W. Ng, Gin Howe Goh","doi":"10.1145/2623330.2623352","DOIUrl":"https://doi.org/10.1145/2623330.2623352","url":null,"abstract":"Tourism industry has become a key economic driver for Singapore. Understanding the behaviors of tourists is very important for the government and private sectors, e.g., restaurants, hotels and advertising companies, to improve their existing services or create new business opportunities. In this joint work with Singapore's Land Transport Authority (LTA), we innovatively apply machine learning techniques to identity the tourists among public commuters using the public transportation data provided by LTA. On successful identification, the travelling patterns of tourists are then revealed and thus allow further analyses to be carried out such as on their favorite destinations, region of stay, etc. Technically, we model the tourists identification as a classification problem, and design an iterative learning algorithm to perform inference with limited prior knowledge and labeled data. We show the superiority of our algorithm with performance evaluation and comparison with other state-of-the-art learning algorithms. Further, we build an interactive web-based system for answering queries regarding the moving patterns of the tourists, which can be used by stakeholders to gain insight into tourists' travelling behaviors in Singapore.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"63 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86021061","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Relevant overlapping subspace clusters on categorical data 分类数据的相关重叠子空间聚类
Xiao He, Jing Feng, B. Konte, S. T. Mai, C. Plant
Clustering categorical data poses some unique challenges: Due to missing order and spacing among the categories, selecting a suitable similarity measure is a difficult task. Many existing techniques require the user to specify input parameters which are difficult to estimate. Moreover, many techniques are limited to detect clusters in the full-dimensional data space. Only few methods exist for subspace clustering and they produce highly redundant results. Therefore, we propose ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data), a novel technique based on the idea of data compression. Following the Minimum Description Length principle, ROCAT automatically detects the most relevant subspace clusters without any input parameter. The relevance of each cluster is validated by its contribution to compress the data. Optimizing the trade-off between goodness-of-fit and model complexity, ROCAT automatically determines a meaningful number of clusters to represent the data. ROCAT is especially designed to detect subspace clusters on categorical data which may overlap in objects and/or attributes; i.e. objects can be assigned to different clusters in different subspaces and attributes may contribute to different subspaces containing clusters. ROCAT naturally avoids undesired redundancy in clusters and subspaces by allowing overlap only if it improves the compression rate. Extensive experiments demonstrate the effectiveness and efficiency of our approach.
聚类分类数据带来了一些独特的挑战:由于类别之间缺少顺序和间距,选择合适的相似性度量是一项困难的任务。许多现有的技术要求用户指定难以估计的输入参数。此外,许多技术仅限于在全维数据空间中检测聚类。子空间聚类的方法很少,结果冗余度很高。因此,我们提出了一种基于数据压缩思想的分类数据相关重叠子空间聚类(ROCAT)技术。根据最小描述长度原则,ROCAT在不需要任何输入参数的情况下自动检测最相关的子空间集群。每个集群的相关性通过其对压缩数据的贡献来验证。通过优化拟合优度和模型复杂性之间的权衡,ROCAT自动确定有意义的集群数量来表示数据。ROCAT特别设计用于检测分类数据上可能在对象和/或属性上重叠的子空间簇;也就是说,对象可以分配给不同子空间中的不同集群,属性可能有助于包含集群的不同子空间。只有在提高压缩率的情况下,ROCAT才允许重叠,从而避免了簇和子空间中不希望出现的冗余。大量的实验证明了该方法的有效性和高效性。
{"title":"Relevant overlapping subspace clusters on categorical data","authors":"Xiao He, Jing Feng, B. Konte, S. T. Mai, C. Plant","doi":"10.1145/2623330.2623652","DOIUrl":"https://doi.org/10.1145/2623330.2623652","url":null,"abstract":"Clustering categorical data poses some unique challenges: Due to missing order and spacing among the categories, selecting a suitable similarity measure is a difficult task. Many existing techniques require the user to specify input parameters which are difficult to estimate. Moreover, many techniques are limited to detect clusters in the full-dimensional data space. Only few methods exist for subspace clustering and they produce highly redundant results. Therefore, we propose ROCAT (Relevant Overlapping Subspace Clusters on Categorical Data), a novel technique based on the idea of data compression. Following the Minimum Description Length principle, ROCAT automatically detects the most relevant subspace clusters without any input parameter. The relevance of each cluster is validated by its contribution to compress the data. Optimizing the trade-off between goodness-of-fit and model complexity, ROCAT automatically determines a meaningful number of clusters to represent the data. ROCAT is especially designed to detect subspace clusters on categorical data which may overlap in objects and/or attributes; i.e. objects can be assigned to different clusters in different subspaces and attributes may contribute to different subspaces containing clusters. ROCAT naturally avoids undesired redundancy in clusters and subspaces by allowing overlap only if it improves the compression rate. Extensive experiments demonstrate the effectiveness and efficiency of our approach.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"41 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85718260","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS) 联合建模面向电影推荐的方面、评级和情感(JMARS)
Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alex Smola, Jing Jiang, Chong Wang
Recommendation and review sites offer a wealth of information beyond ratings. For instance, on IMDb users leave reviews, commenting on different aspects of a movie (e.g. actors, plot, visual effects), and expressing their sentiments (positive or negative) on these aspects in their reviews. This suggests that uncovering aspects and sentiments will allow us to gain a better understanding of users, movies, and the process involved in generating ratings. The ability to answer questions such as "Does this user care more about the plot or about the special effects?" or "What is the quality of the movie in terms of acting?" helps us to understand why certain ratings are generated. This can be used to provide more meaningful recommendations. In this work we propose a probabilistic model based on collaborative filtering and topic modeling. It allows us to capture the interest distribution of users and the content distribution for movies; it provides a link between interest and relevance on a per-aspect basis and it allows us to differentiate between positive and negative sentiments on a per-aspect basis. Unlike prior work our approach is entirely unsupervised and does not require knowledge of the aspect specific ratings or genres for inference. We evaluate our model on a live copy crawled from IMDb. Our model offers superior performance by joint modeling. Moreover, we are able to address the cold start problem -- by utilizing the information inherent in reviews our model demonstrates improvement for new users and movies.
推荐和评论网站提供了丰富的信息,除了评级。例如,用户在IMDb上留下评论,评论电影的不同方面(如演员、情节、视觉效果),并在评论中表达他们对这些方面的看法(积极或消极)。这表明,揭示方面和情感将使我们能够更好地理解用户、电影以及生成评级的过程。能够回答诸如“这个用户更关心情节还是特效?”或“从表演角度来看,这部电影的质量如何?”这样的问题,有助于我们理解为什么会产生某些评级。这可以用来提供更有意义的建议。在这项工作中,我们提出了一个基于协同过滤和主题建模的概率模型。它允许我们捕捉用户的兴趣分布和电影的内容分布;它在每个方面的基础上提供了兴趣和相关性之间的联系,它允许我们区分每个方面的积极和消极情绪。与之前的工作不同,我们的方法是完全无监督的,并且不需要了解特定方面的评级或类型来进行推理。我们在从IMDb抓取的实时副本上评估我们的模型。我们的模型通过关节建模提供了优越的性能。此外,我们能够解决冷启动问题——通过利用评论中固有的信息,我们的模型为新用户和电影展示了改进。
{"title":"Jointly modeling aspects, ratings and sentiments for movie recommendation (JMARS)","authors":"Qiming Diao, Minghui Qiu, Chao-Yuan Wu, Alex Smola, Jing Jiang, Chong Wang","doi":"10.1145/2623330.2623758","DOIUrl":"https://doi.org/10.1145/2623330.2623758","url":null,"abstract":"Recommendation and review sites offer a wealth of information beyond ratings. For instance, on IMDb users leave reviews, commenting on different aspects of a movie (e.g. actors, plot, visual effects), and expressing their sentiments (positive or negative) on these aspects in their reviews. This suggests that uncovering aspects and sentiments will allow us to gain a better understanding of users, movies, and the process involved in generating ratings. The ability to answer questions such as \"Does this user care more about the plot or about the special effects?\" or \"What is the quality of the movie in terms of acting?\" helps us to understand why certain ratings are generated. This can be used to provide more meaningful recommendations. In this work we propose a probabilistic model based on collaborative filtering and topic modeling. It allows us to capture the interest distribution of users and the content distribution for movies; it provides a link between interest and relevance on a per-aspect basis and it allows us to differentiate between positive and negative sentiments on a per-aspect basis. Unlike prior work our approach is entirely unsupervised and does not require knowledge of the aspect specific ratings or genres for inference. We evaluate our model on a live copy crawled from IMDb. Our model offers superior performance by joint modeling. Moreover, we are able to address the cold start problem -- by utilizing the information inherent in reviews our model demonstrates improvement for new users and movies.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"6 16","pages":""},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91418298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 425
期刊
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1