首页 > 最新文献

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献

英文 中文
Representing documents through their readers 通过读者表示文档
Khalid El-Arini, Min Xu, E. Fox, Carlos Guestrin
From Twitter to Facebook to Reddit, users have become accustomed to sharing the articles they read with friends or followers on their social networks. While previous work has modeled what these shared stories say about the user who shares them, the converse question remains unexplored: what can we learn about an article from the identities of its likely readers? To address this question, we model the content of news articles and blog posts by attributes of the people who are likely to share them. For example, many Twitter users describe themselves in a short profile, labeling themselves with phrases such as "vegetarian" or "liberal." By assuming that a user's labels correspond to topics in the articles he shares, we can learn a labeled dictionary from a training corpus of articles shared on Twitter. Thereafter, we can code any new document as a sparse non-negative linear combination of user labels, where we encourage correlated labels to appear together in the output via a structured sparsity penalty. Finally, we show that our approach yields a novel document representation that can be effectively used in many problem settings, from recommendation to modeling news dynamics. For example, while the top politics stories will change drastically from one month to the next, the "politics" label will still be there to describe them. We evaluate our model on millions of tweeted news articles and blog posts collected between September 2010 and September 2012, demonstrating that our approach is effective.
从Twitter到Facebook再到Reddit,用户已经习惯于在社交网络上与朋友或关注者分享他们读到的文章。虽然之前的工作已经模拟了这些共享故事对分享它们的用户的影响,但相反的问题仍然没有被探索:我们可以从一篇文章的可能读者的身份中了解到什么?为了解决这个问题,我们根据可能分享它们的人的属性对新闻文章和博客文章的内容进行建模。例如,许多Twitter用户在简短的简介中描述自己,给自己贴上“素食主义者”或“自由主义者”等短语的标签。假设用户的标签与他分享的文章中的主题相对应,我们可以从Twitter上分享的文章训练语料库中学习一个有标签的字典。此后,我们可以将任何新文档编码为用户标签的稀疏非负线性组合,我们通过结构化稀疏性惩罚鼓励相关标签在输出中一起出现。最后,我们展示了我们的方法产生了一种新颖的文档表示,可以有效地用于许多问题设置,从推荐到新闻动态建模。例如,虽然每个月的热门政治新闻都会发生巨大变化,但“政治”这个标签仍然会用来描述它们。我们对2010年9月至2012年9月间收集的数百万篇推特新闻文章和博客文章进行了评估,证明我们的方法是有效的。
{"title":"Representing documents through their readers","authors":"Khalid El-Arini, Min Xu, E. Fox, Carlos Guestrin","doi":"10.1145/2487575.2487596","DOIUrl":"https://doi.org/10.1145/2487575.2487596","url":null,"abstract":"From Twitter to Facebook to Reddit, users have become accustomed to sharing the articles they read with friends or followers on their social networks. While previous work has modeled what these shared stories say about the user who shares them, the converse question remains unexplored: what can we learn about an article from the identities of its likely readers? To address this question, we model the content of news articles and blog posts by attributes of the people who are likely to share them. For example, many Twitter users describe themselves in a short profile, labeling themselves with phrases such as \"vegetarian\" or \"liberal.\" By assuming that a user's labels correspond to topics in the articles he shares, we can learn a labeled dictionary from a training corpus of articles shared on Twitter. Thereafter, we can code any new document as a sparse non-negative linear combination of user labels, where we encourage correlated labels to appear together in the output via a structured sparsity penalty. Finally, we show that our approach yields a novel document representation that can be effectively used in many problem settings, from recommendation to modeling news dynamics. For example, while the top politics stories will change drastically from one month to the next, the \"politics\" label will still be there to describe them. We evaluate our model on millions of tweeted news articles and blog posts collected between September 2010 and September 2012, demonstrating that our approach is effective.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73234605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Forex-foreteller: currency trend modeling using news articles 外汇预测:使用新闻文章进行货币趋势建模
Fang Jin, Nathan Self, Parang Saraf, P. Butler, W. Wang, Naren Ramakrishnan
Financial markets are quite sensitive to unanticipated news and events. Identifying the effect of news on the market is a challenging task. In this demo, we present Forex-foreteller (FF) which mines news articles and makes forecasts about the movement of foreign currency markets. The system uses a combination of language models, topic clustering, and sentiment analysis to identify relevant news articles. These articles along with the historical stock index and currency exchange values are used in a linear regression model to make forecasts. The system has an interactive visualizer designed specifically for touch-sensitive devices which depicts forecasts along with the chronological news events and financial data used for making the forecasts.
金融市场对意料之外的新闻和事件相当敏感。确定新闻对市场的影响是一项具有挑战性的任务。在这个演示中,我们展示了Forex-foreteller (FF),它可以挖掘新闻文章并预测外汇市场的走势。该系统结合使用语言模型、主题聚类和情感分析来识别相关的新闻文章。这些文章与历史股票指数和货币兑换价值一起用于线性回归模型进行预测。该系统具有专门为触摸敏感设备设计的交互式可视化工具,它可以描述预测以及用于进行预测的按时间顺序排列的新闻事件和财务数据。
{"title":"Forex-foreteller: currency trend modeling using news articles","authors":"Fang Jin, Nathan Self, Parang Saraf, P. Butler, W. Wang, Naren Ramakrishnan","doi":"10.1145/2487575.2487710","DOIUrl":"https://doi.org/10.1145/2487575.2487710","url":null,"abstract":"Financial markets are quite sensitive to unanticipated news and events. Identifying the effect of news on the market is a challenging task. In this demo, we present Forex-foreteller (FF) which mines news articles and makes forecasts about the movement of foreign currency markets. The system uses a combination of language models, topic clustering, and sentiment analysis to identify relevant news articles. These articles along with the historical stock index and currency exchange values are used in a linear regression model to make forecasts. The system has an interactive visualizer designed specifically for touch-sensitive devices which depicts forecasts along with the chronological news events and financial data used for making the forecasts.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85653866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Optimizing parallel belief propagation in junction treesusing regression 基于回归的连接树并行信念传播优化
Lu Zheng, O. Mengshoel
The junction tree approach, with applications in artificial intelligence, computer vision, machine learning, and statistics, is often used for computing posterior distributions in probabilistic graphical models. One of the key challenges associated with junction trees is computational, and several parallel computing technologies - including many-core processors - have been investigated to meet this challenge. Many-core processors (including GPUs) are now programmable, unfortunately their complexities make it hard to manually tune their parameters in order to optimize software performance. In this paper, we investigate a machine learning approach to minimize the execution time of parallel junction tree algorithms implemented on a GPU. By carefully allocating a GPU's threads to different parallel computing opportunities in a junction tree, and treating this thread allocation problem as a machine learning problem, we find in experiments that regression - specifically support vector regression - can substantially outperform manual optimization.
连接树方法在人工智能、计算机视觉、机器学习和统计学中都有应用,经常用于计算概率图形模型中的后验分布。与连接树相关的关键挑战之一是计算性,为了应对这一挑战,已经研究了几种并行计算技术(包括多核处理器)。许多核心处理器(包括gpu)现在都是可编程的,不幸的是,它们的复杂性使得手动调整它们的参数以优化软件性能变得困难。在本文中,我们研究了一种机器学习方法来最小化并行结树算法在GPU上的执行时间。通过仔细地将GPU的线程分配到连接树中不同的并行计算机会,并将此线程分配问题视为机器学习问题,我们在实验中发现回归-特别是支持向量回归-可以大大优于手动优化。
{"title":"Optimizing parallel belief propagation in junction treesusing regression","authors":"Lu Zheng, O. Mengshoel","doi":"10.1145/2487575.2487611","DOIUrl":"https://doi.org/10.1145/2487575.2487611","url":null,"abstract":"The junction tree approach, with applications in artificial intelligence, computer vision, machine learning, and statistics, is often used for computing posterior distributions in probabilistic graphical models. One of the key challenges associated with junction trees is computational, and several parallel computing technologies - including many-core processors - have been investigated to meet this challenge. Many-core processors (including GPUs) are now programmable, unfortunately their complexities make it hard to manually tune their parameters in order to optimize software performance. In this paper, we investigate a machine learning approach to minimize the execution time of parallel junction tree algorithms implemented on a GPU. By carefully allocating a GPU's threads to different parallel computing opportunities in a junction tree, and treating this thread allocation problem as a machine learning problem, we find in experiments that regression - specifically support vector regression - can substantially outperform manual optimization.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78165059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
Silence is also evidence: interpreting dwell time for recommendation from psychological perspective 沉默也是证据:从心理学角度解释推荐的停留时间
Peifeng Yin, Ping Luo, Wang-Chien Lee, Min Wang
Social media is a platform for people to share and vote content. From the analysis of the social media data we found that users are quite inactive in rating/voting. For example, a user on average only votes 2 out of 100 accessed items. Traditional recommendation methods are mostly based on users' votes and thus can not cope with this situation. Based on the observation that the dwell time on an item may reflect the opinion of a user, we aim to enrich the user-vote matrix by converting the dwell time on items into users' ``pseudo votes'' and then help improve recommendation performance. However, it is challenging to correctly interpret the dwell time since many subjective human factors, e.g. user expectation, sensitivity to various item qualities, reading speed, are involved into the casual behavior of online reading. In psychology, it is assumed that people have choice threshold in decision making. The time spent on making decision reflects the decision maker's threshold. This idea inspires us to develop a View-Voting model, which can estimate how much the user likes the viewed item according to her dwell time, and thus make recommendations even if there is no voting data available. Finally, our experimental evaluation shows that the traditional rate-based recommendation's performance is greatly improved with the support of VV model.
社交媒体是人们分享和投票内容的平台。通过对社交媒体数据的分析,我们发现用户在评分/投票方面相当不活跃。例如,用户平均只在100个访问项目中投票2个。传统的推荐方法大多基于用户的投票,无法应对这种情况。基于观察到商品的停留时间可能反映用户的意见,我们的目标是通过将商品的停留时间转换为用户的“伪投票”来丰富用户投票矩阵,从而帮助提高推荐性能。然而,由于许多主观的人为因素,如用户期望,对各种物品质量的敏感性,阅读速度,都涉及到在线阅读的随意行为,因此正确解释停留时间是具有挑战性的。在心理学中,人们在做决策时假定有选择阈值。花在决策上的时间反映了决策者的门槛。这个想法启发我们开发了一个View-Voting模型,它可以根据用户的停留时间来估计用户对所看物品的喜爱程度,从而在没有投票数据的情况下进行推荐。最后,我们的实验评估表明,在VV模型的支持下,传统的基于率的推荐性能得到了很大的提高。
{"title":"Silence is also evidence: interpreting dwell time for recommendation from psychological perspective","authors":"Peifeng Yin, Ping Luo, Wang-Chien Lee, Min Wang","doi":"10.1145/2487575.2487663","DOIUrl":"https://doi.org/10.1145/2487575.2487663","url":null,"abstract":"Social media is a platform for people to share and vote content. From the analysis of the social media data we found that users are quite inactive in rating/voting. For example, a user on average only votes 2 out of 100 accessed items. Traditional recommendation methods are mostly based on users' votes and thus can not cope with this situation. Based on the observation that the dwell time on an item may reflect the opinion of a user, we aim to enrich the user-vote matrix by converting the dwell time on items into users' ``pseudo votes'' and then help improve recommendation performance. However, it is challenging to correctly interpret the dwell time since many subjective human factors, e.g. user expectation, sensitivity to various item qualities, reading speed, are involved into the casual behavior of online reading. In psychology, it is assumed that people have choice threshold in decision making. The time spent on making decision reflects the decision maker's threshold. This idea inspires us to develop a View-Voting model, which can estimate how much the user likes the viewed item according to her dwell time, and thus make recommendations even if there is no voting data available. Finally, our experimental evaluation shows that the traditional rate-based recommendation's performance is greatly improved with the support of VV model.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"73298605","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 62
Gaussian multiple instance learning approach for mapping the slums of the world using very high resolution imagery 高斯多实例学习方法用于使用非常高分辨率的图像绘制世界贫民窟
Ranga Raju Vatsavai
In this paper, we present a computationally efficient algorithm based on multiple instance learning for mapping informal settlements (slums) using very high-resolution remote sensing imagery. From remote sensing perspective, informal settlements share unique spatial characteristics that distinguish them from other urban structures like industrial, commercial, and formal residential settlements. However, regular pattern recognition and machine learning methods, which are predominantly single-instance or per-pixel classifiers, often fail to accurately map the informal settlements as they do not capture the complex spatial patterns. To overcome these limitations we employed a multiple instance based machine learning approach, where groups of contiguous pixels (image patches) are modeled as generated by a Gaussian distribution. We have conducted several experiments on very high-resolution satellite imagery, representing four unique geographic regions across the world. Our method showed consistent improvement in accurately identifying informal settlements.
在本文中,我们提出了一种基于多实例学习的计算效率高的算法,用于利用高分辨率遥感图像绘制非正式住区(贫民窟)。从遥感角度来看,非正式住区具有独特的空间特征,使其区别于工业、商业和正式住区等其他城市结构。然而,常规的模式识别和机器学习方法,主要是单实例或逐像素分类器,往往不能准确地绘制非正式住区,因为它们不能捕捉复杂的空间模式。为了克服这些限制,我们采用了基于多实例的机器学习方法,其中连续像素组(图像补丁)由高斯分布生成。我们在非常高分辨率的卫星图像上进行了几次实验,代表了世界上四个独特的地理区域。我们的方法在准确识别非正式定居点方面显示出持续的改进。
{"title":"Gaussian multiple instance learning approach for mapping the slums of the world using very high resolution imagery","authors":"Ranga Raju Vatsavai","doi":"10.1145/2487575.2488210","DOIUrl":"https://doi.org/10.1145/2487575.2488210","url":null,"abstract":"In this paper, we present a computationally efficient algorithm based on multiple instance learning for mapping informal settlements (slums) using very high-resolution remote sensing imagery. From remote sensing perspective, informal settlements share unique spatial characteristics that distinguish them from other urban structures like industrial, commercial, and formal residential settlements. However, regular pattern recognition and machine learning methods, which are predominantly single-instance or per-pixel classifiers, often fail to accurately map the informal settlements as they do not capture the complex spatial patterns. To overcome these limitations we employed a multiple instance based machine learning approach, where groups of contiguous pixels (image patches) are modeled as generated by a Gaussian distribution. We have conducted several experiments on very high-resolution satellite imagery, representing four unique geographic regions across the world. Our method showed consistent improvement in accurately identifying informal settlements.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74479235","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
Ad click prediction: a view from the trenches 广告点击预测:从战壕的角度来看
H. B. McMahan, Gary Holt, D. Sculley, Michael Young, D. Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, D. Golovin, S. Chikkerur, Dan Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos, J. Kubica
Predicting ad click-through rates (CTR) is a massive-scale learning problem that is central to the multi-billion dollar online advertising industry. We present a selection of case studies and topics drawn from recent experiments in the setting of a deployed CTR prediction system. These include improvements in the context of traditional supervised learning based on an FTRL-Proximal online learning algorithm (which has excellent sparsity and convergence properties) and the use of per-coordinate learning rates. We also explore some of the challenges that arise in a real-world system that may appear at first to be outside the domain of traditional machine learning research. These include useful tricks for memory savings, methods for assessing and visualizing performance, practical methods for providing confidence estimates for predicted probabilities, calibration methods, and methods for automated management of features. Finally, we also detail several directions that did not turn out to be beneficial for us, despite promising results elsewhere in the literature. The goal of this paper is to highlight the close relationship between theoretical advances and practical engineering in this industrial setting, and to show the depth of challenges that appear when applying traditional machine learning methods in a complex dynamic system.
预测广告点击率(CTR)是一个大规模的学习问题,对数十亿美元的在线广告行业至关重要。我们提出了案例研究和主题的选择,从最近的实验中得出,在部署CTR预测系统的设置。其中包括基于FTRL-Proximal在线学习算法(具有出色的稀疏性和收敛性)的传统监督学习背景下的改进,以及使用每坐标学习率。我们还探讨了现实世界系统中出现的一些挑战,这些挑战最初可能出现在传统机器学习研究领域之外。这些方法包括节省内存的有用技巧、评估和可视化性能的方法、为预测概率提供置信度估计的实用方法、校准方法以及自动管理特征的方法。最后,我们还详细介绍了几个对我们没有好处的方向,尽管在其他文献中有很好的结果。本文的目的是强调在这种工业环境中理论进步与实际工程之间的密切关系,并展示在复杂动态系统中应用传统机器学习方法时出现的挑战的深度。
{"title":"Ad click prediction: a view from the trenches","authors":"H. B. McMahan, Gary Holt, D. Sculley, Michael Young, D. Ebner, Julian Grady, Lan Nie, Todd Phillips, Eugene Davydov, D. Golovin, S. Chikkerur, Dan Liu, M. Wattenberg, A. M. Hrafnkelsson, T. Boulos, J. Kubica","doi":"10.1145/2487575.2488200","DOIUrl":"https://doi.org/10.1145/2487575.2488200","url":null,"abstract":"Predicting ad click-through rates (CTR) is a massive-scale learning problem that is central to the multi-billion dollar online advertising industry. We present a selection of case studies and topics drawn from recent experiments in the setting of a deployed CTR prediction system. These include improvements in the context of traditional supervised learning based on an FTRL-Proximal online learning algorithm (which has excellent sparsity and convergence properties) and the use of per-coordinate learning rates. We also explore some of the challenges that arise in a real-world system that may appear at first to be outside the domain of traditional machine learning research. These include useful tricks for memory savings, methods for assessing and visualizing performance, practical methods for providing confidence estimates for predicted probabilities, calibration methods, and methods for automated management of features. Finally, we also detail several directions that did not turn out to be beneficial for us, despite promising results elsewhere in the literature. The goal of this paper is to highlight the close relationship between theoretical advances and practical engineering in this industrial setting, and to show the depth of challenges that appear when applying traditional machine learning methods in a complex dynamic system.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74335317","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 910
Selective sampling on graphs for classification 对图进行选择性抽样进行分类
Quanquan Gu, C. Aggarwal, Jialu Liu, Jiawei Han
Selective sampling is an active variant of online learning in which the learner is allowed to adaptively query the label of an observed example. The goal of selective sampling is to achieve a good trade-off between prediction performance and the number of queried labels. Existing selective sampling algorithms are designed for vector-based data. In this paper, motivated by the ubiquity of graph representations in real-world applications, we propose to study selective sampling on graphs. We first present an online version of the well-known Learning with Local and Global Consistency method (OLLGC). It is essentially a second-order online learning algorithm, and can be seen as an online ridge regression in the Hilbert space of functions defined on graphs. We prove its regret bound in terms of the structural property (cut size) of a graph. Based on OLLGC, we present a selective sampling algorithm, namely Selective Sampling with Local and Global Consistency (SSLGC), which queries the label of each node based on the confidence of the linear function on graphs. Its bound on the label complexity is also derived. We analyze the low-rank approximation of graph kernels, which enables the online algorithms scale to large graphs. Experiments on benchmark graph datasets show that OLLGC outperforms the state-of-the-art first-order algorithm significantly, and SSLGC achieves comparable or even better results than OLLGC while querying substantially fewer nodes. Moreover, SSLGC is overwhelmingly better than random sampling.
选择性抽样是在线学习的一种主动变体,允许学习者自适应地查询观察到的示例的标签。选择性抽样的目标是在预测性能和查询标签数量之间实现良好的权衡。现有的选择性采样算法都是针对基于向量的数据设计的。在本文中,由于图表示在现实应用中的普遍性,我们提出了对图的选择性抽样的研究。我们首先提出了一个著名的局部和全局一致性学习方法(OLLGC)的在线版本。它本质上是一种二阶在线学习算法,可以看作是希尔伯特空间中定义在图上的函数的在线岭回归。我们用图的结构性质(切量)证明了它的遗憾界。在此基础上,提出了一种基于图上线性函数置信度查询每个节点标签的选择性采样算法,即局部全局一致性选择性采样(SSLGC)算法。并推导了其在标签复杂度上的界。我们分析了图核的低秩近似,使在线算法能够扩展到大型图。在基准图数据集上的实验表明,OLLGC的性能明显优于最先进的一阶算法,并且在查询的节点少得多的情况下,SSLGC可以获得与OLLGC相当甚至更好的结果。此外,SSLGC绝对优于随机抽样。
{"title":"Selective sampling on graphs for classification","authors":"Quanquan Gu, C. Aggarwal, Jialu Liu, Jiawei Han","doi":"10.1145/2487575.2487641","DOIUrl":"https://doi.org/10.1145/2487575.2487641","url":null,"abstract":"Selective sampling is an active variant of online learning in which the learner is allowed to adaptively query the label of an observed example. The goal of selective sampling is to achieve a good trade-off between prediction performance and the number of queried labels. Existing selective sampling algorithms are designed for vector-based data. In this paper, motivated by the ubiquity of graph representations in real-world applications, we propose to study selective sampling on graphs. We first present an online version of the well-known Learning with Local and Global Consistency method (OLLGC). It is essentially a second-order online learning algorithm, and can be seen as an online ridge regression in the Hilbert space of functions defined on graphs. We prove its regret bound in terms of the structural property (cut size) of a graph. Based on OLLGC, we present a selective sampling algorithm, namely Selective Sampling with Local and Global Consistency (SSLGC), which queries the label of each node based on the confidence of the linear function on graphs. Its bound on the label complexity is also derived. We analyze the low-rank approximation of graph kernels, which enables the online algorithms scale to large graphs. Experiments on benchmark graph datasets show that OLLGC outperforms the state-of-the-art first-order algorithm significantly, and SSLGC achieves comparable or even better results than OLLGC while querying substantially fewer nodes. Moreover, SSLGC is overwhelmingly better than random sampling.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78647750","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
On community detection in real-world networks and the importance of degree assortativity 现实世界网络中的社区检测及度配性的重要性
M. Ciglan, M. Laclavik, K. Nørvåg
Graph clustering, often addressed as community detection, is a prominent task in the domain of graph data mining with dozens of algorithms proposed in recent years. In this paper, we focus on several popular community detection algorithms with low computational complexity and with decent performance on the artificial benchmarks, and we study their behaviour on real-world networks. Motivated by the observation that there is a class of networks for which the community detection methods fail to deliver good community structure, we examine the assortativity coefficient of ground-truth communities and show that assortativity of a community structure can be very different from the assortativity of the original network. We then examine the possibility of exploiting the latter by weighting edges of a network with the aim to improve the community detection outputs for networks with assortative community structure. The evaluation shows that the proposed weighting can significantly improve the results of community detection methods on networks with assortative community structure.
图聚类,通常被称为社区检测,是图数据挖掘领域的一项重要任务,近年来提出了数十种算法。在本文中,我们重点研究了几种流行的社区检测算法,它们具有较低的计算复杂度和在人工基准上的良好性能,并研究了它们在现实世界网络中的行为。由于观察到有一类网络的社区检测方法无法提供良好的社区结构,我们检查了真实社区的选型系数,并表明社区结构的选型性可能与原始网络的选型性有很大不同。然后,我们通过加权网络的边缘来研究利用后者的可能性,目的是改善具有分类社区结构的网络的社区检测输出。评价结果表明,所提出的加权方法可以显著改善社区结构网络的社区检测结果。
{"title":"On community detection in real-world networks and the importance of degree assortativity","authors":"M. Ciglan, M. Laclavik, K. Nørvåg","doi":"10.1145/2487575.2487666","DOIUrl":"https://doi.org/10.1145/2487575.2487666","url":null,"abstract":"Graph clustering, often addressed as community detection, is a prominent task in the domain of graph data mining with dozens of algorithms proposed in recent years. In this paper, we focus on several popular community detection algorithms with low computational complexity and with decent performance on the artificial benchmarks, and we study their behaviour on real-world networks. Motivated by the observation that there is a class of networks for which the community detection methods fail to deliver good community structure, we examine the assortativity coefficient of ground-truth communities and show that assortativity of a community structure can be very different from the assortativity of the original network. We then examine the possibility of exploiting the latter by weighting edges of a network with the aim to improve the community detection outputs for networks with assortative community structure. The evaluation shows that the proposed weighting can significantly improve the results of community detection methods on networks with assortative community structure.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87004764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 36
To buy or not to buy: that is the question 买还是不买,这是个问题
Oren Etzioni
Shopping can be decomposed into three basic questions: what, where, and when to buy? In this talk, I'll describe how we utilize advanced data-mining and text-mining techniques at Decide.com (and earlier at Farecast) to solve these problems for on-line shoppers. Our algorithms have predicted prices utilizing billions of data points, and ranked products based on millions of reviews.
购物可以分解为三个基本问题:买什么,在哪里买,什么时候买?在这次演讲中,我将描述我们如何利用先进的数据挖掘和文本挖掘技术在decisi.com(以及早些时候在Farecast)为在线购物者解决这些问题。我们的算法利用数十亿个数据点预测价格,并根据数百万条评论对产品进行排名。
{"title":"To buy or not to buy: that is the question","authors":"Oren Etzioni","doi":"10.1145/2487575.2491129","DOIUrl":"https://doi.org/10.1145/2487575.2491129","url":null,"abstract":"Shopping can be decomposed into three basic questions: what, where, and when to buy? In this talk, I'll describe how we utilize advanced data-mining and text-mining techniques at Decide.com (and earlier at Farecast) to solve these problems for on-line shoppers. Our algorithms have predicted prices utilizing billions of data points, and ranked products based on millions of reviews.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74886529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Experience from hosting a corporate prediction market: benefits beyond the forecasts 主持企业预测市场的经验:超出预测的好处
T. A. Montgomery, Paul M. Stieg, Michael J. Cavaretta, P. Moraal
Prediction markets are virtual stock markets used to gain insight and forecast events by leveraging the wisdom of crowds. Popularly applied in the public to cultural questions (election results, box-office returns), they have recently been applied by corporations to leverage employee knowledge and forecast answers to business questions (sales volumes, products and features, release timing). Determining whether to run a prediction market requires practical experience that is rarely described. Over the last few years, Ford Motor Company obtained practical experience by deploying one of the largest corporate prediction markets known. Business partners in the US, Europe, and South America provided questions on new vehicle features, sales volumes, take rates, pricing, and macroeconomic trends. We describe our experience, including both the strong and weak correlations found between predictions and real world results. Evaluating this methodology goes beyond prediction accuracy, however, since there are many side benefits. In addition to the predictions, we discuss the value of comments, stock price changes over time, the ability to overcome bureaucratic limits, and flexibly filling holes in corporate knowledge, enabling better decision making. We conclude with advice on running prediction markets, including writing good questions, market duration, motivating traders and protecting confidential information.
预测市场是虚拟的股票市场,通过利用人群的智慧来获得洞察力和预测事件。它们在公众中广泛应用于文化问题(选举结果、票房回报),最近被企业应用于利用员工知识和预测商业问题(销量、产品和功能、发行时间)的答案。决定是否运行预测市场需要很少被描述的实际经验。在过去几年中,福特汽车公司通过部署已知的最大的企业预测市场之一获得了实践经验。美国、欧洲和南美的业务合作伙伴提供了有关新车特性、销量、费率、定价和宏观经济趋势的问题。我们描述我们的经验,包括预测和现实世界结果之间的强相关性和弱相关性。然而,评估这种方法不仅仅是预测的准确性,因为它有许多附带的好处。除了预测之外,我们还讨论了评论的价值、股票价格随时间的变化、克服官僚主义限制的能力,以及灵活填补公司知识漏洞的能力,从而实现更好的决策。最后,我们给出了运行预测市场的建议,包括写好问题、市场持续时间、激励交易者和保护机密信息。
{"title":"Experience from hosting a corporate prediction market: benefits beyond the forecasts","authors":"T. A. Montgomery, Paul M. Stieg, Michael J. Cavaretta, P. Moraal","doi":"10.1145/2487575.2488212","DOIUrl":"https://doi.org/10.1145/2487575.2488212","url":null,"abstract":"Prediction markets are virtual stock markets used to gain insight and forecast events by leveraging the wisdom of crowds. Popularly applied in the public to cultural questions (election results, box-office returns), they have recently been applied by corporations to leverage employee knowledge and forecast answers to business questions (sales volumes, products and features, release timing). Determining whether to run a prediction market requires practical experience that is rarely described. Over the last few years, Ford Motor Company obtained practical experience by deploying one of the largest corporate prediction markets known. Business partners in the US, Europe, and South America provided questions on new vehicle features, sales volumes, take rates, pricing, and macroeconomic trends. We describe our experience, including both the strong and weak correlations found between predictions and real world results. Evaluating this methodology goes beyond prediction accuracy, however, since there are many side benefits. In addition to the predictions, we discuss the value of comments, stock price changes over time, the ability to overcome bureaucratic limits, and flexibly filling holes in corporate knowledge, enabling better decision making. We conclude with advice on running prediction markets, including writing good questions, market duration, motivating traders and protecting confidential information.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76021300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
期刊
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1