首页 > 最新文献

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献

英文 中文
Mining evolutionary multi-branch trees from text streams 从文本流中挖掘进化多分支树
Xiting Wang, Shixia Liu, Yangqiu Song, B. Guo
Understanding topic hierarchies in text streams and their evolution patterns over time is very important in many applications. In this paper, we propose an evolutionary multi-branch tree clustering method for streaming text data. We build evolutionary trees in a Bayesian online filtering framework. The tree construction is formulated as an online posterior estimation problem, which considers both the likelihood of the current tree and conditional prior given the previous tree. We also introduce a constraint model to compute the conditional prior of a tree in the multi-branch setting. Experiments on real world news data demonstrate that our algorithm can better incorporate historical tree information and is more efficient and effective than the traditional evolutionary hierarchical clustering algorithm.
在许多应用程序中,理解文本流中的主题层次结构及其随时间的演变模式非常重要。本文提出了一种用于流文本数据的进化多分支树聚类方法。我们在贝叶斯在线过滤框架中构建进化树。树的构造被表述为一个在线后验估计问题,它既考虑了当前树的可能性,也考虑了给定之前树的条件先验。我们还引入了一个约束模型来计算多分支环境下树的条件先验。在真实新闻数据上的实验表明,该算法能更好地融合历史树信息,比传统的进化层次聚类算法更高效。
{"title":"Mining evolutionary multi-branch trees from text streams","authors":"Xiting Wang, Shixia Liu, Yangqiu Song, B. Guo","doi":"10.1145/2487575.2487603","DOIUrl":"https://doi.org/10.1145/2487575.2487603","url":null,"abstract":"Understanding topic hierarchies in text streams and their evolution patterns over time is very important in many applications. In this paper, we propose an evolutionary multi-branch tree clustering method for streaming text data. We build evolutionary trees in a Bayesian online filtering framework. The tree construction is formulated as an online posterior estimation problem, which considers both the likelihood of the current tree and conditional prior given the previous tree. We also introduce a constraint model to compute the conditional prior of a tree in the multi-branch setting. Experiments on real world news data demonstrate that our algorithm can better incorporate historical tree information and is more efficient and effective than the traditional evolutionary hierarchical clustering algorithm.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"43 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"91335803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
STED: semi-supervised targeted-interest event detectionin in twitter twitter中半监督的目标兴趣事件检测
Ting Hua, F. Chen, Liang Zhao, Chang-Tien Lu, Naren Ramakrishnan
Social microblogs such as Twitter and Weibo are experiencing an explosive growth with billions of global users sharing their daily observations and thoughts. Beyond public interests (e.g., sports, music), microblogs can provide highly detailed information for those interested in public health, homeland security, and financial analysis. However, the language used in Twitter is heavily informal, ungrammatical, and dynamic. Existing data mining algorithms require extensive manually labeling to build and maintain a supervised system. This paper presents STED, a semi-supervised system that helps users to automatically detect and interactively visualize events of a targeted type from twitter, such as crimes, civil unrests, and disease outbreaks. Our model first applies transfer learning and label propagation to automatically generate labeled data, then learns a customized text classifier based on mini-clustering, and finally applies fast spatial scan statistics to estimate the locations of events. We demonstrate STED's usage and benefits using twitter data collected from Latin America countries, and show how our system helps to detect and track example events such as civil unrests and crimes.
推特和微博等社交微博正在经历爆炸式增长,全球数十亿用户在分享他们的日常观察和想法。除了公共利益(如体育、音乐)之外,微博还可以为那些对公共卫生、国土安全和财务分析感兴趣的人提供非常详细的信息。然而,Twitter中使用的语言非常不正式,不符合语法,而且是动态的。现有的数据挖掘算法需要大量的人工标记来构建和维护一个受监督的系统。本文介绍了STED,这是一个半监督系统,可以帮助用户自动检测和交互式地可视化来自twitter的目标类型的事件,例如犯罪,内乱和疾病爆发。我们的模型首先采用迁移学习和标签传播来自动生成标记数据,然后学习基于迷你聚类的自定义文本分类器,最后应用快速空间扫描统计来估计事件的位置。我们使用从拉丁美洲国家收集的twitter数据展示了STED的使用和好处,并展示了我们的系统如何帮助检测和跟踪诸如内乱和犯罪等示例事件。
{"title":"STED: semi-supervised targeted-interest event detectionin in twitter","authors":"Ting Hua, F. Chen, Liang Zhao, Chang-Tien Lu, Naren Ramakrishnan","doi":"10.1145/2487575.2487712","DOIUrl":"https://doi.org/10.1145/2487575.2487712","url":null,"abstract":"Social microblogs such as Twitter and Weibo are experiencing an explosive growth with billions of global users sharing their daily observations and thoughts. Beyond public interests (e.g., sports, music), microblogs can provide highly detailed information for those interested in public health, homeland security, and financial analysis. However, the language used in Twitter is heavily informal, ungrammatical, and dynamic. Existing data mining algorithms require extensive manually labeling to build and maintain a supervised system. This paper presents STED, a semi-supervised system that helps users to automatically detect and interactively visualize events of a targeted type from twitter, such as crimes, civil unrests, and disease outbreaks. Our model first applies transfer learning and label propagation to automatically generate labeled data, then learns a customized text classifier based on mini-clustering, and finally applies fast spatial scan statistics to estimate the locations of events. We demonstrate STED's usage and benefits using twitter data collected from Latin America countries, and show how our system helps to detect and track example events such as civil unrests and crimes.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"11 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"90968803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 66
MI2LS: multi-instance learning from multiple informationsources MI2LS:从多个信息源进行多实例学习
Dan Zhang, Jingrui He, Richard D. Lawrence
In Multiple Instance Learning (MIL), each entity is normally expressed as a set of instances. Most of the current MIL methods only deal with the case when each instance is represented by one type of features. However, in many real world applications, entities are often described from several different information sources/views. For example, when applying MIL to image categorization, the characteristics of each image can be derived from both its RGB features and SIFT features. Previous research work has shown that, in traditional learning methods, leveraging the consistencies between different information sources could improve the classification performance drastically. Out of a similar motivation, to incorporate the consistencies between different information sources into MIL, we propose a novel research framework -- Multi-Instance Learning from Multiple Information Sources (MI2LS). Based on this framework, an algorithm -- Fast MI2LS (FMI2LS) is designed, which combines Concave-Convex Constraint Programming (CCCP) method and an adapte- d Stoachastic Gradient Descent (SGD) method. Some theoretical analysis on the optimality of the adapted SGD method and the generalized error bound of the formulation are given based on the proposed method. Experimental results on document classification and a novel application -- Insider Threat Detection (ITD), clearly demonstrate the superior performance of the proposed method over state-of-the-art MIL methods.
在多实例学习(MIL)中,每个实体通常表示为一组实例。当前大多数MIL方法只处理每个实例由一种类型的特征表示的情况。然而,在许多现实世界的应用程序中,实体通常是从几个不同的信息源/视图描述的。例如,在将MIL应用于图像分类时,可以同时从图像的RGB特征和SIFT特征中获得图像的特征。以往的研究表明,在传统的学习方法中,利用不同信息源之间的一致性可以大大提高分类性能。出于类似的动机,为了将不同信息源之间的一致性纳入MIL,我们提出了一个新的研究框架——多信息源的多实例学习(MI2LS)。基于该框架,设计了一种将凹凸约束规划(CCCP)方法与自适应随机梯度下降(SGD)方法相结合的快速MI2LS (FMI2LS)算法。在此基础上,对该方法的最优性进行了理论分析,并给出了公式的广义误差界。在文档分类和一个新的应用——内部威胁检测(ITD)上的实验结果清楚地表明,该方法比最先进的MIL方法性能优越。
{"title":"MI2LS: multi-instance learning from multiple informationsources","authors":"Dan Zhang, Jingrui He, Richard D. Lawrence","doi":"10.1145/2487575.2487651","DOIUrl":"https://doi.org/10.1145/2487575.2487651","url":null,"abstract":"In Multiple Instance Learning (MIL), each entity is normally expressed as a set of instances. Most of the current MIL methods only deal with the case when each instance is represented by one type of features. However, in many real world applications, entities are often described from several different information sources/views. For example, when applying MIL to image categorization, the characteristics of each image can be derived from both its RGB features and SIFT features. Previous research work has shown that, in traditional learning methods, leveraging the consistencies between different information sources could improve the classification performance drastically. Out of a similar motivation, to incorporate the consistencies between different information sources into MIL, we propose a novel research framework -- Multi-Instance Learning from Multiple Information Sources (MI2LS). Based on this framework, an algorithm -- Fast MI2LS (FMI2LS) is designed, which combines Concave-Convex Constraint Programming (CCCP) method and an adapte- d Stoachastic Gradient Descent (SGD) method. Some theoretical analysis on the optimality of the adapted SGD method and the generalized error bound of the formulation are given based on the proposed method. Experimental results on document classification and a novel application -- Insider Threat Detection (ITD), clearly demonstrate the superior performance of the proposed method over state-of-the-art MIL methods.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"77 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78211976","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 32
iHR: an online recruiting system for Xiamen Talent Service Center iHR:厦门市人才服务中心网上招聘系统
Wenxing Hong, Lei Li, Tao Li, Wenfu Pan
Online recruiting systems have gained immense attention in the wake of more and more job seekers searching jobs and enterprises finding candidates on the Internet. A critical problem in a recruiting system is how to maximally satisfy the desires of both job seekers and enterprises with reasonable recommendations or search results. In this paper, we investigate and compare various online recruiting systems from a product perspective. We then point out several key functions that help achieve a win-win situation between job seekers and enterprises for a successful recruiting system. Based on the observations and key functions, we design, implement and deploy a web-based application of recruiting system, named iHR, for Xiamen Talent Service Center. The system utilizes the latest advances in data mining and recommendation technologies to create a user-oriented service for a myriad of audience in job marketing community. Empirical evaluation and online user studies demonstrate the efficacy and effectiveness of our proposed system. Currently, iHR has been deployed at http://i.xmrc.com.cn/XMRCIntel.
随着越来越多的求职者在网上找工作,越来越多的企业在网上找候选人,网上招聘系统得到了极大的关注。招聘系统中的一个关键问题是如何通过合理的推荐或搜索结果最大限度地满足求职者和企业的愿望。在本文中,我们从产品的角度对各种在线招聘系统进行了调查和比较。然后,我们指出了几个关键的功能,有助于实现求职者和企业之间的双赢局面,为一个成功的招聘系统。基于观察结果和关键功能,我们为厦门市人才服务中心设计、实现并部署了基于web的招聘应用系统iHR。该系统利用最新的数据挖掘和推荐技术,为就业营销社区的无数受众创造了以用户为导向的服务。实证评估和在线用户研究证明了我们提出的系统的有效性和有效性。目前,已在http://i.xmrc.com.cn/XMRCIntel上部署了《国际卫生条例》。
{"title":"iHR: an online recruiting system for Xiamen Talent Service Center","authors":"Wenxing Hong, Lei Li, Tao Li, Wenfu Pan","doi":"10.1145/2487575.2488199","DOIUrl":"https://doi.org/10.1145/2487575.2488199","url":null,"abstract":"Online recruiting systems have gained immense attention in the wake of more and more job seekers searching jobs and enterprises finding candidates on the Internet. A critical problem in a recruiting system is how to maximally satisfy the desires of both job seekers and enterprises with reasonable recommendations or search results. In this paper, we investigate and compare various online recruiting systems from a product perspective. We then point out several key functions that help achieve a win-win situation between job seekers and enterprises for a successful recruiting system. Based on the observations and key functions, we design, implement and deploy a web-based application of recruiting system, named iHR, for Xiamen Talent Service Center. The system utilizes the latest advances in data mining and recommendation technologies to create a user-oriented service for a myriad of audience in job marketing community. Empirical evaluation and online user studies demonstrate the efficacy and effectiveness of our proposed system. Currently, iHR has been deployed at http://i.xmrc.com.cn/XMRCIntel.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78433433","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
A data-driven method for in-game decision making in MLB: when to pull a starting pitcher MLB游戏内决策的数据驱动方法:何时启用首发投手
Gartheeban Ganeshapillai, J. Guttag
Professional sports is a roughly $500 billion dollar industry that is increasingly data-driven. In this paper we show how machine learning can be applied to generate a model that could lead to better on-field decisions by managers of professional baseball teams. Specifically we show how to use regularized linear regression to learn pitcher-specific predictive models that can be used to help decide when a starting pitcher should be replaced. A key step in the process is our method of converting categorical variables (e.g., the venue in which a game is played) into continuous variables suitable for the regression. Another key step is dealing with situations in which there is an insufficient amount of data to compute measures such as the effectiveness of a pitcher against specific batters. For each season we trained on the first 80% of the games, and tested on the rest. The results suggest that using our model could have led to better decisions than those made by major league managers. Applying our model would have led to a different decision 48% of the time. For those games in which a manager left a pitcher in that our model would have removed, the pitcher ended up performing poorly 60% of the time.
职业体育是一个价值约5000亿美元的产业,越来越多的数据驱动。在本文中,我们展示了如何应用机器学习来生成一个模型,该模型可以让职业棒球队的经理做出更好的场上决策。具体来说,我们展示了如何使用正则化线性回归来学习投手特定的预测模型,这些模型可以用来帮助决定何时应该替换首发投手。这个过程中的一个关键步骤是我们将分类变量(例如,进行游戏的场地)转换为适合回归的连续变量的方法。另一个关键步骤是处理数据量不足的情况,例如投手对特定击球手的有效性。每个赛季,我们在前80%的比赛中进行训练,然后在剩下的比赛中进行测试。结果表明,使用我们的模型可能会比大联盟经理做出更好的决策。应用我们的模型会在48%的情况下做出不同的决定。对于那些教练留下投手的比赛,我们的模型会将其剔除,投手在60%的时间里表现不佳。
{"title":"A data-driven method for in-game decision making in MLB: when to pull a starting pitcher","authors":"Gartheeban Ganeshapillai, J. Guttag","doi":"10.1145/2487575.2487660","DOIUrl":"https://doi.org/10.1145/2487575.2487660","url":null,"abstract":"Professional sports is a roughly $500 billion dollar industry that is increasingly data-driven. In this paper we show how machine learning can be applied to generate a model that could lead to better on-field decisions by managers of professional baseball teams. Specifically we show how to use regularized linear regression to learn pitcher-specific predictive models that can be used to help decide when a starting pitcher should be replaced. A key step in the process is our method of converting categorical variables (e.g., the venue in which a game is played) into continuous variables suitable for the regression. Another key step is dealing with situations in which there is an insufficient amount of data to compute measures such as the effectiveness of a pitcher against specific batters. For each season we trained on the first 80% of the games, and tested on the rest. The results suggest that using our model could have led to better decisions than those made by major league managers. Applying our model would have led to a different decision 48% of the time. For those games in which a manager left a pitcher in that our model would have removed, the pitcher ended up performing poorly 60% of the time.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"76318589","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Debiasing social wisdom 消除社会智慧的偏见
Abhimanyu Das, Sreenivas Gollapudi, R. Panigrahy, Mahyar Salek
With the explosive growth of social networks, many applications are increasingly harnessing the pulse of online crowds for a variety of tasks such as marketing, advertising, and opinion mining. An important example is the wisdom of crowd effect that has been well studied for such tasks when the crowd is non-interacting. However, these studies don't explicitly address the network effects in social networks. A key difference in this setting is the presence of social influences that arise from these interactions and can undermine the wisdom of the crowd [17]. Using a natural model of opinion formation, we analyze the effect of these interactions on an individual's opinion and estimate her propensity to conform. We then propose efficient sampling algorithms incorporating these conformity values to arrive at a debiased estimate of the wisdom of a crowd. We analyze the trade-off between the sample size and estimation error and validate our algorithms using both real data obtained from online user experiments and synthetic data.
随着社会网络的爆炸性增长,许多应用程序越来越多地利用在线人群的脉搏来完成各种任务,如营销、广告和意见挖掘。一个重要的例子是群体效应的智慧,这已经被很好地研究了当人群没有相互作用时的任务。然而,这些研究并没有明确指出社交网络中的网络效应。在这种情况下,一个关键的区别是,这些互动产生的社会影响可能会破坏群体的智慧。使用自然的意见形成模型,我们分析了这些互动对个人意见的影响,并估计了她的顺从倾向。然后,我们提出了有效的抽样算法,结合这些一致性值,以达到对人群智慧的无偏见估计。我们分析了样本大小和估计误差之间的权衡,并使用从在线用户实验和合成数据中获得的真实数据验证了我们的算法。
{"title":"Debiasing social wisdom","authors":"Abhimanyu Das, Sreenivas Gollapudi, R. Panigrahy, Mahyar Salek","doi":"10.1145/2487575.2487684","DOIUrl":"https://doi.org/10.1145/2487575.2487684","url":null,"abstract":"With the explosive growth of social networks, many applications are increasingly harnessing the pulse of online crowds for a variety of tasks such as marketing, advertising, and opinion mining. An important example is the wisdom of crowd effect that has been well studied for such tasks when the crowd is non-interacting. However, these studies don't explicitly address the network effects in social networks. A key difference in this setting is the presence of social influences that arise from these interactions and can undermine the wisdom of the crowd [17]. Using a natural model of opinion formation, we analyze the effect of these interactions on an individual's opinion and estimate her propensity to conform. We then propose efficient sampling algorithms incorporating these conformity values to arrive at a debiased estimate of the wisdom of a crowd. We analyze the trade-off between the sample size and estimation error and validate our algorithms using both real data obtained from online user experiments and synthetic data.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84431044","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 47
The dataminer's guide to scalable mixed-membership and nonparametric bayesian models 可扩展混合成员和非参数贝叶斯模型的数据挖掘指南
Amr Ahmed, Alex Smola
Large amounts of data arise in a multitude of situations, ranging from bioinformatics to astronomy, manufacturing, and medical applications. For concreteness our tutorial focuses on data obtained in the context of the internet, such as user generated content (microblogs, e-mails, messages), behavioral data (locations, interactions, clicks, queries), and graphs. Due to its magnitude, much of the challenges are to extract structure and interpretable models without the need for additional labels, i.e. to design effective unsupervised techniques. We present design patterns for hierarchical nonparametric Bayesian models, efficient inference algorithms, and modeling tools to describe salient aspects of the data.
从生物信息学到天文学、制造业和医学应用,在多种情况下都会产生大量数据。具体而言,我们的教程侧重于在互联网上下文中获得的数据,例如用户生成的内容(微博、电子邮件、消息)、行为数据(位置、交互、点击、查询)和图表。由于其规模,许多挑战是在不需要额外标签的情况下提取结构和可解释模型,即设计有效的无监督技术。我们提出了分层非参数贝叶斯模型的设计模式,有效的推理算法和建模工具来描述数据的突出方面。
{"title":"The dataminer's guide to scalable mixed-membership and nonparametric bayesian models","authors":"Amr Ahmed, Alex Smola","doi":"10.1145/2487575.2506181","DOIUrl":"https://doi.org/10.1145/2487575.2506181","url":null,"abstract":"Large amounts of data arise in a multitude of situations, ranging from bioinformatics to astronomy, manufacturing, and medical applications. For concreteness our tutorial focuses on data obtained in the context of the internet, such as user generated content (microblogs, e-mails, messages), behavioral data (locations, interactions, clicks, queries), and graphs. Due to its magnitude, much of the challenges are to extract structure and interpretable models without the need for additional labels, i.e. to design effective unsupervised techniques. We present design patterns for hierarchical nonparametric Bayesian models, efficient inference algorithms, and modeling tools to describe salient aspects of the data.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84685998","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mining data from mobile devices: a survey of smart sensing and analytics 从移动设备中挖掘数据:智能传感和分析的调查
S. Papadimitriou, Tina Eliassi-Rad
Mobile connected devices, and smartphones in particular, are rapidly emerging as a dominant computing and sensing platform. This poses several unique opportunities for data collection and analysis, as well as new challenges. In this tutorial, we survey the state-of-the-art in terms of mining data from mobile devices across different application areas such as ads, healthcare, geosocial, public policy, etc. Our tutorial has three parts. In part one, we summarize data collection in terms of various sensing modalities. In part two, we present cross-cutting challenges such as real-time analysis, security, and we outline cross cutting methods for mobile data mining such as network inference, streaming algorithms, etc. In the last part, we specifically overview emerging and fast-growing application areas, such as noted above. Concluding, we briefly highlight the opportunities for joint design of new data collection techniques and analysis methods, suggesting additional directions for future research.
移动连接设备,尤其是智能手机,正迅速成为一个占主导地位的计算和传感平台。这为数据收集和分析提供了几个独特的机会,同时也带来了新的挑战。在本教程中,我们将从跨不同应用领域(如广告、医疗保健、地理社会、公共政策等)的移动设备中挖掘数据的最新技术进行调查。我们的教程有三个部分。在第一部分中,我们总结了各种传感模式的数据收集。在第二部分中,我们提出了跨领域的挑战,如实时分析、安全性,并概述了移动数据挖掘的跨领域方法,如网络推理、流算法等。在最后一部分中,我们特别概述了新兴和快速增长的应用领域,例如上面提到的。最后,我们简要地强调了联合设计新的数据收集技术和分析方法的机会,并提出了未来研究的其他方向。
{"title":"Mining data from mobile devices: a survey of smart sensing and analytics","authors":"S. Papadimitriou, Tina Eliassi-Rad","doi":"10.1145/2487575.2506177","DOIUrl":"https://doi.org/10.1145/2487575.2506177","url":null,"abstract":"Mobile connected devices, and smartphones in particular, are rapidly emerging as a dominant computing and sensing platform. This poses several unique opportunities for data collection and analysis, as well as new challenges. In this tutorial, we survey the state-of-the-art in terms of mining data from mobile devices across different application areas such as ads, healthcare, geosocial, public policy, etc. Our tutorial has three parts. In part one, we summarize data collection in terms of various sensing modalities. In part two, we present cross-cutting challenges such as real-time analysis, security, and we outline cross cutting methods for mobile data mining such as network inference, streaming algorithms, etc. In the last part, we specifically overview emerging and fast-growing application areas, such as noted above. Concluding, we briefly highlight the opportunities for joint design of new data collection techniques and analysis methods, suggesting additional directions for future research.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"63 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"85182606","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Scalable inference in max-margin topic models 最大边际主题模型中的可伸缩推理
Jun Zhu, Xun Zheng, Li Zhou, Bo Zhang
Topic models have played a pivotal role in analyzing large collections of complex data. Besides discovering latent semantics, supervised topic models (STMs) can make predictions on unseen test data. By marrying with advanced learning techniques, the predictive strengths of STMs have been dramatically enhanced, such as max-margin supervised topic models, state-of-the-art methods that integrate max-margin learning with topic models. Though powerful, max-margin STMs have a hard non-smooth learning problem. Existing algorithms rely on solving multiple latent SVM subproblems in an EM-type procedure, which can be too slow to be applicable to large-scale categorization tasks. In this paper, we present a highly scalable approach to building max-margin supervised topic models. Our approach builds on three key innovations: 1) a new formulation of Gibbs max-margin supervised topic models for both multi-class and multi-label classification; 2) a simple ``augment-and-collapse" Gibbs sampling algorithm without making restricting assumptions on the posterior distributions; 3) an efficient parallel implementation that can easily tackle data sets with hundreds of categories and millions of documents. Furthermore, our algorithm does not need to solve SVM subproblems. Though performing the two tasks of topic discovery and learning predictive models jointly, which significantly improves the classification performance, our methods have comparable scalability as the state-of-the-art parallel algorithms for the standard LDA topic models which perform the single task of topic discovery only. Finally, an open-source implementation is also provided at: http://www.ml-thu.net/~jun/medlda.
主题模型在分析大型复杂数据集方面发挥了关键作用。除了发现潜在语义外,监督主题模型(STMs)还可以对未知的测试数据进行预测。通过与先进的学习技术相结合,stm的预测能力得到了极大的增强,例如最大边际监督主题模型,将最大边际学习与主题模型相结合的最先进的方法。虽然功能强大,但最大边际stm存在一个困难的非平滑学习问题。现有算法依赖于在em型过程中求解多个潜在的SVM子问题,速度太慢,无法适用于大规模的分类任务。在本文中,我们提出了一种高度可扩展的方法来构建最大边际监督主题模型。我们的方法建立在三个关键创新的基础上:1)针对多类和多标签分类的Gibbs最大边际监督主题模型的新公式;2)一种简单的“扩充-坍缩”吉布斯抽样算法,不对后验分布作限制性假设;3)一种高效的并行实现,可以轻松处理包含数百个类别和数百万个文档的数据集。此外,我们的算法不需要解决支持向量机的子问题。虽然我们的方法将主题发现和学习预测模型两项任务联合起来执行,大大提高了分类性能,但我们的方法具有与仅执行单一主题发现任务的标准LDA主题模型的最先进并行算法相当的可扩展性。最后,还提供了一个开源实现:http://www.ml-thu.net/~jun/medlda。
{"title":"Scalable inference in max-margin topic models","authors":"Jun Zhu, Xun Zheng, Li Zhou, Bo Zhang","doi":"10.1145/2487575.2487658","DOIUrl":"https://doi.org/10.1145/2487575.2487658","url":null,"abstract":"Topic models have played a pivotal role in analyzing large collections of complex data. Besides discovering latent semantics, supervised topic models (STMs) can make predictions on unseen test data. By marrying with advanced learning techniques, the predictive strengths of STMs have been dramatically enhanced, such as max-margin supervised topic models, state-of-the-art methods that integrate max-margin learning with topic models. Though powerful, max-margin STMs have a hard non-smooth learning problem. Existing algorithms rely on solving multiple latent SVM subproblems in an EM-type procedure, which can be too slow to be applicable to large-scale categorization tasks. In this paper, we present a highly scalable approach to building max-margin supervised topic models. Our approach builds on three key innovations: 1) a new formulation of Gibbs max-margin supervised topic models for both multi-class and multi-label classification; 2) a simple ``augment-and-collapse\" Gibbs sampling algorithm without making restricting assumptions on the posterior distributions; 3) an efficient parallel implementation that can easily tackle data sets with hundreds of categories and millions of documents. Furthermore, our algorithm does not need to solve SVM subproblems. Though performing the two tasks of topic discovery and learning predictive models jointly, which significantly improves the classification performance, our methods have comparable scalability as the state-of-the-art parallel algorithms for the standard LDA topic models which perform the single task of topic discovery only. Finally, an open-source implementation is also provided at: http://www.ml-thu.net/~jun/medlda.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"86 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82141696","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
A “semi-lazy” approach to probabilistic path prediction in dynamic environments 动态环境中概率路径预测的“半惰性”方法
Jingbo Zhou, A. Tung, Wei Wu, W. Ng
Path prediction is useful in a wide range of applications. Most of the existing solutions, however, are based on eager learning methods where models and patterns are extracted from historical trajectories and then used for future prediction. Since such approaches are committed to a set of statistically significant models or patterns, problems can arise in dynamic environments where the underlying models change quickly or where the regions are not covered with statistically significant models or patterns. We propose a "semi-lazy" approach to path prediction that builds prediction models on the fly using dynamically selected reference trajectories. Such an approach has several advantages. First, the target trajectories to be predicted are known before the models are built, which allows us to construct models that are deemed relevant to the target trajectories. Second, unlike the lazy learning approaches, we use sophisticated learning algorithms to derive accurate prediction models with acceptable delay based on a small number of selected reference trajectories. Finally, our approach can be continuously self-correcting since we can dynamically re-construct new models if the predicted movements do not match the actual ones. Our prediction model can construct a probabilistic path whose probability of occurrence is larger than a threshold and which is furthest ahead in term of time. Users can control the confidence of the path prediction by setting a probability threshold. We conducted a comprehensive experimental study on real-world and synthetic datasets to show the effectiveness and efficiency of our approach.
路径预测在很多应用中都很有用。然而,大多数现有的解决方案都是基于渴望学习方法,从历史轨迹中提取模型和模式,然后用于未来预测。由于这些方法致力于一组统计上重要的模型或模式,因此在底层模型快速变化的动态环境中,或者在没有统计上重要的模型或模式覆盖的区域中,可能会出现问题。我们提出了一种“半懒惰”的路径预测方法,使用动态选择的参考轨迹在飞行中构建预测模型。这种方法有几个优点。首先,要预测的目标轨迹在建立模型之前是已知的,这使我们能够构建被认为与目标轨迹相关的模型。其次,与惰性学习方法不同,我们使用复杂的学习算法基于少量选定的参考轨迹来获得具有可接受延迟的准确预测模型。最后,我们的方法可以持续自我修正,因为如果预测的运动与实际的运动不匹配,我们可以动态地重建新的模型。我们的预测模型可以构造一个概率路径,它的发生概率大于某个阈值,并且在时间上遥遥领先。用户可以通过设置概率阈值来控制路径预测的置信度。我们对真实世界和合成数据集进行了全面的实验研究,以显示我们方法的有效性和效率。
{"title":"A “semi-lazy” approach to probabilistic path prediction in dynamic environments","authors":"Jingbo Zhou, A. Tung, Wei Wu, W. Ng","doi":"10.1145/2487575.2487609","DOIUrl":"https://doi.org/10.1145/2487575.2487609","url":null,"abstract":"Path prediction is useful in a wide range of applications. Most of the existing solutions, however, are based on eager learning methods where models and patterns are extracted from historical trajectories and then used for future prediction. Since such approaches are committed to a set of statistically significant models or patterns, problems can arise in dynamic environments where the underlying models change quickly or where the regions are not covered with statistically significant models or patterns. We propose a \"semi-lazy\" approach to path prediction that builds prediction models on the fly using dynamically selected reference trajectories. Such an approach has several advantages. First, the target trajectories to be predicted are known before the models are built, which allows us to construct models that are deemed relevant to the target trajectories. Second, unlike the lazy learning approaches, we use sophisticated learning algorithms to derive accurate prediction models with acceptable delay based on a small number of selected reference trajectories. Finally, our approach can be continuously self-correcting since we can dynamically re-construct new models if the predicted movements do not match the actual ones. Our prediction model can construct a probabilistic path whose probability of occurrence is larger than a threshold and which is furthest ahead in term of time. Users can control the confidence of the path prediction by setting a probability threshold. We conducted a comprehensive experimental study on real-world and synthetic datasets to show the effectiveness and efficiency of our approach.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"49 1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83210726","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
期刊
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1