首页 > 最新文献

Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献

英文 中文
Multi-task copula by sparse graph regression 稀疏图回归的多任务耦合
Tianyi Zhou, D. Tao
This paper proposes multi-task copula (MTC) that can handle a much wider class of tasks than mean regression with Gaussian noise in most former multi-task learning (MTL). While former MTL emphasizes shared structure among models, MTC aims at joint prediction to exploit inter-output correlation. Given input, the outputs of MTC are allowed to follow arbitrary joint continuous distribution. MTC captures the joint likelihood of multi-output by learning the marginal of each output firstly and then a sparse and smooth output dependency graph function. While the former can be achieved by classical MTL, learning graphs dynamically varying with input is quite a challenge. We address this issue by developing sparse graph regression (SpaGraphR), a non-parametric estimator incorporating kernel smoothing, maximum likelihood, and sparse graph structure to gain fast learning algorithm. It starts from a few seed graphs on a few input points, and then updates the graphs on other input points by a fast operator via coarse-to-fine propagation. Due to the power of copula in modeling semi-parametric distributions, SpaGraphR can model a rich class of dynamic non-Gaussian correlations. We show that MTC can address more flexible and difficult tasks that do not fit the assumptions of former MTL nicely, and can fully exploit their relatedness. Experiments on robotic control and stock price prediction justify its appealing performance in challenging MTL problems.
本文提出了一种多任务copula (MTC)方法,它可以处理比以往大多数多任务学习(MTL)中带有高斯噪声的平均回归更广泛的任务类别。先前的MTL强调模型之间的共享结构,而MTC的目标是联合预测,利用输出间的相关性。给定输入,允许MTC的输出服从任意联合连续分布。MTC通过首先学习每个输出的边缘,然后学习一个稀疏光滑的输出依赖图函数来捕获多输出的联合似然。虽然前者可以通过经典的MTL实现,但学习随输入动态变化的图是一个相当大的挑战。我们通过开发稀疏图回归(SpaGraphR)来解决这个问题,这是一种结合核平滑、最大似然和稀疏图结构的非参数估计器,以获得快速学习算法。它从几个输入点上的几个种子图开始,然后通过一个快速的算子通过粗到精的传播来更新其他输入点上的图。由于copula在半参数分布建模中的强大功能,SpaGraphR可以建模丰富的动态非高斯相关性。我们表明,MTC可以解决更灵活和更困难的任务,这些任务不符合前MTL的假设,并且可以充分利用它们的相关性。机器人控制和股票价格预测的实验证明了它在挑战性MTL问题上的出色表现。
{"title":"Multi-task copula by sparse graph regression","authors":"Tianyi Zhou, D. Tao","doi":"10.1145/2623330.2623697","DOIUrl":"https://doi.org/10.1145/2623330.2623697","url":null,"abstract":"This paper proposes multi-task copula (MTC) that can handle a much wider class of tasks than mean regression with Gaussian noise in most former multi-task learning (MTL). While former MTL emphasizes shared structure among models, MTC aims at joint prediction to exploit inter-output correlation. Given input, the outputs of MTC are allowed to follow arbitrary joint continuous distribution. MTC captures the joint likelihood of multi-output by learning the marginal of each output firstly and then a sparse and smooth output dependency graph function. While the former can be achieved by classical MTL, learning graphs dynamically varying with input is quite a challenge. We address this issue by developing sparse graph regression (SpaGraphR), a non-parametric estimator incorporating kernel smoothing, maximum likelihood, and sparse graph structure to gain fast learning algorithm. It starts from a few seed graphs on a few input points, and then updates the graphs on other input points by a fast operator via coarse-to-fine propagation. Due to the power of copula in modeling semi-parametric distributions, SpaGraphR can model a rich class of dynamic non-Gaussian correlations. We show that MTC can address more flexible and difficult tasks that do not fit the assumptions of former MTL nicely, and can fully exploit their relatedness. Experiments on robotic control and stock price prediction justify its appealing performance in challenging MTL problems.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"74828531","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Sampling for big data: a tutorial 大数据抽样:教程
Graham Cormode, N. Duffield
One response to the proliferation of large datasets has been to develop ingenious ways to throw resources at the problem, using massive fault tolerant storage architectures, parallel and graphical computation models such as MapReduce, Pregel and Giraph. However, not all environments can support this scale of resources, and not all queries need an exact response. This motivates the use of sampling to generate summary datasets that support rapid queries, and prolong the useful life of the data in storage. To be effective, sampling must mediate the tensions between resource constraints, data characteristics, and the required query accuracy. The state-of-the-art in sampling goes far beyond simple uniform selection of elements, to maximize the usefulness of the resulting sample. This tutorial reviews progress in sample design for large datasets, including streaming and graph-structured data. Applications are discussed to sampling network traffic and social networks.
对大量数据集的一种回应是开发巧妙的方法来解决问题,使用大规模容错存储架构,并行和图形计算模型,如MapReduce, Pregel和Giraph。然而,并不是所有的环境都能支持这种规模的资源,也不是所有的查询都需要精确的响应。这促使使用抽样来生成支持快速查询的汇总数据集,并延长存储中数据的使用寿命。为了有效,采样必须调解资源约束、数据特征和所需查询准确性之间的紧张关系。最先进的采样技术远远超出了简单统一的元素选择,以最大限度地提高所得样本的有用性。本教程回顾了大型数据集的样本设计进展,包括流和图结构数据。讨论了采样网络流量和社交网络的应用。
{"title":"Sampling for big data: a tutorial","authors":"Graham Cormode, N. Duffield","doi":"10.1145/2623330.2630811","DOIUrl":"https://doi.org/10.1145/2623330.2630811","url":null,"abstract":"One response to the proliferation of large datasets has been to develop ingenious ways to throw resources at the problem, using massive fault tolerant storage architectures, parallel and graphical computation models such as MapReduce, Pregel and Giraph. However, not all environments can support this scale of resources, and not all queries need an exact response. This motivates the use of sampling to generate summary datasets that support rapid queries, and prolong the useful life of the data in storage. To be effective, sampling must mediate the tensions between resource constraints, data characteristics, and the required query accuracy. The state-of-the-art in sampling goes far beyond simple uniform selection of elements, to maximize the usefulness of the resulting sample. This tutorial reviews progress in sample design for large datasets, including streaming and graph-structured data. Applications are discussed to sampling network traffic and social networks.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78218032","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 51
A case study: privacy preserving release of spatio-temporal density in paris 以巴黎为例:保护隐私的时空密度释放
G. Ács, C. Castelluccia
With billions of handsets in use worldwide, the quantity of mobility data is gigantic. When aggregated they can help understand complex processes, such as the spread viruses, and built better transportation systems, prevent traffic congestion. While the benefits provided by these datasets are indisputable, they unfortunately pose a considerable threat to location privacy. In this paper, we present a new anonymization scheme to release the spatio-temporal density of Paris, in France, i.e., the number of individuals in 989 different areas of the city released every hour over a whole week. The density is computed from a call-data-record (CDR) dataset, provided by the French Telecom operator Orange, containing the CDR of roughly 2 million users over one week. Our scheme is differential private, and hence, provides provable privacy guarantee to each individual in the dataset. Our main goal with this case study is to show that, even with large dimensional sensitive data, differential privacy can provide practical utility with meaningful privacy guarantee, if the anonymization scheme is carefully designed. This work is part of the national project XData (http://xdata.fr) that aims at combining large (anonymized) datasets provided by different service providers (telecom, electricity, water management, postal service, etc.).
随着全球数十亿部手机的使用,移动数据的数量是巨大的。当它们聚集在一起时,可以帮助理解复杂的过程,例如病毒的传播,并建立更好的交通系统,防止交通拥堵。虽然这些数据集提供的好处是无可争辩的,但不幸的是,它们对位置隐私构成了相当大的威胁。在本文中,我们提出了一种新的匿名化方案来发布法国巴黎的时空密度,即在整个星期内,城市989个不同地区每小时发布的个人数量。密度是根据法国电信运营商Orange提供的通话数据记录(CDR)数据集计算得出的,该数据集包含了大约200万用户一周的通话记录。我们的方案是差分私有的,因此为数据集中的每个个体提供了可证明的隐私保证。本案例研究的主要目的是表明,如果仔细设计匿名方案,即使使用大维度敏感数据,差异隐私也可以提供具有有意义的隐私保证的实用程序。这项工作是国家项目XData (http://xdata.fr)的一部分,该项目旨在结合不同服务提供商(电信、电力、水务管理、邮政服务等)提供的大型(匿名)数据集。
{"title":"A case study: privacy preserving release of spatio-temporal density in paris","authors":"G. Ács, C. Castelluccia","doi":"10.1145/2623330.2623361","DOIUrl":"https://doi.org/10.1145/2623330.2623361","url":null,"abstract":"With billions of handsets in use worldwide, the quantity of mobility data is gigantic. When aggregated they can help understand complex processes, such as the spread viruses, and built better transportation systems, prevent traffic congestion. While the benefits provided by these datasets are indisputable, they unfortunately pose a considerable threat to location privacy. In this paper, we present a new anonymization scheme to release the spatio-temporal density of Paris, in France, i.e., the number of individuals in 989 different areas of the city released every hour over a whole week. The density is computed from a call-data-record (CDR) dataset, provided by the French Telecom operator Orange, containing the CDR of roughly 2 million users over one week. Our scheme is differential private, and hence, provides provable privacy guarantee to each individual in the dataset. Our main goal with this case study is to show that, even with large dimensional sensitive data, differential privacy can provide practical utility with meaningful privacy guarantee, if the anonymization scheme is carefully designed. This work is part of the national project XData (http://xdata.fr) that aims at combining large (anonymized) datasets provided by different service providers (telecom, electricity, water management, postal service, etc.).","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77736582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 108
Predicting student risks through longitudinal analysis 通过纵向分析预测学生风险
Ashay Tamhane, S. Ikbal, Bikram Sengupta, Mayuri Duggirala, James J. Appleton
Poor academic performance in K-12 is often a precursor to unsatisfactory educational outcomes such as dropout, which are associated with significant personal and social costs. Hence, it is important to be able to predict students at risk of poor performance, so that the right personalized intervention plans can be initiated. In this paper, we report on a large-scale study to identify students at risk of not meeting acceptable levels of performance in one state-level and one national standardized assessment in Grade 8 of a major US school district. An important highlight of our study is its scale - both in terms of the number of students included, the number of years and the number of features, which provide a very solid grounding to the research. We report on our experience with handling the scale and complexity of data, and on the relative performance of various machine learning techniques we used for building predictive models. Our results demonstrate that it is possible to predict students at-risk of poor assessment performance with a high degree of accuracy, and to do so well in advance. These insights can be used to pro-actively initiate personalized intervention programs and improve the chances of student success.
K-12阶段的学习成绩不佳往往是不令人满意的教育结果的前兆,比如辍学,这与巨大的个人和社会成本有关。因此,能够预测学生表现不佳的风险是很重要的,这样才能启动正确的个性化干预计划。在本文中,我们报告了一项大规模的研究,以确定在美国主要学区的一个州一级和一个国家8年级标准化评估中存在未达到可接受水平表现风险的学生。我们研究的一个重要亮点是它的规模——无论是在学生人数、年数还是特征数量上,这都为研究提供了非常坚实的基础。我们报告了我们处理数据规模和复杂性的经验,以及我们用于构建预测模型的各种机器学习技术的相对性能。我们的研究结果表明,有可能以高度的准确性预测有不良评估表现风险的学生,并提前做好。这些见解可以用于主动启动个性化干预计划,提高学生成功的机会。
{"title":"Predicting student risks through longitudinal analysis","authors":"Ashay Tamhane, S. Ikbal, Bikram Sengupta, Mayuri Duggirala, James J. Appleton","doi":"10.1145/2623330.2623355","DOIUrl":"https://doi.org/10.1145/2623330.2623355","url":null,"abstract":"Poor academic performance in K-12 is often a precursor to unsatisfactory educational outcomes such as dropout, which are associated with significant personal and social costs. Hence, it is important to be able to predict students at risk of poor performance, so that the right personalized intervention plans can be initiated. In this paper, we report on a large-scale study to identify students at risk of not meeting acceptable levels of performance in one state-level and one national standardized assessment in Grade 8 of a major US school district. An important highlight of our study is its scale - both in terms of the number of students included, the number of years and the number of features, which provide a very solid grounding to the research. We report on our experience with handling the scale and complexity of data, and on the relative performance of various machine learning techniques we used for building predictive models. Our results demonstrate that it is possible to predict students at-risk of poor assessment performance with a high degree of accuracy, and to do so well in advance. These insights can be used to pro-actively initiate personalized intervention programs and improve the chances of student success.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"81275150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Large scale predictive modeling for micro-simulation of 3G air interface load 3G空口载荷微观仿真的大规模预测建模
D. Radosavljevik, P. V. D. Putten
This paper outlines the approach developed together with the Radio Network Strategy & Design Department of a large European telecom operator in order to forecast the Air-Interface load in their 3G network, which is used for planning network upgrades and budgeting purposes. It is based on large scale intelligent data analysis and modeling at the level of thousands of individual radio cells resulting in 30,000 models per day. It has been embedded into a scenario simulation framework that is used by end users not experienced in data mining for studying and simulating the behavior of this complex networked system, as an example of a systematic approach to the deployment step in the KDD process. This system is already in use for two years in the country where it was developed and it is a part of a standard business process. In the last six months this national operator became a competence center for predictive modeling for micro-simulation of 3G air interface load for four other operators of the same parent company.
本文概述了与一家大型欧洲电信运营商的无线网络战略与设计部共同开发的方法,以预测其3G网络中的空中接口负载,用于规划网络升级和预算目的。它基于大规模的智能数据分析和建模,在数千个单个无线电单元的水平上,每天产生30,000个模型。它被嵌入到一个场景模拟框架中,供没有数据挖掘经验的最终用户用于研究和模拟这个复杂网络系统的行为,作为KDD过程中部署步骤的系统方法的一个示例。该系统在其开发国已经使用了两年,并且是标准业务流程的一部分。在过去的六个月里,这个全国性的运营商成为了一个能力中心,为同一母公司的其他四家运营商提供3G空口负载微仿真预测建模。
{"title":"Large scale predictive modeling for micro-simulation of 3G air interface load","authors":"D. Radosavljevik, P. V. D. Putten","doi":"10.1145/2623330.2623339","DOIUrl":"https://doi.org/10.1145/2623330.2623339","url":null,"abstract":"This paper outlines the approach developed together with the Radio Network Strategy & Design Department of a large European telecom operator in order to forecast the Air-Interface load in their 3G network, which is used for planning network upgrades and budgeting purposes. It is based on large scale intelligent data analysis and modeling at the level of thousands of individual radio cells resulting in 30,000 models per day. It has been embedded into a scenario simulation framework that is used by end users not experienced in data mining for studying and simulating the behavior of this complex networked system, as an example of a systematic approach to the deployment step in the KDD process. This system is already in use for two years in the country where it was developed and it is a part of a standard business process. In the last six months this national operator became a competence center for predictive modeling for micro-simulation of 3G air interface load for four other operators of the same parent company.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79539441","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Does social good justify risking personal privacy? 社会利益是否证明冒个人隐私的风险是正当的?
R. Ramakrishnan, Geoffrey I. Webb
When data-driven improvements involve personally identifiable data, or even data that can be used to infer sensitive information about individuals, we face the dilemma that we potentially risk compromising privacy. As we see increased emphasis on using data mining to effect improvements in a range of socially beneficial activities, from improving matching of talented students to opportunities for higher education, or improving allocation of funds across competing school programs, or reducing hospitalization time following surgery, the dilemma can often be especially acute. The data involved often is personally identifiable or revealing and sensitive, and many of the institutions that must be involved in gathering and maintaining custody of the data are not equipped to adequately secure the data, raising the risk of privacy breaches. How should we approach this trade-off? Can we assess the risks? Can we control or mitigate them? Can we develop guidelines for when the risk is or is not worthwhile, and for how best to handle data in different common scenarios? Chairs Raghu Ramakrishnan and Geoffrey I. Webb bring this panel of leading data miners and privacy experts together to address these critical issues.
当数据驱动的改进涉及个人身份数据,甚至可用于推断个人敏感信息的数据时,我们面临着潜在的隐私风险。当我们看到越来越多地强调使用数据挖掘来改善一系列对社会有益的活动时,从改善有才能的学生与高等教育机会的匹配,或改善竞争学校项目的资金分配,或减少手术后的住院时间,这种困境往往特别严重。所涉及的数据通常是个人身份或暴露和敏感的,许多必须参与收集和维护数据保管的机构没有足够的设备来保护数据,从而增加了隐私泄露的风险。我们应该如何处理这种权衡?我们能评估风险吗?我们能控制或减轻它们吗?我们能否制定指导方针,说明什么时候值得冒险,什么时候不值得冒险,以及如何在不同的常见场景中最好地处理数据?主席Raghu Ramakrishnan和Geoffrey I. Webb将这个由领先的数据挖掘者和隐私专家组成的小组聚集在一起解决这些关键问题。
{"title":"Does social good justify risking personal privacy?","authors":"R. Ramakrishnan, Geoffrey I. Webb","doi":"10.1145/2623330.2630814","DOIUrl":"https://doi.org/10.1145/2623330.2630814","url":null,"abstract":"When data-driven improvements involve personally identifiable data, or even data that can be used to infer sensitive information about individuals, we face the dilemma that we potentially risk compromising privacy. As we see increased emphasis on using data mining to effect improvements in a range of socially beneficial activities, from improving matching of talented students to opportunities for higher education, or improving allocation of funds across competing school programs, or reducing hospitalization time following surgery, the dilemma can often be especially acute. The data involved often is personally identifiable or revealing and sensitive, and many of the institutions that must be involved in gathering and maintaining custody of the data are not equipped to adequately secure the data, raising the risk of privacy breaches. How should we approach this trade-off? Can we assess the risks? Can we control or mitigate them? Can we develop guidelines for when the risk is or is not worthwhile, and for how best to handle data in different common scenarios? Chairs Raghu Ramakrishnan and Geoffrey I. Webb bring this panel of leading data miners and privacy experts together to address these critical issues.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84037180","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
FEMA: flexible evolutionary multi-faceted analysis for dynamic behavioral pattern discovery 动态行为模式发现的灵活进化多面分析
Meng Jiang, Peng Cui, Fei Wang, Xinran Xu, Wenwu Zhu, Shiqiang Yang
Behavioral pattern discovery is increasingly being studied to understand human behavior and the discovered patterns can be used in many real world applications such as web search, recommender system and advertisement targeting. Traditional methods usually consider the behaviors as simple user and item connections, or represent them with a static model. In real world, however, human behaviors are actually complex and dynamic: they include correlations between user and multiple types of objects and also continuously evolve along time. These characteristics cause severe data sparsity and computational complexity problem, which pose great challenge to human behavioral analysis and prediction. In this paper, we propose a Flexible Evolutionary Multi-faceted Analysis (FEMA) framework for both behavior prediction and pattern mining. FEMA utilizes a flexible and dynamic factorization scheme for analyzing human behavioral data sequences, which can incorporate various knowledge embedded in different object domains to alleviate the sparsity problem. We give approximation algorithms for efficiency, where the bound of approximation loss is theoretically proved. We extensively evaluate the proposed method in two real datasets. For the prediction of human behaviors, the proposed FEMA significantly outperforms other state-of-the-art baseline methods by 17.4%. Moreover, FEMA is able to discover quite a number of interesting multi-faceted temporal patterns on human behaviors with good interpretability. More importantly, it can reduce the run time from hours to minutes, which is significant for industry to serve real-time applications.
行为模式发现的研究越来越多地用于理解人类行为,所发现的模式可以用于许多现实世界的应用,如网络搜索、推荐系统和广告定位。传统方法通常将行为视为简单的用户和项目连接,或者用静态模型表示它们。然而,在现实世界中,人类的行为实际上是复杂和动态的:它们包括用户和多种类型对象之间的相关性,并且随着时间的推移不断发展。这些特征导致了严重的数据稀疏性和计算复杂度问题,对人类行为分析和预测提出了巨大的挑战。在本文中,我们提出了一个灵活的进化多面分析(FEMA)框架,用于行为预测和模式挖掘。FEMA利用灵活的动态因子分解方案分析人类行为数据序列,该方案可以将嵌入在不同对象域中的各种知识融合在一起,以缓解稀疏性问题。给出了效率的近似算法,并从理论上证明了近似损失的界。我们在两个真实数据集中广泛地评估了所提出的方法。对于人类行为的预测,提议的FEMA显著优于其他最先进的基线方法17.4%。此外,FEMA能够发现相当多有趣的人类行为的多面时间模式,并且具有很好的解释性。更重要的是,它可以将运行时间从几小时减少到几分钟,这对于为实时应用程序提供服务的行业来说非常重要。
{"title":"FEMA: flexible evolutionary multi-faceted analysis for dynamic behavioral pattern discovery","authors":"Meng Jiang, Peng Cui, Fei Wang, Xinran Xu, Wenwu Zhu, Shiqiang Yang","doi":"10.1145/2623330.2623644","DOIUrl":"https://doi.org/10.1145/2623330.2623644","url":null,"abstract":"Behavioral pattern discovery is increasingly being studied to understand human behavior and the discovered patterns can be used in many real world applications such as web search, recommender system and advertisement targeting. Traditional methods usually consider the behaviors as simple user and item connections, or represent them with a static model. In real world, however, human behaviors are actually complex and dynamic: they include correlations between user and multiple types of objects and also continuously evolve along time. These characteristics cause severe data sparsity and computational complexity problem, which pose great challenge to human behavioral analysis and prediction. In this paper, we propose a Flexible Evolutionary Multi-faceted Analysis (FEMA) framework for both behavior prediction and pattern mining. FEMA utilizes a flexible and dynamic factorization scheme for analyzing human behavioral data sequences, which can incorporate various knowledge embedded in different object domains to alleviate the sparsity problem. We give approximation algorithms for efficiency, where the bound of approximation loss is theoretically proved. We extensively evaluate the proposed method in two real datasets. For the prediction of human behaviors, the proposed FEMA significantly outperforms other state-of-the-art baseline methods by 17.4%. Moreover, FEMA is able to discover quite a number of interesting multi-faceted temporal patterns on human behaviors with good interpretability. More importantly, it can reduce the run time from hours to minutes, which is significant for industry to serve real-time applications.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78367183","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
FBLG: a simple and effective approach for temporal dependence discovery from time series data FBLG:从时间序列数据中发现时间相关性的一种简单有效的方法
Dehua Cheng, M. T. Bahadori, Yan Liu
Discovering temporal dependence structure from multivariate time series has established its importance in many applications. We observe that when we look in reversed order of time, the temporal dependence structure of the time series is usually preserved after switching the roles of cause and effect. Inspired by this observation, we create a new time series by reversing the time stamps of original time series and combine both time series to improve the performance of temporal dependence recovery. We also provide theoretical justification for the proposed algorithm for several existing time series models. We test our approach on both synthetic and real world datasets. The experimental results confirm that this surprisingly simple approach is indeed effective under various circumstances.
从多元时间序列中发现时间相关结构已经确立了它在许多应用中的重要性。我们观察到,当我们以时间的倒序来观察时,在转换因果角色后,时间序列的时间依赖结构通常是保留的。受这一观察结果的启发,我们通过对原始时间序列的时间戳进行反转来创建新的时间序列,并将两个时间序列组合在一起,以提高时间依赖恢复的性能。我们还对几个现有的时间序列模型提供了理论证明。我们在合成和真实世界的数据集上测试了我们的方法。实验结果证实,这种令人惊讶的简单方法在各种情况下确实有效。
{"title":"FBLG: a simple and effective approach for temporal dependence discovery from time series data","authors":"Dehua Cheng, M. T. Bahadori, Yan Liu","doi":"10.1145/2623330.2623709","DOIUrl":"https://doi.org/10.1145/2623330.2623709","url":null,"abstract":"Discovering temporal dependence structure from multivariate time series has established its importance in many applications. We observe that when we look in reversed order of time, the temporal dependence structure of the time series is usually preserved after switching the roles of cause and effect. Inspired by this observation, we create a new time series by reversing the time stamps of original time series and combine both time series to improve the performance of temporal dependence recovery. We also provide theoretical justification for the proposed algorithm for several existing time series models. We test our approach on both synthetic and real world datasets. The experimental results confirm that this surprisingly simple approach is indeed effective under various circumstances.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87496703","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Dynamics of news events and social media reaction 动态的新闻事件和社会媒体的反应
Mikalai Tsytsarau, Themis Palpanas, M. Castellanos
The analysis of social sentiment expressed on the Web is becoming increasingly relevant to a variety of applications, and it is important to understand the underlying mechanisms which drive the evolution of sentiments in one way or another, in order to be able to predict these changes in the future. In this paper, we study the dynamics of news events and their relation to changes of sentiment expressed on relevant topics. We propose a novel framework, which models the behavior of news and social media in response to events as a convolution between event's importance and media response function, specific to media and event type. This framework is suitable for detecting time and duration of events, as well as their impact and dynamics, from time series of publication volume. These data can greatly enhance events analysis; for instance, they can help distinguish important events from unimportant, or predict sentiment and stock market shifts. As an example of such application, we extracted news events for a variety of topics and then correlated this data with the corresponding sentiment time series, revealing the connection between sentiment shifts and event dynamics.
对网络上表达的社会情感的分析与各种各样的应用越来越相关,为了能够预测未来的这些变化,理解以某种方式驱动情感演变的潜在机制是很重要的。在本文中,我们研究了新闻事件的动态及其与相关话题表达情绪变化的关系。我们提出了一个新的框架,该框架将新闻和社交媒体对事件的反应行为建模为事件重要性与媒体反应函数之间的卷积,具体到媒体和事件类型。该框架适用于从出版物量的时间序列中检测事件的时间和持续时间,以及它们的影响和动态。这些数据可以极大地增强事件分析;例如,它们可以帮助区分重要事件和不重要事件,或者预测市场情绪和股市走势。作为这种应用的一个例子,我们提取了各种主题的新闻事件,然后将这些数据与相应的情绪时间序列相关联,揭示了情绪变化与事件动态之间的联系。
{"title":"Dynamics of news events and social media reaction","authors":"Mikalai Tsytsarau, Themis Palpanas, M. Castellanos","doi":"10.1145/2623330.2623670","DOIUrl":"https://doi.org/10.1145/2623330.2623670","url":null,"abstract":"The analysis of social sentiment expressed on the Web is becoming increasingly relevant to a variety of applications, and it is important to understand the underlying mechanisms which drive the evolution of sentiments in one way or another, in order to be able to predict these changes in the future. In this paper, we study the dynamics of news events and their relation to changes of sentiment expressed on relevant topics. We propose a novel framework, which models the behavior of news and social media in response to events as a convolution between event's importance and media response function, specific to media and event type. This framework is suitable for detecting time and duration of events, as well as their impact and dynamics, from time series of publication volume. These data can greatly enhance events analysis; for instance, they can help distinguish important events from unimportant, or predict sentiment and stock market shifts. As an example of such application, we extracted news events for a variety of topics and then correlated this data with the corresponding sentiment time series, revealing the connection between sentiment shifts and event dynamics.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82898967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 68
Community detection in graphs through correlation 通过相关性在图中进行社区检测
Lian Duan, W. Street, Yanchi Liu, Haibing Lu
Community detection is an important task for social networks, which helps us understand the functional modules on the whole network. Among different community detection methods based on graph structures, modularity-based methods are very popular recently, but suffer a well-known resolution limit problem. This paper connects modularity-based methods with correlation analysis by subtly reformatting their math formulas and investigates how to fully make use of correlation analysis to change the objective function of modularity-based methods, which provides a more natural and effective way to solve the resolution limit problem. In addition, a novel theoretical analysis on the upper bound of different objective functions helps us understand their bias to different community sizes, and experiments are conducted on both real life and simulated data to validate our findings.
社区检测是社交网络的一项重要任务,它帮助我们了解整个网络的功能模块。在各种基于图结构的社区检测方法中,基于模块化的社区检测方法是近年来非常流行的一种方法,但存在一个众所周知的分辨率限制问题。本文通过对基于模块化的方法和相关分析方法的数学公式进行巧妙的重新格式化,将两者联系起来,探讨如何充分利用相关分析改变基于模块化方法的目标函数,从而为解决分辨率极限问题提供一种更自然、更有效的方法。此外,对不同目标函数的上界进行了新颖的理论分析,帮助我们理解它们对不同社区规模的偏差,并在现实生活和模拟数据上进行了实验来验证我们的发现。
{"title":"Community detection in graphs through correlation","authors":"Lian Duan, W. Street, Yanchi Liu, Haibing Lu","doi":"10.1145/2623330.2623629","DOIUrl":"https://doi.org/10.1145/2623330.2623629","url":null,"abstract":"Community detection is an important task for social networks, which helps us understand the functional modules on the whole network. Among different community detection methods based on graph structures, modularity-based methods are very popular recently, but suffer a well-known resolution limit problem. This paper connects modularity-based methods with correlation analysis by subtly reformatting their math formulas and investigates how to fully make use of correlation analysis to change the objective function of modularity-based methods, which provides a more natural and effective way to solve the resolution limit problem. In addition, a novel theoretical analysis on the upper bound of different objective functions helps us understand their bias to different community sizes, and experiments are conducted on both real life and simulated data to validate our findings.","PeriodicalId":20536,"journal":{"name":"Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2014-08-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"89057154","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 61
期刊
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and data mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1