首页 > 最新文献

Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining最新文献

英文 中文
Approximate graph mining with label costs 近似图挖掘与标签成本
Pranay Anchuri, Mohammed J. Zaki, Omer Barkol, Shahar Golan, Moshe Shamy
Many real-world graphs have complex labels on the nodes and edges. Mining only exact patterns yields limited insights, since it may be hard to find exact matches. However, in many domains it is relatively easy to define a cost (or distance) between different labels. Using this information, it becomes possible to mine a much richer set of approximate subgraph patterns, which preserve the topology but allow bounded label mismatches. We present novel and scalable methods to efficiently solve the approximate isomorphism problem. We show that approximate mining yields interesting patterns in several real-world graphs ranging from IT and protein interaction networks to protein structures.
许多现实世界的图在节点和边上都有复杂的标签。只挖掘精确的模式只能产生有限的见解,因为可能很难找到精确的匹配。然而,在许多领域,定义不同标签之间的成本(或距离)是相对容易的。使用这些信息,可以挖掘更丰富的近似子图模式集,这些模式保留了拓扑结构,但允许有界标签不匹配。我们提出了一种新颖的、可扩展的方法来有效地解决近似同构问题。我们展示了在从IT和蛋白质相互作用网络到蛋白质结构的几个真实世界图中,近似挖掘产生了有趣的模式。
{"title":"Approximate graph mining with label costs","authors":"Pranay Anchuri, Mohammed J. Zaki, Omer Barkol, Shahar Golan, Moshe Shamy","doi":"10.1145/2487575.2487602","DOIUrl":"https://doi.org/10.1145/2487575.2487602","url":null,"abstract":"Many real-world graphs have complex labels on the nodes and edges. Mining only exact patterns yields limited insights, since it may be hard to find exact matches. However, in many domains it is relatively easy to define a cost (or distance) between different labels. Using this information, it becomes possible to mine a much richer set of approximate subgraph patterns, which preserve the topology but allow bounded label mismatches. We present novel and scalable methods to efficiently solve the approximate isomorphism problem. We show that approximate mining yields interesting patterns in several real-world graphs ranging from IT and protein interaction networks to protein structures.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"77528210","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 30
Mining discriminative subgraphs from global-state networks 从全局状态网络中挖掘判别子图
Sayan Ranu, Minh X. Hoang, Ambuj K. Singh
Global-state networks provide a powerful mechanism to model the increasing heterogeneity in data generated by current systems. Such a network comprises of a series of network snapshots with dynamic local states at nodes, and a global network state indicating the occurrence of an event. Mining discriminative subgraphs from global-state networks allows us to identify the influential sub-networks that have maximum impact on the global state and unearth the complex relationships between the local entities of a network and their collective behavior. In this paper, we explore this problem and design a technique called MINDS to mine minimally discriminative subgraphs from large global-state networks. To combat the exponential subgraph search space, we derive the concept of an edit map and perform Metropolis Hastings sampling on it to compute the answer set. Furthermore, we formulate the idea of network-constrained decision trees to learn prediction models that adhere to the underlying network structure. Extensive experiments on real datasets demonstrate excellent accuracy in terms of prediction quality. Additionally, MINDS achieves a speed-up of at least four orders of magnitude over baseline techniques.
全局状态网络提供了一种强大的机制来模拟当前系统生成的数据中日益增加的异质性。该网络包括节点处具有动态局部状态的一系列网络快照,以及指示事件发生的全局网络状态。从全局状态网络中挖掘判别子图使我们能够识别对全局状态影响最大的有影响力的子网络,并揭示网络局部实体与其集体行为之间的复杂关系。在本文中,我们探讨了这个问题,并设计了一种称为MINDS的技术来从大型全局状态网络中挖掘最小判别子图。为了对抗指数子图搜索空间,我们提出了编辑映射的概念,并对其进行Metropolis Hastings采样来计算答案集。此外,我们提出了网络约束决策树的思想来学习遵循底层网络结构的预测模型。在真实数据集上进行的大量实验表明,该方法在预测质量方面具有优异的准确性。此外,MINDS的速度比基线技术至少提高了四个数量级。
{"title":"Mining discriminative subgraphs from global-state networks","authors":"Sayan Ranu, Minh X. Hoang, Ambuj K. Singh","doi":"10.1145/2487575.2487692","DOIUrl":"https://doi.org/10.1145/2487575.2487692","url":null,"abstract":"Global-state networks provide a powerful mechanism to model the increasing heterogeneity in data generated by current systems. Such a network comprises of a series of network snapshots with dynamic local states at nodes, and a global network state indicating the occurrence of an event. Mining discriminative subgraphs from global-state networks allows us to identify the influential sub-networks that have maximum impact on the global state and unearth the complex relationships between the local entities of a network and their collective behavior. In this paper, we explore this problem and design a technique called MINDS to mine minimally discriminative subgraphs from large global-state networks. To combat the exponential subgraph search space, we derive the concept of an edit map and perform Metropolis Hastings sampling on it to compute the answer set. Furthermore, we formulate the idea of network-constrained decision trees to learn prediction models that adhere to the underlying network structure. Extensive experiments on real datasets demonstrate excellent accuracy in terms of prediction quality. Additionally, MINDS achieves a speed-up of at least four orders of magnitude over baseline techniques.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"9 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"79190269","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 26
Financing lead triggers: empowering sales reps through knowledge discovery and fusion 融资线索触发器:通过知识发现和融合赋予销售代表权力
K. Aggour, Bethany Hoogs
Sales representatives must have access to meaningful and actionable intelligence about potential customers to be effective in their roles. Historically, GE Capital Americas sales reps identified leads by manually searching through news reports and financial statements either in print or online. Here we describe a system built to automate the collection and aggregation of information on companies, which is then mined to identify actionable sales leads. The Financing Lead Triggers system is comprised of three core components that perform information fusion, knowledge discovery and information visualization. Together these components extract raw data from disparate sources, fuse that data into information, and then automatically mine that information for actionable sales leads driven by a combination of expert-defined and statistically derived triggers. A web-based interface provides sales reps access to the company information and sales leads in a single location. The use of the Lead Triggers system has significantly improved the performance of the sales reps, providing them with actionable intelligence that has improved their productivity by 30-50%. In 2010, Lead Triggers provided leads on opportunities that represented over $44B in new deal commitments for GE Capital.
销售代表必须能够获得有关潜在客户的有意义和可操作的情报,以便在他们的角色中发挥作用。过去,通用电气金融美洲公司的销售代表通过手动搜索印刷或在线的新闻报道和财务报表来识别潜在客户。在这里,我们描述了一个用于自动收集和汇总公司信息的系统,然后挖掘这些信息以确定可操作的销售线索。融资线索触发系统由信息融合、知识发现和信息可视化三个核心组件组成。这些组件一起从不同的来源提取原始数据,将这些数据融合成信息,然后自动挖掘这些信息,以获得由专家定义和统计派生的触发器组合驱动的可操作的销售线索。基于web的界面使销售代表能够在一个位置访问公司信息和销售线索。Lead Triggers系统的使用大大提高了销售代表的绩效,为他们提供了可操作的情报,使他们的生产力提高了30-50%。2010年,Lead Triggers为GE Capital提供了超过440亿美元的新交易机会。
{"title":"Financing lead triggers: empowering sales reps through knowledge discovery and fusion","authors":"K. Aggour, Bethany Hoogs","doi":"10.1145/2487575.2488190","DOIUrl":"https://doi.org/10.1145/2487575.2488190","url":null,"abstract":"Sales representatives must have access to meaningful and actionable intelligence about potential customers to be effective in their roles. Historically, GE Capital Americas sales reps identified leads by manually searching through news reports and financial statements either in print or online. Here we describe a system built to automate the collection and aggregation of information on companies, which is then mined to identify actionable sales leads. The Financing Lead Triggers system is comprised of three core components that perform information fusion, knowledge discovery and information visualization. Together these components extract raw data from disparate sources, fuse that data into information, and then automatically mine that information for actionable sales leads driven by a combination of expert-defined and statistically derived triggers. A web-based interface provides sales reps access to the company information and sales leads in a single location. The use of the Lead Triggers system has significantly improved the performance of the sales reps, providing them with actionable intelligence that has improved their productivity by 30-50%. In 2010, Lead Triggers provided leads on opportunities that represented over $44B in new deal commitments for GE Capital.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"32 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"78663986","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
STRIP: stream learning of influence probabilities STRIP:影响概率的流学习
Konstantin Kutzkov, A. Bifet, F. Bonchi, A. Gionis
Influence-driven diffusion of information is a fundamental process in social networks. Learning the latent variables of such process, i.e., the influence strength along each link, is a central question towards understanding the structure and function of complex networks, modeling information cascades, and developing applications such as viral marketing. Motivated by modern microblogging platforms, such as twitter, in this paper we study the problem of learning influence probabilities in a data-stream scenario, in which the network topology is relatively stable and the challenge of a learning algorithm is to keep up with a continuous stream of tweets using a small amount of time and memory. Our contribution is a number of randomized approximation algorithms, categorized according to the available space (superlinear, linear, and sublinear in the number of nodes n) and according to different models (landmark and sliding window). Among several results, we show that we can learn influence probabilities with one pass over the data, using O(nlog n) space, in both the landmark model and the sliding-window model, and we further show that our algorithm is within a logarithmic factor of optimal. For truly large graphs, when one needs to operate with sublinear space, we show that we can still learn influence probabilities in one pass, assuming that we restrict our attention to the most active users. Our thorough experimental evaluation on large social graph demonstrates that the empirical performance of our algorithms agrees with that predicted by the theory.
影响驱动的信息扩散是社会网络的一个基本过程。学习这一过程的潜在变量,即每个环节的影响强度,是理解复杂网络的结构和功能、建模信息级联以及开发病毒式营销等应用程序的核心问题。受现代微博平台(如twitter)的启发,本文研究了数据流场景下的学习影响概率问题,在这种场景下,网络拓扑结构相对稳定,学习算法的挑战是使用少量的时间和内存来跟上连续的tweet流。我们的贡献是一些随机逼近算法,根据可用空间(节点数量n中的超线性、线性和亚线性)和不同的模型(地标和滑动窗口)进行分类。在几个结果中,我们表明,在地标模型和滑动窗口模型中,我们可以使用O(nlog n)空间通过一次数据来学习影响概率,并且我们进一步表明,我们的算法在最优的对数因子范围内。对于真正的大图,当需要在次线性空间中操作时,我们表明,假设我们将注意力限制在最活跃的用户上,我们仍然可以在一次传递中学习影响概率。我们对大型社交图谱的实验评估表明,我们的算法的经验表现与理论预测一致。
{"title":"STRIP: stream learning of influence probabilities","authors":"Konstantin Kutzkov, A. Bifet, F. Bonchi, A. Gionis","doi":"10.1145/2487575.2487657","DOIUrl":"https://doi.org/10.1145/2487575.2487657","url":null,"abstract":"Influence-driven diffusion of information is a fundamental process in social networks. Learning the latent variables of such process, i.e., the influence strength along each link, is a central question towards understanding the structure and function of complex networks, modeling information cascades, and developing applications such as viral marketing. Motivated by modern microblogging platforms, such as twitter, in this paper we study the problem of learning influence probabilities in a data-stream scenario, in which the network topology is relatively stable and the challenge of a learning algorithm is to keep up with a continuous stream of tweets using a small amount of time and memory. Our contribution is a number of randomized approximation algorithms, categorized according to the available space (superlinear, linear, and sublinear in the number of nodes n) and according to different models (landmark and sliding window). Among several results, we show that we can learn influence probabilities with one pass over the data, using O(nlog n) space, in both the landmark model and the sliding-window model, and we further show that our algorithm is within a logarithmic factor of optimal. For truly large graphs, when one needs to operate with sublinear space, we show that we can still learn influence probabilities in one pass, assuming that we restrict our attention to the most active users. Our thorough experimental evaluation on large social graph demonstrates that the empirical performance of our algorithms agrees with that predicted by the theory.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"59 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86996767","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 50
A phrase mining framework for recursive construction of a topical hierarchy 用于递归构建主题层次结构的短语挖掘框架
Chi Wang, Marina Danilevsky, Nihit Desai, Yinan Zhang, Phuong Nguyen, T. Taula, Jiawei Han
A high quality hierarchical organization of the concepts in a dataset at different levels of granularity has many valuable applications such as search, summarization, and content browsing. In this paper we propose an algorithm for recursively constructing a hierarchy of topics from a collection of content-representative documents. We characterize each topic in the hierarchy by an integrated ranked list of mixed-length phrases. Our mining framework is based on a phrase-centric view for clustering, extracting, and ranking topical phrases. Experiments with datasets from three different domains illustrate our ability to generate hierarchies of high quality topics represented by meaningful phrases.
在不同粒度级别上对数据集中的概念进行高质量的分层组织具有许多有价值的应用,例如搜索、摘要和内容浏览。在本文中,我们提出了一种从内容代表文档集合中递归构建主题层次结构的算法。我们通过混合长度短语的综合排名列表来描述层次结构中的每个主题。我们的挖掘框架基于以短语为中心的视图,用于对主题短语进行聚类、提取和排序。对来自三个不同领域的数据集进行的实验表明,我们能够生成由有意义的短语表示的高质量主题的层次结构。
{"title":"A phrase mining framework for recursive construction of a topical hierarchy","authors":"Chi Wang, Marina Danilevsky, Nihit Desai, Yinan Zhang, Phuong Nguyen, T. Taula, Jiawei Han","doi":"10.1145/2487575.2487631","DOIUrl":"https://doi.org/10.1145/2487575.2487631","url":null,"abstract":"A high quality hierarchical organization of the concepts in a dataset at different levels of granularity has many valuable applications such as search, summarization, and content browsing. In this paper we propose an algorithm for recursively constructing a hierarchy of topics from a collection of content-representative documents. We characterize each topic in the hierarchy by an integrated ranked list of mixed-length phrases. Our mining framework is based on a phrase-centric view for clustering, extracting, and ranking topical phrases. Experiments with datasets from three different domains illustrate our ability to generate hierarchies of high quality topics represented by meaningful phrases.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"82787111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 91
AMETHYST: a system for mining and exploring topical hierarchies of heterogeneous data AMETHYST:用于挖掘和探索异构数据的主题层次结构的系统
Marina Danilevsky, Chi Wang, Fangbo Tao, Son Nguyen, Gong Chen, Nihit Desai, Lidan Wang, Jiawei Han
In this demo we present AMETHYST, a system for exploring and analyzing a topical hierarchy constructed from a heterogeneous information network (HIN). HINs, composed of multiple types of entities and links are very common in the real world. Many have a text component, and thus can benefit from a high quality hierarchical organization of the topics in the network dataset. By organizing the topics into a hierarchy, AMETHYST helps understand search results in the context of an ontology, and explain entity relatedness at different granularities. The automatically constructed topical hierarchy reflects a domain-specific ontology, interacts with multiple types of linked entities, and can be tailored for both free text and OLAP queries.
在这个演示中,我们展示了AMETHYST,一个用于探索和分析由异构信息网络(HIN)构建的主题层次结构的系统。HINs由多种类型的实体和链接组成,在现实世界中非常常见。许多具有文本组件,因此可以从网络数据集中主题的高质量分层组织中受益。通过将主题组织到层次结构中,AMETHYST有助于理解本体上下文中的搜索结果,并在不同粒度上解释实体的相关性。自动构建的主题层次结构反映了特定于领域的本体,与多种类型的链接实体交互,并且可以针对自由文本和OLAP查询进行定制。
{"title":"AMETHYST: a system for mining and exploring topical hierarchies of heterogeneous data","authors":"Marina Danilevsky, Chi Wang, Fangbo Tao, Son Nguyen, Gong Chen, Nihit Desai, Lidan Wang, Jiawei Han","doi":"10.1145/2487575.2487716","DOIUrl":"https://doi.org/10.1145/2487575.2487716","url":null,"abstract":"In this demo we present AMETHYST, a system for exploring and analyzing a topical hierarchy constructed from a heterogeneous information network (HIN). HINs, composed of multiple types of entities and links are very common in the real world. Many have a text component, and thus can benefit from a high quality hierarchical organization of the topics in the network dataset. By organizing the topics into a hierarchy, AMETHYST helps understand search results in the context of an ontology, and explain entity relatedness at different granularities. The automatically constructed topical hierarchy reflects a domain-specific ontology, interacts with multiple types of linked entities, and can be tailored for both free text and OLAP queries.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"87807047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Network discovery via constrained tensor analysis of fMRI data 网络发现通过约束张量分析的fMRI数据
I. Davidson, Sean Gilpin, Owen Carmichael, Peter B. Walker
We pose the problem of network discovery which involves simplifying spatio-temporal data into cohesive regions (nodes) and relationships between those regions (edges). Such problems naturally exist in fMRI scans of human subjects. These scans consist of activations of thousands of voxels over time with the aim to simplify them into the underlying cognitive network being used. We propose supervised and semi-supervised variations of this problem and postulate a constrained tensor decomposition formulation and a corresponding alternating least squares solver that is easy to implement. We show this formulation works well in controlled experiments where supervision is incomplete, superfluous and noisy and is able to recover the underlying ground truth network. We then show that for real fMRI data our approach can reproduce well known results in neurology regarding the default mode network in resting-state healthy and Alzheimer affected individuals. Finally, we show that the reconstruction error of the decomposition provides a useful measure of the network strength and is useful at predicting key cognitive scores both by itself and with clinical information.
我们提出了网络发现问题,该问题涉及将时空数据简化为内聚区域(节点)和这些区域(边缘)之间的关系。这些问题自然存在于人类受试者的功能磁共振成像扫描中。随着时间的推移,这些扫描包括数千个体素的激活,目的是将它们简化为正在使用的潜在认知网络。我们提出了这个问题的监督和半监督变量,并假设了一个约束张量分解公式和一个相应的易于实现的交替最小二乘求解器。我们表明,该公式在监督不完整、多余和有噪声的控制实验中效果很好,并且能够恢复潜在的地面真值网络。然后,我们表明,对于真实的fMRI数据,我们的方法可以重现神经学中关于静息状态健康和阿尔茨海默病患者的默认模式网络的众所周知的结果。最后,我们证明了分解的重建误差提供了一个有用的网络强度度量,并且在预测关键的认知分数时非常有用,无论是它本身还是临床信息。
{"title":"Network discovery via constrained tensor analysis of fMRI data","authors":"I. Davidson, Sean Gilpin, Owen Carmichael, Peter B. Walker","doi":"10.1145/2487575.2487619","DOIUrl":"https://doi.org/10.1145/2487575.2487619","url":null,"abstract":"We pose the problem of network discovery which involves simplifying spatio-temporal data into cohesive regions (nodes) and relationships between those regions (edges). Such problems naturally exist in fMRI scans of human subjects. These scans consist of activations of thousands of voxels over time with the aim to simplify them into the underlying cognitive network being used. We propose supervised and semi-supervised variations of this problem and postulate a constrained tensor decomposition formulation and a corresponding alternating least squares solver that is easy to implement. We show this formulation works well in controlled experiments where supervision is incomplete, superfluous and noisy and is able to recover the underlying ground truth network. We then show that for real fMRI data our approach can reproduce well known results in neurology regarding the default mode network in resting-state healthy and Alzheimer affected individuals. Finally, we show that the reconstruction error of the decomposition provides a useful measure of the network strength and is useful at predicting key cognitive scores both by itself and with clinical information.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83090582","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 91
Accurate intelligible models with pairwise interactions 具有两两相互作用的精确可理解模型
Yin Lou, R. Caruana, J. Gehrke, G. Hooker
Standard generalized additive models (GAMs) usually model the dependent variable as a sum of univariate models. Although previous studies have shown that standard GAMs can be interpreted by users, their accuracy is significantly less than more complex models that permit interactions. In this paper, we suggest adding selected terms of interacting pairs of features to standard GAMs. The resulting models, which we call GA2{M}$-models, for Generalized Additive Models plus Interactions, consist of univariate terms and a small number of pairwise interaction terms. Since these models only include one- and two-dimensional components, the components of GA2M-models can be visualized and interpreted by users. To explore the huge (quadratic) number of pairs of features, we develop a novel, computationally efficient method called FAST for ranking all possible pairs of features as candidates for inclusion into the model. In a large-scale empirical study, we show the effectiveness of FAST in ranking candidate pairs of features. In addition, we show the surprising result that GA2M-models have almost the same performance as the best full-complexity models on a number of real datasets. Thus this paper postulates that for many problems, GA2M-models can yield models that are both intelligible and accurate.
标准广义加性模型(GAMs)通常将因变量建模为单变量模型的和。尽管先前的研究表明,标准GAMs可以被用户解释,但其准确性明显低于允许交互的更复杂的模型。在本文中,我们建议在标准GAMs中添加交互特征对的选定项。所得模型,我们称之为GA2{M}$-模型,用于广义加性模型加相互作用,由单变量项和少量成对相互作用项组成。由于这些模型只包含一维和二维组件,因此ga2m模型的组件可以被用户可视化和解释。为了探索巨大的(二次)数量的特征对,我们开发了一种新的,计算效率高的方法,称为FAST,用于将所有可能的特征对作为候选特征纳入模型。在大规模的实证研究中,我们证明了FAST在对候选特征对排序方面的有效性。此外,我们还展示了令人惊讶的结果,即ga2m模型在许多真实数据集上具有与最佳全复杂度模型几乎相同的性能。因此,本文假设对于许多问题,ga2m模型可以产生既可理解又准确的模型。
{"title":"Accurate intelligible models with pairwise interactions","authors":"Yin Lou, R. Caruana, J. Gehrke, G. Hooker","doi":"10.1145/2487575.2487579","DOIUrl":"https://doi.org/10.1145/2487575.2487579","url":null,"abstract":"Standard generalized additive models (GAMs) usually model the dependent variable as a sum of univariate models. Although previous studies have shown that standard GAMs can be interpreted by users, their accuracy is significantly less than more complex models that permit interactions. In this paper, we suggest adding selected terms of interacting pairs of features to standard GAMs. The resulting models, which we call GA2{M}$-models, for Generalized Additive Models plus Interactions, consist of univariate terms and a small number of pairwise interaction terms. Since these models only include one- and two-dimensional components, the components of GA2M-models can be visualized and interpreted by users. To explore the huge (quadratic) number of pairs of features, we develop a novel, computationally efficient method called FAST for ranking all possible pairs of features as candidates for inclusion into the model. In a large-scale empirical study, we show the effectiveness of FAST in ranking candidate pairs of features. In addition, we show the surprising result that GA2M-models have almost the same performance as the best full-complexity models on a number of real datasets. Thus this paper postulates that for many problems, GA2M-models can yield models that are both intelligible and accurate.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"55 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"84359623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 406
Analysis of advanced meter infrastructure data of water consumption in apartment buildings 公寓用水量先进计量基础设施数据分析
Einat Kermany, Hanna Mazzawi, Dorit Baras, Y. Naveh, Hagai Michaelis
We present our experience of using machine learning techniques over data originating from advanced meter infrastructure (AMI) systems for water consumption in a medium-size city. We focus on two new use cases that are of special importance to city authorities. One use case is the automatic identification of malfunctioning meters, with a focus on distinguishing them from legitimate non-consumption such as during periods when the household residents are on vacation. The other use case is the identification of leaks or theft in the unmetered common areas of apartment buildings. These two use cases are highly important to city authorities both because of the lost revenue they imply and because of the hassle to the residents in cases of delayed identification. Both cases are inherently complex to analyze and require advanced data mining techniques in order to achieve high levels of correct identification. Our results provide for faster and more accurate detection of malfunctioning meters as well as leaks in the common areas. This results in significant tangible value to the authorities in terms of increase in technician efficiency and a decrease in the amount of wasted, non-revenue, water.
我们介绍了我们使用机器学习技术的经验,这些数据来自于一个中型城市的先进计量基础设施(AMI)系统。我们重点关注两个对市政当局特别重要的新用例。一个用例是自动识别故障仪表,重点是将它们与合法的非消费(例如在家庭居民度假期间)区分开来。另一个用例是在公寓楼的非计量公共区域识别泄漏或盗窃。这两个用例对市政当局来说非常重要,因为它们意味着收入损失,也因为在身份识别延迟的情况下给居民带来麻烦。这两种情况分析起来都很复杂,需要先进的数据挖掘技术才能实现高水平的正确识别。我们的结果提供了更快,更准确的检测故障仪表以及泄漏在公共区域。这给当局带来了显著的有形价值,提高了技术人员的效率,减少了浪费的、无收益的水。
{"title":"Analysis of advanced meter infrastructure data of water consumption in apartment buildings","authors":"Einat Kermany, Hanna Mazzawi, Dorit Baras, Y. Naveh, Hagai Michaelis","doi":"10.1145/2487575.2488193","DOIUrl":"https://doi.org/10.1145/2487575.2488193","url":null,"abstract":"We present our experience of using machine learning techniques over data originating from advanced meter infrastructure (AMI) systems for water consumption in a medium-size city. We focus on two new use cases that are of special importance to city authorities. One use case is the automatic identification of malfunctioning meters, with a focus on distinguishing them from legitimate non-consumption such as during periods when the household residents are on vacation. The other use case is the identification of leaks or theft in the unmetered common areas of apartment buildings. These two use cases are highly important to city authorities both because of the lost revenue they imply and because of the hassle to the residents in cases of delayed identification. Both cases are inherently complex to analyze and require advanced data mining techniques in order to achieve high levels of correct identification. Our results provide for faster and more accurate detection of malfunctioning meters as well as leaks in the common areas. This results in significant tangible value to the authorities in terms of increase in technician efficiency and a decrease in the amount of wasted, non-revenue, water.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"114 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86780449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Measuring spontaneous devaluations in user preferences 测量用户偏好的自发贬值
Komal Kapoor, Nisheeth Srivastava, J. Srivastava, P. Schrater
Spontaneous devaluation in preferences is ubiquitous, where yesterday's hit is today's affliction. Despite technological advances facilitating access to a wide range of media commodities, finding engaging content is a major enterprise with few principled solutions. Systems tracking spontaneous devaluation in user preferences can allow prediction of the onset of boredom in users potentially catering to their changed needs. In this work, we study the music listening histories of Last.fm users focusing on the changes in their preferences based on their choices for different artists at different points in time. A hazard function, commonly used in statistics for survival analysis, is used to capture the rate at which a user returns to an artist as a function of exposure to the artist. The analysis provides the first evidence of spontaneous devaluation in preferences of music listeners. Better understanding of the temporal dynamics of this phenomenon can inform solutions to the similarity-diversity dilemma of recommender systems.
偏好的自发贬值无处不在,昨天的打击就是今天的痛苦。尽管技术进步促进了对广泛媒体商品的访问,但寻找吸引人的内容是一项主要的企业,几乎没有原则性的解决方案。跟踪用户偏好自发贬值的系统可以预测用户可能会对其变化的需求产生厌烦。在本研究中,我们研究了Last的音乐聆听史。FM用户关注的是他们在不同时间点对不同艺人的选择所带来的喜好变化。危险函数通常用于生存分析的统计,用于捕获用户返回艺术家的比率,作为与艺术家接触的函数。该分析提供了首个证据,证明音乐听众的偏好会自发降低。更好地理解这种现象的时间动态可以为推荐系统的相似性-多样性困境提供解决方案。
{"title":"Measuring spontaneous devaluations in user preferences","authors":"Komal Kapoor, Nisheeth Srivastava, J. Srivastava, P. Schrater","doi":"10.1145/2487575.2487679","DOIUrl":"https://doi.org/10.1145/2487575.2487679","url":null,"abstract":"Spontaneous devaluation in preferences is ubiquitous, where yesterday's hit is today's affliction. Despite technological advances facilitating access to a wide range of media commodities, finding engaging content is a major enterprise with few principled solutions. Systems tracking spontaneous devaluation in user preferences can allow prediction of the onset of boredom in users potentially catering to their changed needs. In this work, we study the music listening histories of Last.fm users focusing on the changes in their preferences based on their choices for different artists at different points in time. A hazard function, commonly used in statistics for survival analysis, is used to capture the rate at which a user returns to an artist as a function of exposure to the artist. The analysis provides the first evidence of spontaneous devaluation in preferences of music listeners. Better understanding of the temporal dynamics of this phenomenon can inform solutions to the similarity-diversity dilemma of recommender systems.","PeriodicalId":20472,"journal":{"name":"Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining","volume":"16 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2013-08-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"86632711","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
期刊
Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and data mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1