首页 > 最新文献

2011 IEEE 11th International Conference on Data Mining最新文献

英文 中文
Detection of Cross-Channel Anomalies from Multiple Data Channels 多数据通道跨通道异常的检测
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.51
Duc-Son Pham, Budhaditya Saha, Dinh Q. Phung, S. Venkatesh
We identify and formulate a novel problem: cross channel anomaly detection from multiple data channels. Cross channel anomalies are common amongst the individual channel anomalies, and are often portent of significant events. Using spectral approaches, we propose a two-stage detection method: anomaly detection at a single-channel level, followed by the detection of cross-channel anomalies from the amalgamation of single channel anomalies. Our mathematical analysis shows that our method is likely to reduce the false alarm rate. We demonstrate our method in two applications: document understanding with multiple text corpora, and detection of repeated anomalies in video surveillance. The experimental results consistently demonstrate the superior performance of our method compared with related state-of-art methods, including the one-class SVM and principal component pursuit. In addition, our framework can be deployed in a decentralized manner, lending itself for large scale data stream analysis.
我们确定并提出了一个新的问题:从多个数据通道进行跨通道异常检测。跨通道异常在个别通道异常中是常见的,并且通常是重大事件的前兆。利用光谱方法,我们提出了一种两阶段检测方法:在单通道水平检测异常,然后从单通道异常合并中检测跨通道异常。我们的数学分析表明,我们的方法有可能降低误报率。我们在两个应用中展示了我们的方法:使用多个文本语料库的文档理解,以及视频监控中重复异常的检测。实验结果一致表明,与一类支持向量机和主成分追踪等相关方法相比,该方法具有优越的性能。此外,我们的框架可以以分散的方式部署,适合大规模数据流分析。
{"title":"Detection of Cross-Channel Anomalies from Multiple Data Channels","authors":"Duc-Son Pham, Budhaditya Saha, Dinh Q. Phung, S. Venkatesh","doi":"10.1109/ICDM.2011.51","DOIUrl":"https://doi.org/10.1109/ICDM.2011.51","url":null,"abstract":"We identify and formulate a novel problem: cross channel anomaly detection from multiple data channels. Cross channel anomalies are common amongst the individual channel anomalies, and are often portent of significant events. Using spectral approaches, we propose a two-stage detection method: anomaly detection at a single-channel level, followed by the detection of cross-channel anomalies from the amalgamation of single channel anomalies. Our mathematical analysis shows that our method is likely to reduce the false alarm rate. We demonstrate our method in two applications: document understanding with multiple text corpora, and detection of repeated anomalies in video surveillance. The experimental results consistently demonstrate the superior performance of our method compared with related state-of-art methods, including the one-class SVM and principal component pursuit. In addition, our framework can be deployed in a decentralized manner, lending itself for large scale data stream analysis.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115706038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
LinkBoost: A Novel Cost-Sensitive Boosting Framework for Community-Level Network Link Prediction LinkBoost:一种用于社区级网络链路预测的新型成本敏感提升框架
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.93
Prakash Mandayam Comar, P. Tan, Anil K. Jain
Link prediction is a challenging task due to the inherent skew ness of network data. Typical link prediction methods can be categorized as either local or global. Local methods consider the link structure in the immediate neighborhood of a node pair to determine the presence or absence of a link, whereas global methods utilize information from the whole network. This paper presents a community (cluster) level link prediction method without the need to explicitly identify the communities in a network. Specifically, a variable-cost loss function is defined to address the data skew ness problem. We provide theoretical proof that shows the equivalence between maximizing the well-known modularity measure used in community detection and minimizing a special case of the proposed loss function. As a result, any link prediction method designed to optimize the loss function would result in more links being predicted within a community than between communities. We design a boosting algorithm to minimize the loss function and present an approach to scale-up the algorithm by decomposing the network into smaller partitions and aggregating the weak learners constructed from each partition. Experimental results show that our proposed Link Boost algorithm consistently performs as good as or better than many existing methods when evaluated on 4 real-world network datasets.
由于网络数据固有的偏性,链路预测是一项具有挑战性的任务。典型的链路预测方法可以分为局部和全局两类。局部方法考虑节点对近邻的链路结构来确定链路的存在与否,而全局方法利用整个网络的信息。本文提出了一种不需要明确识别网络中的社区的社区(簇)级链路预测方法。具体来说,定义了一个变代价损失函数来解决数据偏度问题。我们提供了理论证明,证明了最大化社区检测中使用的众所周知的模块化度量与最小化所提出的损失函数的特殊情况之间的等价性。因此,任何旨在优化损失函数的链路预测方法都会导致一个社区内的链路预测多于社区之间的链路预测。我们设计了一种增强算法来最小化损失函数,并提出了一种通过将网络分解为更小的分区并聚合从每个分区构建的弱学习器来扩展算法的方法。实验结果表明,当在4个真实网络数据集上进行评估时,我们提出的Link Boost算法的性能与许多现有方法一样好或更好。
{"title":"LinkBoost: A Novel Cost-Sensitive Boosting Framework for Community-Level Network Link Prediction","authors":"Prakash Mandayam Comar, P. Tan, Anil K. Jain","doi":"10.1109/ICDM.2011.93","DOIUrl":"https://doi.org/10.1109/ICDM.2011.93","url":null,"abstract":"Link prediction is a challenging task due to the inherent skew ness of network data. Typical link prediction methods can be categorized as either local or global. Local methods consider the link structure in the immediate neighborhood of a node pair to determine the presence or absence of a link, whereas global methods utilize information from the whole network. This paper presents a community (cluster) level link prediction method without the need to explicitly identify the communities in a network. Specifically, a variable-cost loss function is defined to address the data skew ness problem. We provide theoretical proof that shows the equivalence between maximizing the well-known modularity measure used in community detection and minimizing a special case of the proposed loss function. As a result, any link prediction method designed to optimize the loss function would result in more links being predicted within a community than between communities. We design a boosting algorithm to minimize the loss function and present an approach to scale-up the algorithm by decomposing the network into smaller partitions and aggregating the weak learners constructed from each partition. Experimental results show that our proposed Link Boost algorithm consistently performs as good as or better than many existing methods when evaluated on 4 real-world network datasets.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115803593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Incremental Elliptical Boundary Estimation for Anomaly Detection in Wireless Sensor Networks 基于增量椭圆边界估计的无线传感器网络异常检测
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.80
Masud Moshtaghi, C. Leckie, S. Karunasekera, J. Bezdek, S. Rajasegarar, M. Palaniswami
Wireless Sensor Networks (WSNs) provide a low cost option for gathering spatially dense data from different environments. However, WSNs have limited energy resources that hinder the dissemination of the raw data over the network to a central location. This has stimulated research into efficient data mining approaches, which can exploit the restricted computational capabilities of the sensors to model their normal behavior. Having a normal model of the network, sensors can then forward anomalous measurements to the base station. Most of the current data modeling approaches proposed for WSNs require a fixed offline training period and use batch training in contrast to the real streaming nature of data in these networks. In addition they usually work in stationary environments. In this paper we present an efficient online model construction algorithm that captures the normal behavior of the system. Our model is capable of tracking changes in the data distribution in the monitored environment. We illustrate the proposed algorithm with numerical results on both real-life and simulated data sets, which demonstrate the efficiency and accuracy of our approach compared to existing methods.
无线传感器网络(WSNs)为从不同环境中收集空间密集数据提供了一种低成本的选择。然而,无线传感器网络的能量资源有限,这阻碍了原始数据通过网络传播到中心位置。这刺激了对有效数据挖掘方法的研究,这些方法可以利用传感器有限的计算能力来模拟它们的正常行为。有了一个正常的网络模型,传感器就可以把异常的测量结果转发给基站。目前针对wsn提出的大多数数据建模方法都需要固定的离线训练周期,并且使用批处理训练,而不是这些网络中数据的真正流性质。此外,它们通常在固定的环境中工作。在本文中,我们提出了一种有效的在线模型构建算法来捕获系统的正常行为。我们的模型能够跟踪被监视环境中数据分布的变化。我们用实际和模拟数据集的数值结果来说明所提出的算法,与现有方法相比,证明了我们的方法的效率和准确性。
{"title":"Incremental Elliptical Boundary Estimation for Anomaly Detection in Wireless Sensor Networks","authors":"Masud Moshtaghi, C. Leckie, S. Karunasekera, J. Bezdek, S. Rajasegarar, M. Palaniswami","doi":"10.1109/ICDM.2011.80","DOIUrl":"https://doi.org/10.1109/ICDM.2011.80","url":null,"abstract":"Wireless Sensor Networks (WSNs) provide a low cost option for gathering spatially dense data from different environments. However, WSNs have limited energy resources that hinder the dissemination of the raw data over the network to a central location. This has stimulated research into efficient data mining approaches, which can exploit the restricted computational capabilities of the sensors to model their normal behavior. Having a normal model of the network, sensors can then forward anomalous measurements to the base station. Most of the current data modeling approaches proposed for WSNs require a fixed offline training period and use batch training in contrast to the real streaming nature of data in these networks. In addition they usually work in stationary environments. In this paper we present an efficient online model construction algorithm that captures the normal behavior of the system. Our model is capable of tracking changes in the data distribution in the monitored environment. We illustrate the proposed algorithm with numerical results on both real-life and simulated data sets, which demonstrate the efficiency and accuracy of our approach compared to existing methods.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"276 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114529237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 41
Minimizing Seed Set for Viral Marketing 最小化病毒式营销种子集
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.99
Cheng Long, R. C. Wong
Viral marketing has attracted considerable concerns in recent years due to its novel idea of leveraging the social network to propagate the awareness of products. Specifically, viral marketing is to first target a limited number of users (seeds) in the social network by providing incentives, and these targeted users would then initiate the process of awareness spread by propagating the information to their friends via their social relationships. Extensive studies have been conducted for maximizing the awareness spread given the number of seeds. However, all of them fail to consider the common scenario of viral marketing where companies hope to use as few seeds as possible yet influencing at least a certain number of users. In this paper, we propose a new problem, called J-MIN-Seed, whose objective is to minimize the number of seeds while at least J users are influenced. J-MIN-Seed, unfortunately, is proved to be NP-hard in this work. In such case, we develop a greedy algorithm that can provide error guarantees for J-MIN-Seed. Furthermore, for the problem setting where J is equal to the number of all users in the social network, denoted by Full-Coverage, we design other efficient algorithms. Extensive experiments were conducted on real datasets to verify our algorithm.
近年来,病毒式营销由于其利用社交网络传播产品意识的新颖理念而引起了相当大的关注。具体来说,病毒式营销是首先通过提供奖励,瞄准社交网络中有限数量的用户(种子),然后这些目标用户将通过社交关系将信息传播给他们的朋友,从而启动意识传播过程。考虑到种子的数量,已经进行了广泛的研究,以最大限度地提高意识的传播。然而,它们都没有考虑到病毒式营销的常见情况,即公司希望使用尽可能少的种子,但至少影响一定数量的用户。在本文中,我们提出了一个新的问题,称为J- min - seed,其目标是在至少J个用户受到影响的情况下,使种子数量最小化。不幸的是,j - min种子在这项工作中被证明是NP-hard。在这种情况下,我们开发了一种贪婪算法,可以为J-MIN-Seed提供错误保证。进一步,对于J等于社交网络中所有用户的数量,用Full-Coverage表示的问题设置,我们设计了其他高效的算法。在实际数据集上进行了大量的实验来验证我们的算法。
{"title":"Minimizing Seed Set for Viral Marketing","authors":"Cheng Long, R. C. Wong","doi":"10.1109/ICDM.2011.99","DOIUrl":"https://doi.org/10.1109/ICDM.2011.99","url":null,"abstract":"Viral marketing has attracted considerable concerns in recent years due to its novel idea of leveraging the social network to propagate the awareness of products. Specifically, viral marketing is to first target a limited number of users (seeds) in the social network by providing incentives, and these targeted users would then initiate the process of awareness spread by propagating the information to their friends via their social relationships. Extensive studies have been conducted for maximizing the awareness spread given the number of seeds. However, all of them fail to consider the common scenario of viral marketing where companies hope to use as few seeds as possible yet influencing at least a certain number of users. In this paper, we propose a new problem, called J-MIN-Seed, whose objective is to minimize the number of seeds while at least J users are influenced. J-MIN-Seed, unfortunately, is proved to be NP-hard in this work. In such case, we develop a greedy algorithm that can provide error guarantees for J-MIN-Seed. Furthermore, for the problem setting where J is equal to the number of all users in the social network, denoted by Full-Coverage, we design other efficient algorithms. Extensive experiments were conducted on real datasets to verify our algorithm.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128352871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 81
Flexible Fault Tolerant Subspace Clustering for Data with Missing Values 缺失值数据的柔性容错子空间聚类
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.70
Stephan Günnemann, Emmanuel Müller, S. Raubach, T. Seidl
In today's applications, data analysis tasks are hindered by many attributes per object as well as by faulty data with missing values. Subspace clustering tackles the challenge of many attributes by cluster detection in any subspace projection of the data. However, it poses novel challenges for handling missing values of objects, which are part of multiple subspace clusters in different projections of the data. In this work, we propose a general fault tolerance definition enhancing subspace clustering models to handle missing values. We introduce a flexible notion of fault tolerance that adapts to the individual characteristics of subspace clusters and ensures a robust parameterization. Allowing missing values in our model increases the computational complexity of subspace clustering. Thus, we prove novel monotonicity properties for an efficient computation of fault tolerant subspace clusters. Experiments on real and synthetic data show that our fault tolerance model yields high quality results even in the presence of many missing values. For repeatability, we provide all datasets and executables on our website.
在今天的应用程序中,数据分析任务受到每个对象的许多属性以及缺失值的错误数据的阻碍。子空间聚类通过在数据的任何子空间投影中进行聚类检测来解决许多属性的挑战。然而,它对处理对象的缺失值提出了新的挑战,这些对象是数据不同投影中多个子空间聚类的一部分。在这项工作中,我们提出了一个通用的容错定义,增强子空间聚类模型来处理缺失值。我们引入了一种灵活的容错概念,以适应子空间集群的个体特征,并确保鲁棒参数化。在我们的模型中允许缺失值增加了子空间聚类的计算复杂度。因此,我们证明了一种新的单调性,可以有效地计算容错子空间簇。在真实数据和合成数据上的实验表明,即使存在许多缺失值,我们的容错模型也能产生高质量的结果。为了可重复性,我们在我们的网站上提供所有的数据集和可执行文件。
{"title":"Flexible Fault Tolerant Subspace Clustering for Data with Missing Values","authors":"Stephan Günnemann, Emmanuel Müller, S. Raubach, T. Seidl","doi":"10.1109/ICDM.2011.70","DOIUrl":"https://doi.org/10.1109/ICDM.2011.70","url":null,"abstract":"In today's applications, data analysis tasks are hindered by many attributes per object as well as by faulty data with missing values. Subspace clustering tackles the challenge of many attributes by cluster detection in any subspace projection of the data. However, it poses novel challenges for handling missing values of objects, which are part of multiple subspace clusters in different projections of the data. In this work, we propose a general fault tolerance definition enhancing subspace clustering models to handle missing values. We introduce a flexible notion of fault tolerance that adapts to the individual characteristics of subspace clusters and ensures a robust parameterization. Allowing missing values in our model increases the computational complexity of subspace clustering. Thus, we prove novel monotonicity properties for an efficient computation of fault tolerant subspace clusters. Experiments on real and synthetic data show that our fault tolerance model yields high quality results even in the presence of many missing values. For repeatability, we provide all datasets and executables on our website.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128821673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
The Joint Inference of Topic Diffusion and Evolution in Social Communities 社会群体中话题扩散与演化的联合推理
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.144
C. Lin, Q. Mei, Jiawei Han, Yunliang Jiang, Marina Danilevsky
The prevalence of Web 2.0 techniques has led to the boom of various online communities, where topics spread ubiquitously among user-generated documents. Working together with this diffusion process is the evolution of topic content, where novel contents are introduced by documents which adopt the topic. Unlike explicit user behavior (e.g., buying a DVD), both the diffusion paths and the evolutionary process of a topic are implicit, making their discovery challenging. In this paper, we track the evolution of an arbitrary topic and reveal the latent diffusion paths of that topic in a social community. A novel and principled probabilistic model is proposed which casts our task as an joint inference problem, which considers textual documents, social influences, and topic evolution in a unified way. Specifically, a mixture model is introduced to model the generation of text according to the diffusion and the evolution of the topic, while the whole diffusion process is regularized with user-level social influences through a Gaussian Markov Random Field. Experiments on both synthetic data and real world data show that the discovery of topic diffusion and evolution benefits from this joint inference, and the probabilistic model we propose performs significantly better than existing methods.
Web 2.0技术的流行导致了各种在线社区的繁荣,其中的主题在用户生成的文档中无处不在地传播。与这一扩散过程一起工作的是主题内容的演变,其中采用主题的文件引入了新颖的内容。与明确的用户行为(例如,购买DVD)不同,主题的扩散路径和进化过程都是隐含的,这使得它们的发现具有挑战性。在本文中,我们跟踪了一个任意话题的演变,揭示了该话题在社会群体中的潜在扩散路径。提出了一种新颖的、原则性的概率模型,将我们的任务作为一个联合推理问题,以统一的方式考虑文本文档、社会影响和主题演变。具体来说,根据话题的扩散和演变,引入混合模型对文本的生成进行建模,而整个扩散过程通过高斯马尔可夫随机场用用户层面的社会影响进行正则化。在合成数据和真实世界数据上的实验表明,这种联合推理有助于发现主题的扩散和进化,并且我们提出的概率模型的性能明显优于现有的方法。
{"title":"The Joint Inference of Topic Diffusion and Evolution in Social Communities","authors":"C. Lin, Q. Mei, Jiawei Han, Yunliang Jiang, Marina Danilevsky","doi":"10.1109/ICDM.2011.144","DOIUrl":"https://doi.org/10.1109/ICDM.2011.144","url":null,"abstract":"The prevalence of Web 2.0 techniques has led to the boom of various online communities, where topics spread ubiquitously among user-generated documents. Working together with this diffusion process is the evolution of topic content, where novel contents are introduced by documents which adopt the topic. Unlike explicit user behavior (e.g., buying a DVD), both the diffusion paths and the evolutionary process of a topic are implicit, making their discovery challenging. In this paper, we track the evolution of an arbitrary topic and reveal the latent diffusion paths of that topic in a social community. A novel and principled probabilistic model is proposed which casts our task as an joint inference problem, which considers textual documents, social influences, and topic evolution in a unified way. Specifically, a mixture model is introduced to model the generation of text according to the diffusion and the evolution of the topic, while the whole diffusion process is regularized with user-level social influences through a Gaussian Markov Random Field. Experiments on both synthetic data and real world data show that the discovery of topic diffusion and evolution benefits from this joint inference, and the probabilistic model we propose performs significantly better than existing methods.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125616894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 74
How Does Research Evolve? Pattern Mining for Research Meme Cycles 研究是如何进化的?模因循环研究的模式挖掘
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.76
Dan He, Xingquan Zhu, D. S. Parker
Recent years have witnessed a great deal of attention in tracking news memes over the web, modeling shifts in the ebb and flow of their popularity. One of the most important features of news memes is that they seldom occur repeatedly, instead, they tend to shift to different but similar memes. In this work, we consider patterns in research memes, which differ significantly from news memes and have received very little attention. One significant difference between research memes and news memes lies in that research memes have cyclic development, motivating the need for models of cycles of research memes. Furthermore, these cycles may reveal important patterns of evolving research, shedding lights on how research progresses. In this paper, we formulate the modeling of the cycles of research memes, and propose solutions to the problem of identifying cycles and discovering patterns among these cycles. Experiments on two different domain applications indicate that our model does find meaningful patterns and our algorithms for pattern discovery are efficient for large scale data analysis.
近年来,人们对网络上的新闻表情包进行了大量关注,并对其人气的涨落进行了建模。新闻模因最重要的特征之一是它们很少重复出现,相反,它们倾向于转向不同但相似的模因。在这项工作中,我们考虑了研究模因的模式,这与新闻模因有很大不同,并且很少受到关注。研究模因与新闻模因的一个显著区别在于,研究模因具有循环发展的特点,因此需要建立研究模因循环模型。此外,这些周期可能揭示进化研究的重要模式,揭示研究如何进展。本文提出了研究模因周期的建模方法,并提出了在研究模因周期中识别周期和发现模式的解决方案。在两个不同领域应用的实验表明,我们的模型确实发现了有意义的模式,我们的模式发现算法对于大规模数据分析是有效的。
{"title":"How Does Research Evolve? Pattern Mining for Research Meme Cycles","authors":"Dan He, Xingquan Zhu, D. S. Parker","doi":"10.1109/ICDM.2011.76","DOIUrl":"https://doi.org/10.1109/ICDM.2011.76","url":null,"abstract":"Recent years have witnessed a great deal of attention in tracking news memes over the web, modeling shifts in the ebb and flow of their popularity. One of the most important features of news memes is that they seldom occur repeatedly, instead, they tend to shift to different but similar memes. In this work, we consider patterns in research memes, which differ significantly from news memes and have received very little attention. One significant difference between research memes and news memes lies in that research memes have cyclic development, motivating the need for models of cycles of research memes. Furthermore, these cycles may reveal important patterns of evolving research, shedding lights on how research progresses. In this paper, we formulate the modeling of the cycles of research memes, and propose solutions to the problem of identifying cycles and discovering patterns among these cycles. Experiments on two different domain applications indicate that our model does find meaningful patterns and our algorithms for pattern discovery are efficient for large scale data analysis.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122786846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Class Imbalance, Redux 职业失衡,Redux
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.33
Byron C. Wallace, Kevin Small, C. Brodley, T. Trikalinos
Class imbalance (i.e., scenarios in which classes are unequally represented in the training data) occurs in many real-world learning tasks. Yet despite its practical importance, there is no established theory of class imbalance, and existing methods for handling it are therefore not well motivated. In this work, we approach the problem of imbalance from a probabilistic perspective, and from this vantage identify dataset characteristics (such as dimensionality, sparsity, etc.) that exacerbate the problem. Motivated by this theory, we advocate the approach of bagging an ensemble of classifiers induced over balanced bootstrap training samples, arguing that this strategy will often succeed where others fail. Thus in addition to providing a theoretical understanding of class imbalance, corroborated by our experiments on both simulated and real datasets, we provide practical guidance for the data mining practitioner working with imbalanced data.
在现实世界的许多学习任务中都会出现类不平衡(即,在训练数据中,类的表现是不平等的)。然而,尽管阶级不平衡具有重要的现实意义,但目前还没有建立阶级不平衡的理论,因此现有的处理方法也没有很好的动力。在这项工作中,我们从概率的角度来处理不平衡问题,并从这个优势来识别加剧问题的数据集特征(如维数、稀疏度等)。在这一理论的激励下,我们提倡在平衡的自举训练样本上归纳分类器集合的方法,认为这种策略通常会在其他策略失败的情况下成功。因此,除了提供对类不平衡的理论理解(通过我们在模拟和真实数据集上的实验得到证实)之外,我们还为处理不平衡数据的数据挖掘从业者提供了实践指导。
{"title":"Class Imbalance, Redux","authors":"Byron C. Wallace, Kevin Small, C. Brodley, T. Trikalinos","doi":"10.1109/ICDM.2011.33","DOIUrl":"https://doi.org/10.1109/ICDM.2011.33","url":null,"abstract":"Class imbalance (i.e., scenarios in which classes are unequally represented in the training data) occurs in many real-world learning tasks. Yet despite its practical importance, there is no established theory of class imbalance, and existing methods for handling it are therefore not well motivated. In this work, we approach the problem of imbalance from a probabilistic perspective, and from this vantage identify dataset characteristics (such as dimensionality, sparsity, etc.) that exacerbate the problem. Motivated by this theory, we advocate the approach of bagging an ensemble of classifiers induced over balanced bootstrap training samples, arguing that this strategy will often succeed where others fail. Thus in addition to providing a theoretical understanding of class imbalance, corroborated by our experiments on both simulated and real datasets, we provide practical guidance for the data mining practitioner working with imbalanced data.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123953674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 193
Patent Maintenance Recommendation with Patent Information Network Model 基于专利信息网络模型的专利维护建议
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.116
Xin Jin, W. Spangler, Ying Chen, Keke Cai, Rui Ma, Li Zhang, X. Wu, Jiawei Han
Patents are of crucial importance for businesses, because they provide legal protection for the invented techniques, processes or products. A patent can be held for up to 20 years. However, large maintenance fees need to be paid to keep it enforceable. If the patent is deemed not valuable, the owner may decide to abandon it by stopping paying the maintenance fees to reduce the cost. For large companies or organizations, making such decisions is difficult because too many patents need to be investigated. In this paper, we introduce the new patent mining problem of automatic patent maintenance prediction, and propose a systematic solution to analyze patents for recommending patent maintenance decision. We model the patents as a heterogeneous time-evolving information network and propose new patent features to build model for a ranked prediction on whether to maintain or abandon a patent. In addition, a network-based refinement approach is proposed to further improve the performance. We have conducted experiments on the large scale United States Patent and Trademark Office (USPTO) database which contains over four million granted patents. The results show that our technique can achieve high performance.
专利对企业至关重要,因为它们为发明的技术、工艺或产品提供法律保护。一项专利的有效期最长可达20年。然而,需要支付大量的维护费用来保持它的可执行性。专利权人认为该专利没有价值的,可以决定放弃该专利,停止支付维护费以降低成本。对于大公司或组织来说,做出这样的决定是困难的,因为需要调查的专利太多了。本文引入了专利维护自动预测的专利挖掘新问题,提出了一种系统的专利分析方案,为专利维护决策提供建议。我们将专利建模为一个异构的时间演化信息网络,并提出新的专利特征,以建立关于是否保留或放弃专利的排名预测模型。此外,提出了一种基于网络的改进方法来进一步提高性能。我们在美国专利和商标局(USPTO)的大型数据库上进行了实验,该数据库包含400多万项授权专利。结果表明,我们的技术可以达到较高的性能。
{"title":"Patent Maintenance Recommendation with Patent Information Network Model","authors":"Xin Jin, W. Spangler, Ying Chen, Keke Cai, Rui Ma, Li Zhang, X. Wu, Jiawei Han","doi":"10.1109/ICDM.2011.116","DOIUrl":"https://doi.org/10.1109/ICDM.2011.116","url":null,"abstract":"Patents are of crucial importance for businesses, because they provide legal protection for the invented techniques, processes or products. A patent can be held for up to 20 years. However, large maintenance fees need to be paid to keep it enforceable. If the patent is deemed not valuable, the owner may decide to abandon it by stopping paying the maintenance fees to reduce the cost. For large companies or organizations, making such decisions is difficult because too many patents need to be investigated. In this paper, we introduce the new patent mining problem of automatic patent maintenance prediction, and propose a systematic solution to analyze patents for recommending patent maintenance decision. We model the patents as a heterogeneous time-evolving information network and propose new patent features to build model for a ranked prediction on whether to maintain or abandon a patent. In addition, a network-based refinement approach is proposed to further improve the performance. We have conducted experiments on the large scale United States Patent and Trademark Office (USPTO) database which contains over four million granted patents. The results show that our technique can achieve high performance.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115109674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 33
Maximum Entropy Modelling for Assessing Results on Real-Valued Data 实值数据结果评估的最大熵模型
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.98
Kleanthis-Nikolaos Kontonasios, Jilles Vreeken, T. D. Bie
Statistical assessment of the results of data mining is increasingly recognised as a core task in the knowledge discovery process. It is of key importance in practice, as results that might seem interesting at first glance can often be explained by well-known basic properties of the data. In pattern mining, for instance, such trivial results can be so overwhelming in number that filtering them out is a necessity in order to identify the truly interesting patterns. In this paper, we propose an approach for assessing results on real-valued rectangular databases. More specifically, using our analytical model we are able to statistically assess whether or not a discovered structure may be the trivial result of the row and column marginal distributions in the database. Our main approach is to use the Maximum Entropy principle to fit a background model to the data while respecting its marginal distributions. To find these distributions, we employ an MDL based histogram estimator, and we fit these in our model using efficient convex optimization techniques. Subsequently, our model can be used to calculate probabilities directly, as well as to efficiently sample data with the purpose of assessing results by means of empirical hypothesis testing. Notably, our approach is efficient, parameter-free, and naturally deals with missing values. As such, it represents a well-founded alternative to swap randomisation
数据挖掘结果的统计评估越来越被认为是知识发现过程中的一项核心任务。这在实践中非常重要,因为乍一看可能很有趣的结果通常可以用众所周知的数据基本属性来解释。例如,在模式挖掘中,这些微不足道的结果可能会大量出现,以至于为了识别真正有趣的模式,必须将它们过滤掉。在本文中,我们提出了一种评估实值矩形数据库结果的方法。更具体地说,使用我们的分析模型,我们能够统计地评估发现的结构是否可能是数据库中行和列边缘分布的平凡结果。我们的主要方法是使用最大熵原理来拟合数据的背景模型,同时尊重其边际分布。为了找到这些分布,我们采用了基于MDL的直方图估计器,并使用高效的凸优化技术将这些分布拟合到我们的模型中。随后,我们的模型可以直接用于计算概率,也可以通过实证假设检验有效地对数据进行抽样,以评估结果。值得注意的是,我们的方法是有效的,无参数的,并且自然地处理缺失值。因此,它代表了交换随机化的一种有充分根据的替代方案
{"title":"Maximum Entropy Modelling for Assessing Results on Real-Valued Data","authors":"Kleanthis-Nikolaos Kontonasios, Jilles Vreeken, T. D. Bie","doi":"10.1109/ICDM.2011.98","DOIUrl":"https://doi.org/10.1109/ICDM.2011.98","url":null,"abstract":"Statistical assessment of the results of data mining is increasingly recognised as a core task in the knowledge discovery process. It is of key importance in practice, as results that might seem interesting at first glance can often be explained by well-known basic properties of the data. In pattern mining, for instance, such trivial results can be so overwhelming in number that filtering them out is a necessity in order to identify the truly interesting patterns. In this paper, we propose an approach for assessing results on real-valued rectangular databases. More specifically, using our analytical model we are able to statistically assess whether or not a discovered structure may be the trivial result of the row and column marginal distributions in the database. Our main approach is to use the Maximum Entropy principle to fit a background model to the data while respecting its marginal distributions. To find these distributions, we employ an MDL based histogram estimator, and we fit these in our model using efficient convex optimization techniques. Subsequently, our model can be used to calculate probabilities directly, as well as to efficiently sample data with the purpose of assessing results by means of empirical hypothesis testing. Notably, our approach is efficient, parameter-free, and naturally deals with missing values. As such, it represents a well-founded alternative to swap randomisation","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126330185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
期刊
2011 IEEE 11th International Conference on Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1