Duc-Son Pham, Budhaditya Saha, Dinh Q. Phung, S. Venkatesh
We identify and formulate a novel problem: cross channel anomaly detection from multiple data channels. Cross channel anomalies are common amongst the individual channel anomalies, and are often portent of significant events. Using spectral approaches, we propose a two-stage detection method: anomaly detection at a single-channel level, followed by the detection of cross-channel anomalies from the amalgamation of single channel anomalies. Our mathematical analysis shows that our method is likely to reduce the false alarm rate. We demonstrate our method in two applications: document understanding with multiple text corpora, and detection of repeated anomalies in video surveillance. The experimental results consistently demonstrate the superior performance of our method compared with related state-of-art methods, including the one-class SVM and principal component pursuit. In addition, our framework can be deployed in a decentralized manner, lending itself for large scale data stream analysis.
{"title":"Detection of Cross-Channel Anomalies from Multiple Data Channels","authors":"Duc-Son Pham, Budhaditya Saha, Dinh Q. Phung, S. Venkatesh","doi":"10.1109/ICDM.2011.51","DOIUrl":"https://doi.org/10.1109/ICDM.2011.51","url":null,"abstract":"We identify and formulate a novel problem: cross channel anomaly detection from multiple data channels. Cross channel anomalies are common amongst the individual channel anomalies, and are often portent of significant events. Using spectral approaches, we propose a two-stage detection method: anomaly detection at a single-channel level, followed by the detection of cross-channel anomalies from the amalgamation of single channel anomalies. Our mathematical analysis shows that our method is likely to reduce the false alarm rate. We demonstrate our method in two applications: document understanding with multiple text corpora, and detection of repeated anomalies in video surveillance. The experimental results consistently demonstrate the superior performance of our method compared with related state-of-art methods, including the one-class SVM and principal component pursuit. In addition, our framework can be deployed in a decentralized manner, lending itself for large scale data stream analysis.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"123 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115706038","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Link prediction is a challenging task due to the inherent skew ness of network data. Typical link prediction methods can be categorized as either local or global. Local methods consider the link structure in the immediate neighborhood of a node pair to determine the presence or absence of a link, whereas global methods utilize information from the whole network. This paper presents a community (cluster) level link prediction method without the need to explicitly identify the communities in a network. Specifically, a variable-cost loss function is defined to address the data skew ness problem. We provide theoretical proof that shows the equivalence between maximizing the well-known modularity measure used in community detection and minimizing a special case of the proposed loss function. As a result, any link prediction method designed to optimize the loss function would result in more links being predicted within a community than between communities. We design a boosting algorithm to minimize the loss function and present an approach to scale-up the algorithm by decomposing the network into smaller partitions and aggregating the weak learners constructed from each partition. Experimental results show that our proposed Link Boost algorithm consistently performs as good as or better than many existing methods when evaluated on 4 real-world network datasets.
{"title":"LinkBoost: A Novel Cost-Sensitive Boosting Framework for Community-Level Network Link Prediction","authors":"Prakash Mandayam Comar, P. Tan, Anil K. Jain","doi":"10.1109/ICDM.2011.93","DOIUrl":"https://doi.org/10.1109/ICDM.2011.93","url":null,"abstract":"Link prediction is a challenging task due to the inherent skew ness of network data. Typical link prediction methods can be categorized as either local or global. Local methods consider the link structure in the immediate neighborhood of a node pair to determine the presence or absence of a link, whereas global methods utilize information from the whole network. This paper presents a community (cluster) level link prediction method without the need to explicitly identify the communities in a network. Specifically, a variable-cost loss function is defined to address the data skew ness problem. We provide theoretical proof that shows the equivalence between maximizing the well-known modularity measure used in community detection and minimizing a special case of the proposed loss function. As a result, any link prediction method designed to optimize the loss function would result in more links being predicted within a community than between communities. We design a boosting algorithm to minimize the loss function and present an approach to scale-up the algorithm by decomposing the network into smaller partitions and aggregating the weak learners constructed from each partition. Experimental results show that our proposed Link Boost algorithm consistently performs as good as or better than many existing methods when evaluated on 4 real-world network datasets.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115803593","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Masud Moshtaghi, C. Leckie, S. Karunasekera, J. Bezdek, S. Rajasegarar, M. Palaniswami
Wireless Sensor Networks (WSNs) provide a low cost option for gathering spatially dense data from different environments. However, WSNs have limited energy resources that hinder the dissemination of the raw data over the network to a central location. This has stimulated research into efficient data mining approaches, which can exploit the restricted computational capabilities of the sensors to model their normal behavior. Having a normal model of the network, sensors can then forward anomalous measurements to the base station. Most of the current data modeling approaches proposed for WSNs require a fixed offline training period and use batch training in contrast to the real streaming nature of data in these networks. In addition they usually work in stationary environments. In this paper we present an efficient online model construction algorithm that captures the normal behavior of the system. Our model is capable of tracking changes in the data distribution in the monitored environment. We illustrate the proposed algorithm with numerical results on both real-life and simulated data sets, which demonstrate the efficiency and accuracy of our approach compared to existing methods.
{"title":"Incremental Elliptical Boundary Estimation for Anomaly Detection in Wireless Sensor Networks","authors":"Masud Moshtaghi, C. Leckie, S. Karunasekera, J. Bezdek, S. Rajasegarar, M. Palaniswami","doi":"10.1109/ICDM.2011.80","DOIUrl":"https://doi.org/10.1109/ICDM.2011.80","url":null,"abstract":"Wireless Sensor Networks (WSNs) provide a low cost option for gathering spatially dense data from different environments. However, WSNs have limited energy resources that hinder the dissemination of the raw data over the network to a central location. This has stimulated research into efficient data mining approaches, which can exploit the restricted computational capabilities of the sensors to model their normal behavior. Having a normal model of the network, sensors can then forward anomalous measurements to the base station. Most of the current data modeling approaches proposed for WSNs require a fixed offline training period and use batch training in contrast to the real streaming nature of data in these networks. In addition they usually work in stationary environments. In this paper we present an efficient online model construction algorithm that captures the normal behavior of the system. Our model is capable of tracking changes in the data distribution in the monitored environment. We illustrate the proposed algorithm with numerical results on both real-life and simulated data sets, which demonstrate the efficiency and accuracy of our approach compared to existing methods.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"276 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114529237","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Viral marketing has attracted considerable concerns in recent years due to its novel idea of leveraging the social network to propagate the awareness of products. Specifically, viral marketing is to first target a limited number of users (seeds) in the social network by providing incentives, and these targeted users would then initiate the process of awareness spread by propagating the information to their friends via their social relationships. Extensive studies have been conducted for maximizing the awareness spread given the number of seeds. However, all of them fail to consider the common scenario of viral marketing where companies hope to use as few seeds as possible yet influencing at least a certain number of users. In this paper, we propose a new problem, called J-MIN-Seed, whose objective is to minimize the number of seeds while at least J users are influenced. J-MIN-Seed, unfortunately, is proved to be NP-hard in this work. In such case, we develop a greedy algorithm that can provide error guarantees for J-MIN-Seed. Furthermore, for the problem setting where J is equal to the number of all users in the social network, denoted by Full-Coverage, we design other efficient algorithms. Extensive experiments were conducted on real datasets to verify our algorithm.
近年来,病毒式营销由于其利用社交网络传播产品意识的新颖理念而引起了相当大的关注。具体来说,病毒式营销是首先通过提供奖励,瞄准社交网络中有限数量的用户(种子),然后这些目标用户将通过社交关系将信息传播给他们的朋友,从而启动意识传播过程。考虑到种子的数量,已经进行了广泛的研究,以最大限度地提高意识的传播。然而,它们都没有考虑到病毒式营销的常见情况,即公司希望使用尽可能少的种子,但至少影响一定数量的用户。在本文中,我们提出了一个新的问题,称为J- min - seed,其目标是在至少J个用户受到影响的情况下,使种子数量最小化。不幸的是,j - min种子在这项工作中被证明是NP-hard。在这种情况下,我们开发了一种贪婪算法,可以为J-MIN-Seed提供错误保证。进一步,对于J等于社交网络中所有用户的数量,用Full-Coverage表示的问题设置,我们设计了其他高效的算法。在实际数据集上进行了大量的实验来验证我们的算法。
{"title":"Minimizing Seed Set for Viral Marketing","authors":"Cheng Long, R. C. Wong","doi":"10.1109/ICDM.2011.99","DOIUrl":"https://doi.org/10.1109/ICDM.2011.99","url":null,"abstract":"Viral marketing has attracted considerable concerns in recent years due to its novel idea of leveraging the social network to propagate the awareness of products. Specifically, viral marketing is to first target a limited number of users (seeds) in the social network by providing incentives, and these targeted users would then initiate the process of awareness spread by propagating the information to their friends via their social relationships. Extensive studies have been conducted for maximizing the awareness spread given the number of seeds. However, all of them fail to consider the common scenario of viral marketing where companies hope to use as few seeds as possible yet influencing at least a certain number of users. In this paper, we propose a new problem, called J-MIN-Seed, whose objective is to minimize the number of seeds while at least J users are influenced. J-MIN-Seed, unfortunately, is proved to be NP-hard in this work. In such case, we develop a greedy algorithm that can provide error guarantees for J-MIN-Seed. Furthermore, for the problem setting where J is equal to the number of all users in the social network, denoted by Full-Coverage, we design other efficient algorithms. Extensive experiments were conducted on real datasets to verify our algorithm.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128352871","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stephan Günnemann, Emmanuel Müller, S. Raubach, T. Seidl
In today's applications, data analysis tasks are hindered by many attributes per object as well as by faulty data with missing values. Subspace clustering tackles the challenge of many attributes by cluster detection in any subspace projection of the data. However, it poses novel challenges for handling missing values of objects, which are part of multiple subspace clusters in different projections of the data. In this work, we propose a general fault tolerance definition enhancing subspace clustering models to handle missing values. We introduce a flexible notion of fault tolerance that adapts to the individual characteristics of subspace clusters and ensures a robust parameterization. Allowing missing values in our model increases the computational complexity of subspace clustering. Thus, we prove novel monotonicity properties for an efficient computation of fault tolerant subspace clusters. Experiments on real and synthetic data show that our fault tolerance model yields high quality results even in the presence of many missing values. For repeatability, we provide all datasets and executables on our website.
{"title":"Flexible Fault Tolerant Subspace Clustering for Data with Missing Values","authors":"Stephan Günnemann, Emmanuel Müller, S. Raubach, T. Seidl","doi":"10.1109/ICDM.2011.70","DOIUrl":"https://doi.org/10.1109/ICDM.2011.70","url":null,"abstract":"In today's applications, data analysis tasks are hindered by many attributes per object as well as by faulty data with missing values. Subspace clustering tackles the challenge of many attributes by cluster detection in any subspace projection of the data. However, it poses novel challenges for handling missing values of objects, which are part of multiple subspace clusters in different projections of the data. In this work, we propose a general fault tolerance definition enhancing subspace clustering models to handle missing values. We introduce a flexible notion of fault tolerance that adapts to the individual characteristics of subspace clusters and ensures a robust parameterization. Allowing missing values in our model increases the computational complexity of subspace clustering. Thus, we prove novel monotonicity properties for an efficient computation of fault tolerant subspace clusters. Experiments on real and synthetic data show that our fault tolerance model yields high quality results even in the presence of many missing values. For repeatability, we provide all datasets and executables on our website.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"74 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128821673","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Lin, Q. Mei, Jiawei Han, Yunliang Jiang, Marina Danilevsky
The prevalence of Web 2.0 techniques has led to the boom of various online communities, where topics spread ubiquitously among user-generated documents. Working together with this diffusion process is the evolution of topic content, where novel contents are introduced by documents which adopt the topic. Unlike explicit user behavior (e.g., buying a DVD), both the diffusion paths and the evolutionary process of a topic are implicit, making their discovery challenging. In this paper, we track the evolution of an arbitrary topic and reveal the latent diffusion paths of that topic in a social community. A novel and principled probabilistic model is proposed which casts our task as an joint inference problem, which considers textual documents, social influences, and topic evolution in a unified way. Specifically, a mixture model is introduced to model the generation of text according to the diffusion and the evolution of the topic, while the whole diffusion process is regularized with user-level social influences through a Gaussian Markov Random Field. Experiments on both synthetic data and real world data show that the discovery of topic diffusion and evolution benefits from this joint inference, and the probabilistic model we propose performs significantly better than existing methods.
Web 2.0技术的流行导致了各种在线社区的繁荣,其中的主题在用户生成的文档中无处不在地传播。与这一扩散过程一起工作的是主题内容的演变,其中采用主题的文件引入了新颖的内容。与明确的用户行为(例如,购买DVD)不同,主题的扩散路径和进化过程都是隐含的,这使得它们的发现具有挑战性。在本文中,我们跟踪了一个任意话题的演变,揭示了该话题在社会群体中的潜在扩散路径。提出了一种新颖的、原则性的概率模型,将我们的任务作为一个联合推理问题,以统一的方式考虑文本文档、社会影响和主题演变。具体来说,根据话题的扩散和演变,引入混合模型对文本的生成进行建模,而整个扩散过程通过高斯马尔可夫随机场用用户层面的社会影响进行正则化。在合成数据和真实世界数据上的实验表明,这种联合推理有助于发现主题的扩散和进化,并且我们提出的概率模型的性能明显优于现有的方法。
{"title":"The Joint Inference of Topic Diffusion and Evolution in Social Communities","authors":"C. Lin, Q. Mei, Jiawei Han, Yunliang Jiang, Marina Danilevsky","doi":"10.1109/ICDM.2011.144","DOIUrl":"https://doi.org/10.1109/ICDM.2011.144","url":null,"abstract":"The prevalence of Web 2.0 techniques has led to the boom of various online communities, where topics spread ubiquitously among user-generated documents. Working together with this diffusion process is the evolution of topic content, where novel contents are introduced by documents which adopt the topic. Unlike explicit user behavior (e.g., buying a DVD), both the diffusion paths and the evolutionary process of a topic are implicit, making their discovery challenging. In this paper, we track the evolution of an arbitrary topic and reveal the latent diffusion paths of that topic in a social community. A novel and principled probabilistic model is proposed which casts our task as an joint inference problem, which considers textual documents, social influences, and topic evolution in a unified way. Specifically, a mixture model is introduced to model the generation of text according to the diffusion and the evolution of the topic, while the whole diffusion process is regularized with user-level social influences through a Gaussian Markov Random Field. Experiments on both synthetic data and real world data show that the discovery of topic diffusion and evolution benefits from this joint inference, and the probabilistic model we propose performs significantly better than existing methods.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125616894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Recent years have witnessed a great deal of attention in tracking news memes over the web, modeling shifts in the ebb and flow of their popularity. One of the most important features of news memes is that they seldom occur repeatedly, instead, they tend to shift to different but similar memes. In this work, we consider patterns in research memes, which differ significantly from news memes and have received very little attention. One significant difference between research memes and news memes lies in that research memes have cyclic development, motivating the need for models of cycles of research memes. Furthermore, these cycles may reveal important patterns of evolving research, shedding lights on how research progresses. In this paper, we formulate the modeling of the cycles of research memes, and propose solutions to the problem of identifying cycles and discovering patterns among these cycles. Experiments on two different domain applications indicate that our model does find meaningful patterns and our algorithms for pattern discovery are efficient for large scale data analysis.
{"title":"How Does Research Evolve? Pattern Mining for Research Meme Cycles","authors":"Dan He, Xingquan Zhu, D. S. Parker","doi":"10.1109/ICDM.2011.76","DOIUrl":"https://doi.org/10.1109/ICDM.2011.76","url":null,"abstract":"Recent years have witnessed a great deal of attention in tracking news memes over the web, modeling shifts in the ebb and flow of their popularity. One of the most important features of news memes is that they seldom occur repeatedly, instead, they tend to shift to different but similar memes. In this work, we consider patterns in research memes, which differ significantly from news memes and have received very little attention. One significant difference between research memes and news memes lies in that research memes have cyclic development, motivating the need for models of cycles of research memes. Furthermore, these cycles may reveal important patterns of evolving research, shedding lights on how research progresses. In this paper, we formulate the modeling of the cycles of research memes, and propose solutions to the problem of identifying cycles and discovering patterns among these cycles. Experiments on two different domain applications indicate that our model does find meaningful patterns and our algorithms for pattern discovery are efficient for large scale data analysis.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122786846","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Byron C. Wallace, Kevin Small, C. Brodley, T. Trikalinos
Class imbalance (i.e., scenarios in which classes are unequally represented in the training data) occurs in many real-world learning tasks. Yet despite its practical importance, there is no established theory of class imbalance, and existing methods for handling it are therefore not well motivated. In this work, we approach the problem of imbalance from a probabilistic perspective, and from this vantage identify dataset characteristics (such as dimensionality, sparsity, etc.) that exacerbate the problem. Motivated by this theory, we advocate the approach of bagging an ensemble of classifiers induced over balanced bootstrap training samples, arguing that this strategy will often succeed where others fail. Thus in addition to providing a theoretical understanding of class imbalance, corroborated by our experiments on both simulated and real datasets, we provide practical guidance for the data mining practitioner working with imbalanced data.
{"title":"Class Imbalance, Redux","authors":"Byron C. Wallace, Kevin Small, C. Brodley, T. Trikalinos","doi":"10.1109/ICDM.2011.33","DOIUrl":"https://doi.org/10.1109/ICDM.2011.33","url":null,"abstract":"Class imbalance (i.e., scenarios in which classes are unequally represented in the training data) occurs in many real-world learning tasks. Yet despite its practical importance, there is no established theory of class imbalance, and existing methods for handling it are therefore not well motivated. In this work, we approach the problem of imbalance from a probabilistic perspective, and from this vantage identify dataset characteristics (such as dimensionality, sparsity, etc.) that exacerbate the problem. Motivated by this theory, we advocate the approach of bagging an ensemble of classifiers induced over balanced bootstrap training samples, arguing that this strategy will often succeed where others fail. Thus in addition to providing a theoretical understanding of class imbalance, corroborated by our experiments on both simulated and real datasets, we provide practical guidance for the data mining practitioner working with imbalanced data.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"40 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123953674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Xin Jin, W. Spangler, Ying Chen, Keke Cai, Rui Ma, Li Zhang, X. Wu, Jiawei Han
Patents are of crucial importance for businesses, because they provide legal protection for the invented techniques, processes or products. A patent can be held for up to 20 years. However, large maintenance fees need to be paid to keep it enforceable. If the patent is deemed not valuable, the owner may decide to abandon it by stopping paying the maintenance fees to reduce the cost. For large companies or organizations, making such decisions is difficult because too many patents need to be investigated. In this paper, we introduce the new patent mining problem of automatic patent maintenance prediction, and propose a systematic solution to analyze patents for recommending patent maintenance decision. We model the patents as a heterogeneous time-evolving information network and propose new patent features to build model for a ranked prediction on whether to maintain or abandon a patent. In addition, a network-based refinement approach is proposed to further improve the performance. We have conducted experiments on the large scale United States Patent and Trademark Office (USPTO) database which contains over four million granted patents. The results show that our technique can achieve high performance.
{"title":"Patent Maintenance Recommendation with Patent Information Network Model","authors":"Xin Jin, W. Spangler, Ying Chen, Keke Cai, Rui Ma, Li Zhang, X. Wu, Jiawei Han","doi":"10.1109/ICDM.2011.116","DOIUrl":"https://doi.org/10.1109/ICDM.2011.116","url":null,"abstract":"Patents are of crucial importance for businesses, because they provide legal protection for the invented techniques, processes or products. A patent can be held for up to 20 years. However, large maintenance fees need to be paid to keep it enforceable. If the patent is deemed not valuable, the owner may decide to abandon it by stopping paying the maintenance fees to reduce the cost. For large companies or organizations, making such decisions is difficult because too many patents need to be investigated. In this paper, we introduce the new patent mining problem of automatic patent maintenance prediction, and propose a systematic solution to analyze patents for recommending patent maintenance decision. We model the patents as a heterogeneous time-evolving information network and propose new patent features to build model for a ranked prediction on whether to maintain or abandon a patent. In addition, a network-based refinement approach is proposed to further improve the performance. We have conducted experiments on the large scale United States Patent and Trademark Office (USPTO) database which contains over four million granted patents. The results show that our technique can achieve high performance.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"44 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115109674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kleanthis-Nikolaos Kontonasios, Jilles Vreeken, T. D. Bie
Statistical assessment of the results of data mining is increasingly recognised as a core task in the knowledge discovery process. It is of key importance in practice, as results that might seem interesting at first glance can often be explained by well-known basic properties of the data. In pattern mining, for instance, such trivial results can be so overwhelming in number that filtering them out is a necessity in order to identify the truly interesting patterns. In this paper, we propose an approach for assessing results on real-valued rectangular databases. More specifically, using our analytical model we are able to statistically assess whether or not a discovered structure may be the trivial result of the row and column marginal distributions in the database. Our main approach is to use the Maximum Entropy principle to fit a background model to the data while respecting its marginal distributions. To find these distributions, we employ an MDL based histogram estimator, and we fit these in our model using efficient convex optimization techniques. Subsequently, our model can be used to calculate probabilities directly, as well as to efficiently sample data with the purpose of assessing results by means of empirical hypothesis testing. Notably, our approach is efficient, parameter-free, and naturally deals with missing values. As such, it represents a well-founded alternative to swap randomisation
{"title":"Maximum Entropy Modelling for Assessing Results on Real-Valued Data","authors":"Kleanthis-Nikolaos Kontonasios, Jilles Vreeken, T. D. Bie","doi":"10.1109/ICDM.2011.98","DOIUrl":"https://doi.org/10.1109/ICDM.2011.98","url":null,"abstract":"Statistical assessment of the results of data mining is increasingly recognised as a core task in the knowledge discovery process. It is of key importance in practice, as results that might seem interesting at first glance can often be explained by well-known basic properties of the data. In pattern mining, for instance, such trivial results can be so overwhelming in number that filtering them out is a necessity in order to identify the truly interesting patterns. In this paper, we propose an approach for assessing results on real-valued rectangular databases. More specifically, using our analytical model we are able to statistically assess whether or not a discovered structure may be the trivial result of the row and column marginal distributions in the database. Our main approach is to use the Maximum Entropy principle to fit a background model to the data while respecting its marginal distributions. To find these distributions, we employ an MDL based histogram estimator, and we fit these in our model using efficient convex optimization techniques. Subsequently, our model can be used to calculate probabilities directly, as well as to efficiently sample data with the purpose of assessing results by means of empirical hypothesis testing. Notably, our approach is efficient, parameter-free, and naturally deals with missing values. As such, it represents a well-founded alternative to swap randomisation","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126330185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}