The training time of most learning algorithms increases as the size of training data increases. Yet, recent advances in linear binary SVM and LR challenge this commonsense by proposing an inverse dependency property, where the training time decreases as the size of training data increases. In this paper, we study the inverse dependency property of multi-class classification problem. We describe a general framework for multi-class classification problem with a single objective to achieve inverse dependency and extend it to three popular multi-class algorithms. We present theoretical results demonstrating its convergence and inverse dependency guarantee. We conduct experiments to empirically verify the inverse dependency of all the three algorithms on large-scale datasets as well as to ensure the accuracy.
{"title":"Characterizing Inverse Time Dependency in Multi-class Learning","authors":"Danqi Chen, Weizhu Chen, Qiang Yang","doi":"10.1109/ICDM.2011.32","DOIUrl":"https://doi.org/10.1109/ICDM.2011.32","url":null,"abstract":"The training time of most learning algorithms increases as the size of training data increases. Yet, recent advances in linear binary SVM and LR challenge this commonsense by proposing an inverse dependency property, where the training time decreases as the size of training data increases. In this paper, we study the inverse dependency property of multi-class classification problem. We describe a general framework for multi-class classification problem with a single objective to achieve inverse dependency and extend it to three popular multi-class algorithms. We present theoretical results demonstrating its convergence and inverse dependency guarantee. We conduct experiments to empirically verify the inverse dependency of all the three algorithms on large-scale datasets as well as to ensure the accuracy.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131112283","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The fast growth of Internet and modern technologies has brought data involving objects of multiple types that are related to each other, called as Multi-Type Relational data. Traditional clustering methods for single-type data rarely work well on them, which calls for new clustering techniques, called as high-order co-clustering (HOCC), to deal with the multiple types of data at the same time. A major challenge in developing HOCC methods is how to effectively make use of all available information contained in a multi-type relational data set, including both inter-type and intra-type relationships. Meanwhile, because many real world data sets are often of large sizes, clustering methods with computationally efficient solution algorithms are of great practical interest. In this paper, we first present a general HOCC framework, named as Orthogonal Nonnegative Matrix Tri-factorization (O-NMTF), for simultaneous clustering of multi-type relational data. The proposed O-NMTF approach employs Nonnegative Matrix Tri-Factorization (NMTF) to simultaneously cluster different types of data using the inter-type relationships, and incorporate intra-type information through manifold regularization, where, different from existing works, we emphasize the importance of the orthogonal ties of the factor matrices of NMTF. Based on O-NMTF, we further develop a novel Fast Nonnegative Matrix Tri-Factorization (F-NMTF) approach to deal with large-scale data. Instead of constraining the factor matrices of NMTF to be nonnegative as in existing methods, F-NMTF constrains them to be cluster indicator matrices, a special type of nonnegative matrices. As a result, the optimization problem of the proposed method can be decoupled, which results in sub problems of much smaller sizes requiring much less matrix multiplications, such that our new algorithm scales well to real world data of large sizes. Extensive experimental evaluations have demonstrated the effectiveness of our new approaches.
{"title":"Nonnegative Matrix Tri-factorization Based High-Order Co-clustering and Its Fast Implementation","authors":"Hua Wang, F. Nie, Heng Huang, C. Ding","doi":"10.1109/ICDM.2011.109","DOIUrl":"https://doi.org/10.1109/ICDM.2011.109","url":null,"abstract":"The fast growth of Internet and modern technologies has brought data involving objects of multiple types that are related to each other, called as Multi-Type Relational data. Traditional clustering methods for single-type data rarely work well on them, which calls for new clustering techniques, called as high-order co-clustering (HOCC), to deal with the multiple types of data at the same time. A major challenge in developing HOCC methods is how to effectively make use of all available information contained in a multi-type relational data set, including both inter-type and intra-type relationships. Meanwhile, because many real world data sets are often of large sizes, clustering methods with computationally efficient solution algorithms are of great practical interest. In this paper, we first present a general HOCC framework, named as Orthogonal Nonnegative Matrix Tri-factorization (O-NMTF), for simultaneous clustering of multi-type relational data. The proposed O-NMTF approach employs Nonnegative Matrix Tri-Factorization (NMTF) to simultaneously cluster different types of data using the inter-type relationships, and incorporate intra-type information through manifold regularization, where, different from existing works, we emphasize the importance of the orthogonal ties of the factor matrices of NMTF. Based on O-NMTF, we further develop a novel Fast Nonnegative Matrix Tri-Factorization (F-NMTF) approach to deal with large-scale data. Instead of constraining the factor matrices of NMTF to be nonnegative as in existing methods, F-NMTF constrains them to be cluster indicator matrices, a special type of nonnegative matrices. As a result, the optimization problem of the proposed method can be decoupled, which results in sub problems of much smaller sizes requiring much less matrix multiplications, such that our new algorithm scales well to real world data of large sizes. Extensive experimental evaluations have demonstrated the effectiveness of our new approaches.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"126 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114061627","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The omnipresence of mobile sensors has brought tremendous opportunities to ubiquitous computing systems. In many natural settings, however, their broader applications are hindered by three main challenges: rarity of labels, uncertainty of activity granularities, and the difficulty of multi-dimensional sensor fusion. In this paper, we propose building a grammar to address all these challenges using a language-based approach. The proposed algorithm, called Helix, first generates an initial vocabulary using unlabeled sensor readings, followed by iteratively combining statistically collocated sub-activities across sensor dimensions and grouping similar activities together to discover higher level activities. The experiments using a 20-minute ping-pong game demonstrate favorable results compared to a Hierarchical Hidden Markov Model (HHMM) baseline. Closer investigations to the learned grammar also shows that the learned grammar captures the natural structure of the underlying activities.
{"title":"Helix: Unsupervised Grammar Induction for Structured Activity Recognition","authors":"Huan-Kai Peng, Pang Wu, Jiang Zhu, J. Zhang","doi":"10.1109/ICDM.2011.74","DOIUrl":"https://doi.org/10.1109/ICDM.2011.74","url":null,"abstract":"The omnipresence of mobile sensors has brought tremendous opportunities to ubiquitous computing systems. In many natural settings, however, their broader applications are hindered by three main challenges: rarity of labels, uncertainty of activity granularities, and the difficulty of multi-dimensional sensor fusion. In this paper, we propose building a grammar to address all these challenges using a language-based approach. The proposed algorithm, called Helix, first generates an initial vocabulary using unlabeled sensor readings, followed by iteratively combining statistically collocated sub-activities across sensor dimensions and grouping similar activities together to discover higher level activities. The experiments using a 20-minute ping-pong game demonstrate favorable results compared to a Hierarchical Hidden Markov Model (HHMM) baseline. Closer investigations to the learned grammar also shows that the learned grammar captures the natural structure of the underlying activities.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114090014","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Networks from different genres are not static entities, but exhibit dynamic behavior. The congestion level of links in transportation networks varies in time depending on the traffic. Similarly, social and communication links are employed at varying rates as information cascades unfold. In recent years there has been an increase of interest in modeling and mining dynamic networks. However, limited attention has been placed in high-scoring sub graph discovery in time-evolving networks. We define the problem of finding the highest-scoring temporal sub graph in a dynamic network, termed Heaviest Dynamic Sub graph (HDS). We show that HDS is NP-hard even with edge weights in {-1,1} and devise an efficient approach for large graph instances that evolve over long time periods. While a naive approach would enumerate all O(t^2) sub-intervals, our solution performs an effective pruning of the sub-interval space by considering O(t*log(t)) groups of sub-intervals and computing an aggregate of each group in logarithmic time. We also define a fast heuristic and a tight upper bound for approximating the static version of HDS, and use them for further pruning the sub-interval space and quickly verifying candidate sub-intervals. We perform an extensive experimental evaluation of our algorithm on transportation, communication and social media networks for discovering sub graphs that correspond to traffic congestions, communication overflow and localized social discussions. Our method is two orders of magnitude faster than a naive approach and scales well with network size and time length.
{"title":"Mining Heavy Subgraphs in Time-Evolving Networks","authors":"Petko Bogdanov, M. Mongiovì, Ambuj K. Singh","doi":"10.1109/ICDM.2011.101","DOIUrl":"https://doi.org/10.1109/ICDM.2011.101","url":null,"abstract":"Networks from different genres are not static entities, but exhibit dynamic behavior. The congestion level of links in transportation networks varies in time depending on the traffic. Similarly, social and communication links are employed at varying rates as information cascades unfold. In recent years there has been an increase of interest in modeling and mining dynamic networks. However, limited attention has been placed in high-scoring sub graph discovery in time-evolving networks. We define the problem of finding the highest-scoring temporal sub graph in a dynamic network, termed Heaviest Dynamic Sub graph (HDS). We show that HDS is NP-hard even with edge weights in {-1,1} and devise an efficient approach for large graph instances that evolve over long time periods. While a naive approach would enumerate all O(t^2) sub-intervals, our solution performs an effective pruning of the sub-interval space by considering O(t*log(t)) groups of sub-intervals and computing an aggregate of each group in logarithmic time. We also define a fast heuristic and a tight upper bound for approximating the static version of HDS, and use them for further pruning the sub-interval space and quickly verifying candidate sub-intervals. We perform an extensive experimental evaluation of our algorithm on transportation, communication and social media networks for discovering sub graphs that correspond to traffic congestions, communication overflow and localized social discussions. Our method is two orders of magnitude faster than a naive approach and scales well with network size and time length.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"258 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116203964","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social tagging predictions involve the co occurrence of users, items and tags. The tremendous growth of users require the recommender system to produce tag recommendations for millions of users and items at any minute. The triplets of users, items and tags are most naturally described by a 3D tensor, and tensor decomposition-based algorithms can produce high quality recommendations. However, each day, thousands of new users are added to the system and the decompositions must be updated daily in a online fashion. In this paper, we provide analysis of the new user problem, and present fold-in algorithms for Tucker, Para Fac, and Low-order tensor decompositions. We show that these algorithm can very efficiently compute the needed decompositions. We evaluate the fold-in algorithms experimentally on several datasets and the results demonstrate the effectiveness of these algorithms.
社会标签预测涉及用户、项目和标签的共同出现。用户的巨大增长要求推荐系统随时为数百万用户和项目提供标签推荐。用户、项目和标签的三元组最自然地由3D张量描述,基于张量分解的算法可以产生高质量的推荐。然而,每天都有成千上万的新用户被添加到系统中,分解必须以在线方式每天更新。在本文中,我们提供了新用户问题的分析,并提出了Tucker, Para facc和低阶张量分解的折叠算法。结果表明,该算法可以非常有效地计算所需的分解。我们在多个数据集上对这些折叠算法进行了实验评估,结果证明了这些算法的有效性。
{"title":"Tensor Fold-in Algorithms for Social Tagging Prediction","authors":"Miao Zhang, C. Ding, Zhifang Liao","doi":"10.1109/ICDM.2011.142","DOIUrl":"https://doi.org/10.1109/ICDM.2011.142","url":null,"abstract":"Social tagging predictions involve the co occurrence of users, items and tags. The tremendous growth of users require the recommender system to produce tag recommendations for millions of users and items at any minute. The triplets of users, items and tags are most naturally described by a 3D tensor, and tensor decomposition-based algorithms can produce high quality recommendations. However, each day, thousands of new users are added to the system and the decompositions must be updated daily in a online fashion. In this paper, we provide analysis of the new user problem, and present fold-in algorithms for Tucker, Para Fac, and Low-order tensor decompositions. We show that these algorithm can very efficiently compute the needed decompositions. We evaluate the fold-in algorithms experimentally on several datasets and the results demonstrate the effectiveness of these algorithms.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128688503","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhijun Yin, Liangliang Cao, Jiawei Han, ChengXiang Zhai, Thomas S. Huang
This paper studies the problem of latent periodic topic analysis from time stamped documents. The examples of time stamped documents include news articles, sales records, financial reports, TV programs, and more recently, posts from social media websites such as Flickr, Twitter, and Face book. Different from detecting periodic patterns in traditional time series database, we discover the topics of coherent semantics and periodic characteristics where a topic is represented by a distribution of words. We propose a model called LPTA (Latent Periodic Topic Analysis) that exploits the periodicity of the terms as well as term co-occurrences. To show the effectiveness of our model, we collect several representative datasets including Seminar, DBLP and Flickr. The results show that our model can discover the latent periodic topics effectively and leverage the information from both text and time well.
{"title":"LPTA: A Probabilistic Model for Latent Periodic Topic Analysis","authors":"Zhijun Yin, Liangliang Cao, Jiawei Han, ChengXiang Zhai, Thomas S. Huang","doi":"10.1109/ICDM.2011.96","DOIUrl":"https://doi.org/10.1109/ICDM.2011.96","url":null,"abstract":"This paper studies the problem of latent periodic topic analysis from time stamped documents. The examples of time stamped documents include news articles, sales records, financial reports, TV programs, and more recently, posts from social media websites such as Flickr, Twitter, and Face book. Different from detecting periodic patterns in traditional time series database, we discover the topics of coherent semantics and periodic characteristics where a topic is represented by a distribution of words. We propose a model called LPTA (Latent Periodic Topic Analysis) that exploits the periodicity of the terms as well as term co-occurrences. To show the effectiveness of our model, we collect several representative datasets including Seminar, DBLP and Flickr. The results show that our model can discover the latent periodic topics effectively and leverage the information from both text and time well.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127422783","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2011-12-11DOI: 10.1007/978-3-642-44958-1_14
Bing Hu, T. Rakthanmanon, Yuan Hao, Scott Evans, S. Lonardi, Eamonn J. Keogh
{"title":"Discovering the Intrinsic Cardinality and Dimensionality of Time Series Using MDL","authors":"Bing Hu, T. Rakthanmanon, Yuan Hao, Scott Evans, S. Lonardi, Eamonn J. Keogh","doi":"10.1007/978-3-642-44958-1_14","DOIUrl":"https://doi.org/10.1007/978-3-642-44958-1_14","url":null,"abstract":"","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114429667","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Data sparsity is a big challenge for collaborative filtering. This problem becomes more serious if the dataset is newly created and has even fewer ratings. By sharing knowledge among different datasets, multi-task learning is a promising technique to address this issue. Most prior work methods directly share objects (users or items) across different datasets. However, object identities and correspondences may not be known in many cases. We extend the previous work of Bayesian matrix factorization with Dirichlet process mixture into a multi-task learning approach by sharing latent parameters among different tasks. Our method does not require object identities and thus is more widely applicable. The proposed model is fully non-parametric in that the dimension of latent feature vectors is automatically determined. Inference is performed using the variational Bayesian algorithm, which is much faster than Gibbs sampling used by most other related Bayesian methods.
{"title":"Multi-task Learning for Bayesian Matrix Factorization","authors":"Chao Yuan","doi":"10.1109/ICDM.2011.107","DOIUrl":"https://doi.org/10.1109/ICDM.2011.107","url":null,"abstract":"Data sparsity is a big challenge for collaborative filtering. This problem becomes more serious if the dataset is newly created and has even fewer ratings. By sharing knowledge among different datasets, multi-task learning is a promising technique to address this issue. Most prior work methods directly share objects (users or items) across different datasets. However, object identities and correspondences may not be known in many cases. We extend the previous work of Bayesian matrix factorization with Dirichlet process mixture into a multi-task learning approach by sharing latent parameters among different tasks. Our method does not require object identities and thus is more widely applicable. The proposed model is fully non-parametric in that the dimension of latent feature vectors is automatically determined. Inference is performed using the variational Bayesian algorithm, which is much faster than Gibbs sampling used by most other related Bayesian methods.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114233728","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
J. Schmidt, Elisabeth Maria Brändle, Stefan Kramer
In many clustering applications the incorporation of background knowledge in the form of constraints is desirable. In this paper, we introduce a new constraint type and the corresponding clustering problem: attribute constrained clustering. The goal is to induce clusters of binary instances that satisfy constraints on the attribute level. These constraints specify whether instances may or may not be grouped to a cluster, depending on specific attribute values. We show how the well-established instance-level constraints, must-link and cannot-link, can be adapted to the attribute level. A variant of the k-Medoids algorithm taking into account attribute level constraints is evaluated on synthetic and real-world data. Experimental results show that such constraints may provide better clustering results at lower specification costs if constraints can be expressed on the attribute level.
{"title":"Clustering with Attribute-Level Constraints","authors":"J. Schmidt, Elisabeth Maria Brändle, Stefan Kramer","doi":"10.1109/ICDM.2011.36","DOIUrl":"https://doi.org/10.1109/ICDM.2011.36","url":null,"abstract":"In many clustering applications the incorporation of background knowledge in the form of constraints is desirable. In this paper, we introduce a new constraint type and the corresponding clustering problem: attribute constrained clustering. The goal is to induce clusters of binary instances that satisfy constraints on the attribute level. These constraints specify whether instances may or may not be grouped to a cluster, depending on specific attribute values. We show how the well-established instance-level constraints, must-link and cannot-link, can be adapted to the attribute level. A variant of the k-Medoids algorithm taking into account attribute level constraints is evaluated on synthetic and real-world data. Experimental results show that such constraints may provide better clustering results at lower specification costs if constraints can be expressed on the attribute level.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128984989","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mining frequent patterns is plagued by the problem of pattern explosion making pattern reduction techniques a key challenge in pattern mining. In this paper we propose a novel theoretical framework for pattern reduction. We do this by measuring the robustness of a property of an item set such as closed ness or non-derivability. The robustness of a property is the probability that this property holds on random subsets of the original data. We study four properties: closed, free, non-derivable and totally shattered item sets, demonstrating how we can compute the robustness analytically without actually sampling the data. Our concept of robustness has many advantages: Unlike statistical approaches for reducing patterns, we do not assume a null hypothesis or any noise model and the patterns reported are simply a subset of all patterns with this property as opposed to approximate patterns for which the property does not really hold. If the underlying property is monotonic, then the measure is also monotonic, allowing us to efficiently mine robust item sets. We further derive a parameter-free technique for ranking item sets that can be used for top-k approaches. Our experiments demonstrate that we can successfully use the robustness measure to reduce the number of patterns and that ranking yields interesting itemsets.
{"title":"Finding Robust Itemsets under Subsampling","authors":"Nikolaj Tatti, Fabian Moerchen","doi":"10.1145/2656261","DOIUrl":"https://doi.org/10.1145/2656261","url":null,"abstract":"Mining frequent patterns is plagued by the problem of pattern explosion making pattern reduction techniques a key challenge in pattern mining. In this paper we propose a novel theoretical framework for pattern reduction. We do this by measuring the robustness of a property of an item set such as closed ness or non-derivability. The robustness of a property is the probability that this property holds on random subsets of the original data. We study four properties: closed, free, non-derivable and totally shattered item sets, demonstrating how we can compute the robustness analytically without actually sampling the data. Our concept of robustness has many advantages: Unlike statistical approaches for reducing patterns, we do not assume a null hypothesis or any noise model and the patterns reported are simply a subset of all patterns with this property as opposed to approximate patterns for which the property does not really hold. If the underlying property is monotonic, then the measure is also monotonic, allowing us to efficiently mine robust item sets. We further derive a parameter-free technique for ranking item sets that can be used for top-k approaches. Our experiments demonstrate that we can successfully use the robustness measure to reduce the number of patterns and that ranking yields interesting itemsets.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124922907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}