Jilin Chen, Jun Yan, Benyu Zhang, Qiang Yang, Zheng Chen
We propose a novel algorithm for extracting diverse topic phrases in order to provide summary for large corpora. Previous works often ignore the importance of diversity and thus extract phrases crowded on some hot topics while failing to cover other less obvious but important topics. We solve this problem through document re-weighting and phrase diversification by using latent semantic analysis (LSA). Experiments on various datasets show that our new algorithm can improve relevance as well as diversity over different topics for topic phrase extraction problems.
{"title":"Diverse Topic Phrase Extraction through Latent Semantic Analysis","authors":"Jilin Chen, Jun Yan, Benyu Zhang, Qiang Yang, Zheng Chen","doi":"10.1109/ICDM.2006.61","DOIUrl":"https://doi.org/10.1109/ICDM.2006.61","url":null,"abstract":"We propose a novel algorithm for extracting diverse topic phrases in order to provide summary for large corpora. Previous works often ignore the importance of diversity and thus extract phrases crowded on some hot topics while failing to cover other less obvious but important topics. We solve this problem through document re-weighting and phrase diversification by using latent semantic analysis (LSA). Experiments on various datasets show that our new algorithm can improve relevance as well as diversity over different topics for topic phrase extraction problems.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"23 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116645719","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sequential pattern mining allows to discover temporal relationship between items within a database. The patterns can then be used to generate association rules. When the databases are very large, the execution speed and the memory usage of the mining algorithm become critical parameters. Previous research has focused on either one of the two parameters. In this paper, we present bitSPADE, a novel algorithm that combines the best features of SPAM, one of the fastest algorithm, and SPADE, one of the most memory efficient algorithm. Moreover, we introduce a new pruning strategy that enables bitSPADE to reach high performances. Experimental evaluations showed that bitSPADE ensures an efficient tradeoff between speed and memory usage by outperforming SPADE by both speed and memory usage factors more than 3.4 and SPAM by a memory consumption factor up to more than an order of magnitude.
{"title":"bitSPADE: A Lattice-based Sequential Pattern Mining Algorithm Using Bitmap Representation","authors":"S. Aseervatham, A. Osmani, E. Viennet","doi":"10.1109/ICDM.2006.28","DOIUrl":"https://doi.org/10.1109/ICDM.2006.28","url":null,"abstract":"Sequential pattern mining allows to discover temporal relationship between items within a database. The patterns can then be used to generate association rules. When the databases are very large, the execution speed and the memory usage of the mining algorithm become critical parameters. Previous research has focused on either one of the two parameters. In this paper, we present bitSPADE, a novel algorithm that combines the best features of SPAM, one of the fastest algorithm, and SPADE, one of the most memory efficient algorithm. Moreover, we introduce a new pruning strategy that enables bitSPADE to reach high performances. Experimental evaluations showed that bitSPADE ensures an efficient tradeoff between speed and memory usage by outperforming SPADE by both speed and memory usage factors more than 3.4 and SPAM by a memory consumption factor up to more than an order of magnitude.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121847359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We describe a part-based object-recognition framework, specialized to mining complex 3D objects from detailed 3D images. Objects are modeled as a collection of parts together with a pairwise potential function. An efficient inference algorithm - based on belief propagation (BP) -finds the optimal layout of parts, given some input image. We introduce AggBP, a message aggregation scheme for BP, in which groups of messages are approximated as a single message. For objects consisting of N parts, we reduce CPU time and memory requirements from O(N2) to O(N). We apply AggBP on synthetic data as well as a real-world task identifying protein fragments in three-dimensional images. These experiments show that our improvements result in minimal loss in accuracy in significantly less time.
{"title":"Belief Propagation in Large, Highly Connected Graphs for 3D Part-Based Object Recognition","authors":"F. DiMaio, J. Shavlik","doi":"10.1109/ICDM.2006.26","DOIUrl":"https://doi.org/10.1109/ICDM.2006.26","url":null,"abstract":"We describe a part-based object-recognition framework, specialized to mining complex 3D objects from detailed 3D images. Objects are modeled as a collection of parts together with a pairwise potential function. An efficient inference algorithm - based on belief propagation (BP) -finds the optimal layout of parts, given some input image. We introduce AggBP, a message aggregation scheme for BP, in which groups of messages are approximated as a single message. For objects consisting of N parts, we reduce CPU time and memory requirements from O(N2) to O(N). We apply AggBP on synthetic data as well as a real-world task identifying protein fragments in three-dimensional images. These experiments show that our improvements result in minimal loss in accuracy in significantly less time.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124130184","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We investigate four previously unexplored aspects of ensemble selection, a procedure for building ensembles of classifiers. First we test whether adjusting model predictions to put them on a canonical scale makes the ensembles more effective. Second, we explore the performance of ensemble selection when different amounts of data are available for ensemble hillclimbing. Third, we quantify the benefit of ensemble selection's ability to optimize to arbitrary metrics. Fourth, we study the performance impact of pruning the number of models available for ensemble selection. Based on our results we present improved ensemble selection methods that double the benefit of the original method.
{"title":"Getting the Most Out of Ensemble Selection","authors":"R. Caruana, Art Munson, Alexandru Niculescu-Mizil","doi":"10.1109/ICDM.2006.76","DOIUrl":"https://doi.org/10.1109/ICDM.2006.76","url":null,"abstract":"We investigate four previously unexplored aspects of ensemble selection, a procedure for building ensembles of classifiers. First we test whether adjusting model predictions to put them on a canonical scale makes the ensembles more effective. Second, we explore the performance of ensemble selection when different amounts of data are available for ensemble hillclimbing. Third, we quantify the benefit of ensemble selection's ability to optimize to arbitrary metrics. Fourth, we study the performance impact of pruning the number of models available for ensemble selection. Based on our results we present improved ensemble selection methods that double the benefit of the original method.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122377498","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Object identification aims at identifying different representations of the same object based on noisy attributes such as descriptions of the same product in different online shops or references to the same paper in different publications. Numerous solutions have been proposed for solving this task, almost all of them based on similarity functions of a pair of objects. Although today the similarity functions are learned from a set of labeled training data, the structural information given by the labeled data is not used. By formulating a generic model for object identification we show how almost any proposed identification model can easily be extended for satisfying structural constraints. Therefore we propose a model that uses structural information given as pairwise constraints to guide collective decisions about object identification in addition to a learned similarity measure. We show with empirical experiments on public and on real-life data that combining both structural information and attribute-based similarity enormously increases the overall performance for object identification tasks.
{"title":"Object Identification with Constraints","authors":"Steffen Rendle, L. Schmidt-Thieme","doi":"10.1109/ICDM.2006.117","DOIUrl":"https://doi.org/10.1109/ICDM.2006.117","url":null,"abstract":"Object identification aims at identifying different representations of the same object based on noisy attributes such as descriptions of the same product in different online shops or references to the same paper in different publications. Numerous solutions have been proposed for solving this task, almost all of them based on similarity functions of a pair of objects. Although today the similarity functions are learned from a set of labeled training data, the structural information given by the labeled data is not used. By formulating a generic model for object identification we show how almost any proposed identification model can easily be extended for satisfying structural constraints. Therefore we propose a model that uses structural information given as pairwise constraints to guide collective decisions about object identification in addition to a learned similarity measure. We show with empirical experiments on public and on real-life data that combining both structural information and attribute-based similarity enormously increases the overall performance for object identification tasks.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129233534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peng Wang, Haixun Wang, Wei Wang, Baile Shi, Philip S. Yu
An avalanche of data available in the stream form is overstretching our data analyzing ability. In this paper, we propose a novel load shedding method that enables fast and accurate stream data classification. We transform input data so that its class information concentrates on a few features, and we introduce a progressive classifier that makes prediction with partial input. We take advantage of stream data's temporal locality -for example, readings from a temperature sensor usually do not change dramatically over a short period of time -for load shedding. We first show that temporal locality of the original data is preserved by our transform, then we utilize positive and negative knowledge about the data (which is of much smaller size than the data itself) for classification. We employ both analytical and empirical analysis to demonstrate the advantage of our approach.
{"title":"LOCI: Load Shedding through Class-Preserving Data Acquisition","authors":"Peng Wang, Haixun Wang, Wei Wang, Baile Shi, Philip S. Yu","doi":"10.1109/ICDM.2006.100","DOIUrl":"https://doi.org/10.1109/ICDM.2006.100","url":null,"abstract":"An avalanche of data available in the stream form is overstretching our data analyzing ability. In this paper, we propose a novel load shedding method that enables fast and accurate stream data classification. We transform input data so that its class information concentrates on a few features, and we introduce a progressive classifier that makes prediction with partial input. We take advantage of stream data's temporal locality -for example, readings from a temperature sensor usually do not change dramatically over a short period of time -for load shedding. We first show that temporal locality of the original data is preserved by our transform, then we utilize positive and negative knowledge about the data (which is of much smaller size than the data itself) for classification. We employ both analytical and empirical analysis to demonstrate the advantage of our approach.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128656062","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The ultimate goal of data mining is to extract knowledge from massive data. Knowledge is ideally represented as human-comprehensible patterns from which end-users can gain intuitions and insights. Yet not all data mining methods produce such readily understandable knowledge, e.g., most clustering algorithms output sets of points as clusters. In this paper, we perform a systematic study of cluster description that generates interpretable patterns from clusters. We introduce and analyze novel description formats leading to more expressive power, motivate and define novel description problems specifying different trade-offs between interpretability and accuracy. We also present effective heuristic algorithms together with their empirical evaluations.
{"title":"Turning Clusters into Patterns: Rectangle-Based Discriminative Data Description","authors":"Byron J. Gao, M. Ester","doi":"10.1109/ICDM.2006.163","DOIUrl":"https://doi.org/10.1109/ICDM.2006.163","url":null,"abstract":"The ultimate goal of data mining is to extract knowledge from massive data. Knowledge is ideally represented as human-comprehensible patterns from which end-users can gain intuitions and insights. Yet not all data mining methods produce such readily understandable knowledge, e.g., most clustering algorithms output sets of points as clusters. In this paper, we perform a systematic study of cluster description that generates interpretable patterns from clusters. We introduce and analyze novel description formats leading to more expressive power, motivate and define novel description problems specifying different trade-offs between interpretability and accuracy. We also present effective heuristic algorithms together with their empirical evaluations.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128742619","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
A document is often full of class-independent "general" words and short of class-specific "core " words, which leads to the difficulty of document clustering. We argue that both problems will be relieved after suitable smoothing of document models in agglomerative approaches and of cluster models in partitional approaches, and hence improve clustering quality. To the best of our knowledge, most model-based clustering approaches use Laplacian smoothing to prevent zero probability while most similarity-based approaches employ the heuristic TF*IDF scheme to discount the effect of "general" words. Inspired by a series of statistical translation language model for text retrieval, we propose in this paper a novel smoothing method referred to as context-sensitive semantic smoothing for document clustering purpose. The comparative experiment on three datasets shows that model-based clustering approaches with semantic smoothing is effective in improving cluster quality.
{"title":"Semantic Smoothing for Model-based Document Clustering","authors":"Xiaodan Zhang, Xiaohua Zhou, Xiaohua Hu","doi":"10.1109/ICDM.2006.142","DOIUrl":"https://doi.org/10.1109/ICDM.2006.142","url":null,"abstract":"A document is often full of class-independent \"general\" words and short of class-specific \"core \" words, which leads to the difficulty of document clustering. We argue that both problems will be relieved after suitable smoothing of document models in agglomerative approaches and of cluster models in partitional approaches, and hence improve clustering quality. To the best of our knowledge, most model-based clustering approaches use Laplacian smoothing to prevent zero probability while most similarity-based approaches employ the heuristic TF*IDF scheme to discount the effect of \"general\" words. Inspired by a series of statistical translation language model for text retrieval, we propose in this paper a novel smoothing method referred to as context-sensitive semantic smoothing for document clustering purpose. The comparative experiment on three datasets shows that model-based clustering approaches with semantic smoothing is effective in improving cluster quality.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"373 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129081431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Choosing an appropriate kernel is one of the key problems in kernel-based methods. Most existing kernel selection methods require that the class labels of the training examples are known. In this paper, we propose an adaptive kernel selection method for kernel principal component analysis, which can effectively learn the kernels when the class labels of the training examples are not available. By iteratively optimizing a novel criterion, the proposed method can achieve nonlinear feature extraction and unsupervised kernel learning simultaneously. Moreover, a non-iterative approximate algorithm is developed. The effectiveness of the proposed algorithms are validated on UCI datasets and the COIL-20 object recognition database.
{"title":"Adaptive Kernel Principal Component Analysis with Unsupervised Learning of Kernels","authors":"Daoqiang Zhang, Zhi-Hua Zhou, Songcan Chen","doi":"10.1109/ICDM.2006.14","DOIUrl":"https://doi.org/10.1109/ICDM.2006.14","url":null,"abstract":"Choosing an appropriate kernel is one of the key problems in kernel-based methods. Most existing kernel selection methods require that the class labels of the training examples are known. In this paper, we propose an adaptive kernel selection method for kernel principal component analysis, which can effectively learn the kernels when the class labels of the training examples are not available. By iteratively optimizing a novel criterion, the proposed method can achieve nonlinear feature extraction and unsupervised kernel learning simultaneously. Moreover, a non-iterative approximate algorithm is developed. The effectiveness of the proposed algorithms are validated on UCI datasets and the COIL-20 object recognition database.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116845047","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Manual debugging is expensive. And the high cost has motivated extensive research on automated fault localization in both software engineering and data mining communities. Fault localization aims at automatically locating likely fault locations, and hence assists manual debugging. A number of fault localization algorithms have been developed in recent years, which prove effective when multiple failing and passing cases are available. However, we notice what is more commonly encountered in practice is the two-sample debugging problem, where only one failing and one passing cases are available. This problem has been either overlooked or insufficiently tackled in previous studies. In this paper, we develop a new fault localization algorithm, named BayesDebug, which simulates some manual debugging principles through a Bayesian approach. Different from existing approaches that base fault analysis on multiple passing and failing cases, BayesDebug only requires one passing and one failing cases. We reason about why BayesDebug fits the two- sample debugging problem and why other approaches do not. Finally, an experiment with a real-world program grep-2.2 is conducted, which exemplifies the effectiveness of BayesDebug.
{"title":"How Bayesians Debug","authors":"Chao Liu, Zeng Lian, Jiawei Han","doi":"10.1109/ICDM.2006.83","DOIUrl":"https://doi.org/10.1109/ICDM.2006.83","url":null,"abstract":"Manual debugging is expensive. And the high cost has motivated extensive research on automated fault localization in both software engineering and data mining communities. Fault localization aims at automatically locating likely fault locations, and hence assists manual debugging. A number of fault localization algorithms have been developed in recent years, which prove effective when multiple failing and passing cases are available. However, we notice what is more commonly encountered in practice is the two-sample debugging problem, where only one failing and one passing cases are available. This problem has been either overlooked or insufficiently tackled in previous studies. In this paper, we develop a new fault localization algorithm, named BayesDebug, which simulates some manual debugging principles through a Bayesian approach. Different from existing approaches that base fault analysis on multiple passing and failing cases, BayesDebug only requires one passing and one failing cases. We reason about why BayesDebug fits the two- sample debugging problem and why other approaches do not. Finally, an experiment with a real-world program grep-2.2 is conducted, which exemplifies the effectiveness of BayesDebug.","PeriodicalId":356443,"journal":{"name":"Sixth International Conference on Data Mining (ICDM'06)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2006-12-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115666617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}