In this paper, we define an online algorithm to learn the generalized cosine similarity measures for kNN classification and hence a similarity matrix A corresponding to a bilinear form. In contrary to the standard cosine measure, the normalization is itself dependent on the similarity matrix which makes it impossible to use directly the algorithms developed for learning Mahanalobis distances, based on positive, semi-definite (PSD) matrices. We follow the approach where we first find an appropriate matrix and then project it onto the cone of PSD matrices, which we have adapted to the particular form of generalized cosine similarities, and more particularly to the fact that such measures are normalized. The resulting online algorithm as well as its batch version is fast and has got better accuracy as compared with state-of-the-art methods on standard data sets.
{"title":"Online and Batch Learning of Generalized Cosine Similarities","authors":"A. M. Qamar, Éric Gaussier","doi":"10.1109/ICDM.2009.114","DOIUrl":"https://doi.org/10.1109/ICDM.2009.114","url":null,"abstract":"In this paper, we define an online algorithm to learn the generalized cosine similarity measures for kNN classification and hence a similarity matrix A corresponding to a bilinear form. In contrary to the standard cosine measure, the normalization is itself dependent on the similarity matrix which makes it impossible to use directly the algorithms developed for learning Mahanalobis distances, based on positive, semi-definite (PSD) matrices. We follow the approach where we first find an appropriate matrix and then project it onto the cone of PSD matrices, which we have adapted to the particular form of generalized cosine similarities, and more particularly to the fact that such measures are normalized. The resulting online algorithm as well as its batch version is fast and has got better accuracy as compared with state-of-the-art methods on standard data sets.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134263292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Topic model, Latent Dirichlet Allocation (LDA), is an effective tool for statistical analysis of large collections of documents. In LDA, each document is modeled as a mixture of topics and the topic proportions are generated from the unimodal Dirichlet distribution prior. When a collection of documents are drawn from multiple classes, this unimodal prior is insufficient for data fitting. To solve this problem, we exploit the multimodal Dirichlet mixture prior, and propose the Dirichlet mixture allocation (DMA). We report experiments on the popular TDT2 Corpus demonstrating that DMA models a collection of documents more precisely than LDA when the documents are obtained from multiple classes.
{"title":"Dirichlet Mixture Allocation for Multiclass Document Collections Modeling","authors":"Wei Bian, D. Tao","doi":"10.1109/ICDM.2009.102","DOIUrl":"https://doi.org/10.1109/ICDM.2009.102","url":null,"abstract":"Topic model, Latent Dirichlet Allocation (LDA), is an effective tool for statistical analysis of large collections of documents. In LDA, each document is modeled as a mixture of topics and the topic proportions are generated from the unimodal Dirichlet distribution prior. When a collection of documents are drawn from multiple classes, this unimodal prior is insufficient for data fitting. To solve this problem, we exploit the multimodal Dirichlet mixture prior, and propose the Dirichlet mixture allocation (DMA). We report experiments on the popular TDT2 Corpus demonstrating that DMA models a collection of documents more precisely than LDA when the documents are obtained from multiple classes.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127780326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Indrajit Bhattacharya, S. Godbole, Sachindra Joshi, Ashish Verma
Lack of supervision in clustering algorithms often leads to clusters that are not useful or interesting to human reviewers. We investigate if supervision can be automatically transferred to a clustering task in a target domain, by providing a relevant supervised partitioning of a dataset from a different source domain. The target clustering is made more meaningful for the human user by trading off intrinsic clustering goodness on the target dataset for alignment with relevant supervised partitions in the source dataset, wherever possible. We propose a cross-guided clustering algorithm that builds on traditional k-means by aligning the target clusters with source partitions. The alignment process makes use of a cross-domain similarity measure that discovers hidden relationships across domains with potentially different vocabularies. Using multiple real-world datasets, we show that our approach improves clustering accuracy significantly over traditional k-means.
{"title":"Cross-Guided Clustering: Transfer of Relevant Supervision across Domains for Improved Clustering","authors":"Indrajit Bhattacharya, S. Godbole, Sachindra Joshi, Ashish Verma","doi":"10.1109/ICDM.2009.33","DOIUrl":"https://doi.org/10.1109/ICDM.2009.33","url":null,"abstract":"Lack of supervision in clustering algorithms often leads to clusters that are not useful or interesting to human reviewers. We investigate if supervision can be automatically transferred to a clustering task in a target domain, by providing a relevant supervised partitioning of a dataset from a different source domain. The target clustering is made more meaningful for the human user by trading off intrinsic clustering goodness on the target dataset for alignment with relevant supervised partitions in the source dataset, wherever possible. We propose a cross-guided clustering algorithm that builds on traditional k-means by aligning the target clusters with source partitions. The alignment process makes use of a cross-domain similarity measure that discovers hidden relationships across domains with potentially different vocabularies. Using multiple real-world datasets, we show that our approach improves clustering accuracy significantly over traditional k-means.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128962838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Retrieving similar data has drawn many research efforts in the literature due to its importance in data mining, database and information retrieval. This problem is challenging when the data is incomplete. In previous research, data incompleteness refers to the fact that data values for some dimensions are unknown. However, in many practical applications (e.g., data collection by sensor network under bad environment), not only data values but even data dimension information may also be missing, which will make most similarity query algorithms infeasible. In this work, we propose the novel similarity query problem on dimension incomplete data and adopt a probabilistic framework to model this problem. For this problem, users can give a distance threshold and a probability threshold to specify their retrieval requirements. The distance threshold is used to specify the allowed distance between query and data objects and the probability threshold is used to require that the retrieval results satisfy the distance condition at least with the given probability. Instead of enumerating all possible cases to recover the missed dimensions, we propose an efficient approach to speed up the retrieval process by leveraging the inherent relations between query and dimension incomplete data objects. During the query process, we estimate the lower/upper bounds of the probability that the query is satisfied by a given data object, and utilize these bounds to filter irrelevant data objects efficiently. Furthermore, a probability triangle inequality is proposed to further speed up query processing. According to our experiments on real data sets, the proposed similarity query method is verified to be effective and efficient on dimension incomplete data.
{"title":"Probabilistic Similarity Query on Dimension Incomplete Data","authors":"Wei-min Cheng, Xiaoming Jin, Jian-Tao Sun","doi":"10.1109/ICDM.2009.72","DOIUrl":"https://doi.org/10.1109/ICDM.2009.72","url":null,"abstract":"Retrieving similar data has drawn many research efforts in the literature due to its importance in data mining, database and information retrieval. This problem is challenging when the data is incomplete. In previous research, data incompleteness refers to the fact that data values for some dimensions are unknown. However, in many practical applications (e.g., data collection by sensor network under bad environment), not only data values but even data dimension information may also be missing, which will make most similarity query algorithms infeasible. In this work, we propose the novel similarity query problem on dimension incomplete data and adopt a probabilistic framework to model this problem. For this problem, users can give a distance threshold and a probability threshold to specify their retrieval requirements. The distance threshold is used to specify the allowed distance between query and data objects and the probability threshold is used to require that the retrieval results satisfy the distance condition at least with the given probability. Instead of enumerating all possible cases to recover the missed dimensions, we propose an efficient approach to speed up the retrieval process by leveraging the inherent relations between query and dimension incomplete data objects. During the query process, we estimate the lower/upper bounds of the probability that the query is satisfied by a given data object, and utilize these bounds to filter irrelevant data objects efficiently. Furthermore, a probability triangle inequality is proposed to further speed up query processing. According to our experiments on real data sets, the proposed similarity query method is verified to be effective and efficient on dimension incomplete data.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114260406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Given a large transaction database, association analysis is concerned with efficiently finding strongly related objects. Unlike traditional associate analysis, where relationships among variables are searched at a global level, we examine confounding factors at a local level. Indeed, many real-world phenomena are localized to specific regions and times. These relationships may not be visible when the entire data set is analyzed. Specially, confounding effects that change the direction of correlation is the most significant. Along this line, we propose to efficiently find confounding effects attributable to local associations. Specifically, we derive an upper bound by a necessary condition of confounders, which can help us prune the search space and efficiently identify confounders. Experimental results show that the proposed CONFOUND algorithm can effectively identify confounders and the computational performance is an order of magnitude faster than benchmark methods.
{"title":"Efficient Discovery of Confounders in Large Data Sets","authors":"Wenjun Zhou, Hui Xiong","doi":"10.1109/ICDM.2009.77","DOIUrl":"https://doi.org/10.1109/ICDM.2009.77","url":null,"abstract":"Given a large transaction database, association analysis is concerned with efficiently finding strongly related objects. Unlike traditional associate analysis, where relationships among variables are searched at a global level, we examine confounding factors at a local level. Indeed, many real-world phenomena are localized to specific regions and times. These relationships may not be visible when the entire data set is analyzed. Specially, confounding effects that change the direction of correlation is the most significant. Along this line, we propose to efficiently find confounding effects attributable to local associations. Specifically, we derive an upper bound by a necessary condition of confounders, which can help us prune the search space and efficiently identify confounders. Experimental results show that the proposed CONFOUND algorithm can effectively identify confounders and the computational performance is an order of magnitude faster than benchmark methods.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131094460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Peculiarity-oriented mining (POM) is a new data mining method consisting of peculiar data identification and peculiar data analysis. Peculiarity factor (PF) and local peculiarity factor (LPF) are important concepts employed to describe the peculiarity of points in the identification step. One can study the notions at both attribute and record levels. In this paper, a new record LPF called distance based record LPF (D-record LPF) is proposed, which is defined as the sum of distances between a point and its nearest neighbors. It is proved mathematically that D-record LPF can characterize accurately the probability density function of a continuous m-dimensional distribution. This provides a theoretical basis for some existing distance based anomaly detection techniques. More important, it also provides an effective method for describing the class conditional probabilities in the Bayesian classifier. The result enables us to apply peculiarity analysis for classification problems. A novel algorithm called LPF-Bayes classifier and its kernelized implementation are presented, which have some connection to the Bayesian classifier. Experimental results on several benchmark data sets demonstrate that the proposed classifiers are effective.
{"title":"Peculiarity Analysis for Classifications","authors":"Jian Yang, Ning Zhong, Yiyu Yao, Jue Wang","doi":"10.1109/ICDM.2009.31","DOIUrl":"https://doi.org/10.1109/ICDM.2009.31","url":null,"abstract":"Peculiarity-oriented mining (POM) is a new data mining method consisting of peculiar data identification and peculiar data analysis. Peculiarity factor (PF) and local peculiarity factor (LPF) are important concepts employed to describe the peculiarity of points in the identification step. One can study the notions at both attribute and record levels. In this paper, a new record LPF called distance based record LPF (D-record LPF) is proposed, which is defined as the sum of distances between a point and its nearest neighbors. It is proved mathematically that D-record LPF can characterize accurately the probability density function of a continuous m-dimensional distribution. This provides a theoretical basis for some existing distance based anomaly detection techniques. More important, it also provides an effective method for describing the class conditional probabilities in the Bayesian classifier. The result enables us to apply peculiarity analysis for classification problems. A novel algorithm called LPF-Bayes classifier and its kernelized implementation are presented, which have some connection to the Bayesian classifier. Experimental results on several benchmark data sets demonstrate that the proposed classifiers are effective.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126686408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Biological networks having complex connectivity have been widely studied recently. By characterizing their inherent and structural behaviors in a topological perspective, these studies have attempted to discover hidden knowledge in the systems. However, even though various algorithms with graph-theoretical modeling have provided fundamentals in the network analysis, the availability of practical approaches to efficiently handle the complexity has been limited. In this paper, we present a novel flow-based approach, called flowNet, to efficiently analyze large-sized, complex networks. Our approach is based on the functional influence model that quantifies the influence of a biological component on another. We introduce a dynamic flow simulation algorithm to generate a flow pattern which is a unique characteristic for each component. The set of patterns can be used in identifying functional modules (i.e., clustering). The proposed flow simulation algorithm runs very efficiently in sparse networks. Since our approach uses a weighted network as an input, we also discuss supervised and unsupervised weighting schemes for unweighted biological networks. As experimental results in real applications to the yeast protein interaction network, we demonstrate that our approach outperforms previous graph clustering methods with respect to accuracy.
{"title":"flowNet: Flow-Based Approach for Efficient Analysis of Complex Biological Networks","authors":"Young-Rae Cho, Lei Shi, A. Zhang","doi":"10.1109/ICDM.2009.39","DOIUrl":"https://doi.org/10.1109/ICDM.2009.39","url":null,"abstract":"Biological networks having complex connectivity have been widely studied recently. By characterizing their inherent and structural behaviors in a topological perspective, these studies have attempted to discover hidden knowledge in the systems. However, even though various algorithms with graph-theoretical modeling have provided fundamentals in the network analysis, the availability of practical approaches to efficiently handle the complexity has been limited. In this paper, we present a novel flow-based approach, called flowNet, to efficiently analyze large-sized, complex networks. Our approach is based on the functional influence model that quantifies the influence of a biological component on another. We introduce a dynamic flow simulation algorithm to generate a flow pattern which is a unique characteristic for each component. The set of patterns can be used in identifying functional modules (i.e., clustering). The proposed flow simulation algorithm runs very efficiently in sparse networks. Since our approach uses a weighted network as an input, we also discuss supervised and unsupervised weighting schemes for unweighted biological networks. As experimental results in real applications to the yeast protein interaction network, we demonstrate that our approach outperforms previous graph clustering methods with respect to accuracy.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128941873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the problem of learning a linear combination of pre-specified kernel matrices in the Fisher discriminant analysis setting. Existing methods for such a task impose an $ell_1$ norm regularisation on the kernel weights, which produces sparse solution but may lead to loss of information. In this paper, we propose to use $ell_2$ norm regularisation instead. The resulting learning problem is formulated as a semi-infinite program and can be solved efficiently. Through experiments on both synthetic data and a very challenging object recognition benchmark, the relative advantages of the proposed method and its $ell_1$ counterpart are demonstrated, and insights are gained as to how the choice of regularisation norm should be made.
{"title":"Non-sparse Multiple Kernel Learning for Fisher Discriminant Analysis","authors":"F. Yan, J. Kittler, K. Mikolajczyk, M. Tahir","doi":"10.1109/ICDM.2009.84","DOIUrl":"https://doi.org/10.1109/ICDM.2009.84","url":null,"abstract":"We consider the problem of learning a linear combination of pre-specified kernel matrices in the Fisher discriminant analysis setting. Existing methods for such a task impose an $ell_1$ norm regularisation on the kernel weights, which produces sparse solution but may lead to loss of information. In this paper, we propose to use $ell_2$ norm regularisation instead. The resulting learning problem is formulated as a semi-infinite program and can be solved efficiently. Through experiments on both synthetic data and a very challenging object recognition benchmark, the relative advantages of the proposed method and its $ell_1$ counterpart are demonstrated, and insights are gained as to how the choice of regularisation norm should be made.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121424649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces a new extension of outlier detection approaches and a new concept, class separation through variance. We show that accumulating information about the outlierness of points in multiple subspaces leads to a ranking in which classes with differing variance naturally tend to separate. Exploiting this leads to a highly effective and efficient unsupervised class separation approach, especially useful in the difficult case of heavily overlapping distributions. Unlike typical outlier detection algorithms, this method can be applied beyond the `rare classes' case with great success. Two novel algorithms that implement this approach are provided. Additionally, experiments show that the novel methods typically outperform other state-of-the-art outlier detection methods on high dimensional data such as Feature Bagging, SOE1, LOF, ORCA and Robust Mahalanobis Distance and competes even with the leading supervised classification methods.
{"title":"Unsupervised Class Separation of Multivariate Data through Cumulative Variance-Based Ranking","authors":"Andrew Foss, Osmar R Zaiane, Sandra Zilles","doi":"10.1109/ICDM.2009.17","DOIUrl":"https://doi.org/10.1109/ICDM.2009.17","url":null,"abstract":"This paper introduces a new extension of outlier detection approaches and a new concept, class separation through variance. We show that accumulating information about the outlierness of points in multiple subspaces leads to a ranking in which classes with differing variance naturally tend to separate. Exploiting this leads to a highly effective and efficient unsupervised class separation approach, especially useful in the difficult case of heavily overlapping distributions. Unlike typical outlier detection algorithms, this method can be applied beyond the `rare classes' case with great success. Two novel algorithms that implement this approach are provided. Additionally, experiments show that the novel methods typically outperform other state-of-the-art outlier detection methods on high dimensional data such as Feature Bagging, SOE1, LOF, ORCA and Robust Mahalanobis Distance and competes even with the leading supervised classification methods.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116140740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Active learning can actively select or construct examples to label to reduce the number of labeled examples needed for building accurate classifiers. However, previous works of active learning can only ask specific queries. For example, to predict osteoarthritis from a patient dataset with 30 attributes, specific queries always contain values of all these 30 attributes, many of which may be irrelevant. A more natural way is to ask "generalized queries" with don't-care attributes, such as "are people over 50 with knee pain likely to have osteoarthritis?" (with only two attributes: age and type of pain). We assume that the oracle (and human experts) can readily answer those generalized queries by returning probabilistic labels. The power of such generalized queries is that one generalized query may be equivalent to many specific ones. However, overly general queries may receive highly uncertain labels from the oracle, and this makes learning difficult. In this paper, we propose a novel active learning algorithm that asks generalized queries. We demonstrate experimentally that our new method asks significantly fewer queries compared with the previous works of active learning. Our method can be readily deployed in real-world tasks where obtaining labeled examples is costly.
{"title":"Active Learning with Generalized Queries","authors":"Jun Du, C. Ling","doi":"10.1109/ICDM.2009.71","DOIUrl":"https://doi.org/10.1109/ICDM.2009.71","url":null,"abstract":"Active learning can actively select or construct examples to label to reduce the number of labeled examples needed for building accurate classifiers. However, previous works of active learning can only ask specific queries. For example, to predict osteoarthritis from a patient dataset with 30 attributes, specific queries always contain values of all these 30 attributes, many of which may be irrelevant. A more natural way is to ask \"generalized queries\" with don't-care attributes, such as \"are people over 50 with knee pain likely to have osteoarthritis?\" (with only two attributes: age and type of pain). We assume that the oracle (and human experts) can readily answer those generalized queries by returning probabilistic labels. The power of such generalized queries is that one generalized query may be equivalent to many specific ones. However, overly general queries may receive highly uncertain labels from the oracle, and this makes learning difficult. In this paper, we propose a novel active learning algorithm that asks generalized queries. We demonstrate experimentally that our new method asks significantly fewer queries compared with the previous works of active learning. Our method can be readily deployed in real-world tasks where obtaining labeled examples is costly.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125948532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}