首页 > 最新文献

2009 Ninth IEEE International Conference on Data Mining最新文献

英文 中文
Online and Batch Learning of Generalized Cosine Similarities 广义余弦相似度的在线和批量学习
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.114
A. M. Qamar, Éric Gaussier
In this paper, we define an online algorithm to learn the generalized cosine similarity measures for kNN classification and hence a similarity matrix A corresponding to a bilinear form. In contrary to the standard cosine measure, the normalization is itself dependent on the similarity matrix which makes it impossible to use directly the algorithms developed for learning Mahanalobis distances, based on positive, semi-definite (PSD) matrices. We follow the approach where we first find an appropriate matrix and then project it onto the cone of PSD matrices, which we have adapted to the particular form of generalized cosine similarities, and more particularly to the fact that such measures are normalized. The resulting online algorithm as well as its batch version is fast and has got better accuracy as compared with state-of-the-art methods on standard data sets.
本文定义了一种在线学习kNN分类广义余弦相似测度的算法,并由此定义了一个双线性形式的相似矩阵a。与标准余弦度量相反,归一化本身依赖于相似矩阵,这使得不可能直接使用基于正半定(PSD)矩阵的学习Mahanalobis距离的算法。我们遵循的方法是,我们首先找到一个合适的矩阵,然后将其投影到PSD矩阵的锥上,我们已经适应了广义余弦相似度的特殊形式,更特别的是,这些度量是标准化的。所得到的在线算法及其批处理版本在标准数据集上与现有方法相比,速度快,精度高。
{"title":"Online and Batch Learning of Generalized Cosine Similarities","authors":"A. M. Qamar, Éric Gaussier","doi":"10.1109/ICDM.2009.114","DOIUrl":"https://doi.org/10.1109/ICDM.2009.114","url":null,"abstract":"In this paper, we define an online algorithm to learn the generalized cosine similarity measures for kNN classification and hence a similarity matrix A corresponding to a bilinear form. In contrary to the standard cosine measure, the normalization is itself dependent on the similarity matrix which makes it impossible to use directly the algorithms developed for learning Mahanalobis distances, based on positive, semi-definite (PSD) matrices. We follow the approach where we first find an appropriate matrix and then project it onto the cone of PSD matrices, which we have adapted to the particular form of generalized cosine similarities, and more particularly to the fact that such measures are normalized. The resulting online algorithm as well as its batch version is fast and has got better accuracy as compared with state-of-the-art methods on standard data sets.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"134263292","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 23
Dirichlet Mixture Allocation for Multiclass Document Collections Modeling 多类文档集合建模的Dirichlet混合分配
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.102
Wei Bian, D. Tao
Topic model, Latent Dirichlet Allocation (LDA), is an effective tool for statistical analysis of large collections of documents. In LDA, each document is modeled as a mixture of topics and the topic proportions are generated from the unimodal Dirichlet distribution prior. When a collection of documents are drawn from multiple classes, this unimodal prior is insufficient for data fitting. To solve this problem, we exploit the multimodal Dirichlet mixture prior, and propose the Dirichlet mixture allocation (DMA). We report experiments on the popular TDT2 Corpus demonstrating that DMA models a collection of documents more precisely than LDA when the documents are obtained from multiple classes.
主题模型潜狄利克雷分配(Latent Dirichlet Allocation, LDA)是对大量文档进行统计分析的有效工具。在LDA中,每个文档被建模为主题的混合物,主题比例由单峰Dirichlet分布先验生成。当从多个类中提取文档集合时,这种单模态先验不足以进行数据拟合。为了解决这一问题,我们利用多模态Dirichlet混合先验,提出了Dirichlet混合分配(DMA)算法。我们报告了在流行的TDT2语料库上的实验,表明当文档来自多个类时,DMA比LDA更精确地建模文档集合。
{"title":"Dirichlet Mixture Allocation for Multiclass Document Collections Modeling","authors":"Wei Bian, D. Tao","doi":"10.1109/ICDM.2009.102","DOIUrl":"https://doi.org/10.1109/ICDM.2009.102","url":null,"abstract":"Topic model, Latent Dirichlet Allocation (LDA), is an effective tool for statistical analysis of large collections of documents. In LDA, each document is modeled as a mixture of topics and the topic proportions are generated from the unimodal Dirichlet distribution prior. When a collection of documents are drawn from multiple classes, this unimodal prior is insufficient for data fitting. To solve this problem, we exploit the multimodal Dirichlet mixture prior, and propose the Dirichlet mixture allocation (DMA). We report experiments on the popular TDT2 Corpus demonstrating that DMA models a collection of documents more precisely than LDA when the documents are obtained from multiple classes.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"15 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127780326","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Cross-Guided Clustering: Transfer of Relevant Supervision across Domains for Improved Clustering 交叉引导聚类:跨域的相关监督转移以改进聚类
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.33
Indrajit Bhattacharya, S. Godbole, Sachindra Joshi, Ashish Verma
Lack of supervision in clustering algorithms often leads to clusters that are not useful or interesting to human reviewers. We investigate if supervision can be automatically transferred to a clustering task in a target domain, by providing a relevant supervised partitioning of a dataset from a different source domain. The target clustering is made more meaningful for the human user by trading off intrinsic clustering goodness on the target dataset for alignment with relevant supervised partitions in the source dataset, wherever possible. We propose a cross-guided clustering algorithm that builds on traditional k-means by aligning the target clusters with source partitions. The alignment process makes use of a cross-domain similarity measure that discovers hidden relationships across domains with potentially different vocabularies. Using multiple real-world datasets, we show that our approach improves clustering accuracy significantly over traditional k-means.
在聚类算法中缺乏监督通常会导致聚类对人类评论者来说没有用处或没有兴趣。我们通过提供来自不同源域的数据集的相关监督分区,研究监督是否可以自动转移到目标域中的聚类任务。在任何可能的情况下,通过权衡目标数据集上的固有聚类优点来与源数据集中的相关监督分区保持一致,目标聚类对人类用户来说更有意义。我们提出了一种基于传统k-means的交叉引导聚类算法,通过将目标聚类与源分区对齐。对齐过程使用跨域相似性度量来发现具有潜在不同词汇表的域之间的隐藏关系。使用多个真实数据集,我们表明我们的方法比传统的k-means显著提高了聚类精度。
{"title":"Cross-Guided Clustering: Transfer of Relevant Supervision across Domains for Improved Clustering","authors":"Indrajit Bhattacharya, S. Godbole, Sachindra Joshi, Ashish Verma","doi":"10.1109/ICDM.2009.33","DOIUrl":"https://doi.org/10.1109/ICDM.2009.33","url":null,"abstract":"Lack of supervision in clustering algorithms often leads to clusters that are not useful or interesting to human reviewers. We investigate if supervision can be automatically transferred to a clustering task in a target domain, by providing a relevant supervised partitioning of a dataset from a different source domain. The target clustering is made more meaningful for the human user by trading off intrinsic clustering goodness on the target dataset for alignment with relevant supervised partitions in the source dataset, wherever possible. We propose a cross-guided clustering algorithm that builds on traditional k-means by aligning the target clusters with source partitions. The alignment process makes use of a cross-domain similarity measure that discovers hidden relationships across domains with potentially different vocabularies. Using multiple real-world datasets, we show that our approach improves clustering accuracy significantly over traditional k-means.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128962838","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 16
Probabilistic Similarity Query on Dimension Incomplete Data 维数不完全数据的概率相似性查询
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.72
Wei-min Cheng, Xiaoming Jin, Jian-Tao Sun
Retrieving similar data has drawn many research efforts in the literature due to its importance in data mining, database and information retrieval. This problem is challenging when the data is incomplete. In previous research, data incompleteness refers to the fact that data values for some dimensions are unknown. However, in many practical applications (e.g., data collection by sensor network under bad environment), not only data values but even data dimension information may also be missing, which will make most similarity query algorithms infeasible. In this work, we propose the novel similarity query problem on dimension incomplete data and adopt a probabilistic framework to model this problem. For this problem, users can give a distance threshold and a probability threshold to specify their retrieval requirements. The distance threshold is used to specify the allowed distance between query and data objects and the probability threshold is used to require that the retrieval results satisfy the distance condition at least with the given probability. Instead of enumerating all possible cases to recover the missed dimensions, we propose an efficient approach to speed up the retrieval process by leveraging the inherent relations between query and dimension incomplete data objects. During the query process, we estimate the lower/upper bounds of the probability that the query is satisfied by a given data object, and utilize these bounds to filter irrelevant data objects efficiently. Furthermore, a probability triangle inequality is proposed to further speed up query processing. According to our experiments on real data sets, the proposed similarity query method is verified to be effective and efficient on dimension incomplete data.
由于检索相似数据在数据挖掘、数据库和信息检索中的重要性,在文献中引起了许多研究的努力。当数据不完整时,这个问题很有挑战性。在以往的研究中,数据不完备是指某些维度的数据值是未知的。然而,在许多实际应用中(如恶劣环境下的传感器网络数据采集),不仅数据值缺失,甚至数据维度信息也可能缺失,这将使大多数相似度查询算法无法实现。在本文中,我们提出了一种新的维度不完备数据的相似度查询问题,并采用概率框架对该问题进行建模。对于这个问题,用户可以给出一个距离阈值和一个概率阈值来指定他们的检索需求。距离阈值用于指定查询和数据对象之间允许的距离,概率阈值用于要求检索结果至少以给定的概率满足距离条件。我们提出了一种有效的方法,通过利用查询和维度不完整数据对象之间的内在关系来加快检索过程,而不是列举所有可能的情况来恢复丢失的维度。在查询过程中,我们估计给定数据对象满足查询的概率的下界/上界,并利用这些边界有效地过滤不相关的数据对象。在此基础上,提出了一种概率三角不等式,进一步提高了查询的处理速度。通过在真实数据集上的实验,验证了所提出的相似度查询方法在维数不完全数据上的有效性。
{"title":"Probabilistic Similarity Query on Dimension Incomplete Data","authors":"Wei-min Cheng, Xiaoming Jin, Jian-Tao Sun","doi":"10.1109/ICDM.2009.72","DOIUrl":"https://doi.org/10.1109/ICDM.2009.72","url":null,"abstract":"Retrieving similar data has drawn many research efforts in the literature due to its importance in data mining, database and information retrieval. This problem is challenging when the data is incomplete. In previous research, data incompleteness refers to the fact that data values for some dimensions are unknown. However, in many practical applications (e.g., data collection by sensor network under bad environment), not only data values but even data dimension information may also be missing, which will make most similarity query algorithms infeasible. In this work, we propose the novel similarity query problem on dimension incomplete data and adopt a probabilistic framework to model this problem. For this problem, users can give a distance threshold and a probability threshold to specify their retrieval requirements. The distance threshold is used to specify the allowed distance between query and data objects and the probability threshold is used to require that the retrieval results satisfy the distance condition at least with the given probability. Instead of enumerating all possible cases to recover the missed dimensions, we propose an efficient approach to speed up the retrieval process by leveraging the inherent relations between query and dimension incomplete data objects. During the query process, we estimate the lower/upper bounds of the probability that the query is satisfied by a given data object, and utilize these bounds to filter irrelevant data objects efficiently. Furthermore, a probability triangle inequality is proposed to further speed up query processing. According to our experiments on real data sets, the proposed similarity query method is verified to be effective and efficient on dimension incomplete data.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"13 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114260406","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Efficient Discovery of Confounders in Large Data Sets 大型数据集中混杂因素的有效发现
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.77
Wenjun Zhou, Hui Xiong
Given a large transaction database, association analysis is concerned with efficiently finding strongly related objects. Unlike traditional associate analysis, where relationships among variables are searched at a global level, we examine confounding factors at a local level. Indeed, many real-world phenomena are localized to specific regions and times. These relationships may not be visible when the entire data set is analyzed. Specially, confounding effects that change the direction of correlation is the most significant. Along this line, we propose to efficiently find confounding effects attributable to local associations. Specifically, we derive an upper bound by a necessary condition of confounders, which can help us prune the search space and efficiently identify confounders. Experimental results show that the proposed CONFOUND algorithm can effectively identify confounders and the computational performance is an order of magnitude faster than benchmark methods.
对于大型事务数据库,关联分析关注的是高效地查找强相关对象。与传统的关联分析不同,在全局层面上搜索变量之间的关系,我们在局部层面上检查混淆因素。事实上,许多现实世界的现象都局限于特定的地区和时间。当分析整个数据集时,这些关系可能不可见。特别地,改变相关方向的混淆效应是最显著的。沿着这条线,我们建议有效地发现可归因于地方关联的混淆效应。具体地说,我们根据混杂的必要条件推导出了一个上界,这可以帮助我们修剪搜索空间,有效地识别混杂。实验结果表明,该算法能够有效地识别混杂因素,计算性能比基准方法快一个数量级。
{"title":"Efficient Discovery of Confounders in Large Data Sets","authors":"Wenjun Zhou, Hui Xiong","doi":"10.1109/ICDM.2009.77","DOIUrl":"https://doi.org/10.1109/ICDM.2009.77","url":null,"abstract":"Given a large transaction database, association analysis is concerned with efficiently finding strongly related objects. Unlike traditional associate analysis, where relationships among variables are searched at a global level, we examine confounding factors at a local level. Indeed, many real-world phenomena are localized to specific regions and times. These relationships may not be visible when the entire data set is analyzed. Specially, confounding effects that change the direction of correlation is the most significant. Along this line, we propose to efficiently find confounding effects attributable to local associations. Specifically, we derive an upper bound by a necessary condition of confounders, which can help us prune the search space and efficiently identify confounders. Experimental results show that the proposed CONFOUND algorithm can effectively identify confounders and the computational performance is an order of magnitude faster than benchmark methods.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131094460","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
Peculiarity Analysis for Classifications 分类的特性分析
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.31
Jian Yang, Ning Zhong, Yiyu Yao, Jue Wang
Peculiarity-oriented mining (POM) is a new data mining method consisting of peculiar data identification and peculiar data analysis. Peculiarity factor (PF) and local peculiarity factor (LPF) are important concepts employed to describe the peculiarity of points in the identification step. One can study the notions at both attribute and record levels. In this paper, a new record LPF called distance based record LPF (D-record LPF) is proposed, which is defined as the sum of distances between a point and its nearest neighbors. It is proved mathematically that D-record LPF can characterize accurately the probability density function of a continuous m-dimensional distribution. This provides a theoretical basis for some existing distance based anomaly detection techniques. More important, it also provides an effective method for describing the class conditional probabilities in the Bayesian classifier. The result enables us to apply peculiarity analysis for classification problems. A novel algorithm called LPF-Bayes classifier and its kernelized implementation are presented, which have some connection to the Bayesian classifier. Experimental results on several benchmark data sets demonstrate that the proposed classifiers are effective.
面向特征挖掘(POM)是一种新的数据挖掘方法,主要包括特征数据识别和特征数据分析。特征因子(PF)和局部特征因子(LPF)是在识别步骤中用来描述点的独特性的重要概念。可以在属性和记录级别研究这些概念。本文提出了一种新的记录LPF,称为基于距离的记录LPF (D-record LPF),它被定义为一个点与其最近邻居之间的距离之和。从数学上证明了d记录LPF能准确表征连续m维分布的概率密度函数。这为现有的一些基于距离的异常检测技术提供了理论基础。更重要的是,它还为贝叶斯分类器中描述类条件概率提供了一种有效的方法。该结果使我们能够将特性分析应用于分类问题。提出了一种与贝叶斯分类器有一定联系的新算法LPF-Bayes分类器及其核化实现。在多个基准数据集上的实验结果表明,该分类器是有效的。
{"title":"Peculiarity Analysis for Classifications","authors":"Jian Yang, Ning Zhong, Yiyu Yao, Jue Wang","doi":"10.1109/ICDM.2009.31","DOIUrl":"https://doi.org/10.1109/ICDM.2009.31","url":null,"abstract":"Peculiarity-oriented mining (POM) is a new data mining method consisting of peculiar data identification and peculiar data analysis. Peculiarity factor (PF) and local peculiarity factor (LPF) are important concepts employed to describe the peculiarity of points in the identification step. One can study the notions at both attribute and record levels. In this paper, a new record LPF called distance based record LPF (D-record LPF) is proposed, which is defined as the sum of distances between a point and its nearest neighbors. It is proved mathematically that D-record LPF can characterize accurately the probability density function of a continuous m-dimensional distribution. This provides a theoretical basis for some existing distance based anomaly detection techniques. More important, it also provides an effective method for describing the class conditional probabilities in the Bayesian classifier. The result enables us to apply peculiarity analysis for classification problems. A novel algorithm called LPF-Bayes classifier and its kernelized implementation are presented, which have some connection to the Bayesian classifier. Experimental results on several benchmark data sets demonstrate that the proposed classifiers are effective.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126686408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
flowNet: Flow-Based Approach for Efficient Analysis of Complex Biological Networks flowNet:基于流的复杂生物网络高效分析方法
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.39
Young-Rae Cho, Lei Shi, A. Zhang
Biological networks having complex connectivity have been widely studied recently. By characterizing their inherent and structural behaviors in a topological perspective, these studies have attempted to discover hidden knowledge in the systems. However, even though various algorithms with graph-theoretical modeling have provided fundamentals in the network analysis, the availability of practical approaches to efficiently handle the complexity has been limited. In this paper, we present a novel flow-based approach, called flowNet, to efficiently analyze large-sized, complex networks. Our approach is based on the functional influence model that quantifies the influence of a biological component on another. We introduce a dynamic flow simulation algorithm to generate a flow pattern which is a unique characteristic for each component. The set of patterns can be used in identifying functional modules (i.e., clustering). The proposed flow simulation algorithm runs very efficiently in sparse networks. Since our approach uses a weighted network as an input, we also discuss supervised and unsupervised weighting schemes for unweighted biological networks. As experimental results in real applications to the yeast protein interaction network, we demonstrate that our approach outperforms previous graph clustering methods with respect to accuracy.
具有复杂连通性的生物网络近年来得到了广泛的研究。这些研究从拓扑学的角度刻画了系统的固有和结构行为,试图发现系统中隐藏的知识。然而,尽管各种图形理论建模算法为网络分析提供了基础,但有效处理复杂性的实用方法的可用性仍然有限。在本文中,我们提出了一种新的基于流的方法,称为flowNet,以有效地分析大型复杂网络。我们的方法基于功能影响模型,该模型量化了一种生物成分对另一种生物成分的影响。我们引入了一种动态流动模拟算法,以生成每个组件具有独特特征的流型。这组模式可用于识别功能模块(即聚类)。本文提出的流仿真算法在稀疏网络中运行效率很高。由于我们的方法使用加权网络作为输入,我们还讨论了非加权生物网络的监督和无监督加权方案。作为酵母蛋白相互作用网络实际应用的实验结果,我们证明了我们的方法在准确性方面优于以前的图聚类方法。
{"title":"flowNet: Flow-Based Approach for Efficient Analysis of Complex Biological Networks","authors":"Young-Rae Cho, Lei Shi, A. Zhang","doi":"10.1109/ICDM.2009.39","DOIUrl":"https://doi.org/10.1109/ICDM.2009.39","url":null,"abstract":"Biological networks having complex connectivity have been widely studied recently. By characterizing their inherent and structural behaviors in a topological perspective, these studies have attempted to discover hidden knowledge in the systems. However, even though various algorithms with graph-theoretical modeling have provided fundamentals in the network analysis, the availability of practical approaches to efficiently handle the complexity has been limited. In this paper, we present a novel flow-based approach, called flowNet, to efficiently analyze large-sized, complex networks. Our approach is based on the functional influence model that quantifies the influence of a biological component on another. We introduce a dynamic flow simulation algorithm to generate a flow pattern which is a unique characteristic for each component. The set of patterns can be used in identifying functional modules (i.e., clustering). The proposed flow simulation algorithm runs very efficiently in sparse networks. Since our approach uses a weighted network as an input, we also discuss supervised and unsupervised weighting schemes for unweighted biological networks. As experimental results in real applications to the yeast protein interaction network, we demonstrate that our approach outperforms previous graph clustering methods with respect to accuracy.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128941873","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Non-sparse Multiple Kernel Learning for Fisher Discriminant Analysis Fisher判别分析的非稀疏多核学习
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.84
F. Yan, J. Kittler, K. Mikolajczyk, M. Tahir
We consider the problem of learning a linear combination of pre-specified kernel matrices in the Fisher discriminant analysis setting. Existing methods for such a task impose an $ell_1$ norm regularisation on the kernel weights, which produces sparse solution but may lead to loss of information. In this paper, we propose to use $ell_2$ norm regularisation instead. The resulting learning problem is formulated as a semi-infinite program and can be solved efficiently. Through experiments on both synthetic data and a very challenging object recognition benchmark, the relative advantages of the proposed method and its $ell_1$ counterpart are demonstrated, and insights are gained as to how the choice of regularisation norm should be made.
我们考虑在Fisher判别分析设置中学习预先指定的核矩阵的线性组合的问题。这种任务的现有方法对核权值施加$ell_1$范数正则化,产生稀疏解,但可能导致信息丢失。在本文中,我们建议使用$ell_2$范数正则化来代替。由此产生的学习问题被表述为半无限规划,可以有效地求解。通过在合成数据和极具挑战性的目标识别基准上的实验,证明了该方法与同类方法的相对优势,并对如何选择正则化范数有了深入的了解。
{"title":"Non-sparse Multiple Kernel Learning for Fisher Discriminant Analysis","authors":"F. Yan, J. Kittler, K. Mikolajczyk, M. Tahir","doi":"10.1109/ICDM.2009.84","DOIUrl":"https://doi.org/10.1109/ICDM.2009.84","url":null,"abstract":"We consider the problem of learning a linear combination of pre-specified kernel matrices in the Fisher discriminant analysis setting. Existing methods for such a task impose an $ell_1$ norm regularisation on the kernel weights, which produces sparse solution but may lead to loss of information. In this paper, we propose to use $ell_2$ norm regularisation instead. The resulting learning problem is formulated as a semi-infinite program and can be solved efficiently. Through experiments on both synthetic data and a very challenging object recognition benchmark, the relative advantages of the proposed method and its $ell_1$ counterpart are demonstrated, and insights are gained as to how the choice of regularisation norm should be made.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121424649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
Unsupervised Class Separation of Multivariate Data through Cumulative Variance-Based Ranking 基于累积方差排序的多变量数据无监督类分离
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.17
Andrew Foss, Osmar R Zaiane, Sandra Zilles
This paper introduces a new extension of outlier detection approaches and a new concept, class separation through variance. We show that accumulating information about the outlierness of points in multiple subspaces leads to a ranking in which classes with differing variance naturally tend to separate. Exploiting this leads to a highly effective and efficient unsupervised class separation approach, especially useful in the difficult case of heavily overlapping distributions. Unlike typical outlier detection algorithms, this method can be applied beyond the `rare classes' case with great success. Two novel algorithms that implement this approach are provided. Additionally, experiments show that the novel methods typically outperform other state-of-the-art outlier detection methods on high dimensional data such as Feature Bagging, SOE1, LOF, ORCA and Robust Mahalanobis Distance and competes even with the leading supervised classification methods.
本文介绍了异常点检测方法的一个新扩展和一个新概念——通过方差进行类分离。我们表明,关于多个子空间中点的离群性的累积信息导致具有不同方差的类自然倾向于分离的排序。利用这一点可以产生一种非常有效和高效的无监督类分离方法,在分布严重重叠的困难情况下特别有用。与典型的离群值检测算法不同,该方法可以应用于“罕见类”以外的情况,并取得了巨大的成功。给出了实现该方法的两种新算法。此外,实验表明,新方法通常优于其他最先进的高维数据异常检测方法,如Feature Bagging, SOE1, LOF, ORCA和鲁棒马氏距离,甚至可以与领先的监督分类方法竞争。
{"title":"Unsupervised Class Separation of Multivariate Data through Cumulative Variance-Based Ranking","authors":"Andrew Foss, Osmar R Zaiane, Sandra Zilles","doi":"10.1109/ICDM.2009.17","DOIUrl":"https://doi.org/10.1109/ICDM.2009.17","url":null,"abstract":"This paper introduces a new extension of outlier detection approaches and a new concept, class separation through variance. We show that accumulating information about the outlierness of points in multiple subspaces leads to a ranking in which classes with differing variance naturally tend to separate. Exploiting this leads to a highly effective and efficient unsupervised class separation approach, especially useful in the difficult case of heavily overlapping distributions. Unlike typical outlier detection algorithms, this method can be applied beyond the `rare classes' case with great success. Two novel algorithms that implement this approach are provided. Additionally, experiments show that the novel methods typically outperform other state-of-the-art outlier detection methods on high dimensional data such as Feature Bagging, SOE1, LOF, ORCA and Robust Mahalanobis Distance and competes even with the leading supervised classification methods.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116140740","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Active Learning with Generalized Queries 基于广义查询的主动学习
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.71
Jun Du, C. Ling
Active learning can actively select or construct examples to label to reduce the number of labeled examples needed for building accurate classifiers. However, previous works of active learning can only ask specific queries. For example, to predict osteoarthritis from a patient dataset with 30 attributes, specific queries always contain values of all these 30 attributes, many of which may be irrelevant. A more natural way is to ask "generalized queries" with don't-care attributes, such as "are people over 50 with knee pain likely to have osteoarthritis?" (with only two attributes: age and type of pain). We assume that the oracle (and human experts) can readily answer those generalized queries by returning probabilistic labels. The power of such generalized queries is that one generalized query may be equivalent to many specific ones. However, overly general queries may receive highly uncertain labels from the oracle, and this makes learning difficult. In this paper, we propose a novel active learning algorithm that asks generalized queries. We demonstrate experimentally that our new method asks significantly fewer queries compared with the previous works of active learning. Our method can be readily deployed in real-world tasks where obtaining labeled examples is costly.
主动学习可以主动选择或构建要标记的示例,以减少构建准确分类器所需的标记示例数量。然而,以前的主动学习工作只能提出特定的问题。例如,要从具有30个属性的患者数据集中预测骨关节炎,特定查询总是包含所有这30个属性的值,其中许多属性可能是不相关的。一种更自然的方式是问带有不关心属性的“一般化问题”,比如“50岁以上膝盖疼痛的人可能患有骨关节炎吗?”(只有两个属性:年龄和疼痛类型)。我们假设oracle(和人类专家)可以通过返回概率标签轻松回答这些通用查询。这种通用查询的强大之处在于,一个通用查询可以等同于许多特定查询。然而,过于通用的查询可能会从oracle接收到高度不确定的标签,这使得学习变得困难。在本文中,我们提出了一种新的主动学习算法,该算法提出了一般化的查询。我们通过实验证明,与之前的主动学习方法相比,我们的新方法要求的查询明显减少。我们的方法可以很容易地部署在现实世界的任务中,在那里获得标记的例子是昂贵的。
{"title":"Active Learning with Generalized Queries","authors":"Jun Du, C. Ling","doi":"10.1109/ICDM.2009.71","DOIUrl":"https://doi.org/10.1109/ICDM.2009.71","url":null,"abstract":"Active learning can actively select or construct examples to label to reduce the number of labeled examples needed for building accurate classifiers. However, previous works of active learning can only ask specific queries. For example, to predict osteoarthritis from a patient dataset with 30 attributes, specific queries always contain values of all these 30 attributes, many of which may be irrelevant. A more natural way is to ask \"generalized queries\" with don't-care attributes, such as \"are people over 50 with knee pain likely to have osteoarthritis?\" (with only two attributes: age and type of pain). We assume that the oracle (and human experts) can readily answer those generalized queries by returning probabilistic labels. The power of such generalized queries is that one generalized query may be equivalent to many specific ones. However, overly general queries may receive highly uncertain labels from the oracle, and this makes learning difficult. In this paper, we propose a novel active learning algorithm that asks generalized queries. We demonstrate experimentally that our new method asks significantly fewer queries compared with the previous works of active learning. Our method can be readily deployed in real-world tasks where obtaining labeled examples is costly.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125948532","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
期刊
2009 Ninth IEEE International Conference on Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1