2011 IEEE 11th International Conference on Data Mining最新文献

英文中文

Positive and Unlabeled Learning for Graph Classification 图分类的积极和无标记学习

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.119

Yuchen Zhao, Xiangnan Kong, Philip S. Yu

The problem of graph classification has drawn much attention in the last decade. Conventional approaches on graph classification focus on mining discriminative sub graph features under supervised settings. The feature selection strategies strictly follow the assumption that both positive and negative graphs exist. However, in many real-world applications, the negative graph examples are not available. In this paper we study the problem of how to select useful sub graph features and perform graph classification based upon only positive and unlabeled graphs. This problem is challenging and different from previous works on PU learning, because there are no predefined features in graph data. Moreover, the sub graph enumeration problem is NP-hard. We need to identify a subset of unlabeled graphs that are most likely to be negative graphs. However, the negative graph selection problem and the sub graph feature selection problem are correlated. Before the reliable negative graphs can be resolved, we need to have a set of useful sub graph features. In order to address this problem, we first derive an evaluation criterion to estimate the dependency between sub graph features and class labels based on a set of estimated negative graphs. In order to build accurate models for the PU learning problem on graph data, we propose an integrated approach to concurrently select the discriminative features and the negative graphs in an iterative manner. Experimental results illustrate the effectiveness and efficiency of the proposed method.

近十年来，图的分类问题引起了人们的广泛关注。传统的图分类方法侧重于在监督设置下挖掘判别子图特征。特征选择策略严格遵循正图和负图同时存在的假设。然而，在许多实际应用中，负图示例是不可用的。本文研究了仅基于正图和未标记图选择有用的子图特征并进行图分类的问题。这个问题是具有挑战性的，并且与以前的PU学习工作不同，因为图数据中没有预定义的特征。而且，子图枚举问题是np困难的。我们需要确定一个未标记图的子集，它最有可能是负图。然而，负图选择问题和子图特征选择问题是相互关联的。在解出可靠负图之前，我们需要有一组有用的子图特征。为了解决这个问题，我们首先基于一组估计的负图，推导出一个评估准则来估计子图特征和类标签之间的依赖关系。为了建立基于图数据的PU学习问题的精确模型，我们提出了一种以迭代方式同时选择判别特征和负图的集成方法。实验结果表明了该方法的有效性和高效性。

{"title":"Positive and Unlabeled Learning for Graph Classification","authors":"Yuchen Zhao, Xiangnan Kong, Philip S. Yu","doi":"10.1109/ICDM.2011.119","DOIUrl":"https://doi.org/10.1109/ICDM.2011.119","url":null,"abstract":"The problem of graph classification has drawn much attention in the last decade. Conventional approaches on graph classification focus on mining discriminative sub graph features under supervised settings. The feature selection strategies strictly follow the assumption that both positive and negative graphs exist. However, in many real-world applications, the negative graph examples are not available. In this paper we study the problem of how to select useful sub graph features and perform graph classification based upon only positive and unlabeled graphs. This problem is challenging and different from previous works on PU learning, because there are no predefined features in graph data. Moreover, the sub graph enumeration problem is NP-hard. We need to identify a subset of unlabeled graphs that are most likely to be negative graphs. However, the negative graph selection problem and the sub graph feature selection problem are correlated. Before the reliable negative graphs can be resolved, we need to have a set of useful sub graph features. In order to address this problem, we first derive an evaluation criterion to estimate the dependency between sub graph features and class labels based on a set of estimated negative graphs. In order to build accurate models for the PU learning problem on graph data, we propose an integrated approach to concurrently select the discriminative features and the negative graphs in an iterative manner. Experimental results illustrate the effectiveness and efficiency of the proposed method.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121386702","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 43

Enabling Fast Lazy Learning for Data Streams 支持数据流的快速惰性学习

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.63

Peng Zhang, Byron J. Gao, Xingquan Zhu, Li Guo

Lazy learning, such as k-nearest neighbor learning, has been widely applied to many applications. Known for well capturing data locality, lazy learning can be advantageous for highly dynamic and complex learning environments such as data streams. Yet its high memory consumption and low prediction efficiency have made it less favorable for stream oriented applications. Specifically, traditional lazy learning stores all the training data and the inductive process is deferred until a query appears, whereas in stream applications, data records flow continuously in large volumes and the prediction of class labels needs to be made in a timely manner. In this paper, we provide a systematic solution that overcomes the memory and efficiency limitations and enables fast lazy learning for concept drifting data streams. In particular, we propose a novel Lazy-tree (Ltree for short) indexing structure that dynamically maintains compact high-level summaries of historical stream records. L-trees are M-Tree [5] like, height-balanced, and can help achieve great memory consumption reduction and sub-linear time complexity for prediction. Moreover, L-trees continuously absorb new stream records and discard outdated ones, so they can naturally adapt to the dynamically changing concepts in data streams for accurate prediction. Extensive experiments on real-world and synthetic data streams demonstrate the performance of our approach.

懒惰学习，如k近邻学习，已经被广泛应用于许多应用中。惰性学习以很好地捕获数据局部性而闻名，对于高度动态和复杂的学习环境(如数据流)可能是有利的。然而，它的高内存消耗和低预测效率使得它不太适合面向流的应用。具体来说，传统的懒惰学习将所有的训练数据存储起来，归纳过程推迟到出现查询时进行，而在流应用中，数据记录大量连续流动，需要及时对类标签进行预测。在本文中，我们提供了一个系统的解决方案，克服了内存和效率的限制，实现了概念漂移数据流的快速惰性学习。特别是，我们提出了一种新颖的Lazy-tree(简称Ltree)索引结构，它可以动态地维护历史流记录的紧凑的高级摘要。l树与m树[5]类似，高度平衡，可以帮助实现极大的内存消耗减少和预测的亚线性时间复杂度。此外，l树不断吸收新的流记录，丢弃过时的流记录，因此l树可以自然地适应数据流中动态变化的概念，从而进行准确的预测。在真实世界和合成数据流上的大量实验证明了我们的方法的性能。

{"title":"Enabling Fast Lazy Learning for Data Streams","authors":"Peng Zhang, Byron J. Gao, Xingquan Zhu, Li Guo","doi":"10.1109/ICDM.2011.63","DOIUrl":"https://doi.org/10.1109/ICDM.2011.63","url":null,"abstract":"Lazy learning, such as k-nearest neighbor learning, has been widely applied to many applications. Known for well capturing data locality, lazy learning can be advantageous for highly dynamic and complex learning environments such as data streams. Yet its high memory consumption and low prediction efficiency have made it less favorable for stream oriented applications. Specifically, traditional lazy learning stores all the training data and the inductive process is deferred until a query appears, whereas in stream applications, data records flow continuously in large volumes and the prediction of class labels needs to be made in a timely manner. In this paper, we provide a systematic solution that overcomes the memory and efficiency limitations and enables fast lazy learning for concept drifting data streams. In particular, we propose a novel Lazy-tree (Ltree for short) indexing structure that dynamically maintains compact high-level summaries of historical stream records. L-trees are M-Tree [5] like, height-balanced, and can help achieve great memory consumption reduction and sub-linear time complexity for prediction. Moreover, L-trees continuously absorb new stream records and discard outdated ones, so they can naturally adapt to the dynamically changing concepts in data streams for accurate prediction. Extensive experiments on real-world and synthetic data streams demonstrate the performance of our approach.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"198 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115838096","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 49

Semi-supervised Discriminant Hashing 半监督判别哈希

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.128

Saehoon Kim, Seungjin Choi

Hashing refers to methods for embedding high dimensional data into a similarity-preserving low-dimensional Hamming space such that similar objects are indexed by binary codes whose Hamming distances are small. Learning hash functions from data has recently been recognized as a promising approach to approximate nearest neighbor search for high dimensional data. Most of ¡®learning to hash' methods resort to either unsupervised or supervised learning to determine hash functions. Recently semi-supervised learning approach was introduced in hashing where pair wise constraints (must link and cannot-link) using labeled data are leveraged while unlabeled data are used for regularization to avoid over-fitting. In this paper we base our semi-supervised hashing on linear discriminant analysis, where hash functions are learned such that labeled data are used to maximize the separability between binary codes associated with different classes while unlabeled data are used for regularization as well as for balancing condition and pair wise decor relation of bits. The resulting method is referred to as semi-supervised discriminant hashing (SSDH). Numerical experiments on MNIST and CIFAR-10 datasets demonstrate that our method outperforms existing methods, especially in the case of short binary codes.

哈希是指将高维数据嵌入到保持相似度的低维汉明空间中，使相似对象通过汉明距离较小的二进制代码进行索引的方法。从数据中学习哈希函数最近被认为是一种很有前途的方法来近似最近邻搜索高维数据。大多数“学习哈希”方法都采用无监督学习或监督学习来确定哈希函数。最近在哈希中引入了半监督学习方法，其中利用标记数据的成对约束(必须链接和不能链接)，而使用未标记数据进行正则化以避免过拟合。在本文中，我们基于线性判别分析的半监督哈希，其中学习了哈希函数，使得标记数据用于最大化与不同类关联的二进制码之间的可分性，而未标记数据用于正则化以及平衡条件和对装饰关系。由此产生的方法被称为半监督判别散列(SSDH)。在MNIST和CIFAR-10数据集上的数值实验表明，我们的方法优于现有的方法，特别是在短二进制码的情况下。

{"title":"Semi-supervised Discriminant Hashing","authors":"Saehoon Kim, Seungjin Choi","doi":"10.1109/ICDM.2011.128","DOIUrl":"https://doi.org/10.1109/ICDM.2011.128","url":null,"abstract":"Hashing refers to methods for embedding high dimensional data into a similarity-preserving low-dimensional Hamming space such that similar objects are indexed by binary codes whose Hamming distances are small. Learning hash functions from data has recently been recognized as a promising approach to approximate nearest neighbor search for high dimensional data. Most of ¡®learning to hash' methods resort to either unsupervised or supervised learning to determine hash functions. Recently semi-supervised learning approach was introduced in hashing where pair wise constraints (must link and cannot-link) using labeled data are leveraged while unlabeled data are used for regularization to avoid over-fitting. In this paper we base our semi-supervised hashing on linear discriminant analysis, where hash functions are learned such that labeled data are used to maximize the separability between binary codes associated with different classes while unlabeled data are used for regularization as well as for balancing condition and pair wise decor relation of bits. The resulting method is referred to as semi-supervised discriminant hashing (SSDH). Numerical experiments on MNIST and CIFAR-10 datasets demonstrate that our method outperforms existing methods, especially in the case of short binary codes.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"10875 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132329111","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 26

Sparse Domain Adaptation in Projection Spaces Based on Good Similarity Functions 基于良好相似函数的投影空间稀疏域自适应

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.136

Emilie Morvant, Amaury Habrard, S. Ayache

We address the problem of domain adaptation for binary classification which arises when the distributions generating the source learning data and target test data are somewhat different. We consider the challenging case where no target labeled data is available. From a theoretical standpoint, a classifier has better generalization guarantees when the two domain marginal distributions are close. We study a new direction based on a recent framework of Balcan et al. allowing to learn linear classifiers in an explicit projection space based on similarity functions that may be not symmetric and not positive semi-definite. We propose a general method for learning a good classifier on target data with generalization guarantees and we improve its efficiency thanks to an iterative procedure by reweighting the similarity function - compatible with Balcan et al. framework - to move closer the two distributions in a new projection space. Hyper parameters and reweighting quality are controlled by a reverse validation procedure. Our approach is based on a linear programming formulation and shows good adaptation performances with very sparse models. We evaluate it on a synthetic problem and on real image annotation task.

针对源学习数据和目标测试数据的分布存在一定差异的情况，提出了二值分类的域自适应问题。我们考虑没有目标标记数据可用的具有挑战性的情况。从理论的角度来看，当两个域边缘分布接近时，分类器具有更好的泛化保证。我们基于Balcan等人最近的框架研究了一个新的方向，允许在基于非对称和非正半确定的相似函数的显式投影空间中学习线性分类器。我们提出了一种通用的方法，在具有泛化保证的目标数据上学习一个好的分类器，并通过重新加权相似函数(与Balcan等人的框架兼容)的迭代过程来提高其效率，从而在新的投影空间中更接近两个分布。通过逆向验证程序控制超参数和重加权质量。该方法基于线性规划公式，对非常稀疏的模型具有良好的自适应性能。在一个综合问题和一个真实图像标注任务上对其进行了评价。

{"title":"Sparse Domain Adaptation in Projection Spaces Based on Good Similarity Functions","authors":"Emilie Morvant, Amaury Habrard, S. Ayache","doi":"10.1109/ICDM.2011.136","DOIUrl":"https://doi.org/10.1109/ICDM.2011.136","url":null,"abstract":"We address the problem of domain adaptation for binary classification which arises when the distributions generating the source learning data and target test data are somewhat different. We consider the challenging case where no target labeled data is available. From a theoretical standpoint, a classifier has better generalization guarantees when the two domain marginal distributions are close. We study a new direction based on a recent framework of Balcan et al. allowing to learn linear classifiers in an explicit projection space based on similarity functions that may be not symmetric and not positive semi-definite. We propose a general method for learning a good classifier on target data with generalization guarantees and we improve its efficiency thanks to an iterative procedure by reweighting the similarity function - compatible with Balcan et al. framework - to move closer the two distributions in a new projection space. Hyper parameters and reweighting quality are controlled by a reverse validation procedure. Our approach is based on a linear programming formulation and shows good adaptation performances with very sparse models. We evaluate it on a synthetic problem and on real image annotation task.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"14 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125257806","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

ADANA: Active Name Disambiguation ADANA:主动名称消歧

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.19

Xuezhi Wang, Jie Tang, Hong Cheng, Philip S. Yu

Name ambiguity has long been viewed as a challenging problem in many applications, such as scientific literature management, people search, and social network analysis. When we search a person name in these systems, many documents (e.g., papers, web pages) containing that person's name may be returned. It is hard to determine which documents are about the person we care about. Although much research has been conducted, the problem remains largely unsolved, especially with the rapid growth of the people information available on the Web. In this paper, we try to study this problem from a new perspective and propose an ADANA method for disambiguating person names via active user interactions. In ADANA, we first introduce a pairwise factor graph (PFG) model for person name disambiguation. The model is flexible and can be easily extended by incorporating various features. Based on the PFG model, we propose an active name disambiguation algorithm, aiming to improve the disambiguation performance by maximizing the utility of the user's correction. Experimental results on three different genres of data sets show that with only a few user corrections, the error rate of name disambiguation can be reduced to 3.1%. A real system has been developed based on the proposed method and is available online.

在科学文献管理、人员搜索和社会网络分析等许多应用中，名称歧义一直被视为一个具有挑战性的问题。当我们在这些系统中搜索人名时，可能会返回许多包含该人名的文档(如论文、网页)。很难确定哪些文件是关于我们关心的人的。尽管已经进行了大量的研究，但这个问题在很大程度上仍未得到解决，特别是随着网络上可获得的个人信息的迅速增长。在本文中，我们试图从一个新的角度来研究这个问题，并提出了一种通过主动用户交互来消除姓名歧义的ADANA方法。在ADANA中，我们首先引入了人名消歧的两两因子图(PFG)模型。该模型是灵活的，可以很容易地通过合并各种功能进行扩展。在PFG模型的基础上，提出了一种主动名称消歧算法，旨在通过最大化用户纠错的效用来提高消歧性能。在三种不同类型的数据集上的实验结果表明，只需少量的用户纠错，姓名消歧的错误率就可以降低到3.1%。在此基础上开发了一个实际的系统，并已上线。

{"title":"ADANA: Active Name Disambiguation","authors":"Xuezhi Wang, Jie Tang, Hong Cheng, Philip S. Yu","doi":"10.1109/ICDM.2011.19","DOIUrl":"https://doi.org/10.1109/ICDM.2011.19","url":null,"abstract":"Name ambiguity has long been viewed as a challenging problem in many applications, such as scientific literature management, people search, and social network analysis. When we search a person name in these systems, many documents (e.g., papers, web pages) containing that person's name may be returned. It is hard to determine which documents are about the person we care about. Although much research has been conducted, the problem remains largely unsolved, especially with the rapid growth of the people information available on the Web. In this paper, we try to study this problem from a new perspective and propose an ADANA method for disambiguating person names via active user interactions. In ADANA, we first introduce a pairwise factor graph (PFG) model for person name disambiguation. The model is flexible and can be easily extended by incorporating various features. Based on the PFG model, we propose an active name disambiguation algorithm, aiming to improve the disambiguation performance by maximizing the utility of the user's correction. Experimental results on three different genres of data sets show that with only a few user corrections, the error rate of name disambiguation can be reduced to 3.1%. A real system has been developed based on the proposed method and is available online.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"219 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114744972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 91

An Analysis of Performance Measures for Binary Classifiers 二值分类器性能指标分析

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.21

Charles Parker

If one is given two binary classifiers and a set of test data, it should be straightforward to determine which of the two classifiers is the superior. Recent work, however, has called into question many of the methods heretofore accepted as standard for this task. In this paper, we analyze seven ways of determining if one classifier is better than another, given the same test data. Five of these are long established and two are relative newcomers. We review and extend work showing that one of these methods is clearly inappropriate, and then conduct an empirical analysis with a large number of datasets to evaluate the real-world implications of our theoretical analysis. Both our empirical and theoretical results converge strongly towards one of the newer methods.

如果给定两个二元分类器和一组测试数据，那么确定两个分类器中哪一个更优应该是很简单的。然而，最近的工作对迄今为止被认为是这项任务标准的许多方法提出了质疑。在本文中，我们分析了在给定相同测试数据的情况下，确定一个分类器是否优于另一个分类器的七种方法。其中5家成立已久，2家相对较新。我们回顾和扩展工作，表明其中一种方法显然是不合适的，然后用大量数据集进行实证分析，以评估我们的理论分析的现实意义。我们的经验和理论结果都强烈倾向于一种较新的方法。

引用次数: 53

Discovering Thematic Patterns in Videos via Cohesive Sub-graph Mining 基于内聚子图挖掘的视频主题模式研究

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.55

Gangqiang Zhao, Junsong Yuan

One category of videos usually contains the same thematic pattern, e.g., the spin action in skating videos. The discovery of the thematic pattern is essential to understand and summarize the video contents. This paper addresses two critical issues in mining thematic video patterns: (1) automatic discovery of thematic patterns without any training or supervision information, and (2) accurate localization of the occurrences of all thematic patterns in videos. The major contributions are two-fold. First, we formulate the thematic video pattern discovery as a cohesive sub-graph selection problem by finding a sub-set of visual words that are spatio-temporally collocated. Then spatio-temporal branch-and-bound search can locate all instances accurately. Second, a novel method is proposed to efficiently find the cohesive sub-graph of maximum overall mutual information scores. Our experimental results on challenging commercial and action videos show that our approach can discover different types of thematic patterns despite variations in scale, view-point, color and lighting conditions, or partial occlusions. Our approach is also robust to the videos with cluttered and dynamic backgrounds.

一类视频通常包含相同的主题模式，例如滑冰视频中的旋转动作。主题模式的发现是理解和总结视频内容的关键。本文解决了主题视频模式挖掘中的两个关键问题:(1)在没有任何训练或监督信息的情况下自动发现主题模式;(2)准确定位视频中所有主题模式的出现情况。主要贡献有两方面。首先，我们将主题视频模式发现制定为一个内聚子图选择问题，通过寻找一个时空搭配的视觉词子集。然后进行时空分支定界搜索，可以准确定位所有实例。其次，提出了一种有效寻找总体互信息得分最大的内聚子图的新方法。我们在具有挑战性的商业视频和动作视频上的实验结果表明，我们的方法可以发现不同类型的主题模式，尽管在规模、视点、颜色和照明条件或部分遮挡方面存在差异。我们的方法对于具有杂乱和动态背景的视频也很健壮。

{"title":"Discovering Thematic Patterns in Videos via Cohesive Sub-graph Mining","authors":"Gangqiang Zhao, Junsong Yuan","doi":"10.1109/ICDM.2011.55","DOIUrl":"https://doi.org/10.1109/ICDM.2011.55","url":null,"abstract":"One category of videos usually contains the same thematic pattern, e.g., the spin action in skating videos. The discovery of the thematic pattern is essential to understand and summarize the video contents. This paper addresses two critical issues in mining thematic video patterns: (1) automatic discovery of thematic patterns without any training or supervision information, and (2) accurate localization of the occurrences of all thematic patterns in videos. The major contributions are two-fold. First, we formulate the thematic video pattern discovery as a cohesive sub-graph selection problem by finding a sub-set of visual words that are spatio-temporally collocated. Then spatio-temporal branch-and-bound search can locate all instances accurately. Second, a novel method is proposed to efficiently find the cohesive sub-graph of maximum overall mutual information scores. Our experimental results on challenging commercial and action videos show that our approach can discover different types of thematic patterns despite variations in scale, view-point, color and lighting conditions, or partial occlusions. Our approach is also robust to the videos with cluttered and dynamic backgrounds.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"30 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123616718","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 19

Exploiting False Discoveries -- Statistical Validation of Patterns and Quality Measures in Subgroup Discovery 利用错误发现——子群发现中模式和质量度量的统计验证

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.65

W. Duivesteijn, A. Knobbe

Subgroup discovery suffers from the multiple comparisons problem: we search through a large space, hence whenever we report a set of discoveries, this set will generally contain false discoveries. We propose a method to compare subgroups found through subgroup discovery with a statistical model we build for these false discoveries. We determine how much the subgroups we find deviate from the model, and hence statistically validate the found subgroups. Furthermore we propose to use this subgroup validation to objectively compare quality measures used in subgroup discovery, by determining how much the top subgroups we find with each measure deviate from the statistical model generated with that measure. We thus aim to determine how good individual measures are in selecting significant findings. We invoke our method to experimentally compare popular quality measures in several subgroup discovery settings.

子群发现存在多重比较问题:我们在一个很大的空间中搜索，因此每当我们报告一组发现时，这组发现通常会包含错误的发现。我们提出了一种方法，将通过子群发现发现的子群与我们为这些错误发现建立的统计模型进行比较。我们确定我们发现的子组偏离模型的程度，从而在统计上验证发现的子组。此外，我们建议使用这个子组验证来客观地比较子组发现中使用的质量度量，通过确定我们在每个度量中发现的顶级子组偏离由该度量生成的统计模型的程度。因此，我们的目标是确定单个测量方法在选择重要发现方面有多好。我们调用我们的方法来实验比较流行的质量措施在几个亚组发现设置。

引用次数: 55

Co-clustering for Binary and Categorical Data with Maximum Modularity 具有最大模块化的二值和分类数据的共聚类

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.37

Lazhar Labiod, M. Nadif

To tackle the co-clustering problem for binary and categorical data, we propose a generalized modularity measure and a spectral approximation of the modularity matrix. A spectral algorithm maximizing the modularity measure is then presented. Experimental results are performed on a variety of simulated and real-world data sets confirming the interest of the use of the modularity in co-clustering and assessing the number of clusters contexts.

为了解决二元数据和分类数据的共聚问题，我们提出了广义模块化度量和模块化矩阵的谱近似。然后提出了一种最大化模块化度量的频谱算法。实验结果在各种模拟和现实世界的数据集上进行，证实了在共聚类和评估聚类上下文数量中使用模块化的兴趣。

引用次数: 42

An Efficient Greedy Method for Unsupervised Feature Selection 一种有效的无监督特征选择贪心方法

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.22

Ahmed K. Farahat, A. Ghodsi, M. Kamel

In data mining applications, data instances are typically described by a huge number of features. Most of these features are irrelevant or redundant, which negatively affects the efficiency and effectiveness of different learning algorithms. The selection of relevant features is a crucial task which can be used to allow a better understanding of data or improve the performance of other learning tasks. Although the selection of relevant features has been extensively studied in supervised learning, feature selection with the absence of class labels is still a challenging task. This paper proposes a novel method for unsupervised feature selection, which efficiently selects features in a greedy manner. The paper first defines an effective criterion for unsupervised feature selection which measures the reconstruction error of the data matrix based on the selected subset of features. The paper then presents a novel algorithm for greedily minimizing the reconstruction error based on the features selected so far. The greedy algorithm is based on an efficient recursive formula for calculating the reconstruction error. Experiments on real data sets demonstrate the effectiveness of the proposed algorithm in comparison to the state-of-the-art methods for unsupervised feature selection.

在数据挖掘应用程序中，数据实例通常由大量特征描述。这些特征大多是不相关或冗余的，这对不同学习算法的效率和有效性产生了负面影响。相关特征的选择是一项至关重要的任务，可以用来更好地理解数据或提高其他学习任务的性能。尽管相关特征的选择在监督学习中已经得到了广泛的研究，但缺乏类标签的特征选择仍然是一个具有挑战性的任务。提出了一种新的无监督特征选择方法，以贪婪的方式有效地选择特征。本文首先定义了一种有效的无监督特征选择准则，该准则基于所选择的特征子集来度量数据矩阵的重构误差。在此基础上，提出了一种基于已有特征的贪婪最小化重构误差的算法。贪婪算法是基于一个有效的递归公式来计算重建误差。在真实数据集上的实验表明，与最先进的无监督特征选择方法相比，所提出的算法是有效的。

{"title":"An Efficient Greedy Method for Unsupervised Feature Selection","authors":"Ahmed K. Farahat, A. Ghodsi, M. Kamel","doi":"10.1109/ICDM.2011.22","DOIUrl":"https://doi.org/10.1109/ICDM.2011.22","url":null,"abstract":"In data mining applications, data instances are typically described by a huge number of features. Most of these features are irrelevant or redundant, which negatively affects the efficiency and effectiveness of different learning algorithms. The selection of relevant features is a crucial task which can be used to allow a better understanding of data or improve the performance of other learning tasks. Although the selection of relevant features has been extensively studied in supervised learning, feature selection with the absence of class labels is still a challenging task. This paper proposes a novel method for unsupervised feature selection, which efficiently selects features in a greedy manner. The paper first defines an effective criterion for unsupervised feature selection which measures the reconstruction error of the data matrix based on the selected subset of features. The paper then presents a novel algorithm for greedily minimizing the reconstruction error based on the features selected so far. The greedy algorithm is based on an efficient recursive formula for calculating the reconstruction error. Experiments on real data sets demonstrate the effectiveness of the proposed algorithm in comparison to the state-of-the-art methods for unsupervised feature selection.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116980630","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 87

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 IEEE 11th International Conference on Data Mining

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀