首页 > 最新文献

2011 IEEE 11th International Conference on Data Mining最新文献

英文 中文
Document Clustering via Matrix Representation 基于矩阵表示的文档聚类
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.59
Xufei Wang, Jiliang Tang, Huan Liu
Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A_i is factorized into a common base determined by non-negative matrices L and R^top, and a non-negative weight matrix M_i such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66% on the studied datasets, and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.
向量空间模型(VSM)被广泛用于表示文档和网页。它在计算上很容易处理,但它也将文档过度简化为矢量,容易受到噪声的影响,并且不能显式地表示文档的底层主题。本文提出了文档的矩阵表示:行表示不同的项,列表示内聚的段。矩阵模型将文档视为一组片段,每个片段是有限数量的潜在主题的概率分布,这些潜在主题可以映射到聚类结构。基于文档矩阵表示的潜在主题提取被表述为约束优化问题,其中每个矩阵(即文档)A_i被分解为由非负矩阵L和R^top确定的公共基,以及一个非负权重矩阵M_i,使得所有文档的重构误差总和最小。实证评价表明,矩阵模型在文档聚类中是可行的:(1)与向量表示相比,矩阵表示能持续提高聚类质量,在研究的数据集上,该方法的相对准确率提高了66%;(2)该方法优于k-means和NMF等基准方法,是LDA和PLSI等最先进方法的补充。此外,所提出的矩阵模型允许在段级别而不是在文档级别进行更精细的信息检索,这使得在信息检索任务中可以返回更多相关的文档。
{"title":"Document Clustering via Matrix Representation","authors":"Xufei Wang, Jiliang Tang, Huan Liu","doi":"10.1109/ICDM.2011.59","DOIUrl":"https://doi.org/10.1109/ICDM.2011.59","url":null,"abstract":"Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A_i is factorized into a common base determined by non-negative matrices L and R^top, and a non-negative weight matrix M_i such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66% on the studied datasets, and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133274488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 21
ASAP: A Self-Adaptive Prediction System for Instant Cloud Resource Demand Provisioning ASAP:一个即时云资源需求预置的自适应预测系统
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.25
Yexi Jiang, Chang-Shing Perng, Tao Li, Rong N. Chang
The promise of cloud computing is to provide computing resources instantly whenever they are needed. The state-of-art virtual machine (VM) provisioning technology can provision a VM in tens of minutes. This latency is unacceptable for jobs that need to scale out during computation. To truly enable on-the-fly scaling, new VM needs to be ready in seconds upon request. In this paper, We present an online temporal data mining system called ASAP, to model and predict the cloud VM demands. ASAP aims to extract high level characteristics from VM provisioning request stream and notify the provisioning system to prepare VMs in advance. For quantification issue, we propose Cloud Prediction Cost to encodes the cost and constraints of the cloud and guide the training of prediction algorithms. Moreover, we utilize a two-level ensemble method to capture the characteristics of the high transient demands time series. Experimental results using historical data from an IBM cloud in operation demonstrate that ASAP significantly improves the cloud service quality and provides possibility for on-the-fly provisioning.
云计算的承诺是在需要的时候立即提供计算资源。通过先进的虚拟机发放技术,可以在几十分钟内发放一个虚拟机。对于在计算过程中需要向外扩展的作业来说,这种延迟是不可接受的。要真正启用实时扩展,新的VM需要在请求后几秒钟内准备好。本文提出了一种在线时态数据挖掘系统ASAP,用于对云虚拟机需求进行建模和预测。ASAP旨在从虚拟机发放请求流中提取高级特征,并通知发放系统提前准备虚拟机。对于量化问题,我们提出了云预测成本来编码云的成本和约束,并指导预测算法的训练。此外,我们还利用两级集成方法来捕捉高暂态需求时间序列的特征。使用运行中的IBM云的历史数据的实验结果表明,ASAP显著提高了云服务质量,并提供了动态配置的可能性。
{"title":"ASAP: A Self-Adaptive Prediction System for Instant Cloud Resource Demand Provisioning","authors":"Yexi Jiang, Chang-Shing Perng, Tao Li, Rong N. Chang","doi":"10.1109/ICDM.2011.25","DOIUrl":"https://doi.org/10.1109/ICDM.2011.25","url":null,"abstract":"The promise of cloud computing is to provide computing resources instantly whenever they are needed. The state-of-art virtual machine (VM) provisioning technology can provision a VM in tens of minutes. This latency is unacceptable for jobs that need to scale out during computation. To truly enable on-the-fly scaling, new VM needs to be ready in seconds upon request. In this paper, We present an online temporal data mining system called ASAP, to model and predict the cloud VM demands. ASAP aims to extract high level characteristics from VM provisioning request stream and notify the provisioning system to prepare VMs in advance. For quantification issue, we propose Cloud Prediction Cost to encodes the cost and constraints of the cloud and guide the training of prediction algorithms. Moreover, we utilize a two-level ensemble method to capture the characteristics of the high transient demands time series. Experimental results using historical data from an IBM cloud in operation demonstrate that ASAP significantly improves the cloud service quality and provides possibility for on-the-fly provisioning.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126414909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 83
Maximum Entropy Modelling for Assessing Results on Real-Valued Data 实值数据结果评估的最大熵模型
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.98
Kleanthis-Nikolaos Kontonasios, Jilles Vreeken, T. D. Bie
Statistical assessment of the results of data mining is increasingly recognised as a core task in the knowledge discovery process. It is of key importance in practice, as results that might seem interesting at first glance can often be explained by well-known basic properties of the data. In pattern mining, for instance, such trivial results can be so overwhelming in number that filtering them out is a necessity in order to identify the truly interesting patterns. In this paper, we propose an approach for assessing results on real-valued rectangular databases. More specifically, using our analytical model we are able to statistically assess whether or not a discovered structure may be the trivial result of the row and column marginal distributions in the database. Our main approach is to use the Maximum Entropy principle to fit a background model to the data while respecting its marginal distributions. To find these distributions, we employ an MDL based histogram estimator, and we fit these in our model using efficient convex optimization techniques. Subsequently, our model can be used to calculate probabilities directly, as well as to efficiently sample data with the purpose of assessing results by means of empirical hypothesis testing. Notably, our approach is efficient, parameter-free, and naturally deals with missing values. As such, it represents a well-founded alternative to swap randomisation
数据挖掘结果的统计评估越来越被认为是知识发现过程中的一项核心任务。这在实践中非常重要,因为乍一看可能很有趣的结果通常可以用众所周知的数据基本属性来解释。例如,在模式挖掘中,这些微不足道的结果可能会大量出现,以至于为了识别真正有趣的模式,必须将它们过滤掉。在本文中,我们提出了一种评估实值矩形数据库结果的方法。更具体地说,使用我们的分析模型,我们能够统计地评估发现的结构是否可能是数据库中行和列边缘分布的平凡结果。我们的主要方法是使用最大熵原理来拟合数据的背景模型,同时尊重其边际分布。为了找到这些分布,我们采用了基于MDL的直方图估计器,并使用高效的凸优化技术将这些分布拟合到我们的模型中。随后,我们的模型可以直接用于计算概率,也可以通过实证假设检验有效地对数据进行抽样,以评估结果。值得注意的是,我们的方法是有效的,无参数的,并且自然地处理缺失值。因此,它代表了交换随机化的一种有充分根据的替代方案
{"title":"Maximum Entropy Modelling for Assessing Results on Real-Valued Data","authors":"Kleanthis-Nikolaos Kontonasios, Jilles Vreeken, T. D. Bie","doi":"10.1109/ICDM.2011.98","DOIUrl":"https://doi.org/10.1109/ICDM.2011.98","url":null,"abstract":"Statistical assessment of the results of data mining is increasingly recognised as a core task in the knowledge discovery process. It is of key importance in practice, as results that might seem interesting at first glance can often be explained by well-known basic properties of the data. In pattern mining, for instance, such trivial results can be so overwhelming in number that filtering them out is a necessity in order to identify the truly interesting patterns. In this paper, we propose an approach for assessing results on real-valued rectangular databases. More specifically, using our analytical model we are able to statistically assess whether or not a discovered structure may be the trivial result of the row and column marginal distributions in the database. Our main approach is to use the Maximum Entropy principle to fit a background model to the data while respecting its marginal distributions. To find these distributions, we employ an MDL based histogram estimator, and we fit these in our model using efficient convex optimization techniques. Subsequently, our model can be used to calculate probabilities directly, as well as to efficiently sample data with the purpose of assessing results by means of empirical hypothesis testing. Notably, our approach is efficient, parameter-free, and naturally deals with missing values. As such, it represents a well-founded alternative to swap randomisation","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126330185","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 20
Finding Robust Itemsets under Subsampling 寻找子抽样下的鲁棒项集
Pub Date : 2011-12-11 DOI: 10.1145/2656261
Nikolaj Tatti, Fabian Moerchen
Mining frequent patterns is plagued by the problem of pattern explosion making pattern reduction techniques a key challenge in pattern mining. In this paper we propose a novel theoretical framework for pattern reduction. We do this by measuring the robustness of a property of an item set such as closed ness or non-derivability. The robustness of a property is the probability that this property holds on random subsets of the original data. We study four properties: closed, free, non-derivable and totally shattered item sets, demonstrating how we can compute the robustness analytically without actually sampling the data. Our concept of robustness has many advantages: Unlike statistical approaches for reducing patterns, we do not assume a null hypothesis or any noise model and the patterns reported are simply a subset of all patterns with this property as opposed to approximate patterns for which the property does not really hold. If the underlying property is monotonic, then the measure is also monotonic, allowing us to efficiently mine robust item sets. We further derive a parameter-free technique for ranking item sets that can be used for top-k approaches. Our experiments demonstrate that we can successfully use the robustness measure to reduce the number of patterns and that ranking yields interesting itemsets.
频繁模式的挖掘受到模式爆炸问题的困扰,这使得模式约简技术成为模式挖掘的一个关键挑战。本文提出了一种新的模式约简理论框架。我们通过测量项目集的属性(如闭性或不可导性)的鲁棒性来做到这一点。属性的鲁棒性是指该属性在原始数据的随机子集上保持不变的概率。我们研究了四种性质:封闭、自由、不可导和完全破碎的项目集,证明了我们如何在不实际采样数据的情况下分析计算鲁棒性。我们的鲁棒性概念有许多优点:与减少模式的统计方法不同,我们不假设零假设或任何噪声模型,报告的模式只是具有此属性的所有模式的子集,而不是具有此属性的近似模式。如果底层属性是单调的,那么度量也是单调的,允许我们有效地挖掘鲁棒项集。我们进一步推导了一种无参数技术,用于对项目集进行排序,可用于top-k方法。我们的实验表明,我们可以成功地使用鲁棒性度量来减少模式的数量,并且排序产生有趣的项集。
{"title":"Finding Robust Itemsets under Subsampling","authors":"Nikolaj Tatti, Fabian Moerchen","doi":"10.1145/2656261","DOIUrl":"https://doi.org/10.1145/2656261","url":null,"abstract":"Mining frequent patterns is plagued by the problem of pattern explosion making pattern reduction techniques a key challenge in pattern mining. In this paper we propose a novel theoretical framework for pattern reduction. We do this by measuring the robustness of a property of an item set such as closed ness or non-derivability. The robustness of a property is the probability that this property holds on random subsets of the original data. We study four properties: closed, free, non-derivable and totally shattered item sets, demonstrating how we can compute the robustness analytically without actually sampling the data. Our concept of robustness has many advantages: Unlike statistical approaches for reducing patterns, we do not assume a null hypothesis or any noise model and the patterns reported are simply a subset of all patterns with this property as opposed to approximate patterns for which the property does not really hold. If the underlying property is monotonic, then the measure is also monotonic, allowing us to efficiently mine robust item sets. We further derive a parameter-free technique for ranking item sets that can be used for top-k approaches. Our experiments demonstrate that we can successfully use the robustness measure to reduce the number of patterns and that ranking yields interesting itemsets.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124922907","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
Finding Communities in Dynamic Social Networks 在动态社交网络中寻找社区
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.67
Chayant Tantipathananandh, T. Berger-Wolf
Communities are natural structures observed in social networks and are usually characterized as "relatively dense" subsets of nodes. Social networks change over time and so do the underlying community structures. Thus, to truly uncover this structure we must take the temporal aspect of networks into consideration. Previously, we have represented framework for finding dynamic communities using the social cost model and formulated the corresponding optimization problem [33], assuming that partitions of individuals into groups are given in each time step. We have also presented heuristics and approximation algorithms for the problem, with the same assumption [32]. In general, however, dynamic social networks are represented as a sequence of graphs of snapshots of the social network and the assumption that we have partitions of individuals into groups does not hold. In this paper, we extend the social cost model and formulate an optimization problem of finding community structure from the sequence of arbitrary graphs. We propose a semi definite programming formulation and a heuristic rounding scheme. We show, using synthetic data sets, that this method is quite accurate on synthetic data sets and present its results on a real social network.
社区是在社会网络中观察到的自然结构,通常以“相对密集”的节点子集为特征。社交网络随着时间的推移而变化,底层社区结构也是如此。因此,为了真正揭示这种结构,我们必须考虑网络的时间方面。在此之前,我们使用社会成本模型表示了寻找动态社区的框架,并制定了相应的优化问题[33],假设每个时间步都给出了个体划分为群体的情况。我们还提出了针对该问题的启发式和近似算法,具有相同的假设[32]。然而,一般来说,动态的社会网络被表示为社会网络快照的一系列图表,我们将个人划分为群体的假设是不成立的。本文推广了社会成本模型,提出了一个从任意图序列中寻找社区结构的优化问题。我们提出了一个半确定规划公式和一个启发式舍入格式。我们使用合成数据集表明,该方法在合成数据集上非常准确,并将其结果呈现在真实的社交网络上。
{"title":"Finding Communities in Dynamic Social Networks","authors":"Chayant Tantipathananandh, T. Berger-Wolf","doi":"10.1109/ICDM.2011.67","DOIUrl":"https://doi.org/10.1109/ICDM.2011.67","url":null,"abstract":"Communities are natural structures observed in social networks and are usually characterized as \"relatively dense\" subsets of nodes. Social networks change over time and so do the underlying community structures. Thus, to truly uncover this structure we must take the temporal aspect of networks into consideration. Previously, we have represented framework for finding dynamic communities using the social cost model and formulated the corresponding optimization problem [33], assuming that partitions of individuals into groups are given in each time step. We have also presented heuristics and approximation algorithms for the problem, with the same assumption [32]. In general, however, dynamic social networks are represented as a sequence of graphs of snapshots of the social network and the assumption that we have partitions of individuals into groups does not hold. In this paper, we extend the social cost model and formulate an optimization problem of finding community structure from the sequence of arbitrary graphs. We propose a semi definite programming formulation and a heuristic rounding scheme. We show, using synthetic data sets, that this method is quite accurate on synthetic data sets and present its results on a real social network.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125409889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 84
A Study of Laplacian Spectra of Graph for Subgraph Queries 子图查询中图的拉普拉斯谱研究
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.17
Lei Zhu, Qinbao Song
The spectrum of graph has been widely used in graph mining to extract graph topological information. It has also been employed as a characteristic of graph to check the sub graph isomorphism testing since it is an invariant of a graph. However, the spectrum cannot be directly applied to a graph and its sub graph, which is a bottleneck for sub graph isomorphism testing. In this paper, we study the Laplacian spectra between a graph and its sub graph, and propose a method by straightforward adoption of them for sub graph queries. In our proposed method, we first encode every vertex and graph by extracting their Laplacian spectra, and generate a novel two-step filtering conditions. Then, we follow the filtering-and verification framework to conduct sub graph queries. Extensive experiments show that, compared with existing counterpart method, as a graph feature, Laplacian spectra can be used to efficiently improves the efficiency of sub graph queries and thus indicate that it have considerable potential.
图谱在图挖掘中被广泛应用于提取图的拓扑信息。由于它是图的不变量,也被用作图的一个特征来检验子图同构检验。然而,谱不能直接应用于图及其子图,这是子图同构检验的瓶颈。本文研究了图与其子图之间的拉普拉斯谱,提出了一种利用拉普拉斯谱进行子图查询的方法。在该方法中,我们首先通过提取每个顶点和图的拉普拉斯谱对其进行编码,并生成一种新的两步滤波条件。然后,我们遵循过滤和验证框架来执行子图查询。大量的实验表明,与现有的对等方法相比,拉普拉斯谱作为一种图特征,可以有效地提高子图查询的效率,具有相当大的潜力。
{"title":"A Study of Laplacian Spectra of Graph for Subgraph Queries","authors":"Lei Zhu, Qinbao Song","doi":"10.1109/ICDM.2011.17","DOIUrl":"https://doi.org/10.1109/ICDM.2011.17","url":null,"abstract":"The spectrum of graph has been widely used in graph mining to extract graph topological information. It has also been employed as a characteristic of graph to check the sub graph isomorphism testing since it is an invariant of a graph. However, the spectrum cannot be directly applied to a graph and its sub graph, which is a bottleneck for sub graph isomorphism testing. In this paper, we study the Laplacian spectra between a graph and its sub graph, and propose a method by straightforward adoption of them for sub graph queries. In our proposed method, we first encode every vertex and graph by extracting their Laplacian spectra, and generate a novel two-step filtering conditions. Then, we follow the filtering-and verification framework to conduct sub graph queries. Extensive experiments show that, compared with existing counterpart method, as a graph feature, Laplacian spectra can be used to efficiently improves the efficiency of sub graph queries and thus indicate that it have considerable potential.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122064074","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
Multi-instance Metric Learning 多实例度量学习
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.106
Ye Xu, Wei Ping, A. Campbell
Multi-instance learning, like other machine learning and data mining tasks, requires distance metrics. Although metric learning methods have been studied for many years, metric learners for multi-instance learning remain almost untouched. In this paper, we propose a framework called Multi-Instance MEtric Learning (MIMEL) to learn an appropriate distance under the multi-instance setting. The distance metric between two bags is defined using the Mahalanobis distance function. The problem is formulated by minimizing the KL divergence between two multivariate Gaussians under the constraints of maximizing the between-class bag distance and minimizing the within-class bag distance. To exploit the mechanism of how instances determine bag labels in multi-instance learning, we design a nonparametric density-estimation-based weighting scheme to assign higher “weights” to the instances that are more likely to be positive in positive bags. The weighting scheme itself has a small workload, which adds little extra computing costs to the proposed framework. Moreover, to further boost the classification accuracy, a kernel version of MIMEL is presented. We evaluate MIMEL, using not only several typical multi-instance tasks, but also two activity recognition datasets. The experimental results demonstrate that MIMEL achieves better classification accuracy than many state-of-the-art distance based algorithms or kernel methods for multi-instance learning.
与其他机器学习和数据挖掘任务一样,多实例学习需要距离度量。虽然度量学习方法已经被研究了很多年,但是用于多实例学习的度量学习器几乎没有被触及。本文提出了一种多实例度量学习(MIMEL)框架,用于在多实例环境下学习合适的距离。两个袋子之间的距离度量是使用马氏距离函数定义的。在类间袋距离最大化和类内袋距离最小化的约束下,最小化两个多元高斯函数之间的KL散度。为了利用实例在多实例学习中如何确定袋标签的机制,我们设计了一个基于非参数密度估计的加权方案,将更高的 - œweightsâ -”分配给更有可能在正袋中为正的实例。加权方案本身具有较小的工作量,这使得所提出的框架的额外计算成本很少。此外,为了进一步提高分类精度,提出了一种核版本的MIMEL。我们不仅使用了几个典型的多实例任务,还使用了两个活动识别数据集来评估MIMEL。实验结果表明,在多实例学习中,MIMEL比许多基于距离的算法或核方法具有更好的分类精度。
{"title":"Multi-instance Metric Learning","authors":"Ye Xu, Wei Ping, A. Campbell","doi":"10.1109/ICDM.2011.106","DOIUrl":"https://doi.org/10.1109/ICDM.2011.106","url":null,"abstract":"Multi-instance learning, like other machine learning and data mining tasks, requires distance metrics. Although metric learning methods have been studied for many years, metric learners for multi-instance learning remain almost untouched. In this paper, we propose a framework called Multi-Instance MEtric Learning (MIMEL) to learn an appropriate distance under the multi-instance setting. The distance metric between two bags is defined using the Mahalanobis distance function. The problem is formulated by minimizing the KL divergence between two multivariate Gaussians under the constraints of maximizing the between-class bag distance and minimizing the within-class bag distance. To exploit the mechanism of how instances determine bag labels in multi-instance learning, we design a nonparametric density-estimation-based weighting scheme to assign higher “weights” to the instances that are more likely to be positive in positive bags. The weighting scheme itself has a small workload, which adds little extra computing costs to the proposed framework. Moreover, to further boost the classification accuracy, a kernel version of MIMEL is presented. We evaluate MIMEL, using not only several typical multi-instance tasks, but also two activity recognition datasets. The experimental results demonstrate that MIMEL achieves better classification accuracy than many state-of-the-art distance based algorithms or kernel methods for multi-instance learning.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124860881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 27
Entropy-Based Graph Clustering: Application to Biological and Social Networks 基于熵的图聚类:在生物和社会网络中的应用
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.64
Edward Casey Kenley, Young-Rae Cho
Complex systems have been widely studied to characterize their structural behaviors from a topological perspective. High modularity is one of the recurrent features of real-world complex systems. Various graph clustering algorithms have been applied to identifying communities in social networks or modules in biological networks. However, their applicability to real-world systems has been limited because of the massive scale and complex connectivity of the networks. In this study, we exploit a novel information-theoretic model for graph clustering. The entropy-based clustering approach finds locally optimal clusters by growing a random seed in a manner that minimizes graph entropy. We design and analyze modifications that further improve its performance. Assigning priority in seed-selection and seed-growth is well applicable to the scale-free networks characterized by the hub-oriented structure. Computing seed-growth in parallel streams also decomposes an extremely large network efficiently. The experimental results with real biological and social networks show that the entropy-based approach has better performance than competing methods in terms of accuracy and efficiency.
从拓扑学的角度对复杂系统的结构行为进行了广泛的研究。高度模块化是现实世界复杂系统的一个反复出现的特征。各种图聚类算法已被应用于识别社会网络中的社区或生物网络中的模块。然而,由于网络的大规模和复杂的连通性,它们对现实世界系统的适用性受到限制。在这项研究中,我们开发了一种新的信息理论模型用于图聚类。基于熵的聚类方法通过以最小化图熵的方式生长随机种子来找到局部最优聚类。我们设计和分析改进,进一步提高其性能。在种子选择和种子生长中分配优先级适用于以中心为导向结构的无标度网络。在并行流中计算种子生长也可以有效地分解一个极大的网络。在真实的生物网络和社会网络上的实验结果表明,基于熵的方法在准确率和效率方面都优于竞争对手的方法。
{"title":"Entropy-Based Graph Clustering: Application to Biological and Social Networks","authors":"Edward Casey Kenley, Young-Rae Cho","doi":"10.1109/ICDM.2011.64","DOIUrl":"https://doi.org/10.1109/ICDM.2011.64","url":null,"abstract":"Complex systems have been widely studied to characterize their structural behaviors from a topological perspective. High modularity is one of the recurrent features of real-world complex systems. Various graph clustering algorithms have been applied to identifying communities in social networks or modules in biological networks. However, their applicability to real-world systems has been limited because of the massive scale and complex connectivity of the networks. In this study, we exploit a novel information-theoretic model for graph clustering. The entropy-based clustering approach finds locally optimal clusters by growing a random seed in a manner that minimizes graph entropy. We design and analyze modifications that further improve its performance. Assigning priority in seed-selection and seed-growth is well applicable to the scale-free networks characterized by the hub-oriented structure. Computing seed-growth in parallel streams also decomposes an extremely large network efficiently. The experimental results with real biological and social networks show that the entropy-based approach has better performance than competing methods in terms of accuracy and efficiency.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125268842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 31
Finding Novel Diagnostic Gene Patterns Based on Interesting Non-redundant Contrast Sequence Rules 基于有趣的非冗余对比序列规则寻找新的诊断基因模式
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.68
Yuhai Zhao, Guoren Wang, Yuan Li, Zhanghui Wang
Diagnostic genes refer to the genes closely related to a specific disease phenotype, the powers of which to distinguish between different classes are often high. Most methods to discovering the powerful diagnostic genes are either singleton discriminability-based or combination discriminability-based. However, both ignore the abundant interactions among genes, which widely exist in the real world. In this paper, we tackle the problem from a new point of view and make the following contributions: (1) we propose an EWave model, which profitably exploits the ordered expressions among genes based on the defined equivalent dimension group sequences taking into account the "noise" universal in the real data, (2) we devise a novel sequence rule, namely interesting non-redundant contrast sequence rule, which is able to capture the difference between different phenotypes in a high accuracy using as few as possible genes, (3) we present an efficient algorithm called NRMINER to find such rules. Unlike the conventional column enumeration and the more recent row enumeration, it performs a novel template-driven enumeration by making use of the special characteristic of micro array data modeled by EWave. Extensive experiments conducted on various synthetic and real datasets show that: (1) NRMINER is significantly faster than the competing algorithm by up to about one order of magnitude, (2) it provides a higher accuracy using fewer genes. Many diagnostic genes discovered by NRMINER are proved biologically related to some disease.
诊断基因是指与特定疾病表型密切相关的基因,其区分不同类别的能力往往很高。大多数发现强大诊断基因的方法要么是基于单例判别性,要么是基于组合判别性。然而,两者都忽略了基因之间丰富的相互作用,这在现实世界中广泛存在。在本文中,我们从一个新的角度来解决这个问题,并做出以下贡献:(1)我们提出了一个EWave模型,该模型基于定义的等效维群序列,考虑到实际数据中普遍存在的“噪声”,有效地利用了基因之间的有序表达;(2)我们设计了一种新的序列规则,即有趣的非冗余对比序列规则,它能够使用尽可能少的基因以高精度捕获不同表型之间的差异;(3)我们提出了一种称为NRMINER的高效算法来寻找这些规则。与传统的列枚举和最近的行枚举不同,它通过利用EWave建模的微阵列数据的特殊特性来执行一种新颖的模板驱动枚举。在各种合成和真实数据集上进行的大量实验表明:(1)NRMINER比竞争算法显著快一个数量级,(2)使用更少的基因提供更高的精度。NRMINER发现的许多诊断基因被证明与某些疾病具有生物学相关性。
{"title":"Finding Novel Diagnostic Gene Patterns Based on Interesting Non-redundant Contrast Sequence Rules","authors":"Yuhai Zhao, Guoren Wang, Yuan Li, Zhanghui Wang","doi":"10.1109/ICDM.2011.68","DOIUrl":"https://doi.org/10.1109/ICDM.2011.68","url":null,"abstract":"Diagnostic genes refer to the genes closely related to a specific disease phenotype, the powers of which to distinguish between different classes are often high. Most methods to discovering the powerful diagnostic genes are either singleton discriminability-based or combination discriminability-based. However, both ignore the abundant interactions among genes, which widely exist in the real world. In this paper, we tackle the problem from a new point of view and make the following contributions: (1) we propose an EWave model, which profitably exploits the ordered expressions among genes based on the defined equivalent dimension group sequences taking into account the \"noise\" universal in the real data, (2) we devise a novel sequence rule, namely interesting non-redundant contrast sequence rule, which is able to capture the difference between different phenotypes in a high accuracy using as few as possible genes, (3) we present an efficient algorithm called NRMINER to find such rules. Unlike the conventional column enumeration and the more recent row enumeration, it performs a novel template-driven enumeration by making use of the special characteristic of micro array data modeled by EWave. Extensive experiments conducted on various synthetic and real datasets show that: (1) NRMINER is significantly faster than the competing algorithm by up to about one order of magnitude, (2) it provides a higher accuracy using fewer genes. Many diagnostic genes discovered by NRMINER are proved biologically related to some disease.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130098483","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Mixture of Softmax sLDA Softmax sLDA的混合物
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.103
Xiaoxu Li, Junyu Zeng, Xiaojie Wang, Yixin Zhong
In this paper, we propose a new variant of supervised Latent Dirichlet Allocation(sLDA): mixture of soft max sLDA, for image classification. Ensemble classification methods can combine multiple weak classifiers to construct a strong classifier. Inspired by the ensemble idea, we try to improve sLDA model using the idea. The mixture of soft max model is a probabilistic ensemble classification model, it can fit the training data and class label well. We embed the mixture of soft max model into LDA model under the framwork of sLDA, and construct an ensemble supervised topic model for image classification. Meanwhile, we derive an elegant parameters estimation algorithm based on variational EM method, and give a simple and efficient approximation method for classifying a new image. Finally, we demonstrate the effectiveness of our model by comparing with some existing approaches on two real world datasets. The results show that our model enhances classification accuracy by 7% on the 1600-image Label Me dataset and 9% on the 1791-image UIUC-Sport dataset.
本文提出了一种新的有监督潜狄利克雷分配方法(sLDA):混合软最大潜狄利克雷分配方法,用于图像分类。集成分类方法可以将多个弱分类器组合成一个强分类器。受集成思想的启发,我们尝试用集成思想改进sLDA模型。混合软极大值模型是一种概率集成分类模型,它能很好地拟合训练数据和类标号。在sLDA框架下,将混合软最大模型嵌入到LDA模型中,构建了一个集成监督主题图像分类模型。同时,我们推导了一种基于变分EM方法的参数估计算法,并给出了一种简单有效的新图像分类近似方法。最后,我们通过在两个真实世界的数据集上与现有的一些方法进行比较,证明了我们模型的有效性。结果表明,我们的模型在1600张Label Me数据集上的分类准确率提高了7%,在1791张UIUC-Sport数据集上的分类准确率提高了9%。
{"title":"Mixture of Softmax sLDA","authors":"Xiaoxu Li, Junyu Zeng, Xiaojie Wang, Yixin Zhong","doi":"10.1109/ICDM.2011.103","DOIUrl":"https://doi.org/10.1109/ICDM.2011.103","url":null,"abstract":"In this paper, we propose a new variant of supervised Latent Dirichlet Allocation(sLDA): mixture of soft max sLDA, for image classification. Ensemble classification methods can combine multiple weak classifiers to construct a strong classifier. Inspired by the ensemble idea, we try to improve sLDA model using the idea. The mixture of soft max model is a probabilistic ensemble classification model, it can fit the training data and class label well. We embed the mixture of soft max model into LDA model under the framwork of sLDA, and construct an ensemble supervised topic model for image classification. Meanwhile, we derive an elegant parameters estimation algorithm based on variational EM method, and give a simple and efficient approximation method for classifying a new image. Finally, we demonstrate the effectiveness of our model by comparing with some existing approaches on two real world datasets. The results show that our model enhances classification accuracy by 7% on the 1600-image Label Me dataset and 9% on the 1791-image UIUC-Sport dataset.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117183542","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
期刊
2011 IEEE 11th International Conference on Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1