2011 IEEE 11th International Conference on Data Mining最新文献

英文中文

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.45

S. Oyama, K. Hayashi, H. Kashima

The increasing interest in dynamically changing networks has led to growing interest in a more general link prediction problem called temporal link prediction in the data mining and machine learning communities. However, only links in identical time frames are considered in temporal link prediction. We propose a new link prediction problem called cross-temporal link prediction in which the links among nodes in different time frames are inferred. A typical example of cross-temporal link prediction is cross-temporal entity resolution to determine the identity of real entities represented by data objects observed in different time periods. In dynamic environments, the features of data change over time, making it difficult to identify cross-temporal links by directly comparing observed data. Other examples of cross-temporal links are asynchronous communications in social networks such as Face book and Twitter, where a message is posted in reply to a previous message. We adopt a dimension reduction approach to cross-temporal link prediction, that is, data objects in different time frames are mapped into a common low-dimensional latent feature space, and the links are identified on the basis of the distance between the data objects. The proposed method uses different low-dimensional feature projections in different time frames, enabling it to adapt to changes in the latent features over time. Using multi-task learning, it jointly learns a set of feature projection matrices from the training data, given the assumption of temporal smoothness of the projections. The optimal solutions are obtained by solving a single generalized eigenvalue problem. Experiments using a real-world set of bibliographic data for cross-temporal entity resolution showed that introducing time-dependent feature projections improves the accuracy of link prediction.

随着人们对动态变化网络的兴趣日益浓厚，数据挖掘和机器学习社区对更普遍的链接预测问题(称为时间链接预测)越来越感兴趣。然而，在时间链路预测中，只考虑相同时间框架内的链路。我们提出了一种新的链路预测问题，即跨时间链路预测问题，该问题推断了不同时间框架内节点之间的链路。跨时间链接预测的一个典型例子是跨时间实体解析，以确定在不同时间段观察到的数据对象所表示的真实实体的身份。在动态环境中，数据的特征随时间而变化，因此很难通过直接比较观测数据来识别跨时间的联系。跨时间链接的其他例子是facebook和Twitter等社交网络中的异步通信，其中发布消息是对前一条消息的回复。我们采用降维方法进行跨时间链接预测，即将不同时间框架的数据对象映射到一个共同的低维潜在特征空间中，并根据数据对象之间的距离来识别链接。该方法在不同的时间框架内使用不同的低维特征投影，使其能够适应潜在特征随时间的变化。该算法采用多任务学习的方法，在假设特征投影的时间平滑的前提下，从训练数据中共同学习一组特征投影矩阵。通过求解单个广义特征值问题得到了最优解。使用一组真实的书目数据进行跨时间实体解析的实验表明，引入时间相关的特征投影可以提高链接预测的准确性。

{"title":"Cross-Temporal Link Prediction","authors":"S. Oyama, K. Hayashi, H. Kashima","doi":"10.1109/ICDM.2011.45","DOIUrl":"https://doi.org/10.1109/ICDM.2011.45","url":null,"abstract":"The increasing interest in dynamically changing networks has led to growing interest in a more general link prediction problem called temporal link prediction in the data mining and machine learning communities. However, only links in identical time frames are considered in temporal link prediction. We propose a new link prediction problem called cross-temporal link prediction in which the links among nodes in different time frames are inferred. A typical example of cross-temporal link prediction is cross-temporal entity resolution to determine the identity of real entities represented by data objects observed in different time periods. In dynamic environments, the features of data change over time, making it difficult to identify cross-temporal links by directly comparing observed data. Other examples of cross-temporal links are asynchronous communications in social networks such as Face book and Twitter, where a message is posted in reply to a previous message. We adopt a dimension reduction approach to cross-temporal link prediction, that is, data objects in different time frames are mapped into a common low-dimensional latent feature space, and the links are identified on the basis of the distance between the data objects. The proposed method uses different low-dimensional feature projections in different time frames, enabling it to adapt to changes in the latent features over time. Using multi-task learning, it jointly learns a set of feature projection matrices from the training data, given the assumption of temporal smoothness of the projections. The optimal solutions are obtained by solving a single generalized eigenvalue problem. Experiments using a real-world set of bibliographic data for cross-temporal entity resolution showed that introducing time-dependent feature projections improves the accuracy of link prediction.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131719765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 42

Boolean Tensor Factorizations 布尔张量分解

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.28

Pauli Miettinen

Tensors are multi-way generalizations of matrices, and similarly to matrices, they can also be factorized, that is, represented (approximately) as a product of factors. These factors are typically either all matrices or a mixture of matrices and tensors. With the widespread adoption of matrix factorization techniques in data mining, also tensor factorizations have started to gain attention. In this paper we study the Boolean tensor factorizations. We assume that the data is binary multi-way data, and we want to factorize it to binary factors using Boolean arithmetic (i.e. defining that 1+1=1). Boolean tensor factorizations are, therefore, natural generalization of the Boolean matrix factorizations. We will study the theory of Boolean tensor factorizations and show that at least some of the benefits Boolean matrix factorizations have over normal matrix factorizations carry over to the tensor data. We will also present algorithms for Boolean variations of CP and Tucker decompositions, the two most-common types of tensor factorizations. With experimentation done with synthetic and real-world data, we show that Boolean tensor factorizations are a viable alternative when the data is naturally binary.

张量是矩阵的多向推广，与矩阵类似，它们也可以被分解，即(近似地)表示为因子的乘积。这些因子通常是所有矩阵或矩阵和张量的混合物。随着矩阵分解技术在数据挖掘中的广泛应用，张量分解也开始受到关注。本文研究了布尔张量分解。我们假设数据是二进制多路数据，并且我们希望使用布尔算法(即定义1+1=1)将其分解为二进制因子。因此，布尔张量分解是布尔矩阵分解的自然推广。我们将研究布尔张量分解的理论，并证明布尔矩阵分解比普通矩阵分解至少有一些好处延续到张量数据中。我们还将介绍CP和Tucker分解的布尔变量的算法，这是两种最常见的张量分解类型。通过对合成数据和真实数据进行的实验，我们表明，当数据是自然二进制时，布尔张量分解是一种可行的替代方法。

引用次数: 62

Combining Feature Context and Spatial Context for Image Pattern Discovery 结合特征上下文和空间上下文的图像模式发现

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.38

Hongxing Wang, Junsong Yuan, Yap-Peng Tan

Once an image is decomposed into a number of visual primitives, e.g., local interest points or salient image regions, it is of great interests to discover meaningful visual patterns from them. Conventional clustering (e.g., k-means) of visual primitives, however, usually ignores the spatial dependency among them, thus cannot discover the high-level visual patterns of complex spatial structure. To overcome this problem, we propose to consider both spatial and feature contexts among visual primitives for pattern discovery. By discovering both spatial co-occurrence patterns among visual primitives and feature co-occurrence patterns among different types of features, our method can better handle the ambiguities of visual primitives, by leveraging these co-occurrences. We formulate the problem as a regularized k-means clustering, and propose an iterative bottom-up/top-down self-learning procedure to gradually refine the result until it converges. The experiments of image text on discovery and image region clustering convince that combining spatial and feature contexts can significantly improve the pattern discovery results.

一旦图像被分解成许多视觉原语，例如局部兴趣点或显著图像区域，从中发现有意义的视觉模式是非常有趣的。然而，传统的视觉原语聚类(如k-means)往往忽略了原语之间的空间依赖关系，无法发现复杂空间结构的高级视觉模式。为了克服这个问题，我们建议在视觉原语中同时考虑空间和特征上下文来进行模式发现。通过发现视觉原语之间的空间共现模式和不同类型特征之间的特征共现模式，我们的方法可以更好地利用这些共现来处理视觉原语的模糊性。我们将这个问题表述为一个正则化的k-means聚类，并提出了一个迭代的自下而上/自上而下的自学习过程，以逐步完善结果，直到它收敛。图像文本的发现和图像区域聚类实验表明，结合空间上下文和特征上下文可以显著提高模式发现的效果。

引用次数: 13

Towards Optimal Discriminating Order for Multiclass Classification 多类分类的最优判别顺序研究

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.147

Dong Liu, Shuicheng Yan, Yadong Mu, Xiansheng Hua, Shih-Fu Chang, HongJiang Zhang

In this paper, we investigate how to design an optimized discriminating order for boosting multiclass classification. The main idea is to optimize a binary tree architecture, referred to as Sequential Discriminating Tree (SDT), that performs the multiclass classification through a hierarchical sequence of coarse-to-fine binary classifiers. To infer such a tree architecture, we employ the constrained large margin clustering procedure which enforces samples belonging to the same class to locate at the same side of the hyper plane while maximizing the margin between these two partitioned class subsets. The proposed SDT algorithm has a theoretic error bound which is shown experimentally to effectively guarantee the generalization performance. Experiment results indicate that SDT clearly beats the state-of-the-art multiclass classification algorithms.

在本文中，我们研究了如何设计一个优化的判别顺序来促进多类分类。主要思想是优化二叉树架构，称为顺序判别树(SDT)，它通过从粗到细的二叉分类器的分层序列执行多类分类。为了推断出这样的树结构，我们采用了约束的大边界聚类过程，该过程强制属于同一类的样本位于超平面的同一侧，同时最大化这两个划分的类子集之间的边界。所提出的SDT算法具有一定的理论误差界，实验证明该算法能有效地保证算法的泛化性能。实验结果表明，SDT明显优于最先进的多类分类算法。

引用次数: 7

Direct Robust Matrix Factorizatoin for Anomaly Detection 直接鲁棒矩阵分解异常检测

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.52

L. Xiong, X. Chen, J. Schneider

Matrix factorization methods are extremely useful in many data mining tasks, yet their performances are often degraded by outliers. In this paper, we propose a novel robust matrix factorization algorithm that is insensitive to outliers. We directly formulate robust factorization as a matrix approximation problem with constraints on the rank of the matrix and the cardinality of the outlier set. Then, unlike existing methods that resort to convex relaxations, we solve this problem directly and efficiently. In addition, structural knowledge about the outliers can be incorporated to find outliers more effectively. We applied this method in anomaly detection tasks on various data sets. Empirical results show that this new algorithm is effective in robust modeling and anomaly detection, and our direct solution achieves superior performance over the state-of-the-art methods based on the L1-norm and the nuclear norm of matrices.

矩阵分解方法在许多数据挖掘任务中非常有用，但其性能经常受到异常值的影响。本文提出了一种对异常值不敏感的鲁棒矩阵分解算法。我们将鲁棒分解直接表述为具有矩阵秩约束和离群集基数约束的矩阵近似问题。然后，与现有的求助于凸松弛的方法不同，我们直接有效地解决了这个问题。此外，可以结合有关异常值的结构知识来更有效地找到异常值。我们将该方法应用于各种数据集的异常检测任务中。实验结果表明，该算法在鲁棒建模和异常检测方面是有效的，并且我们的直接解比基于l1范数和矩阵核范数的最新方法具有更好的性能。

引用次数: 106

Semi-supervised Feature Importance Evaluation with Ensemble Learning 基于集成学习的半监督特征重要性评价

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.129

H. Barkia, H. Elghazel, A. Aussem

We consider the problem of using a large amount of unlabeled data to improve the efficiency of feature selection in high dimensional datasets, when only a small set of labeled examples is available. We propose a new semi-supervised feature importance evaluation method (SSFI for short), that combines ideas from co-training and random forests with a new permutation-based out-of-bag feature importance measure. We provide empirical results on several benchmark datasets indicating that SSFI can lead to significant improvement over state-of-the-art semi-supervised and supervised algorithms.

我们考虑了在高维数据集中，当只有一小部分标记样本可用时，使用大量未标记数据来提高特征选择效率的问题。我们提出了一种新的半监督特征重要性评价方法(简称SSFI)，该方法将协同训练和随机森林的思想与一种新的基于置换的袋外特征重要性度量相结合。我们提供了几个基准数据集的实证结果，表明SSFI可以导致最先进的半监督和监督算法的显着改进。

引用次数: 15

Using Bayesian Network Learning Algorithm to Discover Causal Relations in Multivariate Time Series 利用贝叶斯网络学习算法发现多元时间序列中的因果关系

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.153

Zhenxing Wang, L. Chan

Many applications naturally involve time series data, and the vector auto regression (VAR) and the structural VAR (SVAR) are dominant tools to investigate relations between variables in time series. In the first part of this work, we show that the SVAR method is incapable of identifying contemporaneous causal relations when data follow Gaussian distributions. In addition, least squares estimators become unreliable when the scales of the problems are large and observations are limited. In the remaining part, we propose an approach to apply Bayesian network learning algorithms to identify SVARs from time series data in order to capture both temporal and contemporaneous causal relations and avoid high-order statistical tests. The difficulty of applying Bayesian network learning algorithms to time series is that the sizes of the networks corresponding to time series tend to be large and high-order statistical tests are required by Bayesian network learning algorithms in this case. To overcome the difficulty, we show that the search space of conditioning sets d-separating two vertices should be subsets of Markov blankets. Based on this fact, we propose an algorithm learning Bayesian networks locally and making the largest order of statistical tests independent of the scales of the problems. Empirical results show that our algorithm outperforms existing methods in terms of both efficiency and accuracy.

许多应用自然涉及时间序列数据，而向量自回归(VAR)和结构自回归(SVAR)是研究时间序列中变量之间关系的主要工具。在本工作的第一部分中，我们表明，当数据遵循高斯分布时，SVAR方法无法识别同期因果关系。此外，当问题的规模很大且观测值有限时，最小二乘估计会变得不可靠。在其余部分中，我们提出了一种应用贝叶斯网络学习算法从时间序列数据中识别svar的方法，以捕获时间和同期因果关系并避免高阶统计检验。贝叶斯网络学习算法应用于时间序列的难点在于时间序列所对应的网络规模往往较大，在这种情况下贝叶斯网络学习算法需要进行高阶统计检验。为了克服这个困难，我们证明条件集d分隔两个顶点的搜索空间应该是马尔可夫毯的子集。基于这一事实，我们提出了一种局部学习贝叶斯网络的算法，并使统计检验的最大阶与问题的规模无关。实验结果表明，我们的算法在效率和精度上都优于现有的方法。

{"title":"Using Bayesian Network Learning Algorithm to Discover Causal Relations in Multivariate Time Series","authors":"Zhenxing Wang, L. Chan","doi":"10.1109/ICDM.2011.153","DOIUrl":"https://doi.org/10.1109/ICDM.2011.153","url":null,"abstract":"Many applications naturally involve time series data, and the vector auto regression (VAR) and the structural VAR (SVAR) are dominant tools to investigate relations between variables in time series. In the first part of this work, we show that the SVAR method is incapable of identifying contemporaneous causal relations when data follow Gaussian distributions. In addition, least squares estimators become unreliable when the scales of the problems are large and observations are limited. In the remaining part, we propose an approach to apply Bayesian network learning algorithms to identify SVARs from time series data in order to capture both temporal and contemporaneous causal relations and avoid high-order statistical tests. The difficulty of applying Bayesian network learning algorithms to time series is that the sizes of the networks corresponding to time series tend to be large and high-order statistical tests are required by Bayesian network learning algorithms in this case. To overcome the difficulty, we show that the search space of conditioning sets d-separating two vertices should be subsets of Markov blankets. Based on this fact, we propose an algorithm learning Bayesian networks locally and making the largest order of statistical tests independent of the scales of the problems. Empirical results show that our algorithm outperforms existing methods in terms of both efficiency and accuracy.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125076658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

A Generalized Fast Subset Sums Framework for Bayesian Event Detection 贝叶斯事件检测的广义快速子集和框架

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.11

Kanghong Shao, Yandong Liu, Daniel B. Neill

We present Generalized Fast Subset Sums (GFSS), a new Bayesian framework for scalable and accurate detection of irregularly shaped spatial clusters using multiple data streams. GFSS extends the previously proposed Multivariate Bayesian Scan Statistic (MBSS) and Fast Subset Sums (FSS) approaches for detection of emerging events. The detection power of MBSS is primarily limited by computational considerations, which limit it to searching over circular spatial regions. GFSS enables more accurate and timely detection by defining a hierarchical prior over all subsets of the N locations, first selecting a local neighborhood consisting of a center location and its neighbors, and introducing a sparsity parameter p to describe how likely each location in the neighborhood is to be affected. This approach allows us to consider all possible subsets of locations (including irregularly-shaped regions) but also puts higher weight on more compact regions. We demonstrate that MBSS and FSS are both special cases of this general framework (assuming p = 1 and p = 0.5 respectively), but substantially higher detection power can be achieved by choosing an appropriate value of p. Thus we show that the distribution of the sparsity parameter p can be accurately learned from a small number of labeled events. Our evaluation results (on synthetic disease outbreaks injected into real-world hospital data) show that the GFSS method with learned sparsity parameter has higher detection power and spatial accuracy than MBSS and FSS, particularly when the affected region is irregular or elongated. We also show that the learned models can be used for event characterization, accurately distinguishing between two otherwise identical event types based on the sparsity of the affected spatial region.

我们提出了广义快速子集和(GFSS)，这是一个新的贝叶斯框架，用于使用多个数据流可扩展和准确检测不规则形状的空间集群。GFSS扩展了先前提出的多元贝叶斯扫描统计(MBSS)和快速子集和(FSS)方法，用于检测新出现的事件。MBSS的检测能力主要受到计算因素的限制，它只能在圆形空间区域内进行搜索。GFSS通过定义N个位置的所有子集的分层先验，首先选择由中心位置及其邻居组成的局部邻域，并引入稀疏度参数p来描述邻域中每个位置受影响的可能性，从而实现更准确和及时的检测。这种方法允许我们考虑所有可能的位置子集(包括不规则形状的区域)，但也赋予更紧凑的区域更高的权重。我们证明了MBSS和FSS都是这个一般框架的特殊情况(分别假设p = 1和p = 0.5)，但通过选择适当的p值可以获得更高的检测能力。因此我们表明，稀疏度参数p的分布可以从少量标记事件中准确地学习到。我们的评估结果(对注入真实医院数据的合成疾病暴发)表明，具有学习稀疏度参数的GFSS方法比MBSS和FSS具有更高的检测能力和空间精度，特别是当受影响区域不规则或拉长时。我们还表明，学习模型可以用于事件表征，基于受影响空间区域的稀疏性，准确区分两种其他相同的事件类型。

{"title":"A Generalized Fast Subset Sums Framework for Bayesian Event Detection","authors":"Kanghong Shao, Yandong Liu, Daniel B. Neill","doi":"10.1109/ICDM.2011.11","DOIUrl":"https://doi.org/10.1109/ICDM.2011.11","url":null,"abstract":"We present Generalized Fast Subset Sums (GFSS), a new Bayesian framework for scalable and accurate detection of irregularly shaped spatial clusters using multiple data streams. GFSS extends the previously proposed Multivariate Bayesian Scan Statistic (MBSS) and Fast Subset Sums (FSS) approaches for detection of emerging events. The detection power of MBSS is primarily limited by computational considerations, which limit it to searching over circular spatial regions. GFSS enables more accurate and timely detection by defining a hierarchical prior over all subsets of the N locations, first selecting a local neighborhood consisting of a center location and its neighbors, and introducing a sparsity parameter p to describe how likely each location in the neighborhood is to be affected. This approach allows us to consider all possible subsets of locations (including irregularly-shaped regions) but also puts higher weight on more compact regions. We demonstrate that MBSS and FSS are both special cases of this general framework (assuming p = 1 and p = 0.5 respectively), but substantially higher detection power can be achieved by choosing an appropriate value of p. Thus we show that the distribution of the sparsity parameter p can be accurately learned from a small number of labeled events. Our evaluation results (on synthetic disease outbreaks injected into real-world hospital data) show that the GFSS method with learned sparsity parameter has higher detection power and spatial accuracy than MBSS and FSS, particularly when the affected region is irregular or elongated. We also show that the learned models can be used for event characterization, accurately distinguishing between two otherwise identical event types based on the sparsity of the affected spatial region.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124494259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 10

SLIM: Sparse Linear Methods for Top-N Recommender Systems Top-N推荐系统的稀疏线性方法

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.134

Xia Ning, G. Karypis

This paper focuses on developing effective and efficient algorithms for top-N recommender systems. A novel Sparse Linear Method (SLIM) is proposed, which generates top-N recommendations by aggregating from user purchase/rating profiles. A sparse aggregation coefficient matrix W is learned from SLIM by solving an `1-norm and `2-norm regularized optimization problem. W is demonstrated to produce high quality recommendations and its sparsity allows SLIM to generate recommendations very fast. A comprehensive set of experiments is conducted by comparing the SLIM method and other state-of-the-art top-N recommendation methods. The experiments show that SLIM achieves significant improvements both in run time performance and recommendation quality over the best existing methods.

本文重点研究了top-N推荐系统的高效算法。提出了一种新的稀疏线性方法(SLIM)，该方法通过汇总用户购买/评级资料生成top-N推荐。通过求解一个“1范数”和“2范数”正则化优化问题，从SLIM中学习到稀疏聚集系数矩阵W。W被证明可以产生高质量的推荐，它的稀疏性允许SLIM非常快地生成推荐。通过比较SLIM方法和其他最先进的top-N推荐方法，进行了一组全面的实验。实验表明，与现有的最佳推荐方法相比，SLIM在运行时性能和推荐质量方面都取得了显著的改进。

引用次数: 672

Tag Clustering and Refinement on Semantic Unity Graph 语义统一图上的标签聚类与改进

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.141

Yang Liu, Fei Wu, Yin Zhang, Jian Shao, Yueting Zhuang

Recently, there has been extensive research towards the user-provided tags on photo sharing websites which can greatly facilitate image retrieval and management. However, due to the arbitrariness of the tagging activities, these tags are often imprecise and incomplete. As a result, quite a few technologies has been proposed to improve the user experience on these photo sharing systems, including tag clustering and refinement, etc. In this work, we propose a novel framework to model the relationships among tags and images which can be applied to many tag based applications. Different from previous approaches which model images and tags as heterogeneous objects, images and their tags are uniformly viewed as compositions of Semantic Unities in our framework. Then Semantic Unity Graph (SUG) is introduced to represent the complex and high-order relationships among these Semantic Unities. Based on the representation of Semantic Unity Graph, the relevance of images and tags can be naturally measured in terms of the similarity of their Semantic Unities. Then Tag clustering and refinement can then be performed on SUG and the polysemy of images and tags is explicitly considered in this framework. The experiment results conducted on NUS-WIDE and MIR-Flickr datasets demonstrate the effectiveness and efficiency of the proposed approach.

近年来，人们对图片分享网站上的用户提供标签进行了广泛的研究，这种标签可以极大地方便图片的检索和管理。然而，由于标注活动的随意性，这些标注往往是不精确和不完整的。因此，人们提出了许多技术来改善这些照片共享系统的用户体验，包括标签聚类和细化等。在这项工作中，我们提出了一个新的框架来模拟标签和图像之间的关系，该框架可以应用于许多基于标签的应用。与以往将图像和标签作为异构对象建模的方法不同，我们的框架将图像及其标签统一地视为语义统一的组合。然后引入语义统一图(Semantic Unity Graph, SUG)来表示这些语义统一之间复杂的高阶关系。基于语义统一图的表示，可以很自然地用图像和标签的语义统一的相似度来衡量它们之间的相关性。然后在SUG上进行标签聚类和细化，并明确考虑了图像和标签的多义性。在NUS-WIDE和MIR-Flickr数据集上进行的实验结果证明了该方法的有效性和效率。

{"title":"Tag Clustering and Refinement on Semantic Unity Graph","authors":"Yang Liu, Fei Wu, Yin Zhang, Jian Shao, Yueting Zhuang","doi":"10.1109/ICDM.2011.141","DOIUrl":"https://doi.org/10.1109/ICDM.2011.141","url":null,"abstract":"Recently, there has been extensive research towards the user-provided tags on photo sharing websites which can greatly facilitate image retrieval and management. However, due to the arbitrariness of the tagging activities, these tags are often imprecise and incomplete. As a result, quite a few technologies has been proposed to improve the user experience on these photo sharing systems, including tag clustering and refinement, etc. In this work, we propose a novel framework to model the relationships among tags and images which can be applied to many tag based applications. Different from previous approaches which model images and tags as heterogeneous objects, images and their tags are uniformly viewed as compositions of Semantic Unities in our framework. Then Semantic Unity Graph (SUG) is introduced to represent the complex and high-order relationships among these Semantic Unities. Based on the representation of Semantic Unity Graph, the relevance of images and tags can be naturally measured in terms of the similarity of their Semantic Unities. Then Tag clustering and refinement can then be performed on SUG and the polysemy of images and tags is explicitly considered in this framework. The experiment results conducted on NUS-WIDE and MIR-Flickr datasets demonstrate the effectiveness and efficiency of the proposed approach.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121910374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 17

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 IEEE 11th International Conference on Data Mining

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀