The increasing interest in dynamically changing networks has led to growing interest in a more general link prediction problem called temporal link prediction in the data mining and machine learning communities. However, only links in identical time frames are considered in temporal link prediction. We propose a new link prediction problem called cross-temporal link prediction in which the links among nodes in different time frames are inferred. A typical example of cross-temporal link prediction is cross-temporal entity resolution to determine the identity of real entities represented by data objects observed in different time periods. In dynamic environments, the features of data change over time, making it difficult to identify cross-temporal links by directly comparing observed data. Other examples of cross-temporal links are asynchronous communications in social networks such as Face book and Twitter, where a message is posted in reply to a previous message. We adopt a dimension reduction approach to cross-temporal link prediction, that is, data objects in different time frames are mapped into a common low-dimensional latent feature space, and the links are identified on the basis of the distance between the data objects. The proposed method uses different low-dimensional feature projections in different time frames, enabling it to adapt to changes in the latent features over time. Using multi-task learning, it jointly learns a set of feature projection matrices from the training data, given the assumption of temporal smoothness of the projections. The optimal solutions are obtained by solving a single generalized eigenvalue problem. Experiments using a real-world set of bibliographic data for cross-temporal entity resolution showed that introducing time-dependent feature projections improves the accuracy of link prediction.
{"title":"Cross-Temporal Link Prediction","authors":"S. Oyama, K. Hayashi, H. Kashima","doi":"10.1109/ICDM.2011.45","DOIUrl":"https://doi.org/10.1109/ICDM.2011.45","url":null,"abstract":"The increasing interest in dynamically changing networks has led to growing interest in a more general link prediction problem called temporal link prediction in the data mining and machine learning communities. However, only links in identical time frames are considered in temporal link prediction. We propose a new link prediction problem called cross-temporal link prediction in which the links among nodes in different time frames are inferred. A typical example of cross-temporal link prediction is cross-temporal entity resolution to determine the identity of real entities represented by data objects observed in different time periods. In dynamic environments, the features of data change over time, making it difficult to identify cross-temporal links by directly comparing observed data. Other examples of cross-temporal links are asynchronous communications in social networks such as Face book and Twitter, where a message is posted in reply to a previous message. We adopt a dimension reduction approach to cross-temporal link prediction, that is, data objects in different time frames are mapped into a common low-dimensional latent feature space, and the links are identified on the basis of the distance between the data objects. The proposed method uses different low-dimensional feature projections in different time frames, enabling it to adapt to changes in the latent features over time. Using multi-task learning, it jointly learns a set of feature projection matrices from the training data, given the assumption of temporal smoothness of the projections. The optimal solutions are obtained by solving a single generalized eigenvalue problem. Experiments using a real-world set of bibliographic data for cross-temporal entity resolution showed that introducing time-dependent feature projections improves the accuracy of link prediction.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131719765","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Tensors are multi-way generalizations of matrices, and similarly to matrices, they can also be factorized, that is, represented (approximately) as a product of factors. These factors are typically either all matrices or a mixture of matrices and tensors. With the widespread adoption of matrix factorization techniques in data mining, also tensor factorizations have started to gain attention. In this paper we study the Boolean tensor factorizations. We assume that the data is binary multi-way data, and we want to factorize it to binary factors using Boolean arithmetic (i.e. defining that 1+1=1). Boolean tensor factorizations are, therefore, natural generalization of the Boolean matrix factorizations. We will study the theory of Boolean tensor factorizations and show that at least some of the benefits Boolean matrix factorizations have over normal matrix factorizations carry over to the tensor data. We will also present algorithms for Boolean variations of CP and Tucker decompositions, the two most-common types of tensor factorizations. With experimentation done with synthetic and real-world data, we show that Boolean tensor factorizations are a viable alternative when the data is naturally binary.
{"title":"Boolean Tensor Factorizations","authors":"Pauli Miettinen","doi":"10.1109/ICDM.2011.28","DOIUrl":"https://doi.org/10.1109/ICDM.2011.28","url":null,"abstract":"Tensors are multi-way generalizations of matrices, and similarly to matrices, they can also be factorized, that is, represented (approximately) as a product of factors. These factors are typically either all matrices or a mixture of matrices and tensors. With the widespread adoption of matrix factorization techniques in data mining, also tensor factorizations have started to gain attention. In this paper we study the Boolean tensor factorizations. We assume that the data is binary multi-way data, and we want to factorize it to binary factors using Boolean arithmetic (i.e. defining that 1+1=1). Boolean tensor factorizations are, therefore, natural generalization of the Boolean matrix factorizations. We will study the theory of Boolean tensor factorizations and show that at least some of the benefits Boolean matrix factorizations have over normal matrix factorizations carry over to the tensor data. We will also present algorithms for Boolean variations of CP and Tucker decompositions, the two most-common types of tensor factorizations. With experimentation done with synthetic and real-world data, we show that Boolean tensor factorizations are a viable alternative when the data is naturally binary.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"741 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131885458","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Once an image is decomposed into a number of visual primitives, e.g., local interest points or salient image regions, it is of great interests to discover meaningful visual patterns from them. Conventional clustering (e.g., k-means) of visual primitives, however, usually ignores the spatial dependency among them, thus cannot discover the high-level visual patterns of complex spatial structure. To overcome this problem, we propose to consider both spatial and feature contexts among visual primitives for pattern discovery. By discovering both spatial co-occurrence patterns among visual primitives and feature co-occurrence patterns among different types of features, our method can better handle the ambiguities of visual primitives, by leveraging these co-occurrences. We formulate the problem as a regularized k-means clustering, and propose an iterative bottom-up/top-down self-learning procedure to gradually refine the result until it converges. The experiments of image text on discovery and image region clustering convince that combining spatial and feature contexts can significantly improve the pattern discovery results.
{"title":"Combining Feature Context and Spatial Context for Image Pattern Discovery","authors":"Hongxing Wang, Junsong Yuan, Yap-Peng Tan","doi":"10.1109/ICDM.2011.38","DOIUrl":"https://doi.org/10.1109/ICDM.2011.38","url":null,"abstract":"Once an image is decomposed into a number of visual primitives, e.g., local interest points or salient image regions, it is of great interests to discover meaningful visual patterns from them. Conventional clustering (e.g., k-means) of visual primitives, however, usually ignores the spatial dependency among them, thus cannot discover the high-level visual patterns of complex spatial structure. To overcome this problem, we propose to consider both spatial and feature contexts among visual primitives for pattern discovery. By discovering both spatial co-occurrence patterns among visual primitives and feature co-occurrence patterns among different types of features, our method can better handle the ambiguities of visual primitives, by leveraging these co-occurrences. We formulate the problem as a regularized k-means clustering, and propose an iterative bottom-up/top-down self-learning procedure to gradually refine the result until it converges. The experiments of image text on discovery and image region clustering convince that combining spatial and feature contexts can significantly improve the pattern discovery results.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"56 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133030588","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we investigate how to design an optimized discriminating order for boosting multiclass classification. The main idea is to optimize a binary tree architecture, referred to as Sequential Discriminating Tree (SDT), that performs the multiclass classification through a hierarchical sequence of coarse-to-fine binary classifiers. To infer such a tree architecture, we employ the constrained large margin clustering procedure which enforces samples belonging to the same class to locate at the same side of the hyper plane while maximizing the margin between these two partitioned class subsets. The proposed SDT algorithm has a theoretic error bound which is shown experimentally to effectively guarantee the generalization performance. Experiment results indicate that SDT clearly beats the state-of-the-art multiclass classification algorithms.
{"title":"Towards Optimal Discriminating Order for Multiclass Classification","authors":"Dong Liu, Shuicheng Yan, Yadong Mu, Xiansheng Hua, Shih-Fu Chang, HongJiang Zhang","doi":"10.1109/ICDM.2011.147","DOIUrl":"https://doi.org/10.1109/ICDM.2011.147","url":null,"abstract":"In this paper, we investigate how to design an optimized discriminating order for boosting multiclass classification. The main idea is to optimize a binary tree architecture, referred to as Sequential Discriminating Tree (SDT), that performs the multiclass classification through a hierarchical sequence of coarse-to-fine binary classifiers. To infer such a tree architecture, we employ the constrained large margin clustering procedure which enforces samples belonging to the same class to locate at the same side of the hyper plane while maximizing the margin between these two partitioned class subsets. The proposed SDT algorithm has a theoretic error bound which is shown experimentally to effectively guarantee the generalization performance. Experiment results indicate that SDT clearly beats the state-of-the-art multiclass classification algorithms.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"48 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133818802","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Matrix factorization methods are extremely useful in many data mining tasks, yet their performances are often degraded by outliers. In this paper, we propose a novel robust matrix factorization algorithm that is insensitive to outliers. We directly formulate robust factorization as a matrix approximation problem with constraints on the rank of the matrix and the cardinality of the outlier set. Then, unlike existing methods that resort to convex relaxations, we solve this problem directly and efficiently. In addition, structural knowledge about the outliers can be incorporated to find outliers more effectively. We applied this method in anomaly detection tasks on various data sets. Empirical results show that this new algorithm is effective in robust modeling and anomaly detection, and our direct solution achieves superior performance over the state-of-the-art methods based on the L1-norm and the nuclear norm of matrices.
{"title":"Direct Robust Matrix Factorizatoin for Anomaly Detection","authors":"L. Xiong, X. Chen, J. Schneider","doi":"10.1109/ICDM.2011.52","DOIUrl":"https://doi.org/10.1109/ICDM.2011.52","url":null,"abstract":"Matrix factorization methods are extremely useful in many data mining tasks, yet their performances are often degraded by outliers. In this paper, we propose a novel robust matrix factorization algorithm that is insensitive to outliers. We directly formulate robust factorization as a matrix approximation problem with constraints on the rank of the matrix and the cardinality of the outlier set. Then, unlike existing methods that resort to convex relaxations, we solve this problem directly and efficiently. In addition, structural knowledge about the outliers can be incorporated to find outliers more effectively. We applied this method in anomaly detection tasks on various data sets. Empirical results show that this new algorithm is effective in robust modeling and anomaly detection, and our direct solution achieves superior performance over the state-of-the-art methods based on the L1-norm and the nuclear norm of matrices.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127918831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We consider the problem of using a large amount of unlabeled data to improve the efficiency of feature selection in high dimensional datasets, when only a small set of labeled examples is available. We propose a new semi-supervised feature importance evaluation method (SSFI for short), that combines ideas from co-training and random forests with a new permutation-based out-of-bag feature importance measure. We provide empirical results on several benchmark datasets indicating that SSFI can lead to significant improvement over state-of-the-art semi-supervised and supervised algorithms.
{"title":"Semi-supervised Feature Importance Evaluation with Ensemble Learning","authors":"H. Barkia, H. Elghazel, A. Aussem","doi":"10.1109/ICDM.2011.129","DOIUrl":"https://doi.org/10.1109/ICDM.2011.129","url":null,"abstract":"We consider the problem of using a large amount of unlabeled data to improve the efficiency of feature selection in high dimensional datasets, when only a small set of labeled examples is available. We propose a new semi-supervised feature importance evaluation method (SSFI for short), that combines ideas from co-training and random forests with a new permutation-based out-of-bag feature importance measure. We provide empirical results on several benchmark datasets indicating that SSFI can lead to significant improvement over state-of-the-art semi-supervised and supervised algorithms.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117237828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Many applications naturally involve time series data, and the vector auto regression (VAR) and the structural VAR (SVAR) are dominant tools to investigate relations between variables in time series. In the first part of this work, we show that the SVAR method is incapable of identifying contemporaneous causal relations when data follow Gaussian distributions. In addition, least squares estimators become unreliable when the scales of the problems are large and observations are limited. In the remaining part, we propose an approach to apply Bayesian network learning algorithms to identify SVARs from time series data in order to capture both temporal and contemporaneous causal relations and avoid high-order statistical tests. The difficulty of applying Bayesian network learning algorithms to time series is that the sizes of the networks corresponding to time series tend to be large and high-order statistical tests are required by Bayesian network learning algorithms in this case. To overcome the difficulty, we show that the search space of conditioning sets d-separating two vertices should be subsets of Markov blankets. Based on this fact, we propose an algorithm learning Bayesian networks locally and making the largest order of statistical tests independent of the scales of the problems. Empirical results show that our algorithm outperforms existing methods in terms of both efficiency and accuracy.
{"title":"Using Bayesian Network Learning Algorithm to Discover Causal Relations in Multivariate Time Series","authors":"Zhenxing Wang, L. Chan","doi":"10.1109/ICDM.2011.153","DOIUrl":"https://doi.org/10.1109/ICDM.2011.153","url":null,"abstract":"Many applications naturally involve time series data, and the vector auto regression (VAR) and the structural VAR (SVAR) are dominant tools to investigate relations between variables in time series. In the first part of this work, we show that the SVAR method is incapable of identifying contemporaneous causal relations when data follow Gaussian distributions. In addition, least squares estimators become unreliable when the scales of the problems are large and observations are limited. In the remaining part, we propose an approach to apply Bayesian network learning algorithms to identify SVARs from time series data in order to capture both temporal and contemporaneous causal relations and avoid high-order statistical tests. The difficulty of applying Bayesian network learning algorithms to time series is that the sizes of the networks corresponding to time series tend to be large and high-order statistical tests are required by Bayesian network learning algorithms in this case. To overcome the difficulty, we show that the search space of conditioning sets d-separating two vertices should be subsets of Markov blankets. Based on this fact, we propose an algorithm learning Bayesian networks locally and making the largest order of statistical tests independent of the scales of the problems. Empirical results show that our algorithm outperforms existing methods in terms of both efficiency and accuracy.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125076658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We present Generalized Fast Subset Sums (GFSS), a new Bayesian framework for scalable and accurate detection of irregularly shaped spatial clusters using multiple data streams. GFSS extends the previously proposed Multivariate Bayesian Scan Statistic (MBSS) and Fast Subset Sums (FSS) approaches for detection of emerging events. The detection power of MBSS is primarily limited by computational considerations, which limit it to searching over circular spatial regions. GFSS enables more accurate and timely detection by defining a hierarchical prior over all subsets of the N locations, first selecting a local neighborhood consisting of a center location and its neighbors, and introducing a sparsity parameter p to describe how likely each location in the neighborhood is to be affected. This approach allows us to consider all possible subsets of locations (including irregularly-shaped regions) but also puts higher weight on more compact regions. We demonstrate that MBSS and FSS are both special cases of this general framework (assuming p = 1 and p = 0.5 respectively), but substantially higher detection power can be achieved by choosing an appropriate value of p. Thus we show that the distribution of the sparsity parameter p can be accurately learned from a small number of labeled events. Our evaluation results (on synthetic disease outbreaks injected into real-world hospital data) show that the GFSS method with learned sparsity parameter has higher detection power and spatial accuracy than MBSS and FSS, particularly when the affected region is irregular or elongated. We also show that the learned models can be used for event characterization, accurately distinguishing between two otherwise identical event types based on the sparsity of the affected spatial region.
{"title":"A Generalized Fast Subset Sums Framework for Bayesian Event Detection","authors":"Kanghong Shao, Yandong Liu, Daniel B. Neill","doi":"10.1109/ICDM.2011.11","DOIUrl":"https://doi.org/10.1109/ICDM.2011.11","url":null,"abstract":"We present Generalized Fast Subset Sums (GFSS), a new Bayesian framework for scalable and accurate detection of irregularly shaped spatial clusters using multiple data streams. GFSS extends the previously proposed Multivariate Bayesian Scan Statistic (MBSS) and Fast Subset Sums (FSS) approaches for detection of emerging events. The detection power of MBSS is primarily limited by computational considerations, which limit it to searching over circular spatial regions. GFSS enables more accurate and timely detection by defining a hierarchical prior over all subsets of the N locations, first selecting a local neighborhood consisting of a center location and its neighbors, and introducing a sparsity parameter p to describe how likely each location in the neighborhood is to be affected. This approach allows us to consider all possible subsets of locations (including irregularly-shaped regions) but also puts higher weight on more compact regions. We demonstrate that MBSS and FSS are both special cases of this general framework (assuming p = 1 and p = 0.5 respectively), but substantially higher detection power can be achieved by choosing an appropriate value of p. Thus we show that the distribution of the sparsity parameter p can be accurately learned from a small number of labeled events. Our evaluation results (on synthetic disease outbreaks injected into real-world hospital data) show that the GFSS method with learned sparsity parameter has higher detection power and spatial accuracy than MBSS and FSS, particularly when the affected region is irregular or elongated. We also show that the learned models can be used for event characterization, accurately distinguishing between two otherwise identical event types based on the sparsity of the affected spatial region.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124494259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper focuses on developing effective and efficient algorithms for top-N recommender systems. A novel Sparse Linear Method (SLIM) is proposed, which generates top-N recommendations by aggregating from user purchase/rating profiles. A sparse aggregation coefficient matrix W is learned from SLIM by solving an `1-norm and `2-norm regularized optimization problem. W is demonstrated to produce high quality recommendations and its sparsity allows SLIM to generate recommendations very fast. A comprehensive set of experiments is conducted by comparing the SLIM method and other state-of-the-art top-N recommendation methods. The experiments show that SLIM achieves significant improvements both in run time performance and recommendation quality over the best existing methods.
{"title":"SLIM: Sparse Linear Methods for Top-N Recommender Systems","authors":"Xia Ning, G. Karypis","doi":"10.1109/ICDM.2011.134","DOIUrl":"https://doi.org/10.1109/ICDM.2011.134","url":null,"abstract":"This paper focuses on developing effective and efficient algorithms for top-N recommender systems. A novel Sparse Linear Method (SLIM) is proposed, which generates top-N recommendations by aggregating from user purchase/rating profiles. A sparse aggregation coefficient matrix W is learned from SLIM by solving an `1-norm and `2-norm regularized optimization problem. W is demonstrated to produce high quality recommendations and its sparsity allows SLIM to generate recommendations very fast. A comprehensive set of experiments is conducted by comparing the SLIM method and other state-of-the-art top-N recommendation methods. The experiments show that SLIM achieves significant improvements both in run time performance and recommendation quality over the best existing methods.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121635800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Liu, Fei Wu, Yin Zhang, Jian Shao, Yueting Zhuang
Recently, there has been extensive research towards the user-provided tags on photo sharing websites which can greatly facilitate image retrieval and management. However, due to the arbitrariness of the tagging activities, these tags are often imprecise and incomplete. As a result, quite a few technologies has been proposed to improve the user experience on these photo sharing systems, including tag clustering and refinement, etc. In this work, we propose a novel framework to model the relationships among tags and images which can be applied to many tag based applications. Different from previous approaches which model images and tags as heterogeneous objects, images and their tags are uniformly viewed as compositions of Semantic Unities in our framework. Then Semantic Unity Graph (SUG) is introduced to represent the complex and high-order relationships among these Semantic Unities. Based on the representation of Semantic Unity Graph, the relevance of images and tags can be naturally measured in terms of the similarity of their Semantic Unities. Then Tag clustering and refinement can then be performed on SUG and the polysemy of images and tags is explicitly considered in this framework. The experiment results conducted on NUS-WIDE and MIR-Flickr datasets demonstrate the effectiveness and efficiency of the proposed approach.
{"title":"Tag Clustering and Refinement on Semantic Unity Graph","authors":"Yang Liu, Fei Wu, Yin Zhang, Jian Shao, Yueting Zhuang","doi":"10.1109/ICDM.2011.141","DOIUrl":"https://doi.org/10.1109/ICDM.2011.141","url":null,"abstract":"Recently, there has been extensive research towards the user-provided tags on photo sharing websites which can greatly facilitate image retrieval and management. However, due to the arbitrariness of the tagging activities, these tags are often imprecise and incomplete. As a result, quite a few technologies has been proposed to improve the user experience on these photo sharing systems, including tag clustering and refinement, etc. In this work, we propose a novel framework to model the relationships among tags and images which can be applied to many tag based applications. Different from previous approaches which model images and tags as heterogeneous objects, images and their tags are uniformly viewed as compositions of Semantic Unities in our framework. Then Semantic Unity Graph (SUG) is introduced to represent the complex and high-order relationships among these Semantic Unities. Based on the representation of Semantic Unity Graph, the relevance of images and tags can be naturally measured in terms of the similarity of their Semantic Unities. Then Tag clustering and refinement can then be performed on SUG and the polysemy of images and tags is explicitly considered in this framework. The experiment results conducted on NUS-WIDE and MIR-Flickr datasets demonstrate the effectiveness and efficiency of the proposed approach.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121910374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}