首页 > 最新文献

2011 IEEE 11th International Conference on Data Mining最新文献

英文 中文
Generating Breakpoint-based Timeline Overview for News Topic Retrospection 为新闻主题回顾生成基于断点的时间线概述
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.71
Po Hu, Minlie Huang, Peng Xu, Weichang Li, A. Usadi, Xiaoyan Zhu
Though news readers can easily access a large number of news articles from the Internet, they can be overwhelmed by the quantity of information available, making it hard to get a concise, global picture of a news topic. In this paper we propose a novel method to address this problem. Given a set of articles for a given news topic, the proposed method models theme variation through time and identifies the breakpoints, which are time points when decisive changes occur. For each breakpoint, a brief summary is automatically constructed based on articles associated with the particular time point. Summaries are then ordered chronologically to form a timeline overview of the news topic. In this fashion, readers can easily track various news topics efficiently. We have conducted experiments on 15 popular topics in 2010. Empirical experiments show the effectiveness of our approach and its advantages over other approaches.
虽然新闻读者可以很容易地从互联网上获取大量的新闻文章,但他们可能会被可获得的信息量所淹没,因此很难对新闻主题进行简洁,全面的了解。在本文中,我们提出了一种新的方法来解决这个问题。给定给定新闻主题的一组文章,所提出的方法对主题随时间的变化进行建模,并识别断点,即发生决定性变化的时间点。对于每个断点,将根据与特定时间点关联的文章自动构造简要摘要。摘要然后按时间顺序排列,形成新闻主题的时间轴概述。通过这种方式,读者可以轻松有效地跟踪各种新闻主题。我们在2010年对15个热门话题进行了实验。实证实验表明了我们的方法的有效性及其优于其他方法的优势。
{"title":"Generating Breakpoint-based Timeline Overview for News Topic Retrospection","authors":"Po Hu, Minlie Huang, Peng Xu, Weichang Li, A. Usadi, Xiaoyan Zhu","doi":"10.1109/ICDM.2011.71","DOIUrl":"https://doi.org/10.1109/ICDM.2011.71","url":null,"abstract":"Though news readers can easily access a large number of news articles from the Internet, they can be overwhelmed by the quantity of information available, making it hard to get a concise, global picture of a news topic. In this paper we propose a novel method to address this problem. Given a set of articles for a given news topic, the proposed method models theme variation through time and identifies the breakpoints, which are time points when decisive changes occur. For each breakpoint, a brief summary is automatically constructed based on articles associated with the particular time point. Summaries are then ordered chronologically to form a timeline overview of the news topic. In this fashion, readers can easily track various news topics efficiently. We have conducted experiments on 15 popular topics in 2010. Empirical experiments show the effectiveness of our approach and its advantages over other approaches.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124956481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
S-preconditioner for Multi-fold Data Reduction with Guaranteed User-Controlled Accuracy 保证用户控制精度的多重数据约简s预调节器
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.138
Ye Jin, Sriram Lakshminarasimhan, Neil Shah, Zhenhuan Gong, Choong-Seock Chang, Jackie H. Chen, S. Ethier, H. Kolla, S. Ku, S. Klasky, R. Latham, R. Ross, K. Schuchardt, N. Samatova
The growing gap between the massive amounts of data generated by petascale scientific simulation codes and the capability of system hardware and software to effectively analyze this data necessitates data reduction. Yet, the increasing data complexity challenges most, if not all, of the existing data compression methods. In fact, loss less compression techniques offer no more than 10% reduction on scientific data that we have experience with, which is widely regarded as effectively incompressible. To bridge this gap, in this paper, we advocate a transformative strategy that enables fast, accurate, and multi-fold reduction of double-precision floating-point scientific data. The intuition behind our method is inspired by an effective use of preconditioners for linear algebra solvers optimized for a particular class of computational "dwarfs" (e.g., dense or sparse matrices). Focusing on a commonly used multi-resolution wavelet compression technique as the underlying "solver" for data reduction we propose the S-preconditioner, which transforms scientific data into a form with high global regularity to ensure a significant decrease in the number of wavelet coefficients stored for a segment of data. Combined with the subsequent EQ-$calibrator, our resultant method (called S-Preconditioned EQ-Calibrated Wavelets (SW)), robustly achieved a 4-to 5-fold data reduction-while guaranteeing user-defined accuracy of reconstructed data to be within 1% point-by-point relative error, lower than 0.01 Normalized RMSE, and higher than 0.99 Pearson Correlation. In this paper, we show the results we obtained by testing our method on six petascale simulation codes including fusion, combustion, climate, astrophysics, and subsurface groundwater in addition to 13 publicly available scientific datasets. We also demonstrate that application-driven data mining tasks performed on decompressed variables or their derived quantities produce results of comparable quality with the ones for the original data.
千万亿次科学模拟代码产生的海量数据与系统硬件和软件有效分析这些数据的能力之间的差距越来越大,这就需要数据精简。然而,不断增加的数据复杂性挑战了大多数(如果不是全部的话)现有的数据压缩方法。事实上,根据我们的经验,低损耗压缩技术可以减少不超过10%的科学数据,而这些数据被广泛认为是不可压缩的。为了弥补这一差距,在本文中,我们提倡一种变革性策略,使双精度浮点科学数据能够快速,准确和多次减少。我们的方法背后的直觉灵感来自于对线性代数解算器的预条件的有效使用,该解算器针对特定的计算“小矮人”(例如,密集或稀疏矩阵)进行了优化。针对一种常用的多分辨率小波压缩技术作为数据约简的基础“求解器”,我们提出了s预条件,它将科学数据转换为具有高度全局规律性的形式,以确保一段数据存储的小波系数数量显著减少。结合随后的EQ-$校准器,我们的结果方法(称为S-Preconditioned EQ-校准小波(SW))稳健地实现了4到5倍的数据缩减,同时保证重构数据的用户定义精度在1%的逐点相对误差内,低于0.01的归一化RMSE,高于0.99的Pearson相关性。在本文中,我们展示了我们通过在包括聚变、燃烧、气候、天体物理和地下地下水在内的6个千兆级模拟代码以及13个公开可用的科学数据集上测试我们的方法获得的结果。我们还证明,应用程序驱动的数据挖掘任务在解压缩变量或其派生量上执行,产生的结果与原始数据的结果质量相当。
{"title":"S-preconditioner for Multi-fold Data Reduction with Guaranteed User-Controlled Accuracy","authors":"Ye Jin, Sriram Lakshminarasimhan, Neil Shah, Zhenhuan Gong, Choong-Seock Chang, Jackie H. Chen, S. Ethier, H. Kolla, S. Ku, S. Klasky, R. Latham, R. Ross, K. Schuchardt, N. Samatova","doi":"10.1109/ICDM.2011.138","DOIUrl":"https://doi.org/10.1109/ICDM.2011.138","url":null,"abstract":"The growing gap between the massive amounts of data generated by petascale scientific simulation codes and the capability of system hardware and software to effectively analyze this data necessitates data reduction. Yet, the increasing data complexity challenges most, if not all, of the existing data compression methods. In fact, loss less compression techniques offer no more than 10% reduction on scientific data that we have experience with, which is widely regarded as effectively incompressible. To bridge this gap, in this paper, we advocate a transformative strategy that enables fast, accurate, and multi-fold reduction of double-precision floating-point scientific data. The intuition behind our method is inspired by an effective use of preconditioners for linear algebra solvers optimized for a particular class of computational \"dwarfs\" (e.g., dense or sparse matrices). Focusing on a commonly used multi-resolution wavelet compression technique as the underlying \"solver\" for data reduction we propose the S-preconditioner, which transforms scientific data into a form with high global regularity to ensure a significant decrease in the number of wavelet coefficients stored for a segment of data. Combined with the subsequent EQ-$calibrator, our resultant method (called S-Preconditioned EQ-Calibrated Wavelets (SW)), robustly achieved a 4-to 5-fold data reduction-while guaranteeing user-defined accuracy of reconstructed data to be within 1% point-by-point relative error, lower than 0.01 Normalized RMSE, and higher than 0.99 Pearson Correlation. In this paper, we show the results we obtained by testing our method on six petascale simulation codes including fusion, combustion, climate, astrophysics, and subsurface groundwater in addition to 13 publicly available scientific datasets. We also demonstrate that application-driven data mining tasks performed on decompressed variables or their derived quantities produce results of comparable quality with the ones for the original data.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124631528","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Learning to Rank for Query-Focused Multi-document Summarization 学习以查询为中心的多文档摘要排序
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.91
Chao Shen, Tao Li
In this paper, we explore how to use ranking SVM to train the feature weights for query-focused multi-document summarization. To apply a supervised learning method to sentence extraction in multi-document summarization, we need to derive the sentence labels for training corpus from the existing human labeling data in form of. However, this process is not trivial, because the human summaries are abstractive, and do not necessarily well match the sentences in the documents. In this paper, we try to address the above problem from the following two aspects. First, we make use of sentence-to-sentence relationships to better estimate the probability of a sentence in the document set to be a summary sentence. Second, to make the derived training data less sensitive, we adopt a cost sensitive loss in the ranking SVM's objective function. The experimental results demonstrate the effectiveness of our proposed method.
在本文中,我们探索了如何使用排序支持向量机来训练以查询为中心的多文档摘要的特征权重。为了将监督学习方法应用于多文档摘要中的句子提取,我们需要从现有的人类标注数据中以形式派生出训练语料库的句子标注。然而,这个过程并不是微不足道的,因为人类的摘要是抽象的,并且不一定与文档中的句子很好地匹配。在本文中,我们试图从以下两个方面来解决上述问题。首先,我们利用句与句之间的关系来更好地估计文档集中某个句子成为总结句的概率。其次,为了降低得到的训练数据的敏感性,我们在排序支持向量机的目标函数中采用了代价敏感损失。实验结果证明了该方法的有效性。
{"title":"Learning to Rank for Query-Focused Multi-document Summarization","authors":"Chao Shen, Tao Li","doi":"10.1109/ICDM.2011.91","DOIUrl":"https://doi.org/10.1109/ICDM.2011.91","url":null,"abstract":"In this paper, we explore how to use ranking SVM to train the feature weights for query-focused multi-document summarization. To apply a supervised learning method to sentence extraction in multi-document summarization, we need to derive the sentence labels for training corpus from the existing human labeling data in form of. However, this process is not trivial, because the human summaries are abstractive, and do not necessarily well match the sentences in the documents. In this paper, we try to address the above problem from the following two aspects. First, we make use of sentence-to-sentence relationships to better estimate the probability of a sentence in the document set to be a summary sentence. Second, to make the derived training data less sensitive, we adopt a cost sensitive loss in the ranking SVM's objective function. The experimental results demonstrate the effectiveness of our proposed method.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116255561","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 35
Clusterability Analysis and Incremental Sampling for Nyström Extension Based Spectral Clustering 基于Nyström可拓谱聚类的聚类性分析与增量采样
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.35
Xianchao Zhang, Quanzeng You
To alleviate the memory and computational burdens of spectral clustering for large scale problems, some kind of low-rank matrix approximation is usually employed. Nyström method is an efficient technique to generate low rank matrix approximation and its most important aspect is sampling. The matrix approximation errors of several sampling schemes have been theoretically analyzed for a number of learning tasks. However, the impact of matrix approximation error on the clustering performance of spectral clustering has not been studied. In this paper, we firstly analyze the performance of Nyström method in terms of cluster ability, thus answer the impact of matrix approximation error on the clustering performance of spectral clustering. Our analysis immediately suggests an incremental sampling scheme for the Nyström method based spectral clustering. Experimental results show that the proposed incremental sampling scheme outperforms existing sampling schemes on various clustering tasks and image segmentation applications, and its efficiency is comparable with existing sampling schemes.
为了减轻谱聚类对大规模问题的内存和计算负担,通常采用某种低秩矩阵近似。Nyström方法是生成低秩矩阵近似的有效方法,其最重要的方面是采样。对几种采样方案的矩阵逼近误差进行了理论分析。然而,矩阵近似误差对谱聚类聚类性能的影响尚未得到研究。本文首先从聚类能力的角度分析Nyström方法的性能,从而回答矩阵逼近误差对谱聚类聚类性能的影响。我们的分析立即提出了一种基于Nyström光谱聚类方法的增量采样方案。实验结果表明,所提出的增量采样方案在各种聚类任务和图像分割应用中都优于现有的采样方案,其效率与现有的采样方案相当。
{"title":"Clusterability Analysis and Incremental Sampling for Nyström Extension Based Spectral Clustering","authors":"Xianchao Zhang, Quanzeng You","doi":"10.1109/ICDM.2011.35","DOIUrl":"https://doi.org/10.1109/ICDM.2011.35","url":null,"abstract":"To alleviate the memory and computational burdens of spectral clustering for large scale problems, some kind of low-rank matrix approximation is usually employed. Nyström method is an efficient technique to generate low rank matrix approximation and its most important aspect is sampling. The matrix approximation errors of several sampling schemes have been theoretically analyzed for a number of learning tasks. However, the impact of matrix approximation error on the clustering performance of spectral clustering has not been studied. In this paper, we firstly analyze the performance of Nyström method in terms of cluster ability, thus answer the impact of matrix approximation error on the clustering performance of spectral clustering. Our analysis immediately suggests an incremental sampling scheme for the Nyström method based spectral clustering. Experimental results show that the proposed incremental sampling scheme outperforms existing sampling schemes on various clustering tasks and image segmentation applications, and its efficiency is comparable with existing sampling schemes.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115810969","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Learning Spectral Embedding for Semi-supervised Clustering 半监督聚类的学习谱嵌入
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.89
Fanhua Shang, Yuanyuan Liu, Fei Wang
In recent years, semi-supervised clustering (SSC) has aroused considerable interests from the machine learning and data mining communities. In this paper, we propose a novel semi-supervised clustering approach with enhanced spectral embedding (ESE) which not only considers structure information contained in data sets but also makes use of prior side information such as pair wise constraints. Specially, we first construct a symmetry-favored k-NN graph which is highly robust to noisy objects and can reflect the underlying manifold structure of data. Then we learn the enhanced spectral embedding towards an ideal representation as consistent with the pair wise constraints as possible. Finally, through taking advantage of Laplacian regularization, we formulate learning spectral representation as semi definite-quadratic-linear programs (SQLPs) under the squared loss function or small semi definitive programs (SDPs) under the hinge loss function, which both can be efficiently solved. Experimental results on a variety of synthetic and real-world data sets show that our approach outperforms the state-of-the-art SSC algorithms on both vector-based and graph-based clustering.
近年来,半监督聚类(SSC)引起了机器学习和数据挖掘领域的广泛关注。本文提出了一种基于增强谱嵌入(ESE)的半监督聚类方法,该方法不仅考虑了数据集中包含的结构信息,而且利用了对约束等先验侧信息。特别地,我们首先构造了一个对称的k-NN图,它对有噪声的对象具有很强的鲁棒性,并能反映数据的底层流形结构。然后,我们学习增强的频谱嵌入,以获得尽可能符合对约束的理想表示。最后,利用拉普拉斯正则化,我们将学习谱表示表述为平方损失函数下的半确定二次线性规划(SQLPs)或铰链损失函数下的小半确定规划(sdp),两者都可以有效地求解。在各种合成数据集和现实世界数据集上的实验结果表明,我们的方法在基于向量和基于图的聚类上都优于最先进的SSC算法。
{"title":"Learning Spectral Embedding for Semi-supervised Clustering","authors":"Fanhua Shang, Yuanyuan Liu, Fei Wang","doi":"10.1109/ICDM.2011.89","DOIUrl":"https://doi.org/10.1109/ICDM.2011.89","url":null,"abstract":"In recent years, semi-supervised clustering (SSC) has aroused considerable interests from the machine learning and data mining communities. In this paper, we propose a novel semi-supervised clustering approach with enhanced spectral embedding (ESE) which not only considers structure information contained in data sets but also makes use of prior side information such as pair wise constraints. Specially, we first construct a symmetry-favored k-NN graph which is highly robust to noisy objects and can reflect the underlying manifold structure of data. Then we learn the enhanced spectral embedding towards an ideal representation as consistent with the pair wise constraints as possible. Finally, through taking advantage of Laplacian regularization, we formulate learning spectral representation as semi definite-quadratic-linear programs (SQLPs) under the squared loss function or small semi definitive programs (SDPs) under the hinge loss function, which both can be efficiently solved. Experimental results on a variety of synthetic and real-world data sets show that our approach outperforms the state-of-the-art SSC algorithms on both vector-based and graph-based clustering.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"32 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116025355","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 24
SPO: Structure Preserving Oversampling for Imbalanced Time Series Classification 非平衡时间序列分类的结构保持过采样
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.137
Hong Cao, Xiaoli Li, D. Woon, See-Kiong Ng
This paper presents a novel structure preserving over sampling (SPO) technique for classifying imbalanced time series data. SPO generates synthetic minority samples based on multivariate Gaussian distribution by estimating the covariance structure of the minority class and regularizing the unreliable eigen spectrum. By preserving the main covariance structure and intelligently creating protective variances in the trivial eigen feature dimensions, the synthetic samples expand effectively into the void area in the data space without being too closely tied with existing minority-class samples. Extensive experiments based on several public time series datasets demonstrate that our proposed SPO in conjunction with support vector machines can achieve better performances than existing over sampling methods and state-of-the-art methods in time series classification.
本文提出了一种新的结构保持过采样(SPO)技术,用于对不平衡时间序列数据进行分类。SPO通过估计少数派类的协方差结构,对不可靠特征谱进行正则化,生成基于多元高斯分布的合成少数派样本。通过保留主协方差结构并在平凡特征维度中智能地创建保护方差,合成样本有效地扩展到数据空间中的空白区域,而不会与现有的少数类样本过于紧密地联系在一起。基于几个公开的时间序列数据集的大量实验表明,我们提出的SPO与支持向量机相结合可以在时间序列分类中获得比现有的过采样方法和最先进的方法更好的性能。
{"title":"SPO: Structure Preserving Oversampling for Imbalanced Time Series Classification","authors":"Hong Cao, Xiaoli Li, D. Woon, See-Kiong Ng","doi":"10.1109/ICDM.2011.137","DOIUrl":"https://doi.org/10.1109/ICDM.2011.137","url":null,"abstract":"This paper presents a novel structure preserving over sampling (SPO) technique for classifying imbalanced time series data. SPO generates synthetic minority samples based on multivariate Gaussian distribution by estimating the covariance structure of the minority class and regularizing the unreliable eigen spectrum. By preserving the main covariance structure and intelligently creating protective variances in the trivial eigen feature dimensions, the synthetic samples expand effectively into the void area in the data space without being too closely tied with existing minority-class samples. Extensive experiments based on several public time series datasets demonstrate that our proposed SPO in conjunction with support vector machines can achieve better performances than existing over sampling methods and state-of-the-art methods in time series classification.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124601823","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 46
Text Clustering via Constrained Nonnegative Matrix Factorization 基于约束非负矩阵分解的文本聚类
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.143
Yan Zhu, L. Jing, Jian Yu
Semi-supervised nonnegative matrix factorization (NMF)receives more and more attention in text mining field. The semi-supervised NMF methods can be divided into two types, one is based on the explicit category labels, the other is based on the pair wise constraints including must-link and cannot-link. As it is hard to obtain the category labels in some tasks, the latter one is more widely used in real applications. To date, all the constrained NMF methods treat the must-link and cannot-link constraints in a same way. However, these two kinds of constraints play different roles in NMF clustering. Thus a novel constrained NMF method is proposed in this paper. In the new method, must-link constraints are used to control the distance of the data in the compressed form, and cannot-ink constraints are used to control the encoding factor. Experimental results on real-world text data sets have shown the good performance of the proposed method.
半监督非负矩阵分解(NMF)在文本挖掘领域受到越来越多的关注。半监督NMF方法可分为两类,一类是基于显式类别标签,另一类是基于必须链接和不能链接的对约束。由于在某些任务中难以获得类别标签,因此后一种方法在实际应用中得到了更广泛的应用。迄今为止,所有受约束的NMF方法都以相同的方式处理必须链接和不能链接的约束。然而,这两种约束在NMF聚类中起着不同的作用。为此,本文提出了一种新的约束NMF方法。在该方法中,采用必须链接约束来控制压缩形式下数据的距离,采用不可链接约束来控制编码因子。在真实文本数据集上的实验结果表明了该方法的良好性能。
{"title":"Text Clustering via Constrained Nonnegative Matrix Factorization","authors":"Yan Zhu, L. Jing, Jian Yu","doi":"10.1109/ICDM.2011.143","DOIUrl":"https://doi.org/10.1109/ICDM.2011.143","url":null,"abstract":"Semi-supervised nonnegative matrix factorization (NMF)receives more and more attention in text mining field. The semi-supervised NMF methods can be divided into two types, one is based on the explicit category labels, the other is based on the pair wise constraints including must-link and cannot-link. As it is hard to obtain the category labels in some tasks, the latter one is more widely used in real applications. To date, all the constrained NMF methods treat the must-link and cannot-link constraints in a same way. However, these two kinds of constraints play different roles in NMF clustering. Thus a novel constrained NMF method is proposed in this paper. In the new method, must-link constraints are used to control the distance of the data in the compressed form, and cannot-ink constraints are used to control the encoding factor. Experimental results on real-world text data sets have shown the good performance of the proposed method.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129743340","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 11
Secure Clustering in Private Networks 私有网络中的安全集群
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.127
Bin Yang, Issei Sato, Hiroshi Nakagawa
Many clustering methods have been proposed for analyzing the relations inside networks with complex structures. Some of them can detect a mixture of assortative and disassortative structures in networks. All these methods are based on the fact that the entire network is observable. However, in the real world, the entities in networks, for example a social network, may be private, and thus, cannot be observed. We focus on private peer-to-peer networks in which all vertices are independent and private, and each vertex only knows about itself and its neighbors. We propose a privacy-preserving Gibbs sampling for clustering these types of private networks and detecting their mixed structures without revealing any private information about any individual entity. Moreover, the running cost of our method is related only to the number of clusters and the maximum degree, but is nearly independent of the number of vertices in the entire network.
为了分析具有复杂结构的网络内部关系,人们提出了许多聚类方法。其中一些可以检测到网络中分类和非分类结构的混合。所有这些方法都是基于整个网络是可观察的这一事实。然而,在现实世界中,网络中的实体(例如社交网络)可能是私有的,因此无法被观察到。我们专注于私有点对点网络,其中所有顶点都是独立和私有的,每个顶点只知道自己和它的邻居。我们提出了一种保护隐私的吉布斯抽样方法,用于聚类这些类型的私有网络,并在不泄露任何单个实体的任何私有信息的情况下检测它们的混合结构。此外,我们的方法的运行成本只与簇的数量和最大度有关,而与整个网络中的顶点数量几乎无关。
{"title":"Secure Clustering in Private Networks","authors":"Bin Yang, Issei Sato, Hiroshi Nakagawa","doi":"10.1109/ICDM.2011.127","DOIUrl":"https://doi.org/10.1109/ICDM.2011.127","url":null,"abstract":"Many clustering methods have been proposed for analyzing the relations inside networks with complex structures. Some of them can detect a mixture of assortative and disassortative structures in networks. All these methods are based on the fact that the entire network is observable. However, in the real world, the entities in networks, for example a social network, may be private, and thus, cannot be observed. We focus on private peer-to-peer networks in which all vertices are independent and private, and each vertex only knows about itself and its neighbors. We propose a privacy-preserving Gibbs sampling for clustering these types of private networks and detecting their mixed structures without revealing any private information about any individual entity. Moreover, the running cost of our method is related only to the number of clusters and the maximum degree, but is nearly independent of the number of vertices in the entire network.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128237972","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
A Fast and Flexible Clustering Algorithm Using Binary Discretization 一种快速灵活的二值离散聚类算法
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.9
M. Sugiyama, Akihiro Yamamoto
We present in this paper a new clustering algorithm for multivariate data. This algorithm, called BOOL (Binary coding Oriented clustering), can detect arbitrarily shaped clusters and is noise tolerant. BOOL handles data using a two-step procedure: data points are first discretized and represented as binary words, clusters are then iteratively constructed by agglomerating smaller clusters using this representation. This latter step is carried out with linear complexity by sorting such binary representations, which results in dramatic speedups when compared with other techniques. Experiments show that BOOL is faster than K-means, and about two to three orders of magnitude faster than two state-of-the-art algorithms that can detect non-convex clusters of arbitrary shapes. We also show that BOOL's results are robust to changes in parameters, whereas most algorithms for arbitrarily shaped clusters are known to be overly sensitive to such changes. The key to the robustness of BOOL is the hierarchical structure of clusters that is introduced automatically by increasing the accuracy of the discretization.
本文提出了一种新的多元数据聚类算法。这种算法被称为BOOL(二进制编码导向聚类),可以检测任意形状的聚类,并且具有耐噪性。BOOL使用两步过程处理数据:首先将数据点离散化并表示为二进制词,然后通过使用这种表示聚合较小的集群来迭代地构建集群。后一步是通过对这种二进制表示进行排序来实现线性复杂性的,与其他技术相比,这导致了戏剧性的加速。实验表明BOOL比K-means快,比两种最先进的算法快两到三个数量级,可以检测任意形状的非凸簇。我们还表明BOOL的结果对参数的变化具有鲁棒性,而已知大多数用于任意形状群集的算法对此类变化过于敏感。BOOL鲁棒性的关键在于通过提高离散化的精度而自动引入的聚类的层次结构。
{"title":"A Fast and Flexible Clustering Algorithm Using Binary Discretization","authors":"M. Sugiyama, Akihiro Yamamoto","doi":"10.1109/ICDM.2011.9","DOIUrl":"https://doi.org/10.1109/ICDM.2011.9","url":null,"abstract":"We present in this paper a new clustering algorithm for multivariate data. This algorithm, called BOOL (Binary coding Oriented clustering), can detect arbitrarily shaped clusters and is noise tolerant. BOOL handles data using a two-step procedure: data points are first discretized and represented as binary words, clusters are then iteratively constructed by agglomerating smaller clusters using this representation. This latter step is carried out with linear complexity by sorting such binary representations, which results in dramatic speedups when compared with other techniques. Experiments show that BOOL is faster than K-means, and about two to three orders of magnitude faster than two state-of-the-art algorithms that can detect non-convex clusters of arbitrary shapes. We also show that BOOL's results are robust to changes in parameters, whereas most algorithms for arbitrarily shaped clusters are known to be overly sensitive to such changes. The key to the robustness of BOOL is the hierarchical structure of clusters that is introduced automatically by increasing the accuracy of the discretization.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132041207","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Web Horror Image Recognition Based on Context-Aware Multi-instance Learning 基于上下文感知多实例学习的网络恐怖图像识别
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.155
Bing Li, Weihua Xiong, Weiming Hu
Along with the ever-growing Web, horror contents sharing in the Internet has interfered with our daily life and affected our, especially children's, health. Therefore horror image recognition is becoming more important for web objectionable content filtering. This paper presents a novel context-aware multi-instance learning (CMIL) model for this task. This work is distinguished by three key contributions. Firstly, the traditional multi-instance learning is extended to context-aware multi-instance learning model through integrating an undirected graph in each bag that represents contextual relationships among instances. Secondly, by introducing a novel energy function, a heuristic optimization algorithm based on Fuzzy Support Vector Machine (FSVM) is given out to find the optimal classifier on CMIL. Finally, the CMIL is applied to recognize horror images. Experimental results on an image set collected from the Internet show that the proposed method is effective on horror image recognition.
随着网络的不断发展,网络上分享的恐怖内容已经干扰了我们的日常生活,影响了我们的健康,尤其是孩子们的健康。因此,恐怖图像识别在网络不良内容过滤中显得尤为重要。本文提出了一种新的上下文感知多实例学习(CMIL)模型。这项工作有三个主要贡献。首先,将传统的多实例学习扩展为上下文感知的多实例学习模型,在每个包中集成一个表示实例间上下文关系的无向图。其次,通过引入新的能量函数,提出了一种基于模糊支持向量机(FSVM)的启发式优化算法来寻找CMIL上的最优分类器。最后,将该方法应用于恐怖图像的识别。在网络图像集上的实验结果表明,该方法对恐怖图像识别是有效的。
{"title":"Web Horror Image Recognition Based on Context-Aware Multi-instance Learning","authors":"Bing Li, Weihua Xiong, Weiming Hu","doi":"10.1109/ICDM.2011.155","DOIUrl":"https://doi.org/10.1109/ICDM.2011.155","url":null,"abstract":"Along with the ever-growing Web, horror contents sharing in the Internet has interfered with our daily life and affected our, especially children's, health. Therefore horror image recognition is becoming more important for web objectionable content filtering. This paper presents a novel context-aware multi-instance learning (CMIL) model for this task. This work is distinguished by three key contributions. Firstly, the traditional multi-instance learning is extended to context-aware multi-instance learning model through integrating an undirected graph in each bag that represents contextual relationships among instances. Secondly, by introducing a novel energy function, a heuristic optimization algorithm based on Fuzzy Support Vector Machine (FSVM) is given out to find the optimal classifier on CMIL. Finally, the CMIL is applied to recognize horror images. Experimental results on an image set collected from the Internet show that the proposed method is effective on horror image recognition.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126514621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
期刊
2011 IEEE 11th International Conference on Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1