首页 > 最新文献

2009 Ninth IEEE International Conference on Data Mining最新文献

英文 中文
iTopicModel: Information Network-Integrated Topic Modeling iTopicModel:信息网络集成主题建模
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.43
Yizhou Sun, Jiawei Han, Jing Gao, Yintao Yu
Document networks, i.e., networks associated with text information, are becoming increasingly popular due to the ubiquity of Web documents, blogs, and various kinds of online data. In this paper, we propose a novel topic modeling framework for document networks, which builds a unified generative topic model that is able to consider both text and structure information for documents. A graphical model is proposed to describe the generative model. On the top layer of this graphical model, we define a novel multivariate Markov Random Field for topic distribution random variables for each document, to model the dependency relationships among documents over the network structure. On the bottom layer, we follow the traditional topic model to model the generation of text for each document. A joint distribution function for both the text and structure of the documents is thus provided. A solution to estimate this topic model is given, by maximizing the log-likelihood of the joint probability. Some important practical issues in real applications are also discussed, including how to decide the topic number and how to choose a good network structure. We apply the model on two real datasets, DBLP and Cora, and the experiments show that this model is more effective in comparison with the state-of-the-art topic modeling algorithms.
由于Web文档、博客和各种在线数据的普遍存在,文档网络(即与文本信息相关的网络)正变得越来越流行。在本文中,我们提出了一个新的文档网络主题建模框架,该框架建立了一个统一的生成主题模型,该模型能够同时考虑文档的文本和结构信息。提出了一种图形模型来描述生成模型。在该图形模型的顶层,我们为每个文档的主题分布随机变量定义了一个新的多元马尔可夫随机场,以模拟网络结构上文档之间的依赖关系。在底层,我们遵循传统的主题模型对每个文档的文本生成进行建模。这样就为文件的文本和结构提供了一个联合分发函数。通过最大化联合概率的对数似然,给出了估计该主题模型的解决方案。讨论了在实际应用中一些重要的实际问题,包括如何确定主题数以及如何选择良好的网络结构。我们将该模型应用于DBLP和Cora两个真实数据集上,实验表明,与目前最先进的主题建模算法相比,该模型更加有效。
{"title":"iTopicModel: Information Network-Integrated Topic Modeling","authors":"Yizhou Sun, Jiawei Han, Jing Gao, Yintao Yu","doi":"10.1109/ICDM.2009.43","DOIUrl":"https://doi.org/10.1109/ICDM.2009.43","url":null,"abstract":"Document networks, i.e., networks associated with text information, are becoming increasingly popular due to the ubiquity of Web documents, blogs, and various kinds of online data. In this paper, we propose a novel topic modeling framework for document networks, which builds a unified generative topic model that is able to consider both text and structure information for documents. A graphical model is proposed to describe the generative model. On the top layer of this graphical model, we define a novel multivariate Markov Random Field for topic distribution random variables for each document, to model the dependency relationships among documents over the network structure. On the bottom layer, we follow the traditional topic model to model the generation of text for each document. A joint distribution function for both the text and structure of the documents is thus provided. A solution to estimate this topic model is given, by maximizing the log-likelihood of the joint probability. Some important practical issues in real applications are also discussed, including how to decide the topic number and how to choose a good network structure. We apply the model on two real datasets, DBLP and Cora, and the experiments show that this model is more effective in comparison with the state-of-the-art topic modeling algorithms.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115139555","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 135
ν-Anomica: A Fast Support Vector Based Novelty Detection Technique - anomica:一种快速的基于支持向量的新颖性检测技术
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.42
Santanu Das, Kanishka Bhaduri, N. Oza, A. Srivastava
In this paper we propose ν-Anomica, a novel anomaly detection technique that can be trained on huge data sets with much reduced running time compared to the benchmark one-class Support Vector Machines algorithm. In ν-Anomica, the idea is to train the machine such that it can provide a close approximation to the exact decision plane using fewer training points and without losing much of the generalization performance of the classical approach. We have tested the proposed algorithm on a variety of continuous data sets under different conditions. We show that under all test conditions the developed procedure closely preserves the accuracy of standard one- class Support Vector Machines while reducing both the training time and the test time by 5 − 20 times.
在本文中,我们提出了一种新的异常检测技术,与基准的单类支持向量机算法相比,它可以在大量数据集上训练,运行时间大大减少。在ν-Anomica中,其思想是训练机器,使其能够使用更少的训练点提供接近确切决策平面的近似,并且不会失去经典方法的大部分泛化性能。我们已经在不同条件下的各种连续数据集上测试了所提出的算法。我们表明,在所有测试条件下,开发的程序密切地保持了标准的一类支持向量机的准确性,同时减少了5 - 20倍的训练时间和测试时间。
{"title":"ν-Anomica: A Fast Support Vector Based Novelty Detection Technique","authors":"Santanu Das, Kanishka Bhaduri, N. Oza, A. Srivastava","doi":"10.1109/ICDM.2009.42","DOIUrl":"https://doi.org/10.1109/ICDM.2009.42","url":null,"abstract":"In this paper we propose ν-Anomica, a novel anomaly detection technique that can be trained on huge data sets with much reduced running time compared to the benchmark one-class Support Vector Machines algorithm. In ν-Anomica, the idea is to train the machine such that it can provide a close approximation to the exact decision plane using fewer training points and without losing much of the generalization performance of the classical approach. We have tested the proposed algorithm on a variety of continuous data sets under different conditions. We show that under all test conditions the developed procedure closely preserves the accuracy of standard one- class Support Vector Machines while reducing both the training time and the test time by 5 − 20 times.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124287367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 7
A Walk from 2-Norm SVM to 1-Norm SVM 从2范数SVM到1范数SVM的行走
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.100
J. Kujala, T. Aho, Tapio Elomaa
This paper studies how useful the standard 2-norm regularized SVM is in approximating the 1-norm SVM problem. To this end, we examine a general method that is based on iteratively re-weighting the features and solving a 2-norm optimization problem. The convergence rate of this method is unknown. Previous work indicates that it might require an excessive number of iterations. We study how well we can do with just a small number of iterations. In theory the convergence rate is fast, except for coordinates of the current solution that are close to zero. Our empirical experiments confirm this. In many problems with irrelevant features, already one iteration is often enough to produce accuracy as good as or better than that of the 1-norm SVM. Hence, it seems that in these problems we do not need to converge to the 1-norm SVM solution near zero values. The benefit of this approach is that we can build something similar to the 1-norm regularized solver based on any 2-norm regularized solver. This is quick to implement and the solution inherits the good qualities of the solver such as scalability and stability.
本文研究了标准2范数正则化支持向量机在逼近1范数支持向量机问题中的作用。为此,我们研究了一种基于迭代重新加权特征和解决2范数优化问题的一般方法。这种方法的收敛速度是未知的。以前的工作表明它可能需要过多的迭代次数。我们研究如何用少量的迭代来做得更好。理论上,除了当前解的坐标接近于零外,收敛速度很快。我们的实证实验证实了这一点。在许多具有不相关特征的问题中,一次迭代通常足以产生与1范数支持向量机相同或更好的精度。因此,在这些问题中,我们似乎不需要收敛到接近零值的1范数SVM解。这种方法的好处是,我们可以在任何2范数正则化求解器的基础上构建类似于1范数正则化求解器的东西。这是快速实现和解决方案继承了求解器的优良品质,如可扩展性和稳定性。
{"title":"A Walk from 2-Norm SVM to 1-Norm SVM","authors":"J. Kujala, T. Aho, Tapio Elomaa","doi":"10.1109/ICDM.2009.100","DOIUrl":"https://doi.org/10.1109/ICDM.2009.100","url":null,"abstract":"This paper studies how useful the standard 2-norm regularized SVM is in approximating the 1-norm SVM problem. To this end, we examine a general method that is based on iteratively re-weighting the features and solving a 2-norm optimization problem. The convergence rate of this method is unknown. Previous work indicates that it might require an excessive number of iterations. We study how well we can do with just a small number of iterations. In theory the convergence rate is fast, except for coordinates of the current solution that are close to zero. Our empirical experiments confirm this. In many problems with irrelevant features, already one iteration is often enough to produce accuracy as good as or better than that of the 1-norm SVM. Hence, it seems that in these problems we do not need to converge to the 1-norm SVM solution near zero values. The benefit of this approach is that we can build something similar to the 1-norm regularized solver based on any 2-norm regularized solver. This is quick to implement and the solution inherits the good qualities of the solver such as scalability and stability.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125321602","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 19
A Local Scalable Distributed Expectation Maximization Algorithm for Large Peer-to-Peer Networks 大型点对点网络的局部可扩展分布式期望最大化算法
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.45
Kanishka Bhaduri, A. Srivastava
This paper describes a local and distributed expectation maximization algorithm for learning parameters of Gaussian mixture models (GMM) in large peer-to-peer (P2P) environments. The algorithm can be used for a variety of well-known data mining tasks in distributed environments such as clustering, anomaly detection, target tracking, and density estimation to name a few, necessary for many emerging P2P applications in bioinformatics, webmining and sensor networks. Centralizing all or some of the data to build global models is impractical in such P2P environments because of the large number of data sources, the asynchronous nature of the P2P networks, and dynamic nature of the data/network. The proposed algorithm takes a two-step approach. In the monitoring phase, the algorithm checks if the model ‘quality’ is acceptable by using an efficient local algorithm. This is then used as a feedback loop to sample data from the network and rebuild the GMM when it is outdated. We present thorough experimental results to verify our theoretical claims.
本文提出了一种用于大规模点对点(P2P)环境下高斯混合模型(GMM)参数学习的局部和分布式期望最大化算法。该算法可用于分布式环境中各种众所周知的数据挖掘任务,如聚类、异常检测、目标跟踪和密度估计等,这是生物信息学、网络挖掘和传感器网络中许多新兴P2P应用所必需的。在这样的P2P环境中,由于数据源数量庞大、P2P网络的异步特性以及数据/网络的动态性,将全部或部分数据集中来构建全局模型是不切实际的。该算法采用两步法。在监控阶段,算法通过使用有效的局部算法检查模型“质量”是否可接受。然后将其用作反馈回路,从网络中采样数据,并在过时时重建GMM。我们提出了彻底的实验结果来验证我们的理论主张。
{"title":"A Local Scalable Distributed Expectation Maximization Algorithm for Large Peer-to-Peer Networks","authors":"Kanishka Bhaduri, A. Srivastava","doi":"10.1109/ICDM.2009.45","DOIUrl":"https://doi.org/10.1109/ICDM.2009.45","url":null,"abstract":"This paper describes a local and distributed expectation maximization algorithm for learning parameters of Gaussian mixture models (GMM) in large peer-to-peer (P2P) environments. The algorithm can be used for a variety of well-known data mining tasks in distributed environments such as clustering, anomaly detection, target tracking, and density estimation to name a few, necessary for many emerging P2P applications in bioinformatics, webmining and sensor networks. Centralizing all or some of the data to build global models is impractical in such P2P environments because of the large number of data sources, the asynchronous nature of the P2P networks, and dynamic nature of the data/network. The proposed algorithm takes a two-step approach. In the monitoring phase, the algorithm checks if the model ‘quality’ is acceptable by using an efficient local algorithm. This is then used as a feedback loop to sample data from the network and rebuild the GMM when it is outdated. We present thorough experimental results to verify our theoretical claims.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115109333","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
On K-Means Cluster Preservation Using Quantization Schemes 基于量化方案的k均值聚类保存
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.12
D. Turaga, M. Vlachos, O. Verscheure
This work examines under what conditions compression methodologies can retain the outcome of clustering operations. We focus on the popular k-Means clustering algorithm and we demonstrate how a properly constructed compression scheme based on post-clustering quantization is capable of maintaining the global cluster structure. Our analytical derivations indicate that a 1-bit moment preserving quantizer per cluster is sufficient to retain the original data clusters. Merits of the proposed compression technique include: a) reduced storage requirements with clustering guarantees, b) data privacy on the original values, and c) shape preservation for data visualization purposes. We evaluate quantization scheme on various high-dimensional datasets, including 1-dimensional and 2-dimensional time-series (shape datasets) and demonstrate the cluster preservation property. We also compare with previously proposed simplification techniques in the time-series area and show significant improvements both on the clustering and shape preservation of the compressed datasets.
这项工作考察了在什么条件下压缩方法可以保留聚类操作的结果。我们关注流行的k-Means聚类算法,并演示了基于聚类后量化的适当构造的压缩方案如何能够保持全局聚类结构。我们的分析推导表明,每簇1位矩保持量化器足以保留原始数据簇。所提出的压缩技术的优点包括:a)通过聚类保证减少了存储需求,b)原始值的数据隐私,以及c)用于数据可视化目的的形状保留。我们在各种高维数据集上评估量化方案,包括一维和二维时间序列(形状数据集),并证明了聚类保持特性。我们还比较了之前提出的时间序列区域的简化技术,并在压缩数据集的聚类和形状保存方面显示出显着的改进。
{"title":"On K-Means Cluster Preservation Using Quantization Schemes","authors":"D. Turaga, M. Vlachos, O. Verscheure","doi":"10.1109/ICDM.2009.12","DOIUrl":"https://doi.org/10.1109/ICDM.2009.12","url":null,"abstract":"This work examines under what conditions compression methodologies can retain the outcome of clustering operations. We focus on the popular k-Means clustering algorithm and we demonstrate how a properly constructed compression scheme based on post-clustering quantization is capable of maintaining the global cluster structure. Our analytical derivations indicate that a 1-bit moment preserving quantizer per cluster is sufficient to retain the original data clusters. Merits of the proposed compression technique include: a) reduced storage requirements with clustering guarantees, b) data privacy on the original values, and c) shape preservation for data visualization purposes. We evaluate quantization scheme on various high-dimensional datasets, including 1-dimensional and 2-dimensional time-series (shape datasets) and demonstrate the cluster preservation property. We also compare with previously proposed simplification techniques in the time-series area and show significant improvements both on the clustering and shape preservation of the compressed datasets.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122364776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 12
Efficient Algorithm for Computing Link-Based Similarity in Real World Networks 现实世界网络中基于链路的相似度计算的高效算法
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.136
Yuanzhe Cai, G. Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu, Xiaoyong Du
Similarity calculation has many applications, such as information retrieval, and collaborative filtering, among many others. It has been shown that link-based similarity measure, such as SimRank, is very effective in characterizing the object similarities in networks, such as the Web, by exploiting the object-to-object relationship. Unfortunately, it is prohibitively expensive to compute the link-based similarity in a relatively large graph. In this paper, based on the observation that link-based similarity scores of real world graphs follow the power-law distribution, we propose a new approximate algorithm, namely Power-SimRank, with guaranteed error bound to efficiently compute link-based similarity measure. We also prove the convergence of the proposed algorithm. Extensive experiments conducted on real world datasets and synthetic datasets show that the proposed algorithm outperforms SimRank by four-five times in terms of efficiency while the error generated by the approximation is small.
相似度计算有许多应用,例如信息检索和协同过滤等。研究表明,基于链接的相似性度量,如simmrank,通过利用对象对对象的关系,可以非常有效地表征网络(如Web)中对象的相似性。不幸的是,在一个相对较大的图中计算基于链接的相似性是非常昂贵的。本文在观察到真实世界图的基于链接的相似度分数服从幂律分布的基础上,提出了一种新的近似算法power- simmrank,该算法具有保证的误差界,可以有效地计算基于链接的相似度度量。并证明了算法的收敛性。在真实世界数据集和合成数据集上进行的大量实验表明,该算法在效率方面优于simmrank 4 - 5倍,而近似产生的误差很小。
{"title":"Efficient Algorithm for Computing Link-Based Similarity in Real World Networks","authors":"Yuanzhe Cai, G. Cong, Xu Jia, Hongyan Liu, Jun He, Jiaheng Lu, Xiaoyong Du","doi":"10.1109/ICDM.2009.136","DOIUrl":"https://doi.org/10.1109/ICDM.2009.136","url":null,"abstract":"Similarity calculation has many applications, such as information retrieval, and collaborative filtering, among many others. It has been shown that link-based similarity measure, such as SimRank, is very effective in characterizing the object similarities in networks, such as the Web, by exploiting the object-to-object relationship. Unfortunately, it is prohibitively expensive to compute the link-based similarity in a relatively large graph. In this paper, based on the observation that link-based similarity scores of real world graphs follow the power-law distribution, we propose a new approximate algorithm, namely Power-SimRank, with guaranteed error bound to efficiently compute link-based similarity measure. We also prove the convergence of the proposed algorithm. Extensive experiments conducted on real world datasets and synthetic datasets show that the proposed algorithm outperforms SimRank by four-five times in terms of efficiency while the error generated by the approximation is small.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122999085","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 22
GSML: A Unified Framework for Sparse Metric Learning GSML:稀疏度量学习的统一框架
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.22
Kaizhu Huang, Yiming Ying, C. Campbell
There has been significant recent interest in sparse metric learning (SML) in which we simultaneously learn both a good distance metric and a low-dimensional representation. Unfortunately, the performance of existing sparse metric learning approaches is usually limited because the authors assumed certain problem relaxations or they target the SML objective indirectly. In this paper, we propose a Generalized Sparse Metric Learning method (GSML). This novel framework offers a unified view for understanding many of the popular sparse metric learning algorithms including the Sparse Metric Learning framework proposed, the Large Margin Nearest Neighbor (LMNN), and the D-ranking Vector Machine (D-ranking VM). Moreover, GSML also establishes a close relationship with the Pairwise Support Vector Machine. Furthermore, the proposed framework is capable of extending many current non-sparse metric learning models such as Relevant Vector Machine (RCA) and a state-of-the-art method proposed into their sparse versions. We present the detailed framework, provide theoretical justifications, build various connections with other models, and propose a practical iterative optimization method, making the framework both theoretically important and practically scalable for medium or large datasets. A series of experiments show that the proposed approach can outperform previous methods in terms of both test accuracy and dimension reduction, on six real-world benchmark datasets.
最近,人们对稀疏度量学习(SML)产生了浓厚的兴趣,在这种学习中,我们可以同时学习良好的距离度量和低维表示。不幸的是,现有的稀疏度量学习方法的性能通常是有限的,因为作者假设了某些问题松弛或间接地针对SML目标。本文提出了一种广义稀疏度量学习方法(GSML)。这个新框架为理解许多流行的稀疏度量学习算法提供了一个统一的视角,包括所提出的稀疏度量学习框架,大边界最近邻(LMNN)和d -排名向量机(d -排名VM)。此外,GSML还与成对支持向量机建立了密切的关系。此外,所提出的框架能够将许多当前的非稀疏度量学习模型(如相关向量机(RCA))和一种最先进的方法扩展到它们的稀疏版本中。我们提出了详细的框架,提供了理论依据,与其他模型建立了各种联系,并提出了一种实用的迭代优化方法,使该框架在理论上具有重要意义,并在实践中可扩展到大中型数据集。一系列实验表明,在6个真实基准数据集上,该方法在测试精度和降维方面都优于先前的方法。
{"title":"GSML: A Unified Framework for Sparse Metric Learning","authors":"Kaizhu Huang, Yiming Ying, C. Campbell","doi":"10.1109/ICDM.2009.22","DOIUrl":"https://doi.org/10.1109/ICDM.2009.22","url":null,"abstract":"There has been significant recent interest in sparse metric learning (SML) in which we simultaneously learn both a good distance metric and a low-dimensional representation. Unfortunately, the performance of existing sparse metric learning approaches is usually limited because the authors assumed certain problem relaxations or they target the SML objective indirectly. In this paper, we propose a Generalized Sparse Metric Learning method (GSML). This novel framework offers a unified view for understanding many of the popular sparse metric learning algorithms including the Sparse Metric Learning framework proposed, the Large Margin Nearest Neighbor (LMNN), and the D-ranking Vector Machine (D-ranking VM). Moreover, GSML also establishes a close relationship with the Pairwise Support Vector Machine. Furthermore, the proposed framework is capable of extending many current non-sparse metric learning models such as Relevant Vector Machine (RCA) and a state-of-the-art method proposed into their sparse versions. We present the detailed framework, provide theoretical justifications, build various connections with other models, and propose a practical iterative optimization method, making the framework both theoretically important and practically scalable for medium or large datasets. A series of experiments show that the proposed approach can outperform previous methods in terms of both test accuracy and dimension reduction, on six real-world benchmark datasets.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114276575","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 43
On the (In)Security and (Im)Practicality of Outsourcing Precise Association Rule Mining 论外包精确关联规则挖掘的安全性和实用性
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.122
Ian Molloy, Ninghui Li, Tiancheng Li
The recent interest in outsourcing IT services onto the cloud raises two main concerns: security and cost. One task that could be outsourced is data mining. In VLDB 2007, Wong et al. propose an approach for outsourcing association rule mining. Their approach maps a set of real items into a set of pseudo items, then maps each transaction non-deterministically. This paper, analyzes both the security and costs associated with outsourcing association rule mining. We show how to break the encoding scheme from Wong et al. without using context specific information and reduce the security to a one-to-one mapping. We present a stricter notion of security than used by Wong et al., and then consider the practicality of outsourcing association rule mining. Our results indicate that outsourcing association rule mining may not be practical, if the data owner is concerned with data confidentiality.
最近对将IT服务外包到云上的兴趣引起了两个主要问题:安全性和成本。一个可以外包的任务是数据挖掘。在VLDB 2007中,Wong等人提出了一种外包关联规则挖掘的方法。他们的方法将一组真实项目映射到一组伪项目,然后不确定地映射每个事务。本文分析了外包关联规则挖掘的安全性和成本。我们展示了如何在不使用上下文特定信息的情况下破解Wong等人的编码方案,并将安全性降低到一对一映射。我们提出了一个比Wong等人使用的更严格的安全概念,然后考虑外包关联规则挖掘的实用性。我们的研究结果表明,如果数据所有者关心数据的机密性,那么外包关联规则挖掘可能不实用。
{"title":"On the (In)Security and (Im)Practicality of Outsourcing Precise Association Rule Mining","authors":"Ian Molloy, Ninghui Li, Tiancheng Li","doi":"10.1109/ICDM.2009.122","DOIUrl":"https://doi.org/10.1109/ICDM.2009.122","url":null,"abstract":"The recent interest in outsourcing IT services onto the cloud raises two main concerns: security and cost. One task that could be outsourced is data mining. In VLDB 2007, Wong et al. propose an approach for outsourcing association rule mining. Their approach maps a set of real items into a set of pseudo items, then maps each transaction non-deterministically. This paper, analyzes both the security and costs associated with outsourcing association rule mining. We show how to break the encoding scheme from Wong et al. without using context specific information and reduce the security to a one-to-one mapping. We present a stricter notion of security than used by Wong et al., and then consider the practicality of outsourcing association rule mining. Our results indicate that outsourcing association rule mining may not be practical, if the data owner is concerned with data confidentiality.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129143360","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
CoFKM: A Centralized Method for Multiple-View Clustering CoFKM:一种多视图聚类的集中方法
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.138
G. Cleuziou, M. Exbrayat, Lionel Martin, Jacques-Henri Sublemontier
This paper deals with clustering for multi-view data, i.e. objects described by several sets of variables or proximity matrices. Many important domains or applications such as Information Retrieval, biology, chemistry and marketing are concerned by this problematic. The aim of this data mining research field is to search for clustering patterns that perform a consensus between the patterns from different views. This requires to merge information from each view by performing a fusion process that identifies the agreement between the views and solves the conflicts. Various fusion strategies can be applied, occurring either before, after or during the clustering process. We draw our inspiration from the existing algorithms based on a centralized strategy. We propose a fuzzy clustering approach that generalizes the three fusion strategies and outperforms the main existing multi-view clustering algorithm both on synthetic and real datasets.
本文研究了多视图数据的聚类问题,即由多组变量或邻近矩阵描述的对象。许多重要的领域或应用,如信息检索、生物学、化学和市场营销都涉及到这个问题。这个数据挖掘研究领域的目的是寻找在不同观点的模式之间执行一致性的聚类模式。这需要通过执行识别视图之间的一致性并解决冲突的融合过程来合并来自每个视图的信息。可以应用各种融合策略,在聚类过程之前、之后或期间进行。我们从基于集中策略的现有算法中汲取灵感。我们提出了一种模糊聚类方法,它对三种融合策略进行了推广,并且在合成数据集和真实数据集上都优于现有的主要多视图聚类算法。
{"title":"CoFKM: A Centralized Method for Multiple-View Clustering","authors":"G. Cleuziou, M. Exbrayat, Lionel Martin, Jacques-Henri Sublemontier","doi":"10.1109/ICDM.2009.138","DOIUrl":"https://doi.org/10.1109/ICDM.2009.138","url":null,"abstract":"This paper deals with clustering for multi-view data, i.e. objects described by several sets of variables or proximity matrices. Many important domains or applications such as Information Retrieval, biology, chemistry and marketing are concerned by this problematic. The aim of this data mining research field is to search for clustering patterns that perform a consensus between the patterns from different views. This requires to merge information from each view by performing a fusion process that identifies the agreement between the views and solves the conflicts. Various fusion strategies can be applied, occurring either before, after or during the clustering process. We draw our inspiration from the existing algorithms based on a centralized strategy. We propose a fuzzy clustering approach that generalizes the three fusion strategies and outperforms the main existing multi-view clustering algorithm both on synthetic and real datasets.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116341089","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 92
Constraint-Based Pattern Mining in Dynamic Graphs 动态图中基于约束的模式挖掘
Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.99
C. Robardet
Dynamic graphs are used to represent relationships between entities that evolve over time. Meaningful patterns in such structured data must capture strong interactions and their evolution over time. In social networks, such patterns can be seen as dynamic community structures, i.e., sets of individuals who strongly and repeatedly interact. In this paper, we propose a constraint-based mining approach to uncover evolving patterns. We propose to mine dense and isolated subgraphs defined by two user-parameterized constraints. The temporal evolution of such patterns is captured by associating a temporal event type to each identified subgraph. We consider five basic temporal events: The formation, dissolution, growth, diminution and stability of subgraphs from one time stamp to the next. We propose an algorithm that finds such subgraphs in a time series of graphs processed incrementally. The extraction is feasible due to efficient patterns and data pruning strategies. We demonstrate the applicability of our method on several real-world dynamic graphs and extract meaningful evolving communities.
动态图用于表示随时间变化的实体之间的关系。这种结构化数据中有意义的模式必须捕获强交互及其随时间的演变。在社交网络中,这种模式可以被看作是动态的社区结构,也就是说,一组强烈且反复互动的个体。在本文中,我们提出了一种基于约束的挖掘方法来发现不断变化的模式。我们提出挖掘由两个用户参数化约束定义的密集和隔离子图。通过将时间事件类型关联到每个已识别的子图,可以捕获这些模式的时间演变。我们考虑了五个基本的时间事件:子图从一个时间戳到下一个时间戳的形成、分解、增长、减少和稳定性。我们提出了一种算法,可以在增量处理的图的时间序列中找到这样的子图。由于采用了有效的模式和数据修剪策略,使得提取是可行的。我们证明了我们的方法在几个真实世界的动态图上的适用性,并提取了有意义的进化群落。
{"title":"Constraint-Based Pattern Mining in Dynamic Graphs","authors":"C. Robardet","doi":"10.1109/ICDM.2009.99","DOIUrl":"https://doi.org/10.1109/ICDM.2009.99","url":null,"abstract":"Dynamic graphs are used to represent relationships between entities that evolve over time. Meaningful patterns in such structured data must capture strong interactions and their evolution over time. In social networks, such patterns can be seen as dynamic community structures, i.e., sets of individuals who strongly and repeatedly interact. In this paper, we propose a constraint-based mining approach to uncover evolving patterns. We propose to mine dense and isolated subgraphs defined by two user-parameterized constraints. The temporal evolution of such patterns is captured by associating a temporal event type to each identified subgraph. We consider five basic temporal events: The formation, dissolution, growth, diminution and stability of subgraphs from one time stamp to the next. We propose an algorithm that finds such subgraphs in a time series of graphs processed incrementally. The extraction is feasible due to efficient patterns and data pruning strategies. We demonstrate the applicability of our method on several real-world dynamic graphs and extract meaningful evolving communities.","PeriodicalId":247645,"journal":{"name":"2009 Ninth IEEE International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116517813","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
期刊
2009 Ninth IEEE International Conference on Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1