2011 IEEE 11th International Conference on Data Mining最新文献

英文中文

Constraint Selection-Based Semi-supervised Feature Selection 基于约束选择的半监督特征选择

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.42

Mohammed Hindawi, Kais Allab, K. Benabdeslem

In this paper, we present a novel feature selection approach based on an efficient selection of pair wise constraints. This aims at selecting the most coherent constraints extracted from labeled part of data. The relevance of features is then evaluated according to their efficient locality preserving and chosen constraint preserving ability. Finally, experimental results are provided for validating our proposal with respect to other known feature selection methods.

在本文中，我们提出了一种新的基于对约束的有效选择的特征选择方法。其目的是从数据的标记部分中选择最一致的约束。然后根据特征的有效局部保持能力和选择约束保持能力来评估特征的相关性。最后，实验结果验证了我们的建议相对于其他已知的特征选择方法。

引用次数: 22

ASAP: A Self-Adaptive Prediction System for Instant Cloud Resource Demand Provisioning ASAP:一个即时云资源需求预置的自适应预测系统

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.25

Yexi Jiang, Chang-Shing Perng, Tao Li, Rong N. Chang

The promise of cloud computing is to provide computing resources instantly whenever they are needed. The state-of-art virtual machine (VM) provisioning technology can provision a VM in tens of minutes. This latency is unacceptable for jobs that need to scale out during computation. To truly enable on-the-fly scaling, new VM needs to be ready in seconds upon request. In this paper, We present an online temporal data mining system called ASAP, to model and predict the cloud VM demands. ASAP aims to extract high level characteristics from VM provisioning request stream and notify the provisioning system to prepare VMs in advance. For quantification issue, we propose Cloud Prediction Cost to encodes the cost and constraints of the cloud and guide the training of prediction algorithms. Moreover, we utilize a two-level ensemble method to capture the characteristics of the high transient demands time series. Experimental results using historical data from an IBM cloud in operation demonstrate that ASAP significantly improves the cloud service quality and provides possibility for on-the-fly provisioning.

云计算的承诺是在需要的时候立即提供计算资源。通过先进的虚拟机发放技术，可以在几十分钟内发放一个虚拟机。对于在计算过程中需要向外扩展的作业来说，这种延迟是不可接受的。要真正启用实时扩展，新的VM需要在请求后几秒钟内准备好。本文提出了一种在线时态数据挖掘系统ASAP，用于对云虚拟机需求进行建模和预测。ASAP旨在从虚拟机发放请求流中提取高级特征，并通知发放系统提前准备虚拟机。对于量化问题，我们提出了云预测成本来编码云的成本和约束，并指导预测算法的训练。此外，我们还利用两级集成方法来捕捉高暂态需求时间序列的特征。使用运行中的IBM云的历史数据的实验结果表明，ASAP显著提高了云服务质量，并提供了动态配置的可能性。

引用次数: 83

Privacy Risk in Graph Stream Publishing for Social Network Data 社交网络数据图流发布中的隐私风险

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.120

Nigel Medforth, Ke Wang

To understand how social networks evolve over time, graphs representing the networks need to be published periodically or on-demand. The identity of the participants (nodes) must be anonymized to protect the privacy of the individuals and their relationships (edges) to the other members in the social network. We identify a new form of privacy attack, which we name the degree-trail attack. This attack re-identifies the nodes belonging to a target participant from a sequence of published graphs by comparing the degree of the nodes in the published graphs with the degree evolution of a target. The power of this attack is that the adversary can actively influence the degree of the target individual by interacting with the social network. We show that the adversary can succeed with a high probability even if published graphs are anonymized by strongest known privacy preserving techniques in the literature. Moreover, this success does not depend on the distinctiveness of the target nodes nor require the adversary to behave differently from a normal participant. One of our contributions is a formal method to assess the privacy risk of this type of attacks and empirically study the severity on real social network data.

为了理解社交网络是如何随时间演变的，需要定期或按需发布表示网络的图表。参与者(节点)的身份必须匿名化，以保护个人的隐私以及他们与社交网络中其他成员的关系(边缘)。我们发现了一种新的隐私攻击形式，我们将其命名为学位追踪攻击。这种攻击通过比较已发布图中节点的程度与目标的程度演变，从一系列已发布图中重新识别属于目标参与者的节点。这种攻击的威力在于，攻击者可以通过与社交网络的互动，积极地影响目标个体的程度。我们表明，即使已发布的图被文献中已知最强的隐私保护技术匿名化，攻击者也可以以高概率成功。此外，这种成功并不依赖于目标节点的独特性，也不需要对手的行为与正常参与者不同。我们的贡献之一是一种正式的方法来评估这类攻击的隐私风险，并对真实社交网络数据的严重性进行实证研究。

{"title":"Privacy Risk in Graph Stream Publishing for Social Network Data","authors":"Nigel Medforth, Ke Wang","doi":"10.1109/ICDM.2011.120","DOIUrl":"https://doi.org/10.1109/ICDM.2011.120","url":null,"abstract":"To understand how social networks evolve over time, graphs representing the networks need to be published periodically or on-demand. The identity of the participants (nodes) must be anonymized to protect the privacy of the individuals and their relationships (edges) to the other members in the social network. We identify a new form of privacy attack, which we name the degree-trail attack. This attack re-identifies the nodes belonging to a target participant from a sequence of published graphs by comparing the degree of the nodes in the published graphs with the degree evolution of a target. The power of this attack is that the adversary can actively influence the degree of the target individual by interacting with the social network. We show that the adversary can succeed with a high probability even if published graphs are anonymized by strongest known privacy preserving techniques in the literature. Moreover, this success does not depend on the distinctiveness of the target nodes nor require the adversary to behave differently from a normal participant. One of our contributions is a formal method to assess the privacy risk of this type of attacks and empirically study the severity on real social network data.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"80 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133872377","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

Document Clustering via Matrix Representation 基于矩阵表示的文档聚类

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.59

Xufei Wang, Jiliang Tang, Huan Liu

Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A_i is factorized into a common base determined by non-negative matrices L and R^top, and a non-negative weight matrix M_i such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66% on the studied datasets, and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.

向量空间模型(VSM)被广泛用于表示文档和网页。它在计算上很容易处理，但它也将文档过度简化为矢量，容易受到噪声的影响，并且不能显式地表示文档的底层主题。本文提出了文档的矩阵表示:行表示不同的项，列表示内聚的段。矩阵模型将文档视为一组片段，每个片段是有限数量的潜在主题的概率分布，这些潜在主题可以映射到聚类结构。基于文档矩阵表示的潜在主题提取被表述为约束优化问题，其中每个矩阵(即文档)A_i被分解为由非负矩阵L和R^top确定的公共基，以及一个非负权重矩阵M_i，使得所有文档的重构误差总和最小。实证评价表明，矩阵模型在文档聚类中是可行的:(1)与向量表示相比，矩阵表示能持续提高聚类质量，在研究的数据集上，该方法的相对准确率提高了66%;(2)该方法优于k-means和NMF等基准方法，是LDA和PLSI等最先进方法的补充。此外，所提出的矩阵模型允许在段级别而不是在文档级别进行更精细的信息检索，这使得在信息检索任务中可以返回更多相关的文档。

{"title":"Document Clustering via Matrix Representation","authors":"Xufei Wang, Jiliang Tang, Huan Liu","doi":"10.1109/ICDM.2011.59","DOIUrl":"https://doi.org/10.1109/ICDM.2011.59","url":null,"abstract":"Vector Space Model (VSM) is widely used to represent documents and web pages. It is simple and easy to deal computationally, but it also oversimplifies a document into a vector, susceptible to noise, and cannot explicitly represent underlying topics of a document. A matrix representation of document is proposed in this paper: rows represent distinct terms and columns represent cohesive segments. The matrix model views a document as a set of segments, and each segment is a probability distribution over a limited number of latent topics which can be mapped to clustering structures. The latent topic extraction based on the matrix representation of documents is formulated as a constraint optimization problem in which each matrix (i.e., a document) A_i is factorized into a common base determined by non-negative matrices L and R^top, and a non-negative weight matrix M_i such that the sum of reconstruction error on all documents is minimized. Empirical evaluation demonstrates that it is feasible to use the matrix model for document clustering: (1) compared with vector representation, using matrix representation improves clustering quality consistently, and the proposed approach achieves a relative accuracy improvement up to 66% on the studied datasets, and (2) the proposed method outperforms baseline methods such as k-means and NMF, and complements the state-of-the-art methods like LDA and PLSI. Furthermore, the proposed matrix model allows more refined information retrieval at a segment level instead of at a document level, which enables the return of more relevant documents in information retrieval tasks.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133274488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 21

Fast and Robust Graph-based Transductive Learning via Minimum Tree Cut 基于最小树切的快速鲁棒基于图的换导学习

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.66

Yanming Zhang, Kaizhu Huang, Cheng-Lin Liu

In this paper, we propose an efficient and robust algorithm for graph-based transductive classification. After approximating a graph with a spanning tree, we develop a linear-time algorithm to label the tree such that the cut size of the tree is minimized. This significantly improves typical graph-based methods, which either have a cubic time complexity (for a dense graph) or $O(kn^2)$ (for a sparse graph with $k$ denoting the node degree). %In addition to its great scalability on large data, our proposed algorithm demonstrates high robustness and accuracy. In particular, on a graph with 400,000 nodes (in which 10,000 nodes are labeled) and 10,455,545 edges, our algorithm achieves the highest accuracy of $99.6%$ but takes less than $10$ seconds to label all the unlabeled data. Furthermore, our method shows great robustness to the graph construction both theoretically and empirically, this overcomes another big problem of traditional graph-based methods. In addition to its good scalability and robustness, the proposed algorithm demonstrates high accuracy. In particular, on a graph with $400,000$ nodes (in which $10,000$ nodes are labeled) and $10,455,545$ edges, our algorithm achieves the highest accuracy of $99.6%$ but takes less than $10$ seconds to label all the unlabeled data.

本文提出了一种高效鲁棒的基于图的转换分类算法。在用生成树逼近图之后，我们开发了一种线性时间算法来标记树，使树的切割尺寸最小化。这大大改进了典型的基于图的方法，这些方法要么具有三次时间复杂度(对于密集图)，要么具有$O(kn^2)$(对于用$k$表示节点度的稀疏图)。除了在大数据上具有良好的可扩展性外，我们提出的算法具有很高的鲁棒性和准确性。特别是，在一个有400,000个节点(其中10,000个节点被标记)和10,455,545条边的图上，我们的算法达到了99.6%的最高准确率，但标记所有未标记数据的时间不到10秒。此外，该方法对图的构造具有很强的鲁棒性，克服了传统基于图的方法存在的另一个大问题。该算法不仅具有良好的可扩展性和鲁棒性，而且具有较高的准确率。特别是，在一个有$400,000$节点(其中$10,000$节点被标记)和$10,455,545$边的图上，我们的算法达到了$ 99.6% $的最高准确率，但花费不到$10$秒来标记所有未标记的数据。

{"title":"Fast and Robust Graph-based Transductive Learning via Minimum Tree Cut","authors":"Yanming Zhang, Kaizhu Huang, Cheng-Lin Liu","doi":"10.1109/ICDM.2011.66","DOIUrl":"https://doi.org/10.1109/ICDM.2011.66","url":null,"abstract":"In this paper, we propose an efficient and robust algorithm for graph-based transductive classification. After approximating a graph with a spanning tree, we develop a linear-time algorithm to label the tree such that the cut size of the tree is minimized. This significantly improves typical graph-based methods, which either have a cubic time complexity (for a dense graph) or $O(kn^2)$ (for a sparse graph with $k$ denoting the node degree). %In addition to its great scalability on large data, our proposed algorithm demonstrates high robustness and accuracy. In particular, on a graph with 400,000 nodes (in which 10,000 nodes are labeled) and 10,455,545 edges, our algorithm achieves the highest accuracy of $99.6%$ but takes less than $10$ seconds to label all the unlabeled data. Furthermore, our method shows great robustness to the graph construction both theoretically and empirically, this overcomes another big problem of traditional graph-based methods. In addition to its good scalability and robustness, the proposed algorithm demonstrates high accuracy. In particular, on a graph with $400,000$ nodes (in which $10,000$ nodes are labeled) and $10,455,545$ edges, our algorithm achieves the highest accuracy of $99.6%$ but takes less than $10$ seconds to label all the unlabeled data.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116310771","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 22

Multi-instance Metric Learning 多实例度量学习

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.106

Ye Xu, Wei Ping, A. Campbell

Multi-instance learning, like other machine learning and data mining tasks, requires distance metrics. Although metric learning methods have been studied for many years, metric learners for multi-instance learning remain almost untouched. In this paper, we propose a framework called Multi-Instance MEtric Learning (MIMEL) to learn an appropriate distance under the multi-instance setting. The distance metric between two bags is defined using the Mahalanobis distance function. The problem is formulated by minimizing the KL divergence between two multivariate Gaussians under the constraints of maximizing the between-class bag distance and minimizing the within-class bag distance. To exploit the mechanism of how instances determine bag labels in multi-instance learning, we design a nonparametric density-estimation-based weighting scheme to assign higher â€œweightsâ€ to the instances that are more likely to be positive in positive bags. The weighting scheme itself has a small workload, which adds little extra computing costs to the proposed framework. Moreover, to further boost the classification accuracy, a kernel version of MIMEL is presented. We evaluate MIMEL, using not only several typical multi-instance tasks, but also two activity recognition datasets. The experimental results demonstrate that MIMEL achieves better classification accuracy than many state-of-the-art distance based algorithms or kernel methods for multi-instance learning.

与其他机器学习和数据挖掘任务一样，多实例学习需要距离度量。虽然度量学习方法已经被研究了很多年，但是用于多实例学习的度量学习器几乎没有被触及。本文提出了一种多实例度量学习(MIMEL)框架，用于在多实例环境下学习合适的距离。两个袋子之间的距离度量是使用马氏距离函数定义的。在类间袋距离最大化和类内袋距离最小化的约束下，最小化两个多元高斯函数之间的KL散度。为了利用实例在多实例学习中如何确定袋标签的机制，我们设计了一个基于非参数密度估计的加权方案，将更高的 - œweightsâ -”分配给更有可能在正袋中为正的实例。加权方案本身具有较小的工作量，这使得所提出的框架的额外计算成本很少。此外，为了进一步提高分类精度，提出了一种核版本的MIMEL。我们不仅使用了几个典型的多实例任务，还使用了两个活动识别数据集来评估MIMEL。实验结果表明，在多实例学习中，MIMEL比许多基于距离的算法或核方法具有更好的分类精度。

{"title":"Multi-instance Metric Learning","authors":"Ye Xu, Wei Ping, A. Campbell","doi":"10.1109/ICDM.2011.106","DOIUrl":"https://doi.org/10.1109/ICDM.2011.106","url":null,"abstract":"Multi-instance learning, like other machine learning and data mining tasks, requires distance metrics. Although metric learning methods have been studied for many years, metric learners for multi-instance learning remain almost untouched. In this paper, we propose a framework called Multi-Instance MEtric Learning (MIMEL) to learn an appropriate distance under the multi-instance setting. The distance metric between two bags is defined using the Mahalanobis distance function. The problem is formulated by minimizing the KL divergence between two multivariate Gaussians under the constraints of maximizing the between-class bag distance and minimizing the within-class bag distance. To exploit the mechanism of how instances determine bag labels in multi-instance learning, we design a nonparametric density-estimation-based weighting scheme to assign higher â€œweightsâ€ to the instances that are more likely to be positive in positive bags. The weighting scheme itself has a small workload, which adds little extra computing costs to the proposed framework. Moreover, to further boost the classification accuracy, a kernel version of MIMEL is presented. We evaluate MIMEL, using not only several typical multi-instance tasks, but also two activity recognition datasets. The experimental results demonstrate that MIMEL achieves better classification accuracy than many state-of-the-art distance based algorithms or kernel methods for multi-instance learning.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124860881","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 27

A Study of Laplacian Spectra of Graph for Subgraph Queries 子图查询中图的拉普拉斯谱研究

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.17

Lei Zhu, Qinbao Song

The spectrum of graph has been widely used in graph mining to extract graph topological information. It has also been employed as a characteristic of graph to check the sub graph isomorphism testing since it is an invariant of a graph. However, the spectrum cannot be directly applied to a graph and its sub graph, which is a bottleneck for sub graph isomorphism testing. In this paper, we study the Laplacian spectra between a graph and its sub graph, and propose a method by straightforward adoption of them for sub graph queries. In our proposed method, we first encode every vertex and graph by extracting their Laplacian spectra, and generate a novel two-step filtering conditions. Then, we follow the filtering-and verification framework to conduct sub graph queries. Extensive experiments show that, compared with existing counterpart method, as a graph feature, Laplacian spectra can be used to efficiently improves the efficiency of sub graph queries and thus indicate that it have considerable potential.

图谱在图挖掘中被广泛应用于提取图的拓扑信息。由于它是图的不变量，也被用作图的一个特征来检验子图同构检验。然而，谱不能直接应用于图及其子图，这是子图同构检验的瓶颈。本文研究了图与其子图之间的拉普拉斯谱，提出了一种利用拉普拉斯谱进行子图查询的方法。在该方法中，我们首先通过提取每个顶点和图的拉普拉斯谱对其进行编码，并生成一种新的两步滤波条件。然后，我们遵循过滤和验证框架来执行子图查询。大量的实验表明，与现有的对等方法相比，拉普拉斯谱作为一种图特征，可以有效地提高子图查询的效率，具有相当大的潜力。

引用次数: 2

Entropy-Based Graph Clustering: Application to Biological and Social Networks 基于熵的图聚类:在生物和社会网络中的应用

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.64

Edward Casey Kenley, Young-Rae Cho

Complex systems have been widely studied to characterize their structural behaviors from a topological perspective. High modularity is one of the recurrent features of real-world complex systems. Various graph clustering algorithms have been applied to identifying communities in social networks or modules in biological networks. However, their applicability to real-world systems has been limited because of the massive scale and complex connectivity of the networks. In this study, we exploit a novel information-theoretic model for graph clustering. The entropy-based clustering approach finds locally optimal clusters by growing a random seed in a manner that minimizes graph entropy. We design and analyze modifications that further improve its performance. Assigning priority in seed-selection and seed-growth is well applicable to the scale-free networks characterized by the hub-oriented structure. Computing seed-growth in parallel streams also decomposes an extremely large network efficiently. The experimental results with real biological and social networks show that the entropy-based approach has better performance than competing methods in terms of accuracy and efficiency.

从拓扑学的角度对复杂系统的结构行为进行了广泛的研究。高度模块化是现实世界复杂系统的一个反复出现的特征。各种图聚类算法已被应用于识别社会网络中的社区或生物网络中的模块。然而，由于网络的大规模和复杂的连通性，它们对现实世界系统的适用性受到限制。在这项研究中，我们开发了一种新的信息理论模型用于图聚类。基于熵的聚类方法通过以最小化图熵的方式生长随机种子来找到局部最优聚类。我们设计和分析改进，进一步提高其性能。在种子选择和种子生长中分配优先级适用于以中心为导向结构的无标度网络。在并行流中计算种子生长也可以有效地分解一个极大的网络。在真实的生物网络和社会网络上的实验结果表明，基于熵的方法在准确率和效率方面都优于竞争对手的方法。

{"title":"Entropy-Based Graph Clustering: Application to Biological and Social Networks","authors":"Edward Casey Kenley, Young-Rae Cho","doi":"10.1109/ICDM.2011.64","DOIUrl":"https://doi.org/10.1109/ICDM.2011.64","url":null,"abstract":"Complex systems have been widely studied to characterize their structural behaviors from a topological perspective. High modularity is one of the recurrent features of real-world complex systems. Various graph clustering algorithms have been applied to identifying communities in social networks or modules in biological networks. However, their applicability to real-world systems has been limited because of the massive scale and complex connectivity of the networks. In this study, we exploit a novel information-theoretic model for graph clustering. The entropy-based clustering approach finds locally optimal clusters by growing a random seed in a manner that minimizes graph entropy. We design and analyze modifications that further improve its performance. Assigning priority in seed-selection and seed-growth is well applicable to the scale-free networks characterized by the hub-oriented structure. Computing seed-growth in parallel streams also decomposes an extremely large network efficiently. The experimental results with real biological and social networks show that the entropy-based approach has better performance than competing methods in terms of accuracy and efficiency.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125268842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 31

Finding Communities in Dynamic Social Networks 在动态社交网络中寻找社区

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.67

Chayant Tantipathananandh, T. Berger-Wolf

Communities are natural structures observed in social networks and are usually characterized as "relatively dense" subsets of nodes. Social networks change over time and so do the underlying community structures. Thus, to truly uncover this structure we must take the temporal aspect of networks into consideration. Previously, we have represented framework for finding dynamic communities using the social cost model and formulated the corresponding optimization problem [33], assuming that partitions of individuals into groups are given in each time step. We have also presented heuristics and approximation algorithms for the problem, with the same assumption [32]. In general, however, dynamic social networks are represented as a sequence of graphs of snapshots of the social network and the assumption that we have partitions of individuals into groups does not hold. In this paper, we extend the social cost model and formulate an optimization problem of finding community structure from the sequence of arbitrary graphs. We propose a semi definite programming formulation and a heuristic rounding scheme. We show, using synthetic data sets, that this method is quite accurate on synthetic data sets and present its results on a real social network.

社区是在社会网络中观察到的自然结构，通常以“相对密集”的节点子集为特征。社交网络随着时间的推移而变化，底层社区结构也是如此。因此，为了真正揭示这种结构，我们必须考虑网络的时间方面。在此之前，我们使用社会成本模型表示了寻找动态社区的框架，并制定了相应的优化问题[33]，假设每个时间步都给出了个体划分为群体的情况。我们还提出了针对该问题的启发式和近似算法，具有相同的假设[32]。然而，一般来说，动态的社会网络被表示为社会网络快照的一系列图表，我们将个人划分为群体的假设是不成立的。本文推广了社会成本模型，提出了一个从任意图序列中寻找社区结构的优化问题。我们提出了一个半确定规划公式和一个启发式舍入格式。我们使用合成数据集表明，该方法在合成数据集上非常准确，并将其结果呈现在真实的社交网络上。

{"title":"Finding Communities in Dynamic Social Networks","authors":"Chayant Tantipathananandh, T. Berger-Wolf","doi":"10.1109/ICDM.2011.67","DOIUrl":"https://doi.org/10.1109/ICDM.2011.67","url":null,"abstract":"Communities are natural structures observed in social networks and are usually characterized as \"relatively dense\" subsets of nodes. Social networks change over time and so do the underlying community structures. Thus, to truly uncover this structure we must take the temporal aspect of networks into consideration. Previously, we have represented framework for finding dynamic communities using the social cost model and formulated the corresponding optimization problem [33], assuming that partitions of individuals into groups are given in each time step. We have also presented heuristics and approximation algorithms for the problem, with the same assumption [32]. In general, however, dynamic social networks are represented as a sequence of graphs of snapshots of the social network and the assumption that we have partitions of individuals into groups does not hold. In this paper, we extend the social cost model and formulate an optimization problem of finding community structure from the sequence of arbitrary graphs. We propose a semi definite programming formulation and a heuristic rounding scheme. We show, using synthetic data sets, that this method is quite accurate on synthetic data sets and present its results on a real social network.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125409889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 84

Low Rank Metric Learning with Manifold Regularization 基于流形正则化的低秩度量学习

2011 IEEE 11th International Conference on Data Mining

Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.95

G. Zhong, Kaizhu Huang, Cheng-Lin Liu

In this paper, we present a semi-supervised method to learn a low rank Mahalanobis distance function. Based on an approximation to the projection distance from a manifold, we propose a novel parametric manifold regularizer. In contrast to previous approaches that usually exploit side information only, our proposed method can further take advantages of the intrinsic manifold information from data. In addition, we focus on learning a metric of low rank directly, this is different from traditional approaches that often enforce the l_1 norm on the metric. The resulting configuration is convex with respect to the manifold structure and the distance function, respectively. We solve it with an alternating optimization algorithm, which proves effective to find a satisfactory solution. For efficient implementation, we even present a fast algorithm, in which the manifold structure and the distance function are learned independently without alternating minimization. Experimental results over 12 standard UCI data sets demonstrate the advantages of our method.

本文提出了一种半监督学习低阶马氏距离函数的方法。基于对流形投影距离的近似，提出了一种新的参数流形正则化器。与以往的方法通常只利用侧面信息相比，我们提出的方法可以进一步利用数据的内在流形信息。此外，我们专注于直接学习低秩的度量，这与传统方法不同，传统方法通常在度量上强制使用l1范数。所得到的配置分别相对于流形结构和距离函数是凸的。用交替优化算法求解，证明了交替优化算法的有效性。为了高效实现，我们甚至提出了一种快速算法，其中流形结构和距离函数是独立学习的，而不是交替最小化。在12个标准UCI数据集上的实验结果证明了该方法的优越性。

引用次数: 29

首页上一页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

2011 IEEE 11th International Conference on Data Mining

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀