首页 > 最新文献

Seventh IEEE International Conference on Data Mining (ICDM 2007)最新文献

英文 中文
An Efficient Spectral Algorithm for Network Community Discovery and Its Applications to Biological and Social Networks 网络社区发现的高效谱算法及其在生物和社会网络中的应用
Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.72
Jianhua Ruan, Weixiong Zhang
Automatic discovery of community structures in complex networks is a fundamental task in many disciplines, including social science, engineering, and biology. A quantitative measure called modularity (Q) has been proposed to effectively assess the quality of community structures. Several community discovery algorithms have since been developed based on the optimization of Q. However, this optimization problem is NP-hard, and the existing algorithms have a low accuracy or are computationally expensive. In this paper, we present an efficient spectral algorithm for modularity optimization. When tested on a large number of synthetic or real-world networks, and compared to the existing algorithms, our method is efficient and and has a high accuracy. In addition, we have successfully applied our algorithm to detect interesting and meaningful community structures from real-world networks in different domains, including biology, medicine and social science. Due to space limitation, results of these applications are presented in a complete version of the paper available on our Website (http://cse .wustl.edu/ ~jruan/).
在复杂网络中自动发现社区结构是许多学科的基本任务,包括社会科学、工程和生物学。为了有效地评价社区结构的质量,提出了一种称为模块化(Q)的定量指标。基于q的优化,已经开发了几种社区发现算法。然而,这种优化问题是np困难的,现有算法精度低或计算成本高。在本文中,我们提出了一种用于模块化优化的高效谱算法。在大量的合成网络或真实网络上进行了测试,并与现有算法进行了比较,结果表明我们的方法是高效的,并且具有较高的准确率。此外,我们已经成功地将我们的算法应用于从不同领域的现实世界网络中检测有趣和有意义的社区结构,包括生物学,医学和社会科学。由于篇幅限制,这些应用程序的结果在我们的网站(http://cse .wustl.edu/ ~jruan/)上以完整的论文版本呈现。
{"title":"An Efficient Spectral Algorithm for Network Community Discovery and Its Applications to Biological and Social Networks","authors":"Jianhua Ruan, Weixiong Zhang","doi":"10.1109/ICDM.2007.72","DOIUrl":"https://doi.org/10.1109/ICDM.2007.72","url":null,"abstract":"Automatic discovery of community structures in complex networks is a fundamental task in many disciplines, including social science, engineering, and biology. A quantitative measure called modularity (Q) has been proposed to effectively assess the quality of community structures. Several community discovery algorithms have since been developed based on the optimization of Q. However, this optimization problem is NP-hard, and the existing algorithms have a low accuracy or are computationally expensive. In this paper, we present an efficient spectral algorithm for modularity optimization. When tested on a large number of synthetic or real-world networks, and compared to the existing algorithms, our method is efficient and and has a high accuracy. In addition, we have successfully applied our algorithm to detect interesting and meaningful community structures from real-world networks in different domains, including biology, medicine and social science. Due to space limitation, results of these applications are presented in a complete version of the paper available on our Website (http://cse .wustl.edu/ ~jruan/).","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"39 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"122799265","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 132
Optimal Subsequence Bijection 最优子序列双射
Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.47
Longin Jan Latecki, Qiang Wang, Suzan Köknar-Tezel, V. Megalooikonomou
We consider the problem of elastic matching of sequences of real numbers. Since both a query and a target sequence may be noisy, i.e., contain some outlier elements, it is desirable to exclude the outlier elements from matching in order to obtain a robust matching performance. Moreover, in many applications like shape alignment or stereo correspondence it is also desirable to have a one-to-one and onto correspondence (bijection) between the remaining elements. We propose an algorithm that determines the optimal subsequence bijection (OSB) of a query and target sequence. The OSB is efficiently computed since we map the problem's solution to a cheapest path in a DAG (directed acyclic graph). We obtained excellent results on standard benchmark time series datasets. We compared OSB to Dynamic Time Warping (DTW) with and without warping window. We do not claim that OSB is always superior to DTW. However, our results demonstrate that skipping outlier elements as done by OSB can significantly improve matching results for many real datasets. Moreover, OSB is particularly suitable for partial matching. We applied it to the object recognition problem when only parts of contours are given. We obtained sequences representing shapes by representing object contours as sequences of curvatures.
研究实数序列的弹性匹配问题。由于查询和目标序列都可能是有噪声的,即包含一些异常元素,因此为了获得稳健的匹配性能,需要将异常元素排除在匹配之外。此外,在许多应用程序中,如形状对齐或立体对应,也希望在其余元素之间具有一对一和对上对应(双射)。我们提出了一种确定查询和目标序列的最优子序列双射(OSB)的算法。OSB可以有效地计算,因为我们将问题的解映射到DAG(有向无环图)中的最便宜路径。我们在标准基准时间序列数据集上获得了很好的结果。我们比较了OSB和动态时间翘曲(DTW)有和没有翘曲窗口。我们并不是说OSB总是优于DTW。然而,我们的结果表明,跳过OSB所做的离群元素可以显着改善许多真实数据集的匹配结果。此外,OSB特别适合于部分匹配。我们将其应用于只给出部分轮廓的目标识别问题。我们通过将物体轮廓表示为曲率序列来获得表示形状的序列。
{"title":"Optimal Subsequence Bijection","authors":"Longin Jan Latecki, Qiang Wang, Suzan Köknar-Tezel, V. Megalooikonomou","doi":"10.1109/ICDM.2007.47","DOIUrl":"https://doi.org/10.1109/ICDM.2007.47","url":null,"abstract":"We consider the problem of elastic matching of sequences of real numbers. Since both a query and a target sequence may be noisy, i.e., contain some outlier elements, it is desirable to exclude the outlier elements from matching in order to obtain a robust matching performance. Moreover, in many applications like shape alignment or stereo correspondence it is also desirable to have a one-to-one and onto correspondence (bijection) between the remaining elements. We propose an algorithm that determines the optimal subsequence bijection (OSB) of a query and target sequence. The OSB is efficiently computed since we map the problem's solution to a cheapest path in a DAG (directed acyclic graph). We obtained excellent results on standard benchmark time series datasets. We compared OSB to Dynamic Time Warping (DTW) with and without warping window. We do not claim that OSB is always superior to DTW. However, our results demonstrate that skipping outlier elements as done by OSB can significantly improve matching results for many real datasets. Moreover, OSB is particularly suitable for partial matching. We applied it to the object recognition problem when only parts of contours are given. We obtained sequences representing shapes by representing object contours as sequences of curvatures.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"33 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114291811","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 53
Can the Content of Public News Be Used to Forecast Abnormal Stock Market Behaviour? 公共新闻内容能否用于预测股市异常行为?
Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.74
Calum S. Robertson, S. Geva, R. Wolff
A popular theory of markets is that they are efficient: all available information is deemed to provide an accurate valuation of an asset at any time. In this paper, we consider how the content of market- related news articles contributes to such information. Specifically, we mine news articles for terms of interest, and quantify this degree of interest. We then incorporate this measure into traditional models for market index volatility with a view to forecasting whether the incidence of interesting news is correlated with a shock in the index, and thus if the information can be captured to value the underlying asset. We illustrate the methodology on stock market indices for the USA, the UK, and Australia.
一个流行的市场理论是,市场是有效的:所有可用的信息都被认为在任何时候都能提供对资产的准确估值。在本文中,我们考虑与市场相关的新闻文章的内容如何有助于这些信息。具体来说,我们挖掘新闻文章的兴趣条款,并量化这种兴趣程度。然后,我们将这一措施纳入市场指数波动的传统模型,以预测有趣新闻的发生率是否与指数中的冲击相关,从而是否可以捕获信息以对标的资产进行估值。我们以美国、英国和澳大利亚的股票市场指数为例说明了这种方法。
{"title":"Can the Content of Public News Be Used to Forecast Abnormal Stock Market Behaviour?","authors":"Calum S. Robertson, S. Geva, R. Wolff","doi":"10.1109/ICDM.2007.74","DOIUrl":"https://doi.org/10.1109/ICDM.2007.74","url":null,"abstract":"A popular theory of markets is that they are efficient: all available information is deemed to provide an accurate valuation of an asset at any time. In this paper, we consider how the content of market- related news articles contributes to such information. Specifically, we mine news articles for terms of interest, and quantify this degree of interest. We then incorporate this measure into traditional models for market index volatility with a view to forecasting whether the incidence of interesting news is correlated with a shock in the index, and thus if the information can be captured to value the underlying asset. We illustrate the methodology on stock market indices for the USA, the UK, and Australia.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124173308","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Local Probabilistic Models for Link Prediction 链路预测的局部概率模型
Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.108
Chao Wang, Venu Satuluri, S. Parthasarathy
One of the core tasks in social network analysis is to predict the formation of links (i.e. various types of relationships) over time. Previous research has generally represented the social network in the form of a graph and has leveraged topological and semantic measures of similarity between two nodes to evaluate the probability of link formation. Here we introduce a novel local probabilistic graphical model method that can scale to large graphs to estimate the joint co-occurrence probability of two nodes. Such a probability measure captures information that is not captured by either topological measures or measures of semantic similarity, which are the dominant measures used for link prediction. We demonstrate the effectiveness of the co-occurrence probability feature by using it both in isolation and in combination with other topological and semantic features for predicting co-authorship collaborations on real datasets.
社交网络分析的核心任务之一是预测链接(即各种类型的关系)随时间的形成。以前的研究一般以图的形式表示社会网络,并利用拓扑和语义度量两个节点之间的相似性来评估链接形成的概率。本文介绍了一种新的局部概率图模型方法,该方法可以扩展到大型图中来估计两个节点的联合共现概率。这种概率度量捕获的信息是拓扑度量或语义相似性度量所不能捕获的,而拓扑度量或语义相似性度量是用于链路预测的主要度量。我们通过单独使用并与其他拓扑和语义特征结合使用共现概率特征来预测真实数据集上的共同作者协作,从而证明了共现概率特征的有效性。
{"title":"Local Probabilistic Models for Link Prediction","authors":"Chao Wang, Venu Satuluri, S. Parthasarathy","doi":"10.1109/ICDM.2007.108","DOIUrl":"https://doi.org/10.1109/ICDM.2007.108","url":null,"abstract":"One of the core tasks in social network analysis is to predict the formation of links (i.e. various types of relationships) over time. Previous research has generally represented the social network in the form of a graph and has leveraged topological and semantic measures of similarity between two nodes to evaluate the probability of link formation. Here we introduce a novel local probabilistic graphical model method that can scale to large graphs to estimate the joint co-occurrence probability of two nodes. Such a probability measure captures information that is not captured by either topological measures or measures of semantic similarity, which are the dominant measures used for link prediction. We demonstrate the effectiveness of the co-occurrence probability feature by using it both in isolation and in combination with other topological and semantic features for predicting co-authorship collaborations on real datasets.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116317507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 330
Efficient Data Sampling in Heterogeneous Peer-to-Peer Networks 异构点对点网络中的高效数据采样
Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.71
Benjamin Arai, Song Lin, D. Gunopulos
Performing data-mining tasks such as clustering, classification, and prediction on large datasets is an arduous task and, many times, it is an infeasible task given current hardware limitations. The distributed nature of peer-to-peer databases further complicates this issue by introducing an access overhead cost in addition to the cost of sending individual tuples over the network. We propose a two-level sampling approach focusing on peer-to-peer databases for maximizing sample quality given a user-defined communication budget. Given that individual peers may have varying cardinality we propose an algorithm for determining the optimal sample rate (the percentage of tuples to sample from a peer) for each peer. We do this by analyzing the variance of individual peers, ultimately minimizing the total variance of the entire sample. By performing local optimization of individual peer sample rates we maximize approximation accuracy of the samples. We also offer several techniques for sampling in peer-to-peer databases given various amounts of known and unknown information about the network and its peers.
在大型数据集上执行数据挖掘任务(如聚类、分类和预测)是一项艰巨的任务,而且在当前的硬件限制下,很多时候这是一项不可行的任务。点对点数据库的分布式特性使这一问题进一步复杂化,因为除了通过网络发送单个元组的成本之外,还引入了访问开销成本。我们提出了一种两级采样方法,专注于点对点数据库,在给定用户定义的通信预算的情况下最大化样本质量。考虑到单个节点可能具有不同的基数,我们提出了一种算法来确定每个节点的最佳采样率(从节点中采样的元组的百分比)。我们通过分析单个同伴的方差来做到这一点,最终使整个样本的总方差最小化。通过对单个对等样本率进行局部优化,使样本的近似精度最大化。我们还提供了几种在点对点数据库中采样的技术,给出了关于网络及其对等节点的各种已知和未知信息。
{"title":"Efficient Data Sampling in Heterogeneous Peer-to-Peer Networks","authors":"Benjamin Arai, Song Lin, D. Gunopulos","doi":"10.1109/ICDM.2007.71","DOIUrl":"https://doi.org/10.1109/ICDM.2007.71","url":null,"abstract":"Performing data-mining tasks such as clustering, classification, and prediction on large datasets is an arduous task and, many times, it is an infeasible task given current hardware limitations. The distributed nature of peer-to-peer databases further complicates this issue by introducing an access overhead cost in addition to the cost of sending individual tuples over the network. We propose a two-level sampling approach focusing on peer-to-peer databases for maximizing sample quality given a user-defined communication budget. Given that individual peers may have varying cardinality we propose an algorithm for determining the optimal sample rate (the percentage of tuples to sample from a peer) for each peer. We do this by analyzing the variance of individual peers, ultimately minimizing the total variance of the entire sample. By performing local optimization of individual peer sample rates we maximize approximation accuracy of the samples. We also offer several techniques for sampling in peer-to-peer databases given various amounts of known and unknown information about the network and its peers.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121713687","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
Social Network Extraction of Academic Researchers 学术研究人员社交网络提取
Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.30
Jie Tang, Duo Zhang, Limin Yao
This paper addresses the issue of extraction of an academic researcher social network. By researcher social network extraction, we are aimed at finding, extracting, and fusing the 'semantic '-based profiling information of a researcher from the Web. Previously, social network extraction was often undertaken separately in an ad-hoc fashion. This paper first gives a formalization of the entire problem. Specifically, it identifies the 'relevant documents' from the Web by a classifier. It then proposes a unified approach to perform the researcher profiling using conditional random fields (CRF). It integrates publications from the existing bibliography datasets. In the integration, it proposes a constraints-based probabilistic model to name disambiguation. Experimental results on an online system show that the unified approach to researcher profiling significantly outperforms the baseline methods of using rule learning or classification. Experimental results also indicate that our method to name disambiguation performs better than the baseline method using unsupervised learning. The methods have been applied to expert finding. Experiments show that the accuracy of expert finding can be significantly improved by using the proposed methods.
本文研究了一个学术研究者社会网络的抽取问题。通过研究人员社交网络提取,我们旨在从网络中发现、提取和融合基于“语义”的研究人员分析信息。在此之前,社交网络的提取通常是单独进行的。本文首先给出了整个问题的形式化。具体来说,它通过分类器从Web中识别“相关文档”。然后提出了一种使用条件随机场(CRF)执行研究人员分析的统一方法。它集成了来自现有书目数据集的出版物。在集成中,提出了一种基于约束的概率模型来实现名称消歧。在线系统的实验结果表明,统一的研究人员分析方法明显优于使用规则学习或分类的基线方法。实验结果还表明,我们的方法比使用无监督学习的基线方法效果更好。该方法已应用于专家寻找。实验表明,采用该方法可以显著提高专家搜索的准确性。
{"title":"Social Network Extraction of Academic Researchers","authors":"Jie Tang, Duo Zhang, Limin Yao","doi":"10.1109/ICDM.2007.30","DOIUrl":"https://doi.org/10.1109/ICDM.2007.30","url":null,"abstract":"This paper addresses the issue of extraction of an academic researcher social network. By researcher social network extraction, we are aimed at finding, extracting, and fusing the 'semantic '-based profiling information of a researcher from the Web. Previously, social network extraction was often undertaken separately in an ad-hoc fashion. This paper first gives a formalization of the entire problem. Specifically, it identifies the 'relevant documents' from the Web by a classifier. It then proposes a unified approach to perform the researcher profiling using conditional random fields (CRF). It integrates publications from the existing bibliography datasets. In the integration, it proposes a constraints-based probabilistic model to name disambiguation. Experimental results on an online system show that the unified approach to researcher profiling significantly outperforms the baseline methods of using rule learning or classification. Experimental results also indicate that our method to name disambiguation performs better than the baseline method using unsupervised learning. The methods have been applied to expert finding. Experiments show that the accuracy of expert finding can be significantly improved by using the proposed methods.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"26 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126265327","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 141
General Averaged Divergence Analysis 一般平均散度分析
Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.105
D. Tao, Xuelong Li, Xindong Wu, S. Maybank
Subspace selection is a powerful tool in data mining. An important subspace method is the Fisher-Rao linear discriminant analysis (LDA), which has been successfully applied in many fields such as biometrics, bioinformatics, and multimedia retrieval. However, LDA has a critical drawback: the projection to a subspace tends to merge those classes that are close together in the original feature space. If the separated classes are sampled from Gaussian distributions, all with identical covariance matrices, then LDA maximizes the mean value of the Kullback-Leibler (KL) divergences between the different classes. We generalize this point of view to obtain a framework for choosing a subspace by 1) generalizing the KL divergence to the Bregman divergence and 2) generalizing the arithmetic mean to a general mean. The framework is named the general averaged divergence analysis (GADA). Under this GADA framework, a geometric mean divergence analysis (GMDA) method based on the geometric mean is studied. A large number of experiments based on synthetic data show that our method significantly outperforms LDA and several representative LDA extensions.
子空间选择是数据挖掘中的一个强大工具。一种重要的子空间方法是Fisher-Rao线性判别分析(LDA),它已成功地应用于生物识别、生物信息学和多媒体检索等许多领域。然而,LDA有一个关键的缺点:对子空间的投影倾向于合并原始特征空间中靠近的那些类。如果分离的类从高斯分布中采样,所有类都具有相同的协方差矩阵,则LDA最大化不同类之间的Kullback-Leibler (KL)散度的平均值。通过将KL散度推广到Bregman散度,将算术均值推广到一般均值,得到了一种选择子空间的框架。该框架被命名为一般平均发散分析(GADA)。在此GADA框架下,研究了一种基于几何均值的几何均值发散分析方法。基于合成数据的大量实验表明,我们的方法明显优于LDA和几种具有代表性的LDA扩展。
{"title":"General Averaged Divergence Analysis","authors":"D. Tao, Xuelong Li, Xindong Wu, S. Maybank","doi":"10.1109/ICDM.2007.105","DOIUrl":"https://doi.org/10.1109/ICDM.2007.105","url":null,"abstract":"Subspace selection is a powerful tool in data mining. An important subspace method is the Fisher-Rao linear discriminant analysis (LDA), which has been successfully applied in many fields such as biometrics, bioinformatics, and multimedia retrieval. However, LDA has a critical drawback: the projection to a subspace tends to merge those classes that are close together in the original feature space. If the separated classes are sampled from Gaussian distributions, all with identical covariance matrices, then LDA maximizes the mean value of the Kullback-Leibler (KL) divergences between the different classes. We generalize this point of view to obtain a framework for choosing a subspace by 1) generalizing the KL divergence to the Bregman divergence and 2) generalizing the arithmetic mean to a general mean. The framework is named the general averaged divergence analysis (GADA). Under this GADA framework, a geometric mean divergence analysis (GMDA) method based on the geometric mean is studied. A large number of experiments based on synthetic data show that our method significantly outperforms LDA and several representative LDA extensions.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"9 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131381087","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 67
Using Burstiness to Improve Clustering of Topics in News Streams 利用突发性改进新闻流中主题聚类
Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.17
Qi He, Kuiyu Chang, Ee-Peng Lim
Specialists who analyze online news have a hard time separating the wheat from the chaff. Moreover, automatic data-mining techniques like clustering of news streams into topical groups can fully recover the underlying true class labels of data if and only if all classes are well separated. In reality, especially for news streams, this is clearly not the case. The question to ask is thus this: if we cannot recover the full C classes by clustering, what is the largest K < C clusters we can find that best resemble the K underlying classes? Using the intuition that bursty topics are more likely to correspond to important events that are of interest to analysts, we propose several new bursty vector space models (B-VSM)for representing a news document. B-VSM takes into account the burstiness (across the full corpus and whole duration) of each constituent word in a document at the time of publication. We benchmarked our B-VSM against the classical TFIDF-VSM on the task of clustering a collection of news stream articles with known topic labels. Experimental results show that B-VSM was able to find the burstiest clusters/topics. Further, it also significantly improved the recall and precision for the top K clusters/topics.
分析网络新闻的专家们很难把小麦从谷壳中分离出来。此外,自动数据挖掘技术,如将新闻流聚类到主题组中,当且仅当所有类都被很好地分离时,可以完全恢复数据的底层真实类标签。在现实中,尤其是对新闻流来说,情况显然并非如此。因此,要问的问题是:如果我们不能通过聚类恢复完整的C类,那么我们能找到的与K个底层类最相似的K < C的最大聚类是什么?利用突发主题更有可能对应于分析师感兴趣的重要事件的直觉,我们提出了几个新的突发向量空间模型(B-VSM)来表示新闻文档。B-VSM考虑了文档中每个组成词在发布时的突发性(跨越整个语料库和整个持续时间)。我们将我们的B-VSM与经典的TFIDF-VSM进行基准测试,对具有已知主题标签的新闻流文章集合进行聚类。实验结果表明,B-VSM能够找到最突发性的聚类/主题。此外,它还显著提高了前K个聚类/主题的查全率和查准率。
{"title":"Using Burstiness to Improve Clustering of Topics in News Streams","authors":"Qi He, Kuiyu Chang, Ee-Peng Lim","doi":"10.1109/ICDM.2007.17","DOIUrl":"https://doi.org/10.1109/ICDM.2007.17","url":null,"abstract":"Specialists who analyze online news have a hard time separating the wheat from the chaff. Moreover, automatic data-mining techniques like clustering of news streams into topical groups can fully recover the underlying true class labels of data if and only if all classes are well separated. In reality, especially for news streams, this is clearly not the case. The question to ask is thus this: if we cannot recover the full C classes by clustering, what is the largest K < C clusters we can find that best resemble the K underlying classes? Using the intuition that bursty topics are more likely to correspond to important events that are of interest to analysts, we propose several new bursty vector space models (B-VSM)for representing a news document. B-VSM takes into account the burstiness (across the full corpus and whole duration) of each constituent word in a document at the time of publication. We benchmarked our B-VSM against the classical TFIDF-VSM on the task of clustering a collection of news stream articles with known topic labels. Experimental results show that B-VSM was able to find the burstiest clusters/topics. Further, it also significantly improved the recall and precision for the top K clusters/topics.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131433775","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 58
Computing Correlation Anomaly Scores Using Stochastic Nearest Neighbors 利用随机近邻计算相关异常分数
Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.12
T. Idé, S. Papadimitriou, M. Vlachos
This paper addresses the task of change analysis of correlated multi-sensor systems. The goal of change analysis is to compute the anomaly score of each sensor when we know that the system has some potential difference from a reference state. Examples include validating the proper performance of various car sensors in the automobile industry. We solve this problem based on a neighborhood preservation principle - If the system is working normally, the neighborhood graph of each sensor is almost invariant against the fluctuations of experimental conditions. Here a neighborhood graph is defined based on the correlation between sensor signals. With the notion of stochastic neighborhood, our method is capable of robustly computing the anomaly score of each sensor under conditions that are hard to be detected by other naive methods.
本文研究了相关多传感器系统的变化分析问题。变化分析的目标是当我们知道系统与参考状态有一些潜在的差异时,计算每个传感器的异常分数。示例包括验证汽车工业中各种汽车传感器的适当性能。我们基于邻域保持原理来解决这个问题——如果系统正常工作,每个传感器的邻域图对于实验条件的波动几乎是不变的。这里根据传感器信号之间的相关性定义一个邻域图。利用随机邻域的概念,我们的方法能够在其他朴素方法难以检测到的情况下鲁棒地计算每个传感器的异常分数。
{"title":"Computing Correlation Anomaly Scores Using Stochastic Nearest Neighbors","authors":"T. Idé, S. Papadimitriou, M. Vlachos","doi":"10.1109/ICDM.2007.12","DOIUrl":"https://doi.org/10.1109/ICDM.2007.12","url":null,"abstract":"This paper addresses the task of change analysis of correlated multi-sensor systems. The goal of change analysis is to compute the anomaly score of each sensor when we know that the system has some potential difference from a reference state. Examples include validating the proper performance of various car sensors in the automobile industry. We solve this problem based on a neighborhood preservation principle - If the system is working normally, the neighborhood graph of each sensor is almost invariant against the fluctuations of experimental conditions. Here a neighborhood graph is defined based on the correlation between sensor signals. With the notion of stochastic neighborhood, our method is capable of robustly computing the anomaly score of each sensor under conditions that are hard to be detected by other naive methods.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"72 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124454594","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 81
High-Speed Function Approximation 高速函数逼近
Pub Date : 2007-10-28 DOI: 10.1109/ICDM.2007.107
Biswanath Panda, Mirek Riedewald, J. Gehrke, S. Pope
We address a new learning problem where the goal is to build a predictive model that minimizes prediction time (the time taken to make a prediction) subject to a constraint on model accuracy. Our solution is a generic framework that leverages existing data mining algorithms without requiring any modifications to these algorithms. We show a first application of our framework to a combustion simulation problem. Our experimental evaluation shows significant improvements over existing methods; prediction time typically is improved by a factor between 2 and 6.
我们解决了一个新的学习问题,其目标是建立一个预测模型,使预测时间(进行预测所花费的时间)在模型精度的约束下最小化。我们的解决方案是一个通用框架,它利用现有的数据挖掘算法,而不需要对这些算法进行任何修改。我们展示了我们的框架在燃烧模拟问题中的第一个应用。我们的实验评估表明,与现有方法相比,我们有了显著的改进;预测时间通常会提高2到6倍。
{"title":"High-Speed Function Approximation","authors":"Biswanath Panda, Mirek Riedewald, J. Gehrke, S. Pope","doi":"10.1109/ICDM.2007.107","DOIUrl":"https://doi.org/10.1109/ICDM.2007.107","url":null,"abstract":"We address a new learning problem where the goal is to build a predictive model that minimizes prediction time (the time taken to make a prediction) subject to a constraint on model accuracy. Our solution is a generic framework that leverages existing data mining algorithms without requiring any modifications to these algorithms. We show a first application of our framework to a combustion simulation problem. Our experimental evaluation shows significant improvements over existing methods; prediction time typically is improved by a factor between 2 and 6.","PeriodicalId":233758,"journal":{"name":"Seventh IEEE International Conference on Data Mining (ICDM 2007)","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2007-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127311481","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
期刊
Seventh IEEE International Conference on Data Mining (ICDM 2007)
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1