首页 > 最新文献

2011 IEEE 11th International Conference on Data Mining最新文献

英文 中文
A New Markov Model for Clustering Categorical Sequences 一类分类序列聚类的新马尔可夫模型
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.13
Tengke Xiong, Shengrui Wang, Q. Jiang, J. Huang
Clustering categorical sequences remains an open and challenging task due to the lack of an inherently meaningful measure of pair wise similarity between sequences. Model initialization is an unsolved problem in model-based clustering algorithms for categorical sequences. In this paper, we propose a simple and effective Markov model to approximate the conditional probability distribution (CPD) model, and use it to design a novel two-tier Markov model to represent a sequence cluster. Furthermore, we design a novel divisive hierarchical algorithm for clustering categorical sequences based on the two-tier Markov model. The experimental results on the data sets from three different domains demonstrate the promising performance of our models and clustering algorithm.
聚类分类序列仍然是一个开放的和具有挑战性的任务,由于缺乏一个固有的有意义的措施对序列之间的相似性。在基于模型的分类序列聚类算法中,模型初始化是一个尚未解决的问题。本文提出了一种简单有效的马尔可夫模型来近似条件概率分布(CPD)模型,并利用该模型设计了一种新的两层马尔可夫模型来表示序列聚类。在此基础上,设计了一种基于两层马尔可夫模型的分类序列聚类算法。在三个不同领域的数据集上的实验结果表明,我们的模型和聚类算法具有良好的性能。
{"title":"A New Markov Model for Clustering Categorical Sequences","authors":"Tengke Xiong, Shengrui Wang, Q. Jiang, J. Huang","doi":"10.1109/ICDM.2011.13","DOIUrl":"https://doi.org/10.1109/ICDM.2011.13","url":null,"abstract":"Clustering categorical sequences remains an open and challenging task due to the lack of an inherently meaningful measure of pair wise similarity between sequences. Model initialization is an unsolved problem in model-based clustering algorithms for categorical sequences. In this paper, we propose a simple and effective Markov model to approximate the conditional probability distribution (CPD) model, and use it to design a novel two-tier Markov model to represent a sequence cluster. Furthermore, we design a novel divisive hierarchical algorithm for clustering categorical sequences based on the two-tier Markov model. The experimental results on the data sets from three different domains demonstrate the promising performance of our models and clustering algorithm.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"59 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130577982","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 18
Analysis of Textual Variation by Latent Tree Structures 潜在树结构分析文本变异
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.24
Teemu Roos, Yuan Zou
We introduce Semstem, a new method for the reconstruction of so called stemmatic trees, i.e., trees encoding the copying relationships among a set of textual variants. Our method is based on a structural expectation-maximization (structural EM) algorithm. It is the first computer-based method able to estimate general latent tree structures, unlike earlier methods that are usually restricted to bifurcating trees where all the extant texts are placed in the leaf nodes. We present experiments on two well known benchmark data sets, showing that the new method outperforms current state-of-the-art both in terms of a numerical score as well as interpretability.
我们介绍了system,一种用于重建所谓的系统化树的新方法,即编码一组文本变体之间的复制关系的树。我们的方法是基于结构期望最大化(structural EM)算法。这是第一个能够估计一般潜在树结构的基于计算机的方法,不像以前的方法通常局限于分叉树,其中所有现存的文本都放在叶节点中。我们在两个众所周知的基准数据集上进行了实验,表明新方法在数值得分和可解释性方面都优于当前最先进的方法。
{"title":"Analysis of Textual Variation by Latent Tree Structures","authors":"Teemu Roos, Yuan Zou","doi":"10.1109/ICDM.2011.24","DOIUrl":"https://doi.org/10.1109/ICDM.2011.24","url":null,"abstract":"We introduce Semstem, a new method for the reconstruction of so called stemmatic trees, i.e., trees encoding the copying relationships among a set of textual variants. Our method is based on a structural expectation-maximization (structural EM) algorithm. It is the first computer-based method able to estimate general latent tree structures, unlike earlier methods that are usually restricted to bifurcating trees where all the extant texts are placed in the leaf nodes. We present experiments on two well known benchmark data sets, showing that the new method outperforms current state-of-the-art both in terms of a numerical score as well as interpretability.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"1909 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128007443","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 13
Isograph: Neighbourhood Graph Construction Based on Geodesic Distance for Semi-supervised Learning 面向半监督学习的基于测地线距离的邻域图构造
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.83
Marjan Ghazvininejad, Mostafa Mahdieh, H. Rabiee, P. Roshan, M. Rohban
Semi-supervised learning based on manifolds has been the focus of extensive research in recent years. Convenient neighbourhood graph construction is a key component of a successful semi-supervised classification method. Previous graph construction methods fail when there are pairs of data points that have small Euclidean distance, but are far apart over the manifold. To overcome this problem, we start with an arbitrary neighbourhood graph and iteratively update the edge weights by using the estimates of the geodesic distances between points. Moreover, we provide theoretical bounds on the values of estimated geodesic distances. Experimental results on real-world data show significant improvement compared to the previous graph construction methods.
基于流形的半监督学习是近年来广泛研究的热点。方便的邻域图构造是半监督分类方法成功的关键组成部分。当数据点对的欧氏距离很小,但在流形上相距很远时,以前的图构造方法就失败了。为了克服这个问题,我们从任意邻域图开始,通过使用点之间测地线距离的估计迭代更新边缘权重。此外,我们还提供了估算测地线距离值的理论边界。在实际数据上的实验结果表明,与之前的图构建方法相比,该方法有了显著的改进。
{"title":"Isograph: Neighbourhood Graph Construction Based on Geodesic Distance for Semi-supervised Learning","authors":"Marjan Ghazvininejad, Mostafa Mahdieh, H. Rabiee, P. Roshan, M. Rohban","doi":"10.1109/ICDM.2011.83","DOIUrl":"https://doi.org/10.1109/ICDM.2011.83","url":null,"abstract":"Semi-supervised learning based on manifolds has been the focus of extensive research in recent years. Convenient neighbourhood graph construction is a key component of a successful semi-supervised classification method. Previous graph construction methods fail when there are pairs of data points that have small Euclidean distance, but are far apart over the manifold. To overcome this problem, we start with an arbitrary neighbourhood graph and iteratively update the edge weights by using the estimates of the geodesic distances between points. Moreover, we provide theoretical bounds on the values of estimated geodesic distances. Experimental results on real-world data show significant improvement compared to the previous graph construction methods.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128104345","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
Classifying Categorical Data by Rule-Based Neighbors 基于规则的邻域分类分类数据
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.34
Jiabing Wang, Pei Zhang, Guihua Wen, Jia Wei
A new learning algorithm for categorical data, named CRN (Classification by Rule-based Neighbors) is proposed in this paper. CRN is a nonmetric and parameter-free classifier, and can be regarded as a hybrid of rule induction and instance-based learning. Based on a new measure of attributes quality and the separate-and-conquer strategy, CRN learns a collection of feature sets such that for each pair of instances belonging to different classes, there is a feature set on which the two instances disagree. For an unlabeled instance I and a labeled instance J, J is a neighbor of I if and only if they agree on all attributes of a feature set. Then, CRN classifies an unlabeled instance I based on I's neighbors on those learned feature sets. To validate the performance of CRN, CRN is compared with six state-of-the-art classifiers on twenty-four datasets. Experimental results demonstrate that although the underlying idea of CRN is simple, the predictive accuracy of CRN is comparable to or better than that of the state-of-the-art classifiers on most datasets.
提出了一种新的分类数据学习算法CRN (Classification by Rule-based Neighbors)。CRN是一种非度量和无参数的分类器,可以看作是规则归纳和基于实例学习的混合。基于一种新的属性质量度量和分而治之策略,CRN学习了一组特征集,使得对于每一对属于不同类别的实例,都有一个两个实例不一致的特征集。对于未标记的实例I和已标记的实例J,当且仅当它们在一个特征集的所有属性上一致时,J是I的邻居。然后,CRN根据学习到的特征集上I的邻居对未标记实例I进行分类。为了验证CRN的性能,CRN与六个最先进的分类器在24个数据集上进行了比较。实验结果表明,尽管CRN的基本思想很简单,但在大多数数据集上,CRN的预测精度与最先进的分类器相当或更好。
{"title":"Classifying Categorical Data by Rule-Based Neighbors","authors":"Jiabing Wang, Pei Zhang, Guihua Wen, Jia Wei","doi":"10.1109/ICDM.2011.34","DOIUrl":"https://doi.org/10.1109/ICDM.2011.34","url":null,"abstract":"A new learning algorithm for categorical data, named CRN (Classification by Rule-based Neighbors) is proposed in this paper. CRN is a nonmetric and parameter-free classifier, and can be regarded as a hybrid of rule induction and instance-based learning. Based on a new measure of attributes quality and the separate-and-conquer strategy, CRN learns a collection of feature sets such that for each pair of instances belonging to different classes, there is a feature set on which the two instances disagree. For an unlabeled instance I and a labeled instance J, J is a neighbor of I if and only if they agree on all attributes of a feature set. Then, CRN classifies an unlabeled instance I based on I's neighbors on those learned feature sets. To validate the performance of CRN, CRN is compared with six state-of-the-art classifiers on twenty-four datasets. Experimental results demonstrate that although the underlying idea of CRN is simple, the predictive accuracy of CRN is comparable to or better than that of the state-of-the-art classifiers on most datasets.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117126674","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Direct Robust Matrix Factorizatoin for Anomaly Detection 直接鲁棒矩阵分解异常检测
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.52
L. Xiong, X. Chen, J. Schneider
Matrix factorization methods are extremely useful in many data mining tasks, yet their performances are often degraded by outliers. In this paper, we propose a novel robust matrix factorization algorithm that is insensitive to outliers. We directly formulate robust factorization as a matrix approximation problem with constraints on the rank of the matrix and the cardinality of the outlier set. Then, unlike existing methods that resort to convex relaxations, we solve this problem directly and efficiently. In addition, structural knowledge about the outliers can be incorporated to find outliers more effectively. We applied this method in anomaly detection tasks on various data sets. Empirical results show that this new algorithm is effective in robust modeling and anomaly detection, and our direct solution achieves superior performance over the state-of-the-art methods based on the L1-norm and the nuclear norm of matrices.
矩阵分解方法在许多数据挖掘任务中非常有用,但其性能经常受到异常值的影响。本文提出了一种对异常值不敏感的鲁棒矩阵分解算法。我们将鲁棒分解直接表述为具有矩阵秩约束和离群集基数约束的矩阵近似问题。然后,与现有的求助于凸松弛的方法不同,我们直接有效地解决了这个问题。此外,可以结合有关异常值的结构知识来更有效地找到异常值。我们将该方法应用于各种数据集的异常检测任务中。实验结果表明,该算法在鲁棒建模和异常检测方面是有效的,并且我们的直接解比基于l1范数和矩阵核范数的最新方法具有更好的性能。
{"title":"Direct Robust Matrix Factorizatoin for Anomaly Detection","authors":"L. Xiong, X. Chen, J. Schneider","doi":"10.1109/ICDM.2011.52","DOIUrl":"https://doi.org/10.1109/ICDM.2011.52","url":null,"abstract":"Matrix factorization methods are extremely useful in many data mining tasks, yet their performances are often degraded by outliers. In this paper, we propose a novel robust matrix factorization algorithm that is insensitive to outliers. We directly formulate robust factorization as a matrix approximation problem with constraints on the rank of the matrix and the cardinality of the outlier set. Then, unlike existing methods that resort to convex relaxations, we solve this problem directly and efficiently. In addition, structural knowledge about the outliers can be incorporated to find outliers more effectively. We applied this method in anomaly detection tasks on various data sets. Empirical results show that this new algorithm is effective in robust modeling and anomaly detection, and our direct solution achieves superior performance over the state-of-the-art methods based on the L1-norm and the nuclear norm of matrices.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127918831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 106
Semi-supervised Feature Importance Evaluation with Ensemble Learning 基于集成学习的半监督特征重要性评价
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.129
H. Barkia, H. Elghazel, A. Aussem
We consider the problem of using a large amount of unlabeled data to improve the efficiency of feature selection in high dimensional datasets, when only a small set of labeled examples is available. We propose a new semi-supervised feature importance evaluation method (SSFI for short), that combines ideas from co-training and random forests with a new permutation-based out-of-bag feature importance measure. We provide empirical results on several benchmark datasets indicating that SSFI can lead to significant improvement over state-of-the-art semi-supervised and supervised algorithms.
我们考虑了在高维数据集中,当只有一小部分标记样本可用时,使用大量未标记数据来提高特征选择效率的问题。我们提出了一种新的半监督特征重要性评价方法(简称SSFI),该方法将协同训练和随机森林的思想与一种新的基于置换的袋外特征重要性度量相结合。我们提供了几个基准数据集的实证结果,表明SSFI可以导致最先进的半监督和监督算法的显着改进。
{"title":"Semi-supervised Feature Importance Evaluation with Ensemble Learning","authors":"H. Barkia, H. Elghazel, A. Aussem","doi":"10.1109/ICDM.2011.129","DOIUrl":"https://doi.org/10.1109/ICDM.2011.129","url":null,"abstract":"We consider the problem of using a large amount of unlabeled data to improve the efficiency of feature selection in high dimensional datasets, when only a small set of labeled examples is available. We propose a new semi-supervised feature importance evaluation method (SSFI for short), that combines ideas from co-training and random forests with a new permutation-based out-of-bag feature importance measure. We provide empirical results on several benchmark datasets indicating that SSFI can lead to significant improvement over state-of-the-art semi-supervised and supervised algorithms.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117237828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 15
Using Bayesian Network Learning Algorithm to Discover Causal Relations in Multivariate Time Series 利用贝叶斯网络学习算法发现多元时间序列中的因果关系
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.153
Zhenxing Wang, L. Chan
Many applications naturally involve time series data, and the vector auto regression (VAR) and the structural VAR (SVAR) are dominant tools to investigate relations between variables in time series. In the first part of this work, we show that the SVAR method is incapable of identifying contemporaneous causal relations when data follow Gaussian distributions. In addition, least squares estimators become unreliable when the scales of the problems are large and observations are limited. In the remaining part, we propose an approach to apply Bayesian network learning algorithms to identify SVARs from time series data in order to capture both temporal and contemporaneous causal relations and avoid high-order statistical tests. The difficulty of applying Bayesian network learning algorithms to time series is that the sizes of the networks corresponding to time series tend to be large and high-order statistical tests are required by Bayesian network learning algorithms in this case. To overcome the difficulty, we show that the search space of conditioning sets d-separating two vertices should be subsets of Markov blankets. Based on this fact, we propose an algorithm learning Bayesian networks locally and making the largest order of statistical tests independent of the scales of the problems. Empirical results show that our algorithm outperforms existing methods in terms of both efficiency and accuracy.
许多应用自然涉及时间序列数据,而向量自回归(VAR)和结构自回归(SVAR)是研究时间序列中变量之间关系的主要工具。在本工作的第一部分中,我们表明,当数据遵循高斯分布时,SVAR方法无法识别同期因果关系。此外,当问题的规模很大且观测值有限时,最小二乘估计会变得不可靠。在其余部分中,我们提出了一种应用贝叶斯网络学习算法从时间序列数据中识别svar的方法,以捕获时间和同期因果关系并避免高阶统计检验。贝叶斯网络学习算法应用于时间序列的难点在于时间序列所对应的网络规模往往较大,在这种情况下贝叶斯网络学习算法需要进行高阶统计检验。为了克服这个困难,我们证明条件集d分隔两个顶点的搜索空间应该是马尔可夫毯的子集。基于这一事实,我们提出了一种局部学习贝叶斯网络的算法,并使统计检验的最大阶与问题的规模无关。实验结果表明,我们的算法在效率和精度上都优于现有的方法。
{"title":"Using Bayesian Network Learning Algorithm to Discover Causal Relations in Multivariate Time Series","authors":"Zhenxing Wang, L. Chan","doi":"10.1109/ICDM.2011.153","DOIUrl":"https://doi.org/10.1109/ICDM.2011.153","url":null,"abstract":"Many applications naturally involve time series data, and the vector auto regression (VAR) and the structural VAR (SVAR) are dominant tools to investigate relations between variables in time series. In the first part of this work, we show that the SVAR method is incapable of identifying contemporaneous causal relations when data follow Gaussian distributions. In addition, least squares estimators become unreliable when the scales of the problems are large and observations are limited. In the remaining part, we propose an approach to apply Bayesian network learning algorithms to identify SVARs from time series data in order to capture both temporal and contemporaneous causal relations and avoid high-order statistical tests. The difficulty of applying Bayesian network learning algorithms to time series is that the sizes of the networks corresponding to time series tend to be large and high-order statistical tests are required by Bayesian network learning algorithms in this case. To overcome the difficulty, we show that the search space of conditioning sets d-separating two vertices should be subsets of Markov blankets. Based on this fact, we propose an algorithm learning Bayesian networks locally and making the largest order of statistical tests independent of the scales of the problems. Empirical results show that our algorithm outperforms existing methods in terms of both efficiency and accuracy.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"38 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125076658","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
A Generalized Fast Subset Sums Framework for Bayesian Event Detection 贝叶斯事件检测的广义快速子集和框架
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.11
Kanghong Shao, Yandong Liu, Daniel B. Neill
We present Generalized Fast Subset Sums (GFSS), a new Bayesian framework for scalable and accurate detection of irregularly shaped spatial clusters using multiple data streams. GFSS extends the previously proposed Multivariate Bayesian Scan Statistic (MBSS) and Fast Subset Sums (FSS) approaches for detection of emerging events. The detection power of MBSS is primarily limited by computational considerations, which limit it to searching over circular spatial regions. GFSS enables more accurate and timely detection by defining a hierarchical prior over all subsets of the N locations, first selecting a local neighborhood consisting of a center location and its neighbors, and introducing a sparsity parameter p to describe how likely each location in the neighborhood is to be affected. This approach allows us to consider all possible subsets of locations (including irregularly-shaped regions) but also puts higher weight on more compact regions. We demonstrate that MBSS and FSS are both special cases of this general framework (assuming p = 1 and p = 0.5 respectively), but substantially higher detection power can be achieved by choosing an appropriate value of p. Thus we show that the distribution of the sparsity parameter p can be accurately learned from a small number of labeled events. Our evaluation results (on synthetic disease outbreaks injected into real-world hospital data) show that the GFSS method with learned sparsity parameter has higher detection power and spatial accuracy than MBSS and FSS, particularly when the affected region is irregular or elongated. We also show that the learned models can be used for event characterization, accurately distinguishing between two otherwise identical event types based on the sparsity of the affected spatial region.
我们提出了广义快速子集和(GFSS),这是一个新的贝叶斯框架,用于使用多个数据流可扩展和准确检测不规则形状的空间集群。GFSS扩展了先前提出的多元贝叶斯扫描统计(MBSS)和快速子集和(FSS)方法,用于检测新出现的事件。MBSS的检测能力主要受到计算因素的限制,它只能在圆形空间区域内进行搜索。GFSS通过定义N个位置的所有子集的分层先验,首先选择由中心位置及其邻居组成的局部邻域,并引入稀疏度参数p来描述邻域中每个位置受影响的可能性,从而实现更准确和及时的检测。这种方法允许我们考虑所有可能的位置子集(包括不规则形状的区域),但也赋予更紧凑的区域更高的权重。我们证明了MBSS和FSS都是这个一般框架的特殊情况(分别假设p = 1和p = 0.5),但通过选择适当的p值可以获得更高的检测能力。因此我们表明,稀疏度参数p的分布可以从少量标记事件中准确地学习到。我们的评估结果(对注入真实医院数据的合成疾病暴发)表明,具有学习稀疏度参数的GFSS方法比MBSS和FSS具有更高的检测能力和空间精度,特别是当受影响区域不规则或拉长时。我们还表明,学习模型可以用于事件表征,基于受影响空间区域的稀疏性,准确区分两种其他相同的事件类型。
{"title":"A Generalized Fast Subset Sums Framework for Bayesian Event Detection","authors":"Kanghong Shao, Yandong Liu, Daniel B. Neill","doi":"10.1109/ICDM.2011.11","DOIUrl":"https://doi.org/10.1109/ICDM.2011.11","url":null,"abstract":"We present Generalized Fast Subset Sums (GFSS), a new Bayesian framework for scalable and accurate detection of irregularly shaped spatial clusters using multiple data streams. GFSS extends the previously proposed Multivariate Bayesian Scan Statistic (MBSS) and Fast Subset Sums (FSS) approaches for detection of emerging events. The detection power of MBSS is primarily limited by computational considerations, which limit it to searching over circular spatial regions. GFSS enables more accurate and timely detection by defining a hierarchical prior over all subsets of the N locations, first selecting a local neighborhood consisting of a center location and its neighbors, and introducing a sparsity parameter p to describe how likely each location in the neighborhood is to be affected. This approach allows us to consider all possible subsets of locations (including irregularly-shaped regions) but also puts higher weight on more compact regions. We demonstrate that MBSS and FSS are both special cases of this general framework (assuming p = 1 and p = 0.5 respectively), but substantially higher detection power can be achieved by choosing an appropriate value of p. Thus we show that the distribution of the sparsity parameter p can be accurately learned from a small number of labeled events. Our evaluation results (on synthetic disease outbreaks injected into real-world hospital data) show that the GFSS method with learned sparsity parameter has higher detection power and spatial accuracy than MBSS and FSS, particularly when the affected region is irregular or elongated. We also show that the learned models can be used for event characterization, accurately distinguishing between two otherwise identical event types based on the sparsity of the affected spatial region.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"37 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124494259","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 10
SLIM: Sparse Linear Methods for Top-N Recommender Systems Top-N推荐系统的稀疏线性方法
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.134
Xia Ning, G. Karypis
This paper focuses on developing effective and efficient algorithms for top-N recommender systems. A novel Sparse Linear Method (SLIM) is proposed, which generates top-N recommendations by aggregating from user purchase/rating profiles. A sparse aggregation coefficient matrix W is learned from SLIM by solving an `1-norm and `2-norm regularized optimization problem. W is demonstrated to produce high quality recommendations and its sparsity allows SLIM to generate recommendations very fast. A comprehensive set of experiments is conducted by comparing the SLIM method and other state-of-the-art top-N recommendation methods. The experiments show that SLIM achieves significant improvements both in run time performance and recommendation quality over the best existing methods.
本文重点研究了top-N推荐系统的高效算法。提出了一种新的稀疏线性方法(SLIM),该方法通过汇总用户购买/评级资料生成top-N推荐。通过求解一个“1范数”和“2范数”正则化优化问题,从SLIM中学习到稀疏聚集系数矩阵W。W被证明可以产生高质量的推荐,它的稀疏性允许SLIM非常快地生成推荐。通过比较SLIM方法和其他最先进的top-N推荐方法,进行了一组全面的实验。实验表明,与现有的最佳推荐方法相比,SLIM在运行时性能和推荐质量方面都取得了显著的改进。
{"title":"SLIM: Sparse Linear Methods for Top-N Recommender Systems","authors":"Xia Ning, G. Karypis","doi":"10.1109/ICDM.2011.134","DOIUrl":"https://doi.org/10.1109/ICDM.2011.134","url":null,"abstract":"This paper focuses on developing effective and efficient algorithms for top-N recommender systems. A novel Sparse Linear Method (SLIM) is proposed, which generates top-N recommendations by aggregating from user purchase/rating profiles. A sparse aggregation coefficient matrix W is learned from SLIM by solving an `1-norm and `2-norm regularized optimization problem. W is demonstrated to produce high quality recommendations and its sparsity allows SLIM to generate recommendations very fast. A comprehensive set of experiments is conducted by comparing the SLIM method and other state-of-the-art top-N recommendation methods. The experiments show that SLIM achieves significant improvements both in run time performance and recommendation quality over the best existing methods.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121635800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 672
Tag Clustering and Refinement on Semantic Unity Graph 语义统一图上的标签聚类与改进
Pub Date : 2011-12-11 DOI: 10.1109/ICDM.2011.141
Yang Liu, Fei Wu, Yin Zhang, Jian Shao, Yueting Zhuang
Recently, there has been extensive research towards the user-provided tags on photo sharing websites which can greatly facilitate image retrieval and management. However, due to the arbitrariness of the tagging activities, these tags are often imprecise and incomplete. As a result, quite a few technologies has been proposed to improve the user experience on these photo sharing systems, including tag clustering and refinement, etc. In this work, we propose a novel framework to model the relationships among tags and images which can be applied to many tag based applications. Different from previous approaches which model images and tags as heterogeneous objects, images and their tags are uniformly viewed as compositions of Semantic Unities in our framework. Then Semantic Unity Graph (SUG) is introduced to represent the complex and high-order relationships among these Semantic Unities. Based on the representation of Semantic Unity Graph, the relevance of images and tags can be naturally measured in terms of the similarity of their Semantic Unities. Then Tag clustering and refinement can then be performed on SUG and the polysemy of images and tags is explicitly considered in this framework. The experiment results conducted on NUS-WIDE and MIR-Flickr datasets demonstrate the effectiveness and efficiency of the proposed approach.
近年来,人们对图片分享网站上的用户提供标签进行了广泛的研究,这种标签可以极大地方便图片的检索和管理。然而,由于标注活动的随意性,这些标注往往是不精确和不完整的。因此,人们提出了许多技术来改善这些照片共享系统的用户体验,包括标签聚类和细化等。在这项工作中,我们提出了一个新的框架来模拟标签和图像之间的关系,该框架可以应用于许多基于标签的应用。与以往将图像和标签作为异构对象建模的方法不同,我们的框架将图像及其标签统一地视为语义统一的组合。然后引入语义统一图(Semantic Unity Graph, SUG)来表示这些语义统一之间复杂的高阶关系。基于语义统一图的表示,可以很自然地用图像和标签的语义统一的相似度来衡量它们之间的相关性。然后在SUG上进行标签聚类和细化,并明确考虑了图像和标签的多义性。在NUS-WIDE和MIR-Flickr数据集上进行的实验结果证明了该方法的有效性和效率。
{"title":"Tag Clustering and Refinement on Semantic Unity Graph","authors":"Yang Liu, Fei Wu, Yin Zhang, Jian Shao, Yueting Zhuang","doi":"10.1109/ICDM.2011.141","DOIUrl":"https://doi.org/10.1109/ICDM.2011.141","url":null,"abstract":"Recently, there has been extensive research towards the user-provided tags on photo sharing websites which can greatly facilitate image retrieval and management. However, due to the arbitrariness of the tagging activities, these tags are often imprecise and incomplete. As a result, quite a few technologies has been proposed to improve the user experience on these photo sharing systems, including tag clustering and refinement, etc. In this work, we propose a novel framework to model the relationships among tags and images which can be applied to many tag based applications. Different from previous approaches which model images and tags as heterogeneous objects, images and their tags are uniformly viewed as compositions of Semantic Unities in our framework. Then Semantic Unity Graph (SUG) is introduced to represent the complex and high-order relationships among these Semantic Unities. Based on the representation of Semantic Unity Graph, the relevance of images and tags can be naturally measured in terms of the similarity of their Semantic Unities. Then Tag clustering and refinement can then be performed on SUG and the polysemy of images and tags is explicitly considered in this framework. The experiment results conducted on NUS-WIDE and MIR-Flickr datasets demonstrate the effectiveness and efficiency of the proposed approach.","PeriodicalId":106216,"journal":{"name":"2011 IEEE 11th International Conference on Data Mining","volume":"35 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2011-12-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121910374","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 17
期刊
2011 IEEE 11th International Conference on Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1