Proceedings. IEEE International Conference on Data Mining最新文献_第2页

Robust Multi-Network Clustering via Joint Cross-Domain Cluster Alignment. 通过联合跨域聚类对齐实现稳健的多网络聚类

Proceedings. IEEE International Conference on Data Mining

Pub Date : 2015-11-01 DOI: 10.1109/ICDM.2015.13

Rui Liu, Wei Cheng, Hanghang Tong, Wei Wang, Xiang Zhang

Network clustering is an important problem that has recently drawn a lot of attentions. Most existing work focuses on clustering nodes within a single network. In many applications, however, there exist multiple related networks, in which each network may be constructed from a different domain and instances in one domain may be related to instances in other domains. In this paper, we propose a robust algorithm, MCA, for multi-network clustering that takes into account cross-domain relationships between instances. MCA has several advantages over the existing single network clustering methods. First, it is able to detect associations between clusters from different domains, which, however, is not addressed by any existing methods. Second, it achieves more consistent clustering results on multiple networks by leveraging the duality between clustering individual networks and inferring cross-network cluster alignment. Finally, it provides a multi-network clustering solution that is more robust to noise and errors. We perform extensive experiments on a variety of real and synthetic networks to demonstrate the effectiveness and efficiency of MCA.

网络聚类是近来备受关注的一个重要问题。现有的大部分工作都集中在单个网络内节点的聚类上。然而，在许多应用中，存在多个相关网络，其中每个网络可能由不同的域构建，一个域中的实例可能与其他域中的实例相关。在本文中，我们提出了一种用于多网络聚类的稳健算法 MCA，该算法考虑了实例之间的跨域关系。与现有的单一网络聚类方法相比，MCA 有几个优点。首先，它能够检测不同领域聚类之间的关联，而现有的方法都没有解决这个问题。其次，它利用单个网络聚类和推断跨网络聚类对齐之间的二元性，在多个网络上实现了更一致的聚类结果。最后，它提供的多网络聚类解决方案对噪声和误差具有更强的鲁棒性。我们在各种真实和合成网络上进行了大量实验，以证明 MCA 的有效性和效率。

{"title":"Robust Multi-Network Clustering via Joint Cross-Domain Cluster Alignment.","authors":"Rui Liu, Wei Cheng, Hanghang Tong, Wei Wang, Xiang Zhang","doi":"10.1109/ICDM.2015.13","DOIUrl":"10.1109/ICDM.2015.13","url":null,"abstract":"Network clustering is an important problem that has recently drawn a lot of attentions. Most existing work focuses on clustering nodes within a single network. In many applications, however, there exist multiple related networks, in which each network may be constructed from a different domain and instances in one domain may be related to instances in other domains. In this paper, we propose a robust algorithm, MCA, for multi-network clustering that takes into account cross-domain relationships between instances. MCA has several advantages over the existing single network clustering methods. First, it is able to detect associations between clusters from different domains, which, however, is not addressed by any existing methods. Second, it achieves more consistent clustering results on multiple networks by leveraging the duality between clustering individual networks and inferring cross-network cluster alignment. Finally, it provides a multi-network clustering solution that is more robust to noise and errors. We perform extensive experiments on a variety of real and synthetic networks to demonstrate the effectiveness and efficiency of MCA.","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":"2015 ","pages":"291-300"},"PeriodicalIF":0.0,"publicationDate":"2015-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4880426/pdf/nihms785953.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34435258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

SimNest: Social Media Nested Epidemic Simulation via Online Semi-supervised Deep Learning. SimNest:通过在线半监督深度学习的社交媒体嵌套流行病模拟。

Proceedings. IEEE International Conference on Data Mining

Pub Date : 2015-11-01 DOI: 10.1109/ICDM.2015.39

Liang Zhao, Jiangzhuo Chen, Feng Chen, Wei Wang, Chang-Tien Lu, Naren Ramakrishnan

Infectious disease epidemics such as influenza and Ebola pose a serious threat to global public health. It is crucial to characterize the disease and the evolution of the ongoing epidemic efficiently and accurately. Computational epidemiology can model the disease progress and underlying contact network, but suffers from the lack of real-time and fine-grained surveillance data. Social media, on the other hand, provides timely and detailed disease surveillance, but is insensible to the underlying contact network and disease model. This paper proposes a novel semi-supervised deep learning framework that integrates the strengths of computational epidemiology and social media mining techniques. Specifically, this framework learns the social media users' health states and intervention actions in real time, which are regularized by the underlying disease model and contact network. Conversely, the learned knowledge from social media can be fed into computational epidemic model to improve the efficiency and accuracy of disease diffusion modeling. We propose an online optimization algorithm to substantialize the above interactive learning process iteratively to achieve a consistent stage of the integration. The extensive experimental results demonstrated that our approach can effectively characterize the spatio-temporal disease diffusion, outperforming competing methods by a substantial margin on multiple metrics.

流感和埃博拉等传染病流行对全球公共卫生构成严重威胁。至关重要的是要有效和准确地描述疾病特征和正在发生的流行病的演变。计算流行病学可以模拟疾病进展和潜在的接触网络，但缺乏实时和细粒度的监测数据。另一方面，社交媒体提供了及时和详细的疾病监测，但对潜在的接触网络和疾病模型不敏感。本文提出了一种新的半监督深度学习框架，该框架集成了计算流行病学和社交媒体挖掘技术的优势。具体而言，该框架实时学习社交媒体用户的健康状态和干预行为，并通过潜在疾病模型和联系网络进行正则化。反过来，可以将从社交媒体中学习到的知识输入到计算流行病模型中，提高疾病扩散建模的效率和准确性。我们提出了一种在线优化算法来迭代实体化上述交互学习过程，以实现集成的一致阶段。大量的实验结果表明，我们的方法可以有效地表征疾病的时空扩散，在多个指标上明显优于竞争对手的方法。

{"title":"SimNest: Social Media Nested Epidemic Simulation via Online Semi-supervised Deep Learning.","authors":"Liang Zhao, Jiangzhuo Chen, Feng Chen, Wei Wang, Chang-Tien Lu, Naren Ramakrishnan","doi":"10.1109/ICDM.2015.39","DOIUrl":"https://doi.org/10.1109/ICDM.2015.39","url":null,"abstract":"Infectious disease epidemics such as influenza and Ebola pose a serious threat to global public health. It is crucial to characterize the disease and the evolution of the ongoing epidemic efficiently and accurately. Computational epidemiology can model the disease progress and underlying contact network, but suffers from the lack of real-time and fine-grained surveillance data. Social media, on the other hand, provides timely and detailed disease surveillance, but is insensible to the underlying contact network and disease model. This paper proposes a novel semi-supervised deep learning framework that integrates the strengths of computational epidemiology and social media mining techniques. Specifically, this framework learns the social media users' health states and intervention actions in real time, which are regularized by the underlying disease model and contact network. Conversely, the learned knowledge from social media can be fed into computational epidemic model to improve the efficiency and accuracy of disease diffusion modeling. We propose an online optimization algorithm to substantialize the above interactive learning process iteratively to achieve a consistent stage of the integration. The extensive experimental results demonstrated that our approach can effectively characterize the spatio-temporal disease diffusion, outperforming competing methods by a substantial margin on multiple metrics.","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":"2015 ","pages":"639-648"},"PeriodicalIF":0.0,"publicationDate":"2015-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDM.2015.39","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34699773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 64

Tensor-based Multi-view Feature Selection with Applications to Brain Diseases. 基于张量的多视角特征选择在脑疾病中的应用

Proceedings. IEEE International Conference on Data Mining

Pub Date : 2014-12-01 DOI: 10.1109/ICDM.2014.26

Bokai Cao, Lifang He, Xiangnan Kong, Philip S Yu, Zhifeng Hao, Ann B Ragin

In the era of big data, we can easily access information from multiple views which may be obtained from different sources or feature subsets. Generally, different views provide complementary information for learning tasks. Thus, multi-view learning can facilitate the learning process and is prevalent in a wide range of application domains. For example, in medical science, measurements from a series of medical examinations are documented for each subject, including clinical, imaging, immunologic, serologic and cognitive measures which are obtained from multiple sources. Specifically, for brain diagnosis, we can have different quantitative analysis which can be seen as different feature subsets of a subject. It is desirable to combine all these features in an effective way for disease diagnosis. However, some measurements from less relevant medical examinations can introduce irrelevant information which can even be exaggerated after view combinations. Feature selection should therefore be incorporated in the process of multi-view learning. In this paper, we explore tensor product to bring different views together in a joint space, and present a dual method of tensor-based multi-view feature selection (dual-Tmfs) based on the idea of support vector machine recursive feature elimination. Experiments conducted on datasets derived from neurological disorder demonstrate the features selected by our proposed method yield better classification performance and are relevant to disease diagnosis.

在大数据时代，我们可以轻松地从多个视图中获取信息，这些视图可能来自不同的来源或特征子集。一般来说，不同视图可为学习任务提供互补信息。因此，多视图学习可以促进学习过程，并广泛应用于各个领域。例如，在医学科学中，每个受试者的一系列体检结果都会被记录下来，其中包括临床、影像、免疫、血清和认知测量结果，而这些测量结果都是从多个来源获得的。具体来说，在脑部诊断中，我们可以进行不同的定量分析，这些分析可被视为受试者的不同特征子集。我们希望能将所有这些特征有效地结合起来进行疾病诊断。然而，一些相关性较低的医学检查测量结果可能会引入无关信息，甚至在视图组合后被夸大。因此，在多视图学习过程中，应结合特征选择。本文基于支持向量机递归特征消除的思想，探索了一种基于张量乘积的多视图特征选择方法（dual-Tmfs）。在神经系统疾病数据集上进行的实验表明，我们提出的方法所选择的特征具有更好的分类性能，并且与疾病诊断相关。

{"title":"Tensor-based Multi-view Feature Selection with Applications to Brain Diseases.","authors":"Bokai Cao, Lifang He, Xiangnan Kong, Philip S Yu, Zhifeng Hao, Ann B Ragin","doi":"10.1109/ICDM.2014.26","DOIUrl":"10.1109/ICDM.2014.26","url":null,"abstract":"In the era of big data, we can easily access information from multiple views which may be obtained from different sources or feature subsets. Generally, different views provide complementary information for learning tasks. Thus, multi-view learning can facilitate the learning process and is prevalent in a wide range of application domains. For example, in medical science, measurements from a series of medical examinations are documented for each subject, including clinical, imaging, immunologic, serologic and cognitive measures which are obtained from multiple sources. Specifically, for brain diagnosis, we can have different quantitative analysis which can be seen as different feature subsets of a subject. It is desirable to combine all these features in an effective way for disease diagnosis. However, some measurements from less relevant medical examinations can introduce irrelevant information which can even be exaggerated after view combinations. Feature selection should therefore be incorporated in the process of multi-view learning. In this paper, we explore tensor product to bring different views together in a joint space, and present a dual method of tensor-based multi-view feature selection (dual-Tmfs) based on the idea of support vector machine recursive feature elimination. Experiments conducted on datasets derived from neurological disorder demonstrate the features selected by our proposed method yield better classification performance and are relevant to disease diagnosis.","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":"2014 ","pages":"40-49"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4415282/pdf/nihms683152.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33272116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

RS-Forest: A Rapid Density Estimator for Streaming Anomaly Detection. RS-Forest:用于流异常检测的快速密度估计器。

Proceedings. IEEE International Conference on Data Mining

Pub Date : 2014-01-01 DOI: 10.1109/ICDM.2014.45

Ke Wu, Kun Zhang, Wei Fan, Andrea Edwards, Philip S Yu

Anomaly detection in streaming data is of high interest in numerous application domains. In this paper, we propose a novel one-class semi-supervised algorithm to detect anomalies in streaming data. Underlying the algorithm is a fast and accurate density estimator implemented by multiple fully randomized space trees (RS-Trees), named RS-Forest. The piecewise constant density estimate of each RS-tree is defined on the tree node into which an instance falls. Each incoming instance in a data stream is scored by the density estimates averaged over all trees in the forest. Two strategies, statistical attribute range estimation of high probability guarantee and dual node profiles for rapid model update, are seamlessly integrated into RS-Forest to systematically address the ever-evolving nature of data streams. We derive the theoretical upper bound for the proposed algorithm and analyze its asymptotic properties via bias-variance decomposition. Empirical comparisons to the state-of-the-art methods on multiple benchmark datasets demonstrate that the proposed method features high detection rate, fast response, and insensitivity to most of the parameter settings. Algorithm implementations and datasets are available upon request.

流数据中的异常检测在许多应用领域都受到高度关注。在本文中，我们提出了一种新的一类半监督算法来检测流数据中的异常。该算法的基础是一个快速准确的密度估计器，由多个完全随机空间树(RS-Trees)实现，称为RS-Forest。在实例所属的树节点上定义每个rs树的分段常数密度估计。数据流中的每个传入实例都通过森林中所有树木的平均密度估计值进行评分。高概率保证的统计属性范围估计和快速模型更新的双节点配置文件两种策略无缝集成到RS-Forest中，以系统地解决数据流不断变化的本质。我们推导了该算法的理论上界，并通过偏方差分解分析了其渐近性质。在多个基准数据集上与最先进的方法进行的经验比较表明，该方法具有检测率高、响应速度快、对大多数参数设置不敏感的特点。算法实现和数据集可根据要求提供。

{"title":"RS-Forest: A Rapid Density Estimator for Streaming Anomaly Detection.","authors":"Ke Wu, Kun Zhang, Wei Fan, Andrea Edwards, Philip S Yu","doi":"10.1109/ICDM.2014.45","DOIUrl":"https://doi.org/10.1109/ICDM.2014.45","url":null,"abstract":"Anomaly detection in streaming data is of high interest in numerous application domains. In this paper, we propose a novel one-class semi-supervised algorithm to detect anomalies in streaming data. Underlying the algorithm is a fast and accurate density estimator implemented by multiple fully randomized space trees (RS-Trees), named RS-Forest. The piecewise constant density estimate of each RS-tree is defined on the tree node into which an instance falls. Each incoming instance in a data stream is scored by the density estimates averaged over all trees in the forest. Two strategies, statistical attribute range estimation of high probability guarantee and dual node profiles for rapid model update, are seamlessly integrated into RS-Forest to systematically address the ever-evolving nature of data streams. We derive the theoretical upper bound for the proposed algorithm and analyze its asymptotic properties via bias-variance decomposition. Empirical comparisons to the state-of-the-art methods on multiple benchmark datasets demonstrate that the proposed method features high detection rate, fast response, and insensitivity to most of the parameter settings. Algorithm implementations and datasets are available upon request.","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":"2014 ","pages":"600-609"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDM.2014.45","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33057623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 89

Learning Protein Folding Energy Functions. 学习蛋白质折叠能量函数。

Proceedings. IEEE International Conference on Data Mining

Pub Date : 2011-12-01 DOI: 10.1109/ICDM.2011.88

Wei Guan, Arkadas Ozakin, Alexander Gray, Jose Borreguero, Shashi Pandit, Anna Jagielska, Liliana Wroblewska, Jeffrey Skolnick

A critical open problem in ab initio protein folding is protein energy function design, which pertains to defining the energy of protein conformations in a way that makes folding most efficient and reliable. In this paper, we address this issue as a weight optimization problem and utilize a machine learning approach, learning-to-rank, to solve this problem. We investigate the ranking-via-classification approach, especially the RankingSVM method and compare it with the state-of-the-art approach to the problem using the MINUIT optimization package. To maintain the physicality of the results, we impose non-negativity constraints on the weights. For this we develop two efficient non-negative support vector machine (NNSVM) methods, derived from L2-norm SVM and L1-norm SVMs, respectively. We demonstrate an energy function which maintains the correct ordering with respect to structure dissimilarity to the native state more often, is more efficient and reliable for learning on large protein sets, and is qualitatively superior to the current state-of-the-art energy function.

从头算蛋白质折叠中的一个关键开放问题是蛋白质能量函数设计，它涉及到以一种使折叠最有效和可靠的方式定义蛋白质构象的能量。在本文中，我们将这个问题作为一个权重优化问题来解决，并利用机器学习方法，学习排序，来解决这个问题。我们研究了通过分类进行排序的方法，特别是RankingSVM方法，并将其与使用MINUIT优化包的最先进方法进行了比较。为了保持结果的物质性，我们对权重施加非负性约束。为此，我们开发了两种高效的非负支持向量机(NNSVM)方法，分别来源于l2范数支持向量机和l1范数支持向量机。我们展示了一种能量函数，它可以更频繁地保持与原始状态结构不相似的正确顺序，对于大型蛋白质集的学习更有效和可靠，并且在质量上优于当前最先进的能量函数。

{"title":"Learning Protein Folding Energy Functions.","authors":"Wei Guan, Arkadas Ozakin, Alexander Gray, Jose Borreguero, Shashi Pandit, Anna Jagielska, Liliana Wroblewska, Jeffrey Skolnick","doi":"10.1109/ICDM.2011.88","DOIUrl":"https://doi.org/10.1109/ICDM.2011.88","url":null,"abstract":"A critical open problem in ab initio protein folding is protein energy function design, which pertains to defining the energy of protein conformations in a way that makes folding most efficient and reliable. In this paper, we address this issue as a weight optimization problem and utilize a machine learning approach, learning-to-rank, to solve this problem. We investigate the ranking-via-classification approach, especially the RankingSVM method and compare it with the state-of-the-art approach to the problem using the MINUIT optimization package. To maintain the physicality of the results, we impose non-negativity constraints on the weights. For this we develop two efficient non-negative support vector machine (NNSVM) methods, derived from L2-norm SVM and L1-norm SVMs, respectively. We demonstrate an energy function which maintains the correct ordering with respect to structure dissimilarity to the native state more often, is more efficient and reliable for learning on large protein sets, and is qualitatively superior to the current state-of-the-art energy function.","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":" ","pages":"1062-1067"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDM.2011.88","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32743507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 8

Learning classification with auxiliary probabilistic information. 利用辅助概率信息学习分类。

Proceedings. IEEE International Conference on Data Mining

Pub Date : 2011-01-01 DOI: 10.1109/ICDM.2011.84

Quang Nguyen, Hamed Valizadegan, Milos Hauskrecht

Finding ways of incorporating auxiliary information or auxiliary data into the learning process has been the topic of active data mining and machine learning research in recent years. In this work we study and develop a new framework for classification learning problem in which, in addition to class labels, the learner is provided with an auxiliary (probabilistic) information that reflects how strong the expert feels about the class label. This approach can be extremely useful for many practical classification tasks that rely on subjective label assessment and where the cost of acquiring additional auxiliary information is negligible when compared to the cost of the example analysis and labelling. We develop classification algorithms capable of using the auxiliary information to make the learning process more efficient in terms of the sample complexity. We demonstrate the benefit of the approach on a number of synthetic and real world data sets by comparing it to the learning with class labels only.

寻找将辅助信息或辅助数据纳入学习过程的方法是近年来活跃的数据挖掘和机器学习研究的主题。在这项工作中，我们研究并开发了一个新的分类学习问题框架，在这个框架中，除了类别标签之外，学习者还被提供了一个辅助(概率)信息，该信息反映了专家对类别标签的感觉有多强。这种方法对于许多依赖于主观标签评估的实际分类任务非常有用，并且与示例分析和标记的成本相比，获取额外辅助信息的成本可以忽略不计。我们开发了能够使用辅助信息的分类算法，使学习过程在样本复杂性方面更有效。我们通过将该方法与仅使用类标签的学习方法进行比较，证明了该方法在许多合成和真实世界数据集上的好处。

引用次数: 28

Conditional Anomaly Detection with Soft Harmonic Functions. 基于软谐波函数的条件异常检测。

Proceedings. IEEE International Conference on Data Mining

Pub Date : 2011-01-01 DOI: 10.1109/ICDM.2011.40

Michal Valko, Branislav Kveton, Hamed Valizadegan, Gregory F Cooper, Milos Hauskrecht

In this paper, we consider the problem of conditional anomaly detection that aims to identify data instances with an unusual response or a class label. We develop a new non-parametric approach for conditional anomaly detection based on the soft harmonic solution, with which we estimate the confidence of the label to detect anomalous mislabeling. We further regularize the solution to avoid the detection of isolated examples and examples on the boundary of the distribution support. We demonstrate the efficacy of the proposed method on several synthetic and UCI ML datasets in detecting unusual labels when compared to several baseline approaches. We also evaluate the performance of our method on a real-world electronic health record dataset where we seek to identify unusual patient-management decisions.

在本文中，我们考虑了条件异常检测问题，该问题旨在识别具有异常响应或类标签的数据实例。我们提出了一种新的基于软调和解的条件异常检测的非参数方法，利用该方法估计标签的置信度来检测异常误标记。我们进一步对解进行正则化，以避免在分布支持的边界上检测孤立样例和样例。与几种基线方法相比，我们证明了所提出的方法在几种合成和UCI ML数据集上检测异常标签的有效性。我们还评估了我们的方法在真实世界的电子健康记录数据集上的性能，我们试图识别不寻常的患者管理决策。

引用次数: 22

Anomaly Detection Using an Ensemble of Feature Models. 基于特征模型集成的异常检测。

Proceedings. IEEE International Conference on Data Mining

Pub Date : 2010-12-13 DOI: 10.1109/ICDM.2010.140

Keith Noto, Carla Brodley, Donna Slonim

We present a new approach to semi-supervised anomaly detection. Given a set of training examples believed to come from the same distribution or class, the task is to learn a model that will be able to distinguish examples in the future that do not belong to the same class. Traditional approaches typically compare the position of a new data point to the set of "normal" training data points in a chosen representation of the feature space. For some data sets, the normal data may not have discernible positions in feature space, but do have consistent relationships among some features that fail to appear in the anomalous examples. Our approach learns to predict the values of training set features from the values of other features. After we have formed an ensemble of predictors, we apply this ensemble to new data points. To combine the contribution of each predictor in our ensemble, we have developed a novel, information-theoretic anomaly measure that our experimental results show selects against noisy and irrelevant features. Our results on 47 data sets show that for most data sets, this approach significantly improves performance over current state-of-the-art feature space distance and density-based approaches.

提出了一种新的半监督异常检测方法。给定一组被认为来自同一分布或类别的训练示例，任务是学习一个模型，该模型将能够在未来区分不属于同一类别的示例。传统方法通常将新数据点的位置与特征空间中选择的“正常”训练数据点的位置进行比较。对于某些数据集，正常数据可能在特征空间中没有可识别的位置，但在异常示例中没有出现的一些特征之间确实存在一致的关系。我们的方法学习从其他特征的值来预测训练集特征的值。在我们形成一个预测集合之后，我们将这个集合应用于新的数据点。为了结合我们集合中每个预测器的贡献，我们开发了一种新的信息理论异常测量，我们的实验结果显示对噪声和不相关特征的选择。我们在47个数据集上的结果表明，对于大多数数据集，这种方法比当前最先进的特征空间距离和基于密度的方法显著提高了性能。

{"title":"Anomaly Detection Using an Ensemble of Feature Models.","authors":"Keith Noto, Carla Brodley, Donna Slonim","doi":"10.1109/ICDM.2010.140","DOIUrl":"https://doi.org/10.1109/ICDM.2010.140","url":null,"abstract":"We present a new approach to semi-supervised anomaly detection. Given a set of training examples believed to come from the same distribution or class, the task is to learn a model that will be able to distinguish examples in the future that do not belong to the same class. Traditional approaches typically compare the position of a new data point to the set of \"normal\" training data points in a chosen representation of the feature space. For some data sets, the normal data may not have discernible positions in feature space, but do have consistent relationships among some features that fail to appear in the anomalous examples. Our approach learns to predict the values of training set features from the values of other features. After we have formed an ensemble of predictors, we apply this ensemble to new data points. To combine the contribution of each predictor in our ensemble, we have developed a novel, information-theoretic anomaly measure that our experimental results show selects against noisy and irrelevant features. Our results on 47 data sets show that for most data sets, this approach significantly improves performance over current state-of-the-art feature space distance and density-based approaches.","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":" ","pages":"953-958"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDM.2010.140","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30227690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 41

Abstraction Augmented Markov Models. 扩充马尔可夫模型。

Proceedings. IEEE International Conference on Data Mining

Pub Date : 2010-12-13 DOI: 10.1109/ICDM.2010.158

Cornelia Caragea, Adrian Silvescu, Doina Caragea, Vasant Honavar

High accuracy sequence classification often requires the use of higher order Markov models (MMs). However, the number of MM parameters increases exponentially with the range of direct dependencies between sequence elements, thereby increasing the risk of overfitting when the data set is limited in size. We present abstraction augmented Markov models (AAMMs) that effectively reduce the number of numeric parameters of k(th) order MMs by successively grouping strings of length k (i.e., k-grams) into abstraction hierarchies. We evaluate AAMMs on three protein subcellular localization prediction tasks. The results of our experiments show that abstraction makes it possible to construct predictive models that use significantly smaller number of features (by one to three orders of magnitude) as compared to MMs. AAMMs are competitive with and, in some cases, significantly outperform MMs. Moreover, the results show that AAMMs often perform significantly better than variable order Markov models, such as decomposed context tree weighting, prediction by partial match, and probabilistic suffix trees.

高精度序列分类通常需要使用高阶马尔可夫模型(mm)。然而，MM参数的数量随着序列元素之间的直接依赖关系的范围呈指数增长，从而增加了数据集规模有限时的过拟合风险。我们提出了抽象增强马尔可夫模型(AAMMs)，该模型通过将长度为k(即k-gram)的字符串连续分组到抽象层次中，有效地减少了k(th)阶mm的数值参数数量。我们在三个蛋白质亚细胞定位预测任务中评估了AAMMs。我们的实验结果表明，与mm相比，抽象使得构建使用更少特征数量(减少一到三个数量级)的预测模型成为可能。aamm与mm具有竞争力，在某些情况下甚至明显优于mm。此外，结果表明，AAMMs通常比分解上下文树权重、部分匹配预测和概率后缀树等变阶马尔可夫模型的性能要好得多。

{"title":"Abstraction Augmented Markov Models.","authors":"Cornelia Caragea, Adrian Silvescu, Doina Caragea, Vasant Honavar","doi":"10.1109/ICDM.2010.158","DOIUrl":"https://doi.org/10.1109/ICDM.2010.158","url":null,"abstract":"High accuracy sequence classification often requires the use of higher order Markov models (MMs). However, the number of MM parameters increases exponentially with the range of direct dependencies between sequence elements, thereby increasing the risk of overfitting when the data set is limited in size. We present abstraction augmented Markov models (AAMMs) that effectively reduce the number of numeric parameters of k(th) order MMs by successively grouping strings of length k (i.e., k-grams) into abstraction hierarchies. We evaluate AAMMs on three protein subcellular localization prediction tasks. The results of our experiments show that abstraction makes it possible to construct predictive models that use significantly smaller number of features (by one to three orders of magnitude) as compared to MMs. AAMMs are competitive with and, in some cases, significantly outperform MMs. Moreover, the results show that AAMMs often perform significantly better than variable order Markov models, such as decomposed context tree weighting, prediction by partial match, and probabilistic suffix trees.","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":" ","pages":"68-77"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDM.2010.158","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30779699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6

A Sparsification Approach for Temporal Graphical Model Decomposition. 时间图模型分解的稀疏化方法。

Proceedings. IEEE International Conference on Data Mining

Pub Date : 2009-12-06 DOI: 10.1109/ICDM.2009.67

Ning Ruan, Ruoming Jin, Victor E Lee, Kun Huang

Temporal causal modeling can be used to recover the causal structure among a group of relevant time series variables. Several methods have been developed to explicitly construct temporal causal graphical models. However, how to best understand and conceptualize these complicated causal relationships is still an open problem. In this paper, we propose a decomposition approach to simplify the temporal graphical model. Our method clusters time series variables into groups such that strong interactions appear among the variables within each group and weak (or no) interactions exist for cross-group variable pairs. Specifically, we formulate the clustering problem for temporal graphical models as a regression-coefficient sparsification problem and define an interesting objective function which balances the model prediction power and its cluster structure. We introduce an iterative optimization approach utilizing the Quasi-Newton method and generalized ridge regression to minimize the objective function and to produce a clustered temporal graphical model. We also present a novel optimization procedure utilizing a graph theoretical tool based on the maximum weight independent set problem to speed up the Quasi-Newton method for a large number of variables. Finally, our detailed experimental study on both synthetic and real datasets demonstrates the effectiveness of our methods.

时间因果建模可以用来恢复一组相关时间序列变量之间的因果结构。已经开发了几种方法来明确地构建时间因果图模型。然而，如何最好地理解和概念化这些复杂的因果关系仍然是一个悬而未决的问题。在本文中，我们提出了一种分解方法来简化时间图形模型。我们的方法将时间序列变量聚类成组，使得每组内的变量之间出现强相互作用，而跨组变量对存在弱(或没有)相互作用。具体来说，我们将时间图模型的聚类问题表述为回归系数稀疏化问题，并定义了一个有趣的目标函数来平衡模型的预测能力和聚类结构。我们引入了一种迭代优化方法，利用准牛顿方法和广义脊回归来最小化目标函数并产生聚类时间图形模型。我们还提出了一种新的优化方法，利用图论工具基于最大权无关集问题来加快拟牛顿方法对大量变量的求解速度。最后，我们在合成数据集和真实数据集上进行了详细的实验研究，证明了我们方法的有效性。

{"title":"A Sparsification Approach for Temporal Graphical Model Decomposition.","authors":"Ning Ruan, Ruoming Jin, Victor E Lee, Kun Huang","doi":"10.1109/ICDM.2009.67","DOIUrl":"https://doi.org/10.1109/ICDM.2009.67","url":null,"abstract":"Temporal causal modeling can be used to recover the causal structure among a group of relevant time series variables. Several methods have been developed to explicitly construct temporal causal graphical models. However, how to best understand and conceptualize these complicated causal relationships is still an open problem. In this paper, we propose a decomposition approach to simplify the temporal graphical model. Our method clusters time series variables into groups such that strong interactions appear among the variables within each group and weak (or no) interactions exist for cross-group variable pairs. Specifically, we formulate the clustering problem for temporal graphical models as a regression-coefficient sparsification problem and define an interesting objective function which balances the model prediction power and its cluster structure. We introduce an iterative optimization approach utilizing the Quasi-Newton method and generalized ridge regression to minimize the objective function and to produce a clustered temporal graphical model. We also present a novel optimization procedure utilizing a graph theoretical tool based on the maximum weight independent set problem to speed up the Quasi-Newton method for a large number of variables. Finally, our detailed experimental study on both synthetic and real datasets demonstrates the effectiveness of our methods.","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":"2009 ","pages":"447-456"},"PeriodicalIF":0.0,"publicationDate":"2009-12-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDM.2009.67","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31383854","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3