首页 > 最新文献

Proceedings. IEEE International Conference on Data Mining最新文献

英文 中文
New Probabilistic Multi-Graph Decomposition Model to Identify Consistent Human Brain Network Modules. 一种新的概率多图分解模型识别一致性人脑网络模块。
Pub Date : 2016-12-01 Epub Date: 2017-02-02 DOI: 10.1109/ICDM.2016.0041
Dijun Luo, Zhouyuan Huo, Yang Wang, Andrew J Saykin, Li Shen, Heng Huang

Many recent scientific efforts have been devoted to constructing the human connectome using Diffusion Tensor Imaging (DTI) data for understanding large-scale brain networks that underlie higher-level cognition in human. However, suitable network analysis computational tools are still lacking in human brain connectivity research. To address this problem, we propose a novel probabilistic multi-graph decomposition model to identify consistent network modules from the brain connectivity networks of the studied subjects. At first, we propose a new probabilistic graph decomposition model to address the high computational complexity issue in existing stochastic block models. After that, we further extend our new probabilistic graph decomposition model for multiple networks/graphs to identify the shared modules cross multiple brain networks by simultaneously incorporating multiple networks and predicting the hidden block state variables. We also derive an efficient optimization algorithm to solve the proposed objective and estimate the model parameters. We validate our method by analyzing both the weighted fiber connectivity networks constructed from DTI images and the standard human face image clustering benchmark data sets. The promising empirical results demonstrate the superior performance of our proposed method.

近年来,许多科学研究都致力于利用弥散张量成像(Diffusion Tensor Imaging, DTI)数据构建人类连接组,以了解人类高级认知基础上的大规模大脑网络。然而,在人脑连通性的研究中,仍然缺乏合适的网络分析计算工具。为了解决这一问题,我们提出了一种新的概率多图分解模型,从被研究对象的大脑连接网络中识别出一致的网络模块。首先,针对现有随机块模型计算复杂度高的问题,提出了一种新的概率图分解模型。之后,我们进一步扩展了新的多网络/图的概率图分解模型,通过同时合并多个网络和预测隐藏块状态变量来识别跨多个大脑网络的共享模块。我们还推导了一种有效的优化算法来求解所提出的目标和估计模型参数。通过分析由DTI图像构建的加权光纤连接网络和标准人脸图像聚类基准数据集,验证了我们的方法。实证结果表明,本文提出的方法具有良好的性能。
{"title":"New Probabilistic Multi-Graph Decomposition Model to Identify Consistent Human Brain Network Modules.","authors":"Dijun Luo,&nbsp;Zhouyuan Huo,&nbsp;Yang Wang,&nbsp;Andrew J Saykin,&nbsp;Li Shen,&nbsp;Heng Huang","doi":"10.1109/ICDM.2016.0041","DOIUrl":"https://doi.org/10.1109/ICDM.2016.0041","url":null,"abstract":"<p><p>Many recent scientific efforts have been devoted to constructing the human connectome using Diffusion Tensor Imaging (DTI) data for understanding large-scale brain networks that underlie higher-level cognition in human. However, suitable network analysis computational tools are still lacking in human brain connectivity research. To address this problem, we propose a novel probabilistic multi-graph decomposition model to identify consistent network modules from the brain connectivity networks of the studied subjects. At first, we propose a new probabilistic graph decomposition model to address the high computational complexity issue in existing stochastic block models. After that, we further extend our new probabilistic graph decomposition model for multiple networks/graphs to identify the shared modules cross multiple brain networks by simultaneously incorporating multiple networks and predicting the hidden block state variables. We also derive an efficient optimization algorithm to solve the proposed objective and estimate the model parameters. We validate our method by analyzing both the weighted fiber connectivity networks constructed from DTI images and the standard human face image clustering benchmark data sets. The promising empirical results demonstrate the superior performance of our proposed method.</p>","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":" ","pages":"301-310"},"PeriodicalIF":0.0,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDM.2016.0041","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36044857","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Robust Multi-Network Clustering via Joint Cross-Domain Cluster Alignment. 通过联合跨域聚类对齐实现稳健的多网络聚类
Pub Date : 2015-11-01 DOI: 10.1109/ICDM.2015.13
Rui Liu, Wei Cheng, Hanghang Tong, Wei Wang, Xiang Zhang

Network clustering is an important problem that has recently drawn a lot of attentions. Most existing work focuses on clustering nodes within a single network. In many applications, however, there exist multiple related networks, in which each network may be constructed from a different domain and instances in one domain may be related to instances in other domains. In this paper, we propose a robust algorithm, MCA, for multi-network clustering that takes into account cross-domain relationships between instances. MCA has several advantages over the existing single network clustering methods. First, it is able to detect associations between clusters from different domains, which, however, is not addressed by any existing methods. Second, it achieves more consistent clustering results on multiple networks by leveraging the duality between clustering individual networks and inferring cross-network cluster alignment. Finally, it provides a multi-network clustering solution that is more robust to noise and errors. We perform extensive experiments on a variety of real and synthetic networks to demonstrate the effectiveness and efficiency of MCA.

网络聚类是近来备受关注的一个重要问题。现有的大部分工作都集中在单个网络内节点的聚类上。然而,在许多应用中,存在多个相关网络,其中每个网络可能由不同的域构建,一个域中的实例可能与其他域中的实例相关。在本文中,我们提出了一种用于多网络聚类的稳健算法 MCA,该算法考虑了实例之间的跨域关系。与现有的单一网络聚类方法相比,MCA 有几个优点。首先,它能够检测不同领域聚类之间的关联,而现有的方法都没有解决这个问题。其次,它利用单个网络聚类和推断跨网络聚类对齐之间的二元性,在多个网络上实现了更一致的聚类结果。最后,它提供的多网络聚类解决方案对噪声和误差具有更强的鲁棒性。我们在各种真实和合成网络上进行了大量实验,以证明 MCA 的有效性和效率。
{"title":"Robust Multi-Network Clustering via Joint Cross-Domain Cluster Alignment.","authors":"Rui Liu, Wei Cheng, Hanghang Tong, Wei Wang, Xiang Zhang","doi":"10.1109/ICDM.2015.13","DOIUrl":"10.1109/ICDM.2015.13","url":null,"abstract":"<p><p>Network clustering is an important problem that has recently drawn a lot of attentions. Most existing work focuses on clustering nodes within a single network. In many applications, however, there exist <i>multiple related</i> networks, in which each network may be constructed from a different domain and instances in one domain may be related to instances in other domains. In this paper, we propose a robust algorithm, MCA, for multi-network clustering that takes into account cross-domain relationships between instances. MCA has several advantages over the existing single network clustering methods. First, it is able to detect associations between clusters from different domains, which, however, is not addressed by any existing methods. Second, it achieves more consistent clustering results on multiple networks by leveraging the <i>duality</i> between clustering individual networks and inferring cross-network cluster alignment. Finally, it provides a multi-network clustering solution that is more robust to noise and errors. We perform extensive experiments on a variety of real and synthetic networks to demonstrate the effectiveness and efficiency of MCA.</p>","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":"2015 ","pages":"291-300"},"PeriodicalIF":0.0,"publicationDate":"2015-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4880426/pdf/nihms785953.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34435258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SimNest: Social Media Nested Epidemic Simulation via Online Semi-supervised Deep Learning. SimNest:通过在线半监督深度学习的社交媒体嵌套流行病模拟。
Pub Date : 2015-11-01 DOI: 10.1109/ICDM.2015.39
Liang Zhao, Jiangzhuo Chen, Feng Chen, Wei Wang, Chang-Tien Lu, Naren Ramakrishnan

Infectious disease epidemics such as influenza and Ebola pose a serious threat to global public health. It is crucial to characterize the disease and the evolution of the ongoing epidemic efficiently and accurately. Computational epidemiology can model the disease progress and underlying contact network, but suffers from the lack of real-time and fine-grained surveillance data. Social media, on the other hand, provides timely and detailed disease surveillance, but is insensible to the underlying contact network and disease model. This paper proposes a novel semi-supervised deep learning framework that integrates the strengths of computational epidemiology and social media mining techniques. Specifically, this framework learns the social media users' health states and intervention actions in real time, which are regularized by the underlying disease model and contact network. Conversely, the learned knowledge from social media can be fed into computational epidemic model to improve the efficiency and accuracy of disease diffusion modeling. We propose an online optimization algorithm to substantialize the above interactive learning process iteratively to achieve a consistent stage of the integration. The extensive experimental results demonstrated that our approach can effectively characterize the spatio-temporal disease diffusion, outperforming competing methods by a substantial margin on multiple metrics.

流感和埃博拉等传染病流行对全球公共卫生构成严重威胁。至关重要的是要有效和准确地描述疾病特征和正在发生的流行病的演变。计算流行病学可以模拟疾病进展和潜在的接触网络,但缺乏实时和细粒度的监测数据。另一方面,社交媒体提供了及时和详细的疾病监测,但对潜在的接触网络和疾病模型不敏感。本文提出了一种新的半监督深度学习框架,该框架集成了计算流行病学和社交媒体挖掘技术的优势。具体而言,该框架实时学习社交媒体用户的健康状态和干预行为,并通过潜在疾病模型和联系网络进行正则化。反过来,可以将从社交媒体中学习到的知识输入到计算流行病模型中,提高疾病扩散建模的效率和准确性。我们提出了一种在线优化算法来迭代实体化上述交互学习过程,以实现集成的一致阶段。大量的实验结果表明,我们的方法可以有效地表征疾病的时空扩散,在多个指标上明显优于竞争对手的方法。
{"title":"SimNest: Social Media Nested Epidemic Simulation via Online Semi-supervised Deep Learning.","authors":"Liang Zhao,&nbsp;Jiangzhuo Chen,&nbsp;Feng Chen,&nbsp;Wei Wang,&nbsp;Chang-Tien Lu,&nbsp;Naren Ramakrishnan","doi":"10.1109/ICDM.2015.39","DOIUrl":"https://doi.org/10.1109/ICDM.2015.39","url":null,"abstract":"<p><p>Infectious disease epidemics such as influenza and Ebola pose a serious threat to global public health. It is crucial to characterize the disease and the evolution of the ongoing epidemic efficiently and accurately. Computational epidemiology can model the disease progress and underlying contact network, but suffers from the lack of real-time and fine-grained surveillance data. Social media, on the other hand, provides timely and detailed disease surveillance, but is insensible to the underlying contact network and disease model. This paper proposes a novel semi-supervised deep learning framework that integrates the strengths of computational epidemiology and social media mining techniques. Specifically, this framework learns the social media users' health states and intervention actions in real time, which are regularized by the underlying disease model and contact network. Conversely, the learned knowledge from social media can be fed into computational epidemic model to improve the efficiency and accuracy of disease diffusion modeling. We propose an online optimization algorithm to substantialize the above interactive learning process iteratively to achieve a consistent stage of the integration. The extensive experimental results demonstrated that our approach can effectively characterize the spatio-temporal disease diffusion, outperforming competing methods by a substantial margin on multiple metrics.</p>","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":"2015 ","pages":"639-648"},"PeriodicalIF":0.0,"publicationDate":"2015-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDM.2015.39","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34699773","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 64
Tensor-based Multi-view Feature Selection with Applications to Brain Diseases. 基于张量的多视角特征选择在脑疾病中的应用
Pub Date : 2014-12-01 DOI: 10.1109/ICDM.2014.26
Bokai Cao, Lifang He, Xiangnan Kong, Philip S Yu, Zhifeng Hao, Ann B Ragin

In the era of big data, we can easily access information from multiple views which may be obtained from different sources or feature subsets. Generally, different views provide complementary information for learning tasks. Thus, multi-view learning can facilitate the learning process and is prevalent in a wide range of application domains. For example, in medical science, measurements from a series of medical examinations are documented for each subject, including clinical, imaging, immunologic, serologic and cognitive measures which are obtained from multiple sources. Specifically, for brain diagnosis, we can have different quantitative analysis which can be seen as different feature subsets of a subject. It is desirable to combine all these features in an effective way for disease diagnosis. However, some measurements from less relevant medical examinations can introduce irrelevant information which can even be exaggerated after view combinations. Feature selection should therefore be incorporated in the process of multi-view learning. In this paper, we explore tensor product to bring different views together in a joint space, and present a dual method of tensor-based multi-view feature selection (dual-Tmfs) based on the idea of support vector machine recursive feature elimination. Experiments conducted on datasets derived from neurological disorder demonstrate the features selected by our proposed method yield better classification performance and are relevant to disease diagnosis.

在大数据时代,我们可以轻松地从多个视图中获取信息,这些视图可能来自不同的来源或特征子集。一般来说,不同视图可为学习任务提供互补信息。因此,多视图学习可以促进学习过程,并广泛应用于各个领域。例如,在医学科学中,每个受试者的一系列体检结果都会被记录下来,其中包括临床、影像、免疫、血清和认知测量结果,而这些测量结果都是从多个来源获得的。具体来说,在脑部诊断中,我们可以进行不同的定量分析,这些分析可被视为受试者的不同特征子集。我们希望能将所有这些特征有效地结合起来进行疾病诊断。然而,一些相关性较低的医学检查测量结果可能会引入无关信息,甚至在视图组合后被夸大。因此,在多视图学习过程中,应结合特征选择。本文基于支持向量机递归特征消除的思想,探索了一种基于张量乘积的多视图特征选择方法(dual-Tmfs)。在神经系统疾病数据集上进行的实验表明,我们提出的方法所选择的特征具有更好的分类性能,并且与疾病诊断相关。
{"title":"Tensor-based Multi-view Feature Selection with Applications to Brain Diseases.","authors":"Bokai Cao, Lifang He, Xiangnan Kong, Philip S Yu, Zhifeng Hao, Ann B Ragin","doi":"10.1109/ICDM.2014.26","DOIUrl":"10.1109/ICDM.2014.26","url":null,"abstract":"<p><p>In the era of big data, we can easily access information from multiple views which may be obtained from different sources or feature subsets. Generally, different views provide complementary information for learning tasks. Thus, multi-view learning can facilitate the learning process and is prevalent in a wide range of application domains. For example, in medical science, measurements from a series of medical examinations are documented for each subject, including clinical, imaging, immunologic, serologic and cognitive measures which are obtained from multiple sources. Specifically, for brain diagnosis, we can have different quantitative analysis which can be seen as different feature subsets of a subject. It is desirable to combine all these features in an effective way for disease diagnosis. However, some measurements from less relevant medical examinations can introduce irrelevant information which can even be exaggerated after view combinations. Feature selection should therefore be incorporated in the process of multi-view learning. In this paper, we explore tensor product to bring different views together in a joint space, and present a dual method of tensor-based multi-view feature selection (dual-Tmfs) based on the idea of support vector machine recursive feature elimination. Experiments conducted on datasets derived from neurological disorder demonstrate the features selected by our proposed method yield better classification performance and are relevant to disease diagnosis.</p>","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":"2014 ","pages":"40-49"},"PeriodicalIF":0.0,"publicationDate":"2014-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4415282/pdf/nihms683152.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33272116","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
RS-Forest: A Rapid Density Estimator for Streaming Anomaly Detection. RS-Forest:用于流异常检测的快速密度估计器。
Pub Date : 2014-01-01 DOI: 10.1109/ICDM.2014.45
Ke Wu, Kun Zhang, Wei Fan, Andrea Edwards, Philip S Yu

Anomaly detection in streaming data is of high interest in numerous application domains. In this paper, we propose a novel one-class semi-supervised algorithm to detect anomalies in streaming data. Underlying the algorithm is a fast and accurate density estimator implemented by multiple fully randomized space trees (RS-Trees), named RS-Forest. The piecewise constant density estimate of each RS-tree is defined on the tree node into which an instance falls. Each incoming instance in a data stream is scored by the density estimates averaged over all trees in the forest. Two strategies, statistical attribute range estimation of high probability guarantee and dual node profiles for rapid model update, are seamlessly integrated into RS-Forest to systematically address the ever-evolving nature of data streams. We derive the theoretical upper bound for the proposed algorithm and analyze its asymptotic properties via bias-variance decomposition. Empirical comparisons to the state-of-the-art methods on multiple benchmark datasets demonstrate that the proposed method features high detection rate, fast response, and insensitivity to most of the parameter settings. Algorithm implementations and datasets are available upon request.

流数据中的异常检测在许多应用领域都受到高度关注。在本文中,我们提出了一种新的一类半监督算法来检测流数据中的异常。该算法的基础是一个快速准确的密度估计器,由多个完全随机空间树(RS-Trees)实现,称为RS-Forest。在实例所属的树节点上定义每个rs树的分段常数密度估计。数据流中的每个传入实例都通过森林中所有树木的平均密度估计值进行评分。高概率保证的统计属性范围估计和快速模型更新的双节点配置文件两种策略无缝集成到RS-Forest中,以系统地解决数据流不断变化的本质。我们推导了该算法的理论上界,并通过偏方差分解分析了其渐近性质。在多个基准数据集上与最先进的方法进行的经验比较表明,该方法具有检测率高、响应速度快、对大多数参数设置不敏感的特点。算法实现和数据集可根据要求提供。
{"title":"RS-Forest: A Rapid Density Estimator for Streaming Anomaly Detection.","authors":"Ke Wu,&nbsp;Kun Zhang,&nbsp;Wei Fan,&nbsp;Andrea Edwards,&nbsp;Philip S Yu","doi":"10.1109/ICDM.2014.45","DOIUrl":"https://doi.org/10.1109/ICDM.2014.45","url":null,"abstract":"<p><p>Anomaly detection in streaming data is of high interest in numerous application domains. In this paper, we propose a novel one-class semi-supervised algorithm to detect anomalies in streaming data. Underlying the algorithm is a fast and accurate density estimator implemented by multiple fully randomized space trees (RS-Trees), named RS-Forest. The piecewise constant density estimate of each RS-tree is defined on the tree node into which an instance falls. Each incoming instance in a data stream is scored by the density estimates averaged over all trees in the forest. Two strategies, statistical attribute range estimation of high probability guarantee and dual node profiles for rapid model update, are seamlessly integrated into RS-Forest to systematically address the ever-evolving nature of data streams. We derive the theoretical upper bound for the proposed algorithm and analyze its asymptotic properties via bias-variance decomposition. Empirical comparisons to the state-of-the-art methods on multiple benchmark datasets demonstrate that the proposed method features high detection rate, fast response, and insensitivity to most of the parameter settings. Algorithm implementations and datasets are available upon request.</p>","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":"2014 ","pages":"600-609"},"PeriodicalIF":0.0,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDM.2014.45","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"33057623","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 89
Learning Protein Folding Energy Functions. 学习蛋白质折叠能量函数。
Pub Date : 2011-12-01 DOI: 10.1109/ICDM.2011.88
Wei Guan, Arkadas Ozakin, Alexander Gray, Jose Borreguero, Shashi Pandit, Anna Jagielska, Liliana Wroblewska, Jeffrey Skolnick

A critical open problem in ab initio protein folding is protein energy function design, which pertains to defining the energy of protein conformations in a way that makes folding most efficient and reliable. In this paper, we address this issue as a weight optimization problem and utilize a machine learning approach, learning-to-rank, to solve this problem. We investigate the ranking-via-classification approach, especially the RankingSVM method and compare it with the state-of-the-art approach to the problem using the MINUIT optimization package. To maintain the physicality of the results, we impose non-negativity constraints on the weights. For this we develop two efficient non-negative support vector machine (NNSVM) methods, derived from L2-norm SVM and L1-norm SVMs, respectively. We demonstrate an energy function which maintains the correct ordering with respect to structure dissimilarity to the native state more often, is more efficient and reliable for learning on large protein sets, and is qualitatively superior to the current state-of-the-art energy function.

从头算蛋白质折叠中的一个关键开放问题是蛋白质能量函数设计,它涉及到以一种使折叠最有效和可靠的方式定义蛋白质构象的能量。在本文中,我们将这个问题作为一个权重优化问题来解决,并利用机器学习方法,学习排序,来解决这个问题。我们研究了通过分类进行排序的方法,特别是RankingSVM方法,并将其与使用MINUIT优化包的最先进方法进行了比较。为了保持结果的物质性,我们对权重施加非负性约束。为此,我们开发了两种高效的非负支持向量机(NNSVM)方法,分别来源于l2范数支持向量机和l1范数支持向量机。我们展示了一种能量函数,它可以更频繁地保持与原始状态结构不相似的正确顺序,对于大型蛋白质集的学习更有效和可靠,并且在质量上优于当前最先进的能量函数。
{"title":"Learning Protein Folding Energy Functions.","authors":"Wei Guan,&nbsp;Arkadas Ozakin,&nbsp;Alexander Gray,&nbsp;Jose Borreguero,&nbsp;Shashi Pandit,&nbsp;Anna Jagielska,&nbsp;Liliana Wroblewska,&nbsp;Jeffrey Skolnick","doi":"10.1109/ICDM.2011.88","DOIUrl":"https://doi.org/10.1109/ICDM.2011.88","url":null,"abstract":"<p><p>A critical open problem in <i>ab initio</i> protein folding is protein energy function design, which pertains to defining the energy of protein conformations in a way that makes folding most efficient and reliable. In this paper, we address this issue as a weight optimization problem and utilize a machine learning approach, learning-to-rank, to solve this problem. We investigate the ranking-via-classification approach, especially the RankingSVM method and compare it with the state-of-the-art approach to the problem using the MINUIT optimization package. To maintain the physicality of the results, we impose non-negativity constraints on the weights. For this we develop two efficient non-negative support vector machine (NNSVM) methods, derived from L2-norm SVM and L1-norm SVMs, respectively. We demonstrate an energy function which maintains the correct ordering with respect to structure dissimilarity to the native state more often, is more efficient and reliable for learning on large protein sets, and is qualitatively superior to the current state-of-the-art energy function.</p>","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":" ","pages":"1062-1067"},"PeriodicalIF":0.0,"publicationDate":"2011-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1109/ICDM.2011.88","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32743507","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 8
Learning classification with auxiliary probabilistic information. 利用辅助概率信息学习分类。
Pub Date : 2011-01-01 DOI: 10.1109/ICDM.2011.84
Quang Nguyen, Hamed Valizadegan, Milos Hauskrecht

Finding ways of incorporating auxiliary information or auxiliary data into the learning process has been the topic of active data mining and machine learning research in recent years. In this work we study and develop a new framework for classification learning problem in which, in addition to class labels, the learner is provided with an auxiliary (probabilistic) information that reflects how strong the expert feels about the class label. This approach can be extremely useful for many practical classification tasks that rely on subjective label assessment and where the cost of acquiring additional auxiliary information is negligible when compared to the cost of the example analysis and labelling. We develop classification algorithms capable of using the auxiliary information to make the learning process more efficient in terms of the sample complexity. We demonstrate the benefit of the approach on a number of synthetic and real world data sets by comparing it to the learning with class labels only.

寻找将辅助信息或辅助数据纳入学习过程的方法是近年来活跃的数据挖掘和机器学习研究的主题。在这项工作中,我们研究并开发了一个新的分类学习问题框架,在这个框架中,除了类别标签之外,学习者还被提供了一个辅助(概率)信息,该信息反映了专家对类别标签的感觉有多强。这种方法对于许多依赖于主观标签评估的实际分类任务非常有用,并且与示例分析和标记的成本相比,获取额外辅助信息的成本可以忽略不计。我们开发了能够使用辅助信息的分类算法,使学习过程在样本复杂性方面更有效。我们通过将该方法与仅使用类标签的学习方法进行比较,证明了该方法在许多合成和真实世界数据集上的好处。
{"title":"Learning classification with auxiliary probabilistic information.","authors":"Quang Nguyen, Hamed Valizadegan, Milos Hauskrecht","doi":"10.1109/ICDM.2011.84","DOIUrl":"10.1109/ICDM.2011.84","url":null,"abstract":"<p><p>Finding ways of incorporating auxiliary information or auxiliary data into the learning process has been the topic of active data mining and machine learning research in recent years. In this work we study and develop a new framework for classification learning problem in which, in addition to class labels, the learner is provided with an auxiliary (probabilistic) information that reflects how strong the expert feels about the class label. This approach can be extremely useful for many practical classification tasks that rely on subjective label assessment and where the cost of acquiring additional auxiliary information is negligible when compared to the cost of the example analysis and labelling. We develop classification algorithms capable of using the auxiliary information to make the learning process more efficient in terms of the sample complexity. We demonstrate the benefit of the approach on a number of synthetic and real world data sets by comparing it to the learning with class labels only.</p>","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":"2011 ","pages":"477-486"},"PeriodicalIF":0.0,"publicationDate":"2011-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4190020/pdf/nihms348374.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32741729","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Conditional Anomaly Detection with Soft Harmonic Functions. 基于软谐波函数的条件异常检测。
Pub Date : 2011-01-01 DOI: 10.1109/ICDM.2011.40
Michal Valko, Branislav Kveton, Hamed Valizadegan, Gregory F Cooper, Milos Hauskrecht

In this paper, we consider the problem of conditional anomaly detection that aims to identify data instances with an unusual response or a class label. We develop a new non-parametric approach for conditional anomaly detection based on the soft harmonic solution, with which we estimate the confidence of the label to detect anomalous mislabeling. We further regularize the solution to avoid the detection of isolated examples and examples on the boundary of the distribution support. We demonstrate the efficacy of the proposed method on several synthetic and UCI ML datasets in detecting unusual labels when compared to several baseline approaches. We also evaluate the performance of our method on a real-world electronic health record dataset where we seek to identify unusual patient-management decisions.

在本文中,我们考虑了条件异常检测问题,该问题旨在识别具有异常响应或类标签的数据实例。我们提出了一种新的基于软调和解的条件异常检测的非参数方法,利用该方法估计标签的置信度来检测异常误标记。我们进一步对解进行正则化,以避免在分布支持的边界上检测孤立样例和样例。与几种基线方法相比,我们证明了所提出的方法在几种合成和UCI ML数据集上检测异常标签的有效性。我们还评估了我们的方法在真实世界的电子健康记录数据集上的性能,我们试图识别不寻常的患者管理决策。
{"title":"Conditional Anomaly Detection with Soft Harmonic Functions.","authors":"Michal Valko, Branislav Kveton, Hamed Valizadegan, Gregory F Cooper, Milos Hauskrecht","doi":"10.1109/ICDM.2011.40","DOIUrl":"10.1109/ICDM.2011.40","url":null,"abstract":"<p><p>In this paper, we consider the problem of conditional anomaly detection that aims to identify data instances with an unusual response or a class label. We develop a new non-parametric approach for conditional anomaly detection based on the soft harmonic solution, with which we estimate the confidence of the label to detect anomalous mislabeling. We further regularize the solution to avoid the detection of isolated examples and examples on the boundary of the distribution support. We demonstrate the efficacy of the proposed method on several synthetic and UCI ML datasets in detecting unusual labels when compared to several baseline approaches. We also evaluate the performance of our method on a real-world electronic health record dataset where we seek to identify unusual patient-management decisions.</p>","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":"2011 ","pages":"735-743"},"PeriodicalIF":0.0,"publicationDate":"2011-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4189186/pdf/nihms348373.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32741730","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Anomaly Detection Using an Ensemble of Feature Models. 基于特征模型集成的异常检测。
Pub Date : 2010-12-13 DOI: 10.1109/ICDM.2010.140
Keith Noto, Carla Brodley, Donna Slonim

We present a new approach to semi-supervised anomaly detection. Given a set of training examples believed to come from the same distribution or class, the task is to learn a model that will be able to distinguish examples in the future that do not belong to the same class. Traditional approaches typically compare the position of a new data point to the set of "normal" training data points in a chosen representation of the feature space. For some data sets, the normal data may not have discernible positions in feature space, but do have consistent relationships among some features that fail to appear in the anomalous examples. Our approach learns to predict the values of training set features from the values of other features. After we have formed an ensemble of predictors, we apply this ensemble to new data points. To combine the contribution of each predictor in our ensemble, we have developed a novel, information-theoretic anomaly measure that our experimental results show selects against noisy and irrelevant features. Our results on 47 data sets show that for most data sets, this approach significantly improves performance over current state-of-the-art feature space distance and density-based approaches.

提出了一种新的半监督异常检测方法。给定一组被认为来自同一分布或类别的训练示例,任务是学习一个模型,该模型将能够在未来区分不属于同一类别的示例。传统方法通常将新数据点的位置与特征空间中选择的“正常”训练数据点的位置进行比较。对于某些数据集,正常数据可能在特征空间中没有可识别的位置,但在异常示例中没有出现的一些特征之间确实存在一致的关系。我们的方法学习从其他特征的值来预测训练集特征的值。在我们形成一个预测集合之后,我们将这个集合应用于新的数据点。为了结合我们集合中每个预测器的贡献,我们开发了一种新的信息理论异常测量,我们的实验结果显示对噪声和不相关特征的选择。我们在47个数据集上的结果表明,对于大多数数据集,这种方法比当前最先进的特征空间距离和基于密度的方法显著提高了性能。
{"title":"Anomaly Detection Using an Ensemble of Feature Models.","authors":"Keith Noto, Carla Brodley, Donna Slonim","doi":"10.1109/ICDM.2010.140","DOIUrl":"10.1109/ICDM.2010.140","url":null,"abstract":"<p><p>We present a new approach to semi-supervised anomaly detection. Given a set of training examples believed to come from the same distribution or class, the task is to learn a model that will be able to distinguish examples in the future that do not belong to the same class. Traditional approaches typically compare the position of a new data point to the set of \"normal\" training data points in a chosen representation of the feature space. For some data sets, the normal data may not have discernible positions in feature space, but do have consistent relationships among some features that fail to appear in the anomalous examples. Our approach learns to predict the values of training set features from the values of other features. After we have formed an ensemble of predictors, we apply this ensemble to new data points. To combine the contribution of each predictor in our ensemble, we have developed a novel, information-theoretic anomaly measure that our experimental results show selects against noisy and irrelevant features. Our results on 47 data sets show that for most data sets, this approach significantly improves performance over current state-of-the-art feature space distance and density-based approaches.</p>","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":" ","pages":"953-958"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3197694/pdf/nihms329660.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30227690","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Abstraction Augmented Markov Models. 扩充马尔可夫模型。
Pub Date : 2010-12-13 DOI: 10.1109/ICDM.2010.158
Cornelia Caragea, Adrian Silvescu, Doina Caragea, Vasant Honavar

High accuracy sequence classification often requires the use of higher order Markov models (MMs). However, the number of MM parameters increases exponentially with the range of direct dependencies between sequence elements, thereby increasing the risk of overfitting when the data set is limited in size. We present abstraction augmented Markov models (AAMMs) that effectively reduce the number of numeric parameters of k(th) order MMs by successively grouping strings of length k (i.e., k-grams) into abstraction hierarchies. We evaluate AAMMs on three protein subcellular localization prediction tasks. The results of our experiments show that abstraction makes it possible to construct predictive models that use significantly smaller number of features (by one to three orders of magnitude) as compared to MMs. AAMMs are competitive with and, in some cases, significantly outperform MMs. Moreover, the results show that AAMMs often perform significantly better than variable order Markov models, such as decomposed context tree weighting, prediction by partial match, and probabilistic suffix trees.

高精度序列分类通常需要使用高阶马尔可夫模型(mm)。然而,MM参数的数量随着序列元素之间的直接依赖关系的范围呈指数增长,从而增加了数据集规模有限时的过拟合风险。我们提出了抽象增强马尔可夫模型(AAMMs),该模型通过将长度为k(即k-gram)的字符串连续分组到抽象层次中,有效地减少了k(th)阶mm的数值参数数量。我们在三个蛋白质亚细胞定位预测任务中评估了AAMMs。我们的实验结果表明,与mm相比,抽象使得构建使用更少特征数量(减少一到三个数量级)的预测模型成为可能。aamm与mm具有竞争力,在某些情况下甚至明显优于mm。此外,结果表明,AAMMs通常比分解上下文树权重、部分匹配预测和概率后缀树等变阶马尔可夫模型的性能要好得多。
{"title":"Abstraction Augmented Markov Models.","authors":"Cornelia Caragea, Adrian Silvescu, Doina Caragea, Vasant Honavar","doi":"10.1109/ICDM.2010.158","DOIUrl":"10.1109/ICDM.2010.158","url":null,"abstract":"<p><p>High accuracy sequence classification often requires the use of higher order Markov models (MMs). However, the number of MM parameters increases exponentially with the range of direct dependencies between sequence elements, thereby increasing the risk of overfitting when the data set is limited in size. We present abstraction augmented Markov models (AAMMs) that effectively reduce the number of numeric parameters of k(th) order MMs by successively grouping strings of length k (i.e., k-grams) into abstraction hierarchies. We evaluate AAMMs on three protein subcellular localization prediction tasks. The results of our experiments show that abstraction makes it possible to construct predictive models that use significantly smaller number of features (by one to three orders of magnitude) as compared to MMs. AAMMs are competitive with and, in some cases, significantly outperform MMs. Moreover, the results show that AAMMs often perform significantly better than variable order Markov models, such as decomposed context tree weighting, prediction by partial match, and probabilistic suffix trees.</p>","PeriodicalId":74565,"journal":{"name":"Proceedings. IEEE International Conference on Data Mining","volume":" ","pages":"68-77"},"PeriodicalIF":0.0,"publicationDate":"2010-12-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3400679/pdf/nihms314859.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"30779699","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings. IEEE International Conference on Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1