Statistical Analysis and Data Mining最新文献

英文中文

Mining Compressing Sequential Patterns 挖掘压缩顺序模式

IF 1.3 4区数学 Q2 Mathematics

Statistical Analysis and Data Mining

Pub Date : 2014-02-01 DOI: 10.1002/sam.11192

Hoang Thanh Lam, F. Mörchen, Dmitriy Fradkin, T. Calders

Pattern mining based on data compression has been successfully applied in many data mining tasks. For itemset data, the Krimp algorithm based on the minimumdescription length MDL principle was shown to be very effective in solving the redundancy issue in descriptive pattern mining. However, for sequence data, the redundancy issue of the set of frequent sequential patterns is not fully addressed in the literature. In this article, we study MDL-based algorithms for mining non-redundant sets of sequential patterns from a sequence database. First, we propose an encoding scheme for compressing sequence data with sequential patterns. Second, we formulate the problem of mining the most compressing sequential patterns from a sequence database. We show that this problem is intractable and belongs to the class of inapproximable problems. Therefore, we propose two heuristic algorithms. The first of these uses a two-phase approach similar to Krimp for itemset data. To overcome performance issues in candidate generation, we also propose GoKrimp, an algorithm that directly mines compressing patterns by greedily extending a pattern until no additional compression benefit of adding the extension into the dictionary. Since checks for additional compression benefit of an extension are computationally expensive we propose a dependency test which only chooses related events for extending a given pattern. This technique improves the efficiency of the GoKrimp algorithm significantly while it still preserves the quality of the set of patterns. We conduct an empirical study on eight datasets to show the effectiveness of our approach in comparison to the state-of-the-art algorithms in terms of interpretability of the extracted patterns, run time, compression ratio, and classification accuracy using the discovered patterns as features for different classifiers. © 2013 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2013

基于数据压缩的模式挖掘已经成功地应用于许多数据挖掘任务中。对于项集数据，基于最小描述长度MDL原理的Krimp算法在解决描述模式挖掘中的冗余问题方面非常有效。然而，对于序列数据，频繁序列模式集的冗余问题在文献中没有得到充分解决。在本文中，我们研究了从序列数据库中挖掘非冗余序列模式集的基于mdl的算法。首先，我们提出了一个用序列模式压缩序列数据的编码方案。其次，我们提出了从序列数据库中挖掘最压缩的序列模式的问题。我们证明了这个问题是棘手的，属于不可逼近问题的范畴。因此，我们提出了两种启发式算法。第一种方法使用了一种两阶段的方法，类似于处理项目集数据的Krimp。为了克服候选生成中的性能问题，我们还提出了GoKrimp算法，该算法通过贪婪地扩展模式来直接挖掘压缩模式，直到将扩展添加到字典中没有额外的压缩好处。由于检查扩展的额外压缩好处在计算上是昂贵的，我们建议使用依赖测试，它只选择扩展给定模式的相关事件。该技术显著提高了GoKrimp算法的效率，同时仍然保持了模式集的质量。我们对8个数据集进行了实证研究，以展示我们的方法在提取模式的可解释性、运行时间、压缩比和使用发现的模式作为不同分类器的特征的分类精度方面与最先进的算法相比的有效性。©2013 Wiley期刊公司统计分析与数据挖掘，2013

{"title":"Mining Compressing Sequential Patterns","authors":"Hoang Thanh Lam, F. Mörchen, Dmitriy Fradkin, T. Calders","doi":"10.1002/sam.11192","DOIUrl":"https://doi.org/10.1002/sam.11192","url":null,"abstract":"Pattern mining based on data compression has been successfully applied in many data mining tasks. For itemset data, the Krimp algorithm based on the minimumdescription length MDL principle was shown to be very effective in solving the redundancy issue in descriptive pattern mining. However, for sequence data, the redundancy issue of the set of frequent sequential patterns is not fully addressed in the literature. In this article, we study MDL-based algorithms for mining non-redundant sets of sequential patterns from a sequence database. First, we propose an encoding scheme for compressing sequence data with sequential patterns. Second, we formulate the problem of mining the most compressing sequential patterns from a sequence database. We show that this problem is intractable and belongs to the class of inapproximable problems. Therefore, we propose two heuristic algorithms. The first of these uses a two-phase approach similar to Krimp for itemset data. To overcome performance issues in candidate generation, we also propose GoKrimp, an algorithm that directly mines compressing patterns by greedily extending a pattern until no additional compression benefit of adding the extension into the dictionary. Since checks for additional compression benefit of an extension are computationally expensive we propose a dependency test which only chooses related events for extending a given pattern. This technique improves the efficiency of the GoKrimp algorithm significantly while it still preserves the quality of the set of patterns. We conduct an empirical study on eight datasets to show the effectiveness of our approach in comparison to the state-of-the-art algorithms in terms of interpretability of the extracted patterns, run time, compression ratio, and classification accuracy using the discovered patterns as features for different classifiers. © 2013 Wiley Periodicals, Inc. Statistical Analysis and Data Mining, 2013","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2014-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"75980591","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 29

Multi-transfer: Transfer learning with multiple views and multiple sources Multi-transfer:多视角、多源的学习迁移

IF 1.3 4区数学 Q2 Mathematics

Statistical Analysis and Data Mining

Pub Date : 2014-01-01 DOI: 10.1137/1.9781611972832.27

Ben Tan, Erheng Zhong, E. Xiang, Qiang Yang

Transfer learning, which aims to help the learning task in a target domain by leveraging knowledge from auxiliary domains, has been demonstrated to be effective in different applications, e.g., text mining, sentiment analysis, etc. In addition, in many real-world applications, auxiliary data are described from multiple perspectives and usually carried by multiple sources. For example, to help classify videos on Youtube, which include three views/perspectives: image, voice and subtitles, one may borrow data from Flickr, Last.FM and Google News. Although any single instance in these domains can only cover a part of the views available on Youtube, actually the piece of information carried by them may compensate with each other. In this paper, we define this transfer learning problem as Transfer Learning with Multiple Views and Multiple Sources. As different sources may have different probability distributions and different views may be compensate or inconsistent with each other, merging all data in a simplistic manner will not give optimal result. Thus, we propose a novel algorithm to leverage knowledge from different views and sources collaboratively, by letting different views from different sources complement each other through a co-training style framework, while revise the distribution differences in different domains. We conduct empirical studies on several real-world datasets to show that the proposed approach can improve the classification accuracy by up to 8% against different state-of-the-art baselines.

迁移学习旨在通过利用辅助领域的知识来帮助目标领域的学习任务，已被证明在不同的应用中是有效的，例如文本挖掘，情感分析等。此外，在许多实际应用程序中，辅助数据是从多个角度描述的，通常由多个源携带。例如，为了帮助对Youtube上的视频进行分类，它包括三种视图/视角:图像、声音和字幕，人们可以从Flickr、Last中借用数据。FM和谷歌新闻。尽管这些域中的任何单个实例只能覆盖Youtube上可用视图的一部分，但实际上它们所携带的信息可以相互补偿。本文将这种迁移学习问题定义为多视图多源迁移学习。由于不同的数据源可能具有不同的概率分布，不同的视图可能相互补偿或不一致，以简单的方式合并所有数据将无法获得最佳结果。因此，我们提出了一种新的算法，通过共同训练风格的框架，让来自不同来源的不同观点相互补充，同时修正不同领域的分布差异，从而协同利用来自不同观点和来源的知识。我们对几个真实世界的数据集进行了实证研究，表明所提出的方法可以在不同的最先进的基线下将分类精度提高8%。

{"title":"Multi-transfer: Transfer learning with multiple views and multiple sources","authors":"Ben Tan, Erheng Zhong, E. Xiang, Qiang Yang","doi":"10.1137/1.9781611972832.27","DOIUrl":"https://doi.org/10.1137/1.9781611972832.27","url":null,"abstract":"Transfer learning, which aims to help the learning task in a target domain by leveraging knowledge from auxiliary domains, has been demonstrated to be effective in different applications, e.g., text mining, sentiment analysis, etc. In addition, in many real-world applications, auxiliary data are described from multiple perspectives and usually carried by multiple sources. For example, to help classify videos on Youtube, which include three views/perspectives: image, voice and subtitles, one may borrow data from Flickr, Last.FM and Google News. Although any single instance in these domains can only cover a part of the views available on Youtube, actually the piece of information carried by them may compensate with each other. In this paper, we define this transfer learning problem as Transfer Learning with Multiple Views and Multiple Sources. As different sources may have different probability distributions and different views may be compensate or inconsistent with each other, merging all data in a simplistic manner will not give optimal result. Thus, we propose a novel algorithm to leverage knowledge from different views and sources collaboratively, by letting different views from different sources complement each other through a co-training style framework, while revise the distribution differences in different domains. We conduct empirical studies on several real-world datasets to show that the proposed approach can improve the classification accuracy by up to 8% against different state-of-the-art baselines.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2014-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"83621622","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 46

A Weighted Random Forests Approach to Improve Predictive Performance. 一种提高预测性能的加权随机森林方法。

IF 1.3 4区数学 Q2 Mathematics

Statistical Analysis and Data Mining

Pub Date : 2013-12-01 DOI: 10.1002/sam.11196

Stacey J Winham, Robert R Freimuth, Joanna M Biernacka

Identifying genetic variants associated with complex disease in high-dimensional data is a challenging problem, and complicated etiologies such as gene-gene interactions are often ignored in analyses. The data-mining method Random Forests (RF) can handle high-dimensions; however, in high-dimensional data, RF is not an effective filter for identifying risk factors associated with the disease trait via complex genetic models such as gene-gene interactions without strong marginal components. Here we propose an extension called Weighted Random Forests (wRF), which incorporates tree-level weights to emphasize more accurate trees in prediction and calculation of variable importance. We demonstrate through simulation and application to data from a genetic study of addiction that wRF can outperform RF in high-dimensional data, although the improvements are modest and limited to situations with effect sizes that are larger than what is realistic in genetics of complex disease. Thus, the current implementation of wRF is unlikely to improve detection of relevant predictors in high-dimensional genetic data, but may be applicable in other situations where larger effect sizes are anticipated.

在高维数据中识别与复杂疾病相关的遗传变异是一个具有挑战性的问题，而复杂的病因，如基因-基因相互作用，往往在分析中被忽略。随机森林(Random Forests, RF)数据挖掘方法可以处理高维数据;然而，在高维数据中，RF不是通过复杂的遗传模型(如没有强边缘成分的基因-基因相互作用)识别与疾病特征相关的风险因素的有效过滤器。在这里，我们提出了一个扩展称为加权随机森林(wRF)，它包含了树级权重，以强调更准确的树在预测和计算变量重要性。我们通过对成瘾遗传研究数据的模拟和应用证明，wRF在高维数据中可以优于RF，尽管这种改进是适度的，并且仅限于效应大小大于复杂疾病遗传学实际效应大小的情况。因此，目前wRF的实施不太可能改善高维遗传数据中相关预测因子的检测，但可能适用于预期更大效应量的其他情况。

{"title":"A Weighted Random Forests Approach to Improve Predictive Performance.","authors":"Stacey J Winham, Robert R Freimuth, Joanna M Biernacka","doi":"10.1002/sam.11196","DOIUrl":"https://doi.org/10.1002/sam.11196","url":null,"abstract":"Identifying genetic variants associated with complex disease in high-dimensional data is a challenging problem, and complicated etiologies such as gene-gene interactions are often ignored in analyses. The data-mining method Random Forests (RF) can handle high-dimensions; however, in high-dimensional data, RF is not an effective filter for identifying risk factors associated with the disease trait via complex genetic models such as gene-gene interactions without strong marginal components. Here we propose an extension called Weighted Random Forests (wRF), which incorporates tree-level weights to emphasize more accurate trees in prediction and calculation of variable importance. We demonstrate through simulation and application to data from a genetic study of addiction that wRF can outperform RF in high-dimensional data, although the improvements are modest and limited to situations with effect sizes that are larger than what is realistic in genetics of complex disease. Thus, the current implementation of wRF is unlikely to improve detection of relevant predictors in high-dimensional genetic data, but may be applicable in other situations where larger effect sizes are anticipated.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2013-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11196","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32096214","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 81

Penalized Regression and Risk Prediction in Genome-Wide Association Studies. 全基因组关联研究中的惩罚回归与风险预测

IF 1.3 4区数学 Q2 Mathematics

Statistical Analysis and Data Mining

Pub Date : 2013-08-01 DOI: 10.1002/sam.11183

Erin Austin, Wei Pan, Xiaotong Shen

An important task in personalized medicine is to predict disease risk based on a person's genome, e.g. on a large number of single-nucleotide polymorphisms (SNPs). Genome-wide association studies (GWAS) make SNP and phenotype data available to researchers. A critical question for researchers is how to best predict disease risk. Penalized regression equipped with variable selection, such as LASSO and SCAD, is deemed to be promising in this setting. However, the sparsity assumption taken by the LASSO, SCAD and many other penalized regression techniques may not be applicable here: it is now hypothesized that many common diseases are associated with many SNPs with small to moderate effects. In this article, we use the GWAS data from the Wellcome Trust Case Control Consortium (WTCCC) to investigate the performance of various unpenalized and penalized regression approaches under true sparse or non-sparse models. We find that in general penalized regression outperformed unpenalized regression; SCAD, TLP and LASSO performed best for sparse models, while elastic net regression was the winner, followed by ridge, TLP and LASSO, for non-sparse models.

个性化医疗的一项重要任务是根据一个人的基因组，如大量的单核苷酸多态性（SNPs）来预测疾病风险。全基因组关联研究（GWAS）为研究人员提供了 SNP 和表型数据。研究人员面临的一个关键问题是如何最有效地预测疾病风险。在这种情况下，带有变量选择功能的惩罚回归（如 LASSO 和 SCAD）被认为很有前途。然而，LASSO、SCAD 和许多其他惩罚性回归技术所采用的稀疏性假设在这里可能并不适用：目前的假设是，许多常见疾病与许多具有小到中等影响的 SNP 相关。在本文中，我们利用威康信托病例控制联盟（WTCCC）的 GWAS 数据，研究了各种非惩罚性和惩罚性回归方法在真正稀疏或非稀疏模型下的表现。我们发现，一般来说，惩罚回归优于非惩罚回归；对于稀疏模型，SCAD、TLP 和 LASSO 表现最佳，而对于非稀疏模型，弹性网回归是赢家，其次是脊回归、TLP 和 LASSO。

{"title":"Penalized Regression and Risk Prediction in Genome-Wide Association Studies.","authors":"Erin Austin, Wei Pan, Xiaotong Shen","doi":"10.1002/sam.11183","DOIUrl":"10.1002/sam.11183","url":null,"abstract":"An important task in personalized medicine is to predict disease risk based on a person's genome, e.g. on a large number of single-nucleotide polymorphisms (SNPs). Genome-wide association studies (GWAS) make SNP and phenotype data available to researchers. A critical question for researchers is how to best predict disease risk. Penalized regression equipped with variable selection, such as LASSO and SCAD, is deemed to be promising in this setting. However, the sparsity assumption taken by the LASSO, SCAD and many other penalized regression techniques may not be applicable here: it is now hypothesized that many common diseases are associated with many SNPs with small to moderate effects. In this article, we use the GWAS data from the Wellcome Trust Case Control Consortium (WTCCC) to investigate the performance of various unpenalized and penalized regression approaches under true sparse or non-sparse models. We find that in general penalized regression outperformed unpenalized regression; SCAD, TLP and LASSO performed best for sparse models, while elastic net regression was the winner, followed by ridge, TLP and LASSO, for non-sparse models.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2013-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3859439/pdf/nihms534715.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31963889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Regularized Partial Least Squares with an Application to NMR Spectroscopy. 正则化偏最小二乘在核磁共振光谱中的应用。

IF 1.3 4区数学 Q2 Mathematics

Statistical Analysis and Data Mining

Pub Date : 2013-08-01 DOI: 10.1002/sam.11169

Genevera I Allen, Christine Peterson, Marina Vannucci, Mirjana Maletić-Savatić

High-dimensional data common in genomics, proteomics, and chemometrics often contains complicated correlation structures. Recently, partial least squares (PLS) and Sparse PLS methods have gained attention in these areas as dimension reduction techniques in the context of supervised data analysis. We introduce a framework for Regularized PLS by solving a relaxation of the SIMPLS optimization problem with penalties on the PLS loadings vectors. Our approach enjoys many advantages including flexibility, general penalties, easy interpretation of results, and fast computation in high-dimensional settings. We also outline extensions of our methods leading to novel methods for non-negative PLS and generalized PLS, an adoption of PLS for structured data. We demonstrate the utility of our methods through simulations and a case study on proton Nuclear Magnetic Resonance (NMR) spectroscopy data.

基因组学、蛋白质组学和化学计量学中常见的高维数据通常包含复杂的相关结构。近年来，偏最小二乘(PLS)和稀疏PLS方法作为监督数据分析中的降维技术在这些领域得到了广泛的关注。我们通过解决对PLS加载向量进行惩罚的SIMPLS优化问题的松弛，引入了正则化PLS的框架。我们的方法具有许多优点，包括灵活性、一般惩罚、易于解释结果以及在高维环境下的快速计算。我们还概述了我们的方法的扩展，导致非负PLS和广义PLS的新方法，采用结构化数据的PLS。我们通过模拟和质子核磁共振(NMR)光谱数据的案例研究证明了我们的方法的实用性。

引用次数: 45

AAPL: Assessing Association between P-value Lists. 评估p值列表之间的关联。

IF 1.3 4区数学 Q2 Mathematics

Statistical Analysis and Data Mining

Pub Date : 2013-04-01 DOI: 10.1002/sam.11180

Tianwei Yu, Yize Zhao, Shihao Shen

Joint analyses of high-throughput datasets generate the need to assess the association between two long lists of p-values. In such p-value lists, the vast majority of the features are insignificant. Ideally contributions of features that are null in both tests should be minimized. However, by random chance their p-values are uniformly distributed between zero and one, and weak correlations of the p-values may exist due to inherent biases in the high-throughput technology used to generate the multiple datasets. Rank-based agreement test may capture such unwanted effects. Testing contingency tables generated using hard cutoffs may be sensitive to arbitrary threshold choice. We develop a novel method based on feature-level concordance using local false discovery rate. The association score enjoys straight-forward interpretation. The method shows higher statistical power to detect association between p-value lists in simulation. We demonstrate its utility using real data analysis. The R implementation of the method is available at http://userwww.service.emory.edu/~tyu8/AAPL/.

对高通量数据集的联合分析需要评估两个长p值列表之间的关联。在这样的p值列表中，绝大多数特征是不重要的。理想情况下，在两个测试中都为空的特性的贡献应该最小化。然而，由于随机的机会，它们的p值均匀分布在0和1之间，并且由于用于生成多个数据集的高通量技术的固有偏差，p值可能存在弱相关性。基于等级的一致性测试可能捕捉到这些不希望看到的效果。使用硬截止生成的测试列联表可能对任意阈值选择很敏感。我们提出了一种基于局部错误发现率的特征级一致性的新方法。协会得分的解释很直接。仿真结果表明，该方法对p值表之间的关联检测具有较高的统计能力。我们通过实际数据分析来证明它的实用性。该方法的R实现可在http://userwww.service.emory.edu/~tyu8/AAPL/获得。

引用次数: 1

Predicting Simulation Parameters of Biological Systems Using a Gaussian Process Model. 利用高斯过程模型预测生物系统仿真参数。

IF 1.3 4区数学 Q2 Mathematics

Statistical Analysis and Data Mining

Pub Date : 2012-12-01 DOI: 10.1002/sam.11163

Xiangxin Zhu, Max Welling, Fang Jin, John Lowengrub

Finding optimal parameters for simulating biological systems is usually a very difficult and expensive task in systems biology. Brute force searching is infeasible in practice because of the huge (often infinite) search space. In this article, we propose predicting the parameters efficiently by learning the relationship between system outputs and parameters using regression. However, the conventional parametric regression models suffer from two issues, thus are not applicable to this problem. First, restricting the regression function as a certain fixed type (e.g. linear, polynomial, etc.) introduces too strong assumptions that reduce the model flexibility. Second, conventional regression models fail to take into account the fact that a fixed parameter value may correspond to multiple different outputs due to the stochastic nature of most biological simulations, and the existence of a potentially large number of other factors that affect the simulation outputs. We propose a novel approach based on a Gaussian process model that addresses the two issues jointly. We apply our approach to a tumor vessel growth model and the feedback Wright-Fisher model. The experimental results show that our method can predict the parameter values of both of the two models with high accuracy.

在系统生物学中，寻找模拟生物系统的最优参数通常是一项非常困难和昂贵的任务。由于巨大的(通常是无限的)搜索空间，暴力搜索在实践中是不可行的。在本文中，我们提出通过学习系统输出与参数之间的关系，使用回归来有效地预测参数。然而，传统的参数回归模型存在两个问题，因此不适用于该问题。首先，将回归函数限制为某种固定类型(例如线性，多项式等)引入了过于强烈的假设，从而降低了模型的灵活性。其次，传统的回归模型没有考虑到这样一个事实，即由于大多数生物模拟的随机性，固定的参数值可能对应多个不同的输出，并且存在潜在的大量其他因素影响模拟输出。我们提出了一种基于高斯过程模型的新方法，共同解决了这两个问题。我们将我们的方法应用于肿瘤血管生长模型和反馈Wright-Fisher模型。实验结果表明，该方法能较好地预测两种模型的参数值。

{"title":"Predicting Simulation Parameters of Biological Systems Using a Gaussian Process Model.","authors":"Xiangxin Zhu, Max Welling, Fang Jin, John Lowengrub","doi":"10.1002/sam.11163","DOIUrl":"https://doi.org/10.1002/sam.11163","url":null,"abstract":"Finding optimal parameters for simulating biological systems is usually a very difficult and expensive task in systems biology. Brute force searching is infeasible in practice because of the huge (often infinite) search space. In this article, we propose predicting the parameters efficiently by learning the relationship between system outputs and parameters using regression. However, the conventional parametric regression models suffer from two issues, thus are not applicable to this problem. First, restricting the regression function as a certain fixed type (e.g. linear, polynomial, etc.) introduces too strong assumptions that reduce the model flexibility. Second, conventional regression models fail to take into account the fact that a fixed parameter value may correspond to multiple different outputs due to the stochastic nature of most biological simulations, and the existence of a potentially large number of other factors that affect the simulation outputs. We propose a novel approach based on a Gaussian process model that addresses the two issues jointly. We apply our approach to a tumor vessel growth model and the feedback Wright-Fisher model. The experimental results show that our method can predict the parameter values of both of the two models with high accuracy.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11163","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"31300011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 9

Maximum Likelihood Estimation Over Directed Acyclic Gaussian Graphs. 有向无环高斯图上的最大似然估计。

IF 1.3 4区数学 Q2 Mathematics

Statistical Analysis and Data Mining

Pub Date : 2012-12-01 DOI: 10.1002/sam.11168

Yiping Yuan, Xiaotong Shen, Wei Pan

Estimation of multiple directed graphs becomes challenging in the presence of inhomogeneous data, where directed acyclic graphs (DAGs) are used to represent causal relations among random variables. To infer causal relations among variables, we estimate multiple DAGs given a known ordering in Gaussian graphical models. In particular, we propose a constrained maximum likelihood method with nonconvex constraints over elements and element-wise differences of adjacency matrices, for identifying the sparseness structure as well as detecting structural changes over adjacency matrices of the graphs. Computationally, we develop an efficient algorithm based on augmented Lagrange multipliers, the difference convex method, and a novel fast algorithm for solving convex relaxation subproblems. Numerical results suggest that the proposed method performs well against its alternatives for simulated and real data.

有向无环图（DAG）用于表示随机变量之间的因果关系，在存在不均匀数据的情况下，估计多个有向图变得具有挑战性。为了推断变量之间的因果关系，我们在高斯图模型中给定已知排序的情况下估计多个 DAG。特别是，我们提出了一种约束最大似然法，该方法对邻接矩阵的元素和元素向差具有非凸约束，用于识别稀疏性结构以及检测图邻接矩阵的结构变化。在计算方面，我们开发了一种基于增强拉格朗日乘数、差凸法的高效算法，以及一种用于解决凸松弛子问题的新型快速算法。数值结果表明，在模拟数据和真实数据方面，所提出的方法与其他方法相比表现良好。

引用次数: 0

Multiple Response Regression for Gaussian Mixture Models with Known Labels. 具有已知标签的高斯混杂模型的多重响应回归。

IF 1.3 4区数学 Q2 Mathematics

Statistical Analysis and Data Mining

Pub Date : 2012-12-01 DOI: 10.1002/sam.11158

Wonyul Lee, Ying Du, Wei Sun, D Neil Hayes, Yufeng Liu

Multiple response regression is a useful regression technique to model multiple response variables using the same set of predictor variables. Most existing methods for multiple response regression are designed for modeling homogeneous data. In many applications, however, one may have heterogeneous data where the samples are divided into multiple groups. Our motivating example is a cancer dataset where the samples belong to multiple cancer subtypes. In this paper, we consider modeling the data coming from a mixture of several Gaussian distributions with known group labels. A naive approach is to split the data into several groups according to the labels and model each group separately. Although it is simple, this approach ignores potential common structures across different groups. We propose new penalized methods to model all groups jointly in which the common and unique structures can be identified. The proposed methods estimate the regression coefficient matrix, as well as the conditional inverse covariance matrix of response variables. Asymptotic properties of the proposed methods are explored. Through numerical examples, we demonstrate that both estimation and prediction can be improved by modeling all groups jointly using the proposed methods. An application to a glioblastoma cancer dataset reveals some interesting common and unique gene relationships across different cancer subtypes.

多重响应回归是一种有用的回归技术，可使用同一组预测变量对多个响应变量进行建模。大多数现有的多重响应回归方法都是为同质数据建模而设计的。然而，在许多应用中，人们可能会遇到样本被分为多组的异质数据。我们以癌症数据集为例，其中的样本属于多种癌症亚型。在本文中，我们考虑对来自多个高斯分布混合物的数据进行建模，这些混合物具有已知的组标签。最简单的方法是根据标签将数据分成几组，然后分别对每组数据建模。这种方法虽然简单，却忽略了不同组之间潜在的共同结构。我们提出了新的惩罚性方法，对所有组别进行联合建模，从而识别出共同和独特的结构。所提出的方法可以估计回归系数矩阵以及响应变量的条件逆协方差矩阵。我们探讨了所提方法的渐近特性。通过数值示例，我们证明了使用所提出的方法对所有组进行联合建模，可以改进估计和预测。对胶质母细胞瘤癌症数据集的应用揭示了不同癌症亚型中一些有趣的共同和独特基因关系。

{"title":"Multiple Response Regression for Gaussian Mixture Models with Known Labels.","authors":"Wonyul Lee, Ying Du, Wei Sun, D Neil Hayes, Yufeng Liu","doi":"10.1002/sam.11158","DOIUrl":"10.1002/sam.11158","url":null,"abstract":"Multiple response regression is a useful regression technique to model multiple response variables using the same set of predictor variables. Most existing methods for multiple response regression are designed for modeling homogeneous data. In many applications, however, one may have heterogeneous data where the samples are divided into multiple groups. Our motivating example is a cancer dataset where the samples belong to multiple cancer subtypes. In this paper, we consider modeling the data coming from a mixture of several Gaussian distributions with known group labels. A naive approach is to split the data into several groups according to the labels and model each group separately. Although it is simple, this approach ignores potential common structures across different groups. We propose new penalized methods to model all groups jointly in which the common and unique structures can be identified. The proposed methods estimate the regression coefficient matrix, as well as the conditional inverse covariance matrix of response variables. Asymptotic properties of the proposed methods are explored. Through numerical examples, we demonstrate that both estimation and prediction can be improved by modeling all groups jointly using the proposed methods. An application to a glioblastoma cancer dataset reveals some interesting common and unique gene relationships across different cancer subtypes.","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2012-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3885347/pdf/nihms539872.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"32023141","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Nonlinear Vertex Discriminant Analysis with Reproducing Kernels. 带再现核的非线性顶点判别分析

IF 1.3 4区数学 Q2 Mathematics

Statistical Analysis and Data Mining

Pub Date : 2012-04-01 DOI: 10.1002/sam.11137

Tong Tong Wu, Yichao Wu

The novel supervised learning method of vertex discriminant analysis (VDA) has been demonstrated for its good performance in multicategory classification. The current paper explores an elaboration of VDA for nonlinear discrimination. By incorporating reproducing kernels, VDA can be generalized from linear discrimination to nonlinear discrimination. Our numerical experiments show that the new reproducing kernel-based method leads to accurate classification for both linear and nonlinear cases.

顶点判别分析（VDA）这种新颖的监督学习方法在多类别分类中表现出色。本文探讨了 VDA 在非线性判别方面的应用。通过加入再现核，VDA 可以从线性判别推广到非线性判别。我们的数值实验表明，基于再现核的新方法可以对线性和非线性情况进行准确分类。

引用次数: 0

首页上一页

下一页尾页

类型

全部化学•材料生命科学医学物理工程技术环境•农林材料科学地球科学法学管理学化学环境科学与生态学计算机科学教育学经济学农林科学人文科学生物学数学物理与天体物理心理学综合性期刊其他工业工程理学历史学农学文学信息工程

数据库

全部 ACS Publications Elsevier ieeexplore Springer The Royal Society of Chemistry Wiley

期刊

Statistical Analysis and Data Mining

全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.

﹀