首页 > 最新文献

Statistical Analysis and Data Mining最新文献

英文 中文
Nonlinear variable selection with continuous outcome: a fully nonparametric incremental forward stagewise approach. 具有连续结果的非线性变量选择:一种完全非参数渐进式前向阶段方法。
IF 1.3 4区 数学 Q2 Mathematics Pub Date : 2018-08-01 Epub Date: 2018-06-19 DOI: 10.1002/sam.11381
Tianwei Yu

We present a method of variable selection for the sparse generalized additive model. The method doesn't assume any specific functional form, and can select from a large number of candidates. It takes the form of incremental forward stagewise regression. Given no functional form is assumed, we devised an approach termed "roughening" to adjust the residuals in the iterations. In simulations, we show the new method is competitive against popular machine learning approaches. We also demonstrate its performance using some real datasets. The method is available as a part of the nlnet package on CRAN (https://cran.r-project.org/package=nlnet).

提出了一种稀疏广义加性模型的变量选择方法。该方法不假设任何特定的函数形式,并且可以从大量候选项中进行选择。它采用增量前向阶段回归的形式。假设没有功能形式,我们设计了一种称为“粗糙化”的方法来调整迭代中的残差。在模拟中,我们证明了新方法与流行的机器学习方法具有竞争力。我们还用一些真实的数据集证明了它的性能。该方法作为nlnet包的一部分在CRAN (https://cran.r-project.org/package=nlnet)上可用。
{"title":"Nonlinear variable selection with continuous outcome: a fully nonparametric incremental forward stagewise approach.","authors":"Tianwei Yu","doi":"10.1002/sam.11381","DOIUrl":"https://doi.org/10.1002/sam.11381","url":null,"abstract":"<p><p>We present a method of variable selection for the sparse generalized additive model. The method doesn't assume any specific functional form, and can select from a large number of candidates. It takes the form of incremental forward stagewise regression. Given no functional form is assumed, we devised an approach termed \"roughening\" to adjust the residuals in the iterations. In simulations, we show the new method is competitive against popular machine learning approaches. We also demonstrate its performance using some real datasets. The method is available as a part of the nlnet package on CRAN (https://cran.r-project.org/package=nlnet).</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11381","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36866356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The next-generation K-means algorithm. 下一代K-means算法。
IF 1.3 4区 数学 Q2 Mathematics Pub Date : 2018-08-01 Epub Date: 2018-05-11 DOI: 10.1002/sam.11379
Eugene Demidenko

Typically, when referring to a model-based classification, the mixture distribution approach is understood. In contrast, we revive the hard-classification model-based approach developed by Banfield and Raftery (1993) for which K-means is equivalent to the maximum likelihood (ML) estimation. The next-generation K-means algorithm does not end after the classification is achieved, but moves forward to answer the following fundamental questions: Are there clusters, how many clusters are there, what are the statistical properties of the estimated means and index sets, what is the distribution of the coefficients in the clusterwise regression, and how to classify multilevel data? The statistical model-based approach for the K-means algorithm is the key, because it allows statistical simulations and studying the properties of classification following the track of the classical statistics. This paper illustrates the application of the ML classification to testing the no-clusters hypothesis, to studying various methods for selection of the number of clusters using simulations, robust clustering using Laplace distribution, studying properties of the coefficients in clusterwise regression, and finally to multilevel data by marrying the variance components model with K-means.

通常,当涉及基于模型的分类时,可以理解混合分布方法。相反,我们恢复了Banfield和Raftery(1993)开发的基于硬分类模型的方法,其中K-means等价于最大似然(ML)估计。下一代K-means算法并没有在分类完成后结束,而是继续回答以下基本问题:是否存在聚类,有多少聚类,估计的均值和指标集的统计特性是什么,聚类回归中系数的分布是什么,以及如何对多水平数据进行分类?K-means算法的基于统计模型的方法是关键,因为它允许按照经典统计的轨迹进行统计模拟和研究分类的性质。本文阐述了ML分类在检验无聚类假设、研究使用模拟选择聚类数量的各种方法、使用拉普拉斯分布的鲁棒聚类、研究逐群回归中系数的性质,以及通过将方差分量模型与K-均值相结合来研究多水平数据中的应用。
{"title":"The next-generation K-means algorithm.","authors":"Eugene Demidenko","doi":"10.1002/sam.11379","DOIUrl":"10.1002/sam.11379","url":null,"abstract":"<p><p>Typically, when referring to a model-based classification, the mixture distribution approach is understood. In contrast, we revive the hard-classification model-based approach developed by Banfield and Raftery (1993) for which K-means is equivalent to the maximum likelihood (ML) estimation. The next-generation K-means algorithm does not end after the classification is achieved, but moves forward to answer the following fundamental questions: Are there clusters, how many clusters are there, what are the statistical properties of the estimated means and index sets, what is the distribution of the coefficients in the clusterwise regression, and how to classify multilevel data? The statistical model-based approach for the K-means algorithm is the key, because it allows statistical simulations and studying the properties of classification following the track of the classical statistics. This paper illustrates the application of the ML classification to testing the no-clusters hypothesis, to studying various methods for selection of the number of clusters using simulations, robust clustering using Laplace distribution, studying properties of the coefficients in clusterwise regression, and finally to multilevel data by marrying the variance components model with K-means.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6062903/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36368001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Whole-Volume Clustering of Time Series Data from Zebrafish Brain Calcium Images via Mixture Modeling. 基于混合建模的斑马鱼脑钙图像时间序列数据的全体积聚类。
IF 1.3 4区 数学 Q2 Mathematics Pub Date : 2018-02-01 Epub Date: 2017-12-06 DOI: 10.1002/sam.11366
Hien D Nguyen, Jeremy F P Ullmann, Geoffrey J McLachlan, Venkatakaushik Voleti, Wenze Li, Elizabeth M C Hillman, David C Reutens, Andrew L Janke

Calcium is a ubiquitous messenger in neural signaling events. An increasing number of techniques are enabling visualization of neurological activity in animal models via luminescent proteins that bind to calcium ions. These techniques generate large volumes of spatially correlated time series. A model-based functional data analysis methodology via Gaussian mixtures is suggested for the clustering of data from such visualizations is proposed. The methodology is theoretically justified and a computationally efficient approach to estimation is suggested. An example analysis of a zebrafish imaging experiment is presented.

钙是神经信号事件中普遍存在的信使。越来越多的技术通过与钙离子结合的发光蛋白使动物模型的神经活动可视化。这些技术产生大量的空间相关时间序列。提出了一种基于模型的功能数据分析方法,通过高斯混合方法对来自此类可视化的数据进行聚类。该方法在理论上是合理的,并提出了一种计算效率高的估计方法。给出了斑马鱼成像实验的实例分析。
{"title":"Whole-Volume Clustering of Time Series Data from Zebrafish Brain Calcium Images via Mixture Modeling.","authors":"Hien D Nguyen,&nbsp;Jeremy F P Ullmann,&nbsp;Geoffrey J McLachlan,&nbsp;Venkatakaushik Voleti,&nbsp;Wenze Li,&nbsp;Elizabeth M C Hillman,&nbsp;David C Reutens,&nbsp;Andrew L Janke","doi":"10.1002/sam.11366","DOIUrl":"https://doi.org/10.1002/sam.11366","url":null,"abstract":"<p><p>Calcium is a ubiquitous messenger in neural signaling events. An increasing number of techniques are enabling visualization of neurological activity in animal models via luminescent proteins that bind to calcium ions. These techniques generate large volumes of spatially correlated time series. A model-based functional data analysis methodology via Gaussian mixtures is suggested for the clustering of data from such visualizations is proposed. The methodology is theoretically justified and a computationally efficient approach to estimation is suggested. An example analysis of a zebrafish imaging experiment is presented.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11366","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36069012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 6
Random Forest Missing Data Algorithms. 随机森林缺失数据算法。
IF 1.3 4区 数学 Q2 Mathematics Pub Date : 2017-12-01 Epub Date: 2017-06-13 DOI: 10.1002/sam.11348
Fei Tang, Hemant Ishwaran

Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting-the latter class representing a generalization of a new promising imputation algorithm called missForest. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.

随机森林缺失数据算法是一种很有吸引力的缺失数据输入方法。它们具有能够处理混合类型缺失数据的理想特性,它们能够适应相互作用和非线性,并且具有扩展到大数据设置的潜力。目前有许多不同的射频插补算法,但对其有效性的指导相对较少。利用大量不同的数据集,在不同的缺失数据机制下评估了各种RF算法的插补性能。算法包括接近输入、动态输入和利用多变量无监督和监督分割的输入——后者代表了一种新的有前途的输入算法misforest的推广。我们的研究结果表明,射频插补通常是稳健的,性能随着相关性的增加而提高。在中度到高度丢失的情况下,甚至(在某些情况下)数据不是随机丢失的情况下,性能都很好。
{"title":"Random Forest Missing Data Algorithms.","authors":"Fei Tang,&nbsp;Hemant Ishwaran","doi":"10.1002/sam.11348","DOIUrl":"https://doi.org/10.1002/sam.11348","url":null,"abstract":"<p><p>Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting-the latter class representing a generalization of a new promising imputation algorithm called missForest. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11348","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35796889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 377
Use and Communication of Probabilistic Forecasts. 概率预测的使用和交流。
IF 1.3 4区 数学 Q2 Mathematics Pub Date : 2016-12-01 Epub Date: 2016-02-23 DOI: 10.1002/sam.11302
Adrian E Raftery

Probabilistic forecasts are becoming more and more available. How should they be used and communicated? What are the obstacles to their use in practice? I review experience with five problems where probabilistic forecasting played an important role. This leads me to identify five types of potential users: Low Stakes Users, who don't need probabilistic forecasts; General Assessors, who need an overall idea of the uncertainty in the forecast; Change Assessors, who need to know if a change is out of line with expectatations; Risk Avoiders, who wish to limit the risk of an adverse outcome; and Decision Theorists, who quantify their loss function and perform the decision-theoretic calculations. This suggests that it is important to interact with users and to consider their goals. The cognitive research tells us that calibration is important for trust in probability forecasts, and that it is important to match the verbal expression with the task. The cognitive load should be minimized, reducing the probabilistic forecast to a single percentile if appropriate. Probabilities of adverse events and percentiles of the predictive distribution of quantities of interest seem often to be the best way to summarize probabilistic forecasts. Formal decision theory has an important role, but in a limited range of applications.

概率预测正变得越来越可行。它们应该如何使用和交流?在实践中使用它们的障碍是什么?我回顾了五个问题的经验,其中概率预测发挥了重要作用。这让我确定了五种类型的潜在用户:低风险用户,他们不需要概率预测;一般评估员,他们需要对预测中的不确定性有一个全面的了解;变更评估员,他们需要知道变更是否超出了预期;风险规避者,希望限制不良结果的风险;以及决策理论家,他们量化损失函数并进行决策理论计算。这表明与用户互动并考虑他们的目标是很重要的。认知研究告诉我们,校准对于概率预测的信任是重要的,并且对于言语表达与任务的匹配是重要的。认知负荷应该最小化,如果合适的话,将概率预测减少到一个百分位数。不良事件的概率和感兴趣数量的预测分布的百分位数似乎往往是总结概率预测的最佳方式。形式决策理论具有重要的作用,但在有限的应用范围内。
{"title":"Use and Communication of Probabilistic Forecasts.","authors":"Adrian E Raftery","doi":"10.1002/sam.11302","DOIUrl":"https://doi.org/10.1002/sam.11302","url":null,"abstract":"<p><p>Probabilistic forecasts are becoming more and more available. How should they be used and communicated? What are the obstacles to their use in practice? I review experience with five problems where probabilistic forecasting played an important role. This leads me to identify five types of potential users: Low Stakes Users, who don't need probabilistic forecasts; General Assessors, who need an overall idea of the uncertainty in the forecast; Change Assessors, who need to know if a change is out of line with expectatations; Risk Avoiders, who wish to limit the risk of an adverse outcome; and Decision Theorists, who quantify their loss function and perform the decision-theoretic calculations. This suggests that it is important to interact with users and to consider their goals. The cognitive research tells us that calibration is important for trust in probability forecasts, and that it is important to match the verbal expression with the task. The cognitive load should be minimized, reducing the probabilistic forecast to a single percentile if appropriate. Probabilities of adverse events and percentiles of the predictive distribution of quantities of interest seem often to be the best way to summarize probabilistic forecasts. Formal decision theory has an important role, but in a limited range of applications.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11302","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34944896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 49
Hierarchical Models for Multiple, Rare Outcomes Using Massive Observational Healthcare Databases. 利用大规模观察性医疗保健数据库建立多种罕见结果的层次模型。
IF 1.3 4区 数学 Q2 Mathematics Pub Date : 2016-08-01 Epub Date: 2016-07-17 DOI: 10.1002/sam.11324
Trevor R Shaddox, Patrick B Ryan, Martijn J Schuemie, David Madigan, Marc A Suchard

Clinical trials often lack power to identify rare adverse drug events (ADEs) and therefore cannot address the threat rare ADEs pose, motivating the need for new ADE detection techniques. Emerging national patient claims and electronic health record databases have inspired post-approval early detection methods like the Bayesian self-controlled case series (BSCCS) regression model. Existing BSCCS models do not account for multiple outcomes, where pathology may be shared across different ADEs. We integrate a pathology hierarchy into the BSCCS model by developing a novel informative hierarchical prior linking outcome-specific effects. Considering shared pathology drastically increases the dimensionality of the already massive models in this field. We develop an efficient method for coping with the dimensionality expansion by reducing the hierarchical model to a form amenable to existing tools. Through a synthetic study we demonstrate decreased bias in risk estimates for drugs when using conditions with different true risk and unequal prevalence. We also examine observational data from the MarketScan Lab Results dataset, exposing the bias that results from aggregating outcomes, as previously employed to estimate risk trends of warfarin and dabigatran for intracranial hemorrhage and gastrointestinal bleeding. We further investigate the limits of our approach by using extremely rare conditions. This research demonstrates that analyzing multiple outcomes simultaneously is feasible at scale and beneficial.

临床试验往往缺乏识别罕见药物不良事件(ADE)的能力,因此无法应对罕见药物不良事件造成的威胁,因此需要新的药物不良事件检测技术。新出现的全国患者索赔和电子健康记录数据库启发了贝叶斯自控病例系列(BSCCS)回归模型等批准后早期检测方法。现有的 BSCCS 模型并不考虑多重结果,不同的 ADE 可能共享病理。我们通过开发一种将特定结果效应联系在一起的新型信息分层先验,将病理学分层整合到 BSCCS 模型中。考虑到共享病理学会大大增加该领域本已庞大的模型的维度。我们开发了一种有效的方法,通过将分层模型简化为现有工具可以使用的形式,来应对维度的扩展。通过一项合成研究,我们证明了在使用具有不同真实风险和不平等流行率的条件时,药物风险估计值的偏差会减小。我们还研究了来自 MarketScan 实验室结果数据集的观察数据,揭示了汇总结果所产生的偏差,正如之前用于估算华法林和达比加群治疗颅内出血和消化道出血的风险趋势那样。我们通过使用极其罕见的病症进一步研究了我们方法的局限性。这项研究表明,同时分析多种结果是可行的,而且是有益的。
{"title":"Hierarchical Models for Multiple, Rare Outcomes Using Massive Observational Healthcare Databases.","authors":"Trevor R Shaddox, Patrick B Ryan, Martijn J Schuemie, David Madigan, Marc A Suchard","doi":"10.1002/sam.11324","DOIUrl":"10.1002/sam.11324","url":null,"abstract":"<p><p>Clinical trials often lack power to identify rare adverse drug events (ADEs) and therefore cannot address the threat rare ADEs pose, motivating the need for new ADE detection techniques. Emerging national patient claims and electronic health record databases have inspired post-approval early detection methods like the Bayesian self-controlled case series (BSCCS) regression model. Existing BSCCS models do not account for multiple outcomes, where pathology may be shared across different ADEs. We integrate a pathology hierarchy into the BSCCS model by developing a novel informative hierarchical prior linking outcome-specific effects. Considering shared pathology drastically increases the dimensionality of the already massive models in this field. We develop an efficient method for coping with the dimensionality expansion by reducing the hierarchical model to a form amenable to existing tools. Through a synthetic study we demonstrate decreased bias in risk estimates for drugs when using conditions with different true risk and unequal prevalence. We also examine observational data from the MarketScan Lab Results dataset, exposing the bias that results from aggregating outcomes, as previously employed to estimate risk trends of warfarin and dabigatran for intracranial hemorrhage and gastrointestinal bleeding. We further investigate the limits of our approach by using extremely rare conditions. This research demonstrates that analyzing multiple outcomes simultaneously is feasible at scale and beneficial.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5423675/pdf/nihms799155.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34993872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Nonlinear Joint Latent Variable Models and Integrative Tumor Subtype Discovery. 非线性联合潜在变量模型与综合肿瘤亚型发现。
IF 1.3 4区 数学 Q2 Mathematics Pub Date : 2016-04-01 Epub Date: 2016-03-28 DOI: 10.1002/sam.11306
Binghui Liu, Xiaotong Shen, Wei Pan

Integrative analysis has been used to identify clusters by integrating data of disparate types, such as deoxyribonucleic acid (DNA) copy number alterations and DNA methylation changes for discovering novel subtypes of tumors. Most existing integrative analysis methods are based on joint latent variable models, which are generally divided into two classes: joint factor analysis and joint mixture modeling, with continuous and discrete parameterizations of the latent variables respectively. Despite recent progresses, many issues remain. In particular, existing integration methods based on joint factor analysis may be inadequate to model multiple clusters due to the unimodality of the assumed Gaussian distribution, while those based on joint mixture modeling may not have the ability for dimension reduction and/or feature selection. In this paper, we employ a nonlinear joint latent variable model to allow for flexible modeling that can account for multiple clusters as well as conduct dimension reduction and feature selection. We propose a method, called integrative and regularized generative topographic mapping (irGTM), to perform simultaneous dimension reduction across multiple types of data while achieving feature selection separately for each data type. Simulations are performed to examine the operating characteristics of the methods, in which the proposed method compares favorably against the popular iCluster that is based on a linear joint latent variable model. Finally, a glioblastoma multiforme (GBM) dataset is examined.

整合分析已被用于通过整合不同类型的数据来识别簇,例如脱氧核糖核酸(DNA)拷贝数改变和DNA甲基化变化,以发现新的肿瘤亚型。现有的综合分析方法大多基于联合潜变量模型,一般分为联合因子分析和联合混合建模两类,分别对潜在变量进行连续参数化和离散参数化。尽管最近取得了进展,但仍存在许多问题。特别是现有的基于联合因子分析的积分方法,由于假设高斯分布的单峰性,可能不足以对多个聚类进行建模,而基于联合混合建模的积分方法可能缺乏降维和/或特征选择的能力。在本文中,我们采用非线性联合潜变量模型来允许灵活建模,可以考虑多个聚类以及进行降维和特征选择。我们提出了一种称为集成和正则化生成地形映射(irGTM)的方法,可以跨多种类型的数据同时执行降维,同时为每种数据类型分别实现特征选择。进行了模拟以检查方法的操作特性,其中所提出的方法与基于线性联合潜在变量模型的流行iCluster相比具有优势。最后,研究了多形性胶质母细胞瘤(GBM)数据集。
{"title":"Nonlinear Joint Latent Variable Models and Integrative Tumor Subtype Discovery.","authors":"Binghui Liu,&nbsp;Xiaotong Shen,&nbsp;Wei Pan","doi":"10.1002/sam.11306","DOIUrl":"https://doi.org/10.1002/sam.11306","url":null,"abstract":"<p><p>Integrative analysis has been used to identify clusters by integrating data of disparate types, such as deoxyribonucleic acid (DNA) copy number alterations and DNA methylation changes for discovering novel subtypes of tumors. Most existing integrative analysis methods are based on joint latent variable models, which are generally divided into two classes: joint factor analysis and joint mixture modeling, with continuous and discrete parameterizations of the latent variables respectively. Despite recent progresses, many issues remain. In particular, existing integration methods based on joint factor analysis may be inadequate to model multiple clusters due to the unimodality of the assumed Gaussian distribution, while those based on joint mixture modeling may not have the ability for dimension reduction and/or feature selection. In this paper, we employ a nonlinear joint latent variable model to allow for flexible modeling that can account for multiple clusters as well as conduct dimension reduction and feature selection. We propose a method, called integrative and regularized generative topographic mapping (irGTM), to perform simultaneous dimension reduction across multiple types of data while achieving feature selection separately for each data type. Simulations are performed to examine the operating characteristics of the methods, in which the proposed method compares favorably against the popular iCluster that is based on a linear joint latent variable model. Finally, a glioblastoma multiforme (GBM) dataset is examined.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11306","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35736330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Composite large margin classifiers with latent subclasses for heterogeneous biomedical data. 针对异构生物医学数据的具有潜在子类的复合大余量分类器。
IF 1.3 4区 数学 Q2 Mathematics Pub Date : 2016-04-01 Epub Date: 2016-01-08 DOI: 10.1002/sam.11300
Guanhua Chen, Yufeng Liu, Dinggang Shen, Michael R Kosorok

High dimensional classification problems are prevalent in a wide range of modern scientific applications. Despite a large number of candidate classification techniques available to use, practitioners often face a dilemma of choosing between linear and general nonlinear classifiers. Specifically, simple linear classifiers have good interpretability, but may have limitations in handling data with complex structures. In contrast, general nonlinear classifiers are more flexible, but may lose interpretability and have higher tendency for overfitting. In this paper, we consider data with potential latent subgroups in the classes of interest. We propose a new method, namely the Composite Large Margin Classifier (CLM), to address the issue of classification with latent subclasses. The CLM aims to find three linear functions simultaneously: one linear function to split the data into two parts, with each part being classified by a different linear classifier. Our method has comparable prediction accuracy to a general nonlinear classifier, and it maintains the interpretability of traditional linear classifiers. We demonstrate the competitive performance of the CLM through comparisons with several existing linear and nonlinear classifiers by Monte Carlo experiments. Analysis of the Alzheimer's disease classification problem using CLM not only provides a lower classification error in discriminating cases and controls, but also identifies subclasses in controls that are more likely to develop the disease in the future.

高维分类问题普遍存在于各种现代科学应用中。尽管有大量的候选分类技术可供使用,但实践者往往面临在线性分类器和一般非线性分类器之间做出选择的难题。具体来说,简单的线性分类器具有良好的可解释性,但在处理结构复杂的数据时可能会受到限制。相比之下,一般非线性分类器更加灵活,但可能会失去可解释性,而且有更高的过拟合倾向。在本文中,我们考虑了在感兴趣的类别中存在潜在子群的数据。我们提出了一种新方法,即复合大边际分类器(CLM),以解决潜在子类的分类问题。CLM 的目标是同时找到三个线性函数:一个线性函数将数据分成两部分,每一部分由不同的线性分类器进行分类。我们的方法具有与一般非线性分类器相当的预测精度,并且保持了传统线性分类器的可解释性。我们通过蒙特卡洛实验与现有的几种线性和非线性分类器进行比较,证明了 CLM 的性能具有竞争力。使用 CLM 分析阿尔茨海默病分类问题不仅能降低区分病例和对照组的分类误差,还能识别对照组中将来更有可能患病的子类。
{"title":"Composite large margin classifiers with latent subclasses for heterogeneous biomedical data.","authors":"Guanhua Chen, Yufeng Liu, Dinggang Shen, Michael R Kosorok","doi":"10.1002/sam.11300","DOIUrl":"10.1002/sam.11300","url":null,"abstract":"<p><p>High dimensional classification problems are prevalent in a wide range of modern scientific applications. Despite a large number of candidate classification techniques available to use, practitioners often face a dilemma of choosing between linear and general nonlinear classifiers. Specifically, simple linear classifiers have good interpretability, but may have limitations in handling data with complex structures. In contrast, general nonlinear classifiers are more flexible, but may lose interpretability and have higher tendency for overfitting. In this paper, we consider data with potential latent subgroups in the classes of interest. We propose a new method, namely the Composite Large Margin Classifier (CLM), to address the issue of classification with latent subclasses. The CLM aims to find three linear functions simultaneously: one linear function to split the data into two parts, with each part being classified by a different linear classifier. Our method has comparable prediction accuracy to a general nonlinear classifier, and it maintains the interpretability of traditional linear classifiers. We demonstrate the competitive performance of the CLM through comparisons with several existing linear and nonlinear classifiers by Monte Carlo experiments. Analysis of the Alzheimer's disease classification problem using CLM not only provides a lower classification error in discriminating cases and controls, but also identifies subclasses in controls that are more likely to develop the disease in the future.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4912001/pdf/nihms737408.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34597836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Feature Import Vector Machine: A General Classifier with Flexible Feature Selection. 特征导入向量机:一种具有灵活特征选择的通用分类器。
IF 1.3 4区 数学 Q2 Mathematics Pub Date : 2015-02-01 Epub Date: 2015-01-26 DOI: 10.1002/sam.11259
Samiran Ghosh, Yazhen Wang

The support vector machine (SVM) and other reproducing kernel Hilbert space (RKHS) based classifier systems are drawing much attention recently due to its robustness and generalization capability. General theme here is to construct classifiers based on the training data in a high dimensional space by using all available dimensions. The SVM achieves huge data compression by selecting only few observations which lie close to the boundary of the classifier function. However when the number of observations are not very large (small n) but the number of dimensions/features are large (large p), then it is not necessary that all available features are of equal importance in the classification context. Possible selection of an useful fraction of the available features may result in huge data compression. In this paper we propose an algorithmic approach by means of which such an optimal set of features could be selected. In short, we reverse the traditional sequential observation selection strategy of SVM to that of sequential feature selection. To achieve this we have modified the solution proposed by Zhu and Hastie (2005) in the context of import vector machine (IVM), to select an optimal sub-dimensional model to build the final classifier with sufficient accuracy.

支持向量机(SVM)和其他基于再现核希尔伯特空间(RKHS)的分类器系统由于其鲁棒性和泛化能力近年来备受关注。这里的一般主题是利用所有可用的维度,在高维空间中基于训练数据构建分类器。支持向量机通过选择靠近分类器函数边界的少量观测值来实现巨大的数据压缩。然而,当观测值的数量不是很大(小n),但维度/特征的数量很大(大p)时,则不一定所有可用的特征在分类上下文中都同等重要。可能选择可用特征的有用部分可能导致巨大的数据压缩。在本文中,我们提出了一种算法方法,通过这种方法可以选择这样一个最优的特征集。简而言之,我们将传统的支持向量机序列观测选择策略逆转为序列特征选择策略。为了实现这一点,我们修改了Zhu和Hastie(2005)在导入向量机(IVM)背景下提出的解决方案,以选择最优的子维度模型来构建具有足够精度的最终分类器。
{"title":"Feature Import Vector Machine: A General Classifier with Flexible Feature Selection.","authors":"Samiran Ghosh,&nbsp;Yazhen Wang","doi":"10.1002/sam.11259","DOIUrl":"https://doi.org/10.1002/sam.11259","url":null,"abstract":"<p><p>The support vector machine (SVM) and other reproducing kernel Hilbert space (RKHS) based classifier systems are drawing much attention recently due to its robustness and generalization capability. General theme here is to construct classifiers based on the training data in a high dimensional space by using all available dimensions. The SVM achieves huge data compression by selecting only few observations which lie close to the boundary of the classifier function. However when the number of observations are not very large (small <i>n</i>) but the number of dimensions/features are large (large <i>p</i>), then it is not necessary that all available features are of equal importance in the classification context. Possible selection of an useful fraction of the available features may result in huge data compression. In this paper we propose an algorithmic approach by means of which such an <i>optimal</i> set of features could be selected. In short, we reverse the traditional sequential observation selection strategy of SVM to that of sequential feature selection. To achieve this we have modified the solution proposed by Zhu and Hastie (2005) in the context of import vector machine (IVM), to select an <i>optimal</i> sub-dimensional model to build the final classifier with sufficient accuracy.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":1.3,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11259","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34463560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Survival Analysis with Electronic Health Record Data: Experiments with Chronic Kidney Disease. 用电子健康记录数据进行生存分析:慢性肾脏疾病的实验。
IF 2.1 4区 数学 Q3 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2014-10-01 Epub Date: 2014-08-19 DOI: 10.1002/sam.11236
Yolanda Hagar, David Albers, Rimma Pivovarov, Herbert Chase, Vanja Dukic, Noémie Elhadad

This paper presents a detailed survival analysis for chronic kidney disease (CKD). The analysis is based on the EHR data comprising almost two decades of clinical observations collected at New York-Presbyterian, a large hospital in New York City with one of the oldest electronic health records in the United States. Our survival analysis approach centers around Bayesian multiresolution hazard modeling, with an objective to capture the changing hazard of CKD over time, adjusted for patient clinical covariates and kidney-related laboratory tests. Special attention is paid to statistical issues common to all EHR data, such as cohort definition, missing data and censoring, variable selection, and potential for joint survival and longitudinal modeling, all of which are discussed alone and within the EHR CKD context.

本文介绍了慢性肾脏疾病(CKD)的详细生存分析。该分析基于电子病历数据,这些数据包括纽约长老会医院近20年的临床观察,纽约长老会医院是纽约市一家拥有美国最古老电子健康记录之一的大型医院。我们的生存分析方法以贝叶斯多分辨率风险模型为中心,目的是捕捉CKD随时间变化的风险,并根据患者临床协变量和肾脏相关实验室检查进行调整。特别关注所有电子病历数据中常见的统计问题,如队列定义、缺失数据和审查、变量选择、联合生存和纵向建模的可能性,所有这些都在电子病历CKD背景下单独讨论。
{"title":"Survival Analysis with Electronic Health Record Data: Experiments with Chronic Kidney Disease.","authors":"Yolanda Hagar, David Albers, Rimma Pivovarov, Herbert Chase, Vanja Dukic, Noémie Elhadad","doi":"10.1002/sam.11236","DOIUrl":"10.1002/sam.11236","url":null,"abstract":"<p><p>This paper presents a detailed survival analysis for chronic kidney disease (CKD). The analysis is based on the EHR data comprising almost two decades of clinical observations collected at New York-Presbyterian, a large hospital in New York City with one of the oldest electronic health records in the United States. Our survival analysis approach centers around Bayesian multiresolution hazard modeling, with an objective to capture the changing hazard of CKD over time, adjusted for patient clinical covariates and kidney-related laboratory tests. Special attention is paid to statistical issues common to all EHR data, such as cohort definition, missing data and censoring, variable selection, and potential for joint survival and longitudinal modeling, all of which are discussed alone and within the EHR CKD context.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":null,"pages":null},"PeriodicalIF":2.1,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8112603/pdf/nihms-1697574.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38975743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Statistical Analysis and Data Mining
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1