Pub Date : 2018-08-01Epub Date: 2018-06-19DOI: 10.1002/sam.11381
Tianwei Yu
We present a method of variable selection for the sparse generalized additive model. The method doesn't assume any specific functional form, and can select from a large number of candidates. It takes the form of incremental forward stagewise regression. Given no functional form is assumed, we devised an approach termed "roughening" to adjust the residuals in the iterations. In simulations, we show the new method is competitive against popular machine learning approaches. We also demonstrate its performance using some real datasets. The method is available as a part of the nlnet package on CRAN (https://cran.r-project.org/package=nlnet).
{"title":"Nonlinear variable selection with continuous outcome: a fully nonparametric incremental forward stagewise approach.","authors":"Tianwei Yu","doi":"10.1002/sam.11381","DOIUrl":"https://doi.org/10.1002/sam.11381","url":null,"abstract":"<p><p>We present a method of variable selection for the sparse generalized additive model. The method doesn't assume any specific functional form, and can select from a large number of candidates. It takes the form of incremental forward stagewise regression. Given no functional form is assumed, we devised an approach termed \"roughening\" to adjust the residuals in the iterations. In simulations, we show the new method is competitive against popular machine learning approaches. We also demonstrate its performance using some real datasets. The method is available as a part of the nlnet package on CRAN (https://cran.r-project.org/package=nlnet).</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"11 4","pages":"188-197"},"PeriodicalIF":1.3,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11381","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36866356","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-08-01Epub Date: 2018-05-11DOI: 10.1002/sam.11379
Eugene Demidenko
Typically, when referring to a model-based classification, the mixture distribution approach is understood. In contrast, we revive the hard-classification model-based approach developed by Banfield and Raftery (1993) for which K-means is equivalent to the maximum likelihood (ML) estimation. The next-generation K-means algorithm does not end after the classification is achieved, but moves forward to answer the following fundamental questions: Are there clusters, how many clusters are there, what are the statistical properties of the estimated means and index sets, what is the distribution of the coefficients in the clusterwise regression, and how to classify multilevel data? The statistical model-based approach for the K-means algorithm is the key, because it allows statistical simulations and studying the properties of classification following the track of the classical statistics. This paper illustrates the application of the ML classification to testing the no-clusters hypothesis, to studying various methods for selection of the number of clusters using simulations, robust clustering using Laplace distribution, studying properties of the coefficients in clusterwise regression, and finally to multilevel data by marrying the variance components model with K-means.
{"title":"The next-generation K-means algorithm.","authors":"Eugene Demidenko","doi":"10.1002/sam.11379","DOIUrl":"10.1002/sam.11379","url":null,"abstract":"<p><p>Typically, when referring to a model-based classification, the mixture distribution approach is understood. In contrast, we revive the hard-classification model-based approach developed by Banfield and Raftery (1993) for which K-means is equivalent to the maximum likelihood (ML) estimation. The next-generation K-means algorithm does not end after the classification is achieved, but moves forward to answer the following fundamental questions: Are there clusters, how many clusters are there, what are the statistical properties of the estimated means and index sets, what is the distribution of the coefficients in the clusterwise regression, and how to classify multilevel data? The statistical model-based approach for the K-means algorithm is the key, because it allows statistical simulations and studying the properties of classification following the track of the classical statistics. This paper illustrates the application of the ML classification to testing the no-clusters hypothesis, to studying various methods for selection of the number of clusters using simulations, robust clustering using Laplace distribution, studying properties of the coefficients in clusterwise regression, and finally to multilevel data by marrying the variance components model with K-means.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"11 4","pages":"153-166"},"PeriodicalIF":1.3,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6062903/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36368001","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2018-02-01Epub Date: 2017-12-06DOI: 10.1002/sam.11366
Hien D Nguyen, Jeremy F P Ullmann, Geoffrey J McLachlan, Venkatakaushik Voleti, Wenze Li, Elizabeth M C Hillman, David C Reutens, Andrew L Janke
Calcium is a ubiquitous messenger in neural signaling events. An increasing number of techniques are enabling visualization of neurological activity in animal models via luminescent proteins that bind to calcium ions. These techniques generate large volumes of spatially correlated time series. A model-based functional data analysis methodology via Gaussian mixtures is suggested for the clustering of data from such visualizations is proposed. The methodology is theoretically justified and a computationally efficient approach to estimation is suggested. An example analysis of a zebrafish imaging experiment is presented.
{"title":"Whole-Volume Clustering of Time Series Data from Zebrafish Brain Calcium Images via Mixture Modeling.","authors":"Hien D Nguyen, Jeremy F P Ullmann, Geoffrey J McLachlan, Venkatakaushik Voleti, Wenze Li, Elizabeth M C Hillman, David C Reutens, Andrew L Janke","doi":"10.1002/sam.11366","DOIUrl":"https://doi.org/10.1002/sam.11366","url":null,"abstract":"<p><p>Calcium is a ubiquitous messenger in neural signaling events. An increasing number of techniques are enabling visualization of neurological activity in animal models via luminescent proteins that bind to calcium ions. These techniques generate large volumes of spatially correlated time series. A model-based functional data analysis methodology via Gaussian mixtures is suggested for the clustering of data from such visualizations is proposed. The methodology is theoretically justified and a computationally efficient approach to estimation is suggested. An example analysis of a zebrafish imaging experiment is presented.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"11 1","pages":"5-16"},"PeriodicalIF":1.3,"publicationDate":"2018-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11366","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"36069012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2017-12-01Epub Date: 2017-06-13DOI: 10.1002/sam.11348
Fei Tang, Hemant Ishwaran
Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting-the latter class representing a generalization of a new promising imputation algorithm called missForest. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.
{"title":"Random Forest Missing Data Algorithms.","authors":"Fei Tang, Hemant Ishwaran","doi":"10.1002/sam.11348","DOIUrl":"https://doi.org/10.1002/sam.11348","url":null,"abstract":"<p><p>Random forest (RF) missing data algorithms are an attractive approach for imputing missing data. They have the desirable properties of being able to handle mixed types of missing data, they are adaptive to interactions and nonlinearity, and they have the potential to scale to big data settings. Currently there are many different RF imputation algorithms, but relatively little guidance about their efficacy. Using a large, diverse collection of data sets, imputation performance of various RF algorithms was assessed under different missing data mechanisms. Algorithms included proximity imputation, on the fly imputation, and imputation utilizing multivariate unsupervised and supervised splitting-the latter class representing a generalization of a new promising imputation algorithm called missForest. Our findings reveal RF imputation to be generally robust with performance improving with increasing correlation. Performance was good under moderate to high missingness, and even (in certain cases) when data was missing not at random.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"10 6","pages":"363-377"},"PeriodicalIF":1.3,"publicationDate":"2017-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11348","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35796889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-12-01Epub Date: 2016-02-23DOI: 10.1002/sam.11302
Adrian E Raftery
Probabilistic forecasts are becoming more and more available. How should they be used and communicated? What are the obstacles to their use in practice? I review experience with five problems where probabilistic forecasting played an important role. This leads me to identify five types of potential users: Low Stakes Users, who don't need probabilistic forecasts; General Assessors, who need an overall idea of the uncertainty in the forecast; Change Assessors, who need to know if a change is out of line with expectatations; Risk Avoiders, who wish to limit the risk of an adverse outcome; and Decision Theorists, who quantify their loss function and perform the decision-theoretic calculations. This suggests that it is important to interact with users and to consider their goals. The cognitive research tells us that calibration is important for trust in probability forecasts, and that it is important to match the verbal expression with the task. The cognitive load should be minimized, reducing the probabilistic forecast to a single percentile if appropriate. Probabilities of adverse events and percentiles of the predictive distribution of quantities of interest seem often to be the best way to summarize probabilistic forecasts. Formal decision theory has an important role, but in a limited range of applications.
{"title":"Use and Communication of Probabilistic Forecasts.","authors":"Adrian E Raftery","doi":"10.1002/sam.11302","DOIUrl":"https://doi.org/10.1002/sam.11302","url":null,"abstract":"<p><p>Probabilistic forecasts are becoming more and more available. How should they be used and communicated? What are the obstacles to their use in practice? I review experience with five problems where probabilistic forecasting played an important role. This leads me to identify five types of potential users: Low Stakes Users, who don't need probabilistic forecasts; General Assessors, who need an overall idea of the uncertainty in the forecast; Change Assessors, who need to know if a change is out of line with expectatations; Risk Avoiders, who wish to limit the risk of an adverse outcome; and Decision Theorists, who quantify their loss function and perform the decision-theoretic calculations. This suggests that it is important to interact with users and to consider their goals. The cognitive research tells us that calibration is important for trust in probability forecasts, and that it is important to match the verbal expression with the task. The cognitive load should be minimized, reducing the probabilistic forecast to a single percentile if appropriate. Probabilities of adverse events and percentiles of the predictive distribution of quantities of interest seem often to be the best way to summarize probabilistic forecasts. Formal decision theory has an important role, but in a limited range of applications.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"9 6","pages":"397-410"},"PeriodicalIF":1.3,"publicationDate":"2016-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11302","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34944896","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-08-01Epub Date: 2016-07-17DOI: 10.1002/sam.11324
Trevor R Shaddox, Patrick B Ryan, Martijn J Schuemie, David Madigan, Marc A Suchard
Clinical trials often lack power to identify rare adverse drug events (ADEs) and therefore cannot address the threat rare ADEs pose, motivating the need for new ADE detection techniques. Emerging national patient claims and electronic health record databases have inspired post-approval early detection methods like the Bayesian self-controlled case series (BSCCS) regression model. Existing BSCCS models do not account for multiple outcomes, where pathology may be shared across different ADEs. We integrate a pathology hierarchy into the BSCCS model by developing a novel informative hierarchical prior linking outcome-specific effects. Considering shared pathology drastically increases the dimensionality of the already massive models in this field. We develop an efficient method for coping with the dimensionality expansion by reducing the hierarchical model to a form amenable to existing tools. Through a synthetic study we demonstrate decreased bias in risk estimates for drugs when using conditions with different true risk and unequal prevalence. We also examine observational data from the MarketScan Lab Results dataset, exposing the bias that results from aggregating outcomes, as previously employed to estimate risk trends of warfarin and dabigatran for intracranial hemorrhage and gastrointestinal bleeding. We further investigate the limits of our approach by using extremely rare conditions. This research demonstrates that analyzing multiple outcomes simultaneously is feasible at scale and beneficial.
{"title":"Hierarchical Models for Multiple, Rare Outcomes Using Massive Observational Healthcare Databases.","authors":"Trevor R Shaddox, Patrick B Ryan, Martijn J Schuemie, David Madigan, Marc A Suchard","doi":"10.1002/sam.11324","DOIUrl":"10.1002/sam.11324","url":null,"abstract":"<p><p>Clinical trials often lack power to identify rare adverse drug events (ADEs) and therefore cannot address the threat rare ADEs pose, motivating the need for new ADE detection techniques. Emerging national patient claims and electronic health record databases have inspired post-approval early detection methods like the Bayesian self-controlled case series (BSCCS) regression model. Existing BSCCS models do not account for multiple outcomes, where pathology may be shared across different ADEs. We integrate a pathology hierarchy into the BSCCS model by developing a novel informative hierarchical prior linking outcome-specific effects. Considering shared pathology drastically increases the dimensionality of the already massive models in this field. We develop an efficient method for coping with the dimensionality expansion by reducing the hierarchical model to a form amenable to existing tools. Through a synthetic study we demonstrate decreased bias in risk estimates for drugs when using conditions with different true risk and unequal prevalence. We also examine observational data from the MarketScan Lab Results dataset, exposing the bias that results from aggregating outcomes, as previously employed to estimate risk trends of warfarin and dabigatran for intracranial hemorrhage and gastrointestinal bleeding. We further investigate the limits of our approach by using extremely rare conditions. This research demonstrates that analyzing multiple outcomes simultaneously is feasible at scale and beneficial.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"9 4","pages":"260-268"},"PeriodicalIF":1.3,"publicationDate":"2016-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5423675/pdf/nihms799155.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34993872","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-04-01Epub Date: 2016-03-28DOI: 10.1002/sam.11306
Binghui Liu, Xiaotong Shen, Wei Pan
Integrative analysis has been used to identify clusters by integrating data of disparate types, such as deoxyribonucleic acid (DNA) copy number alterations and DNA methylation changes for discovering novel subtypes of tumors. Most existing integrative analysis methods are based on joint latent variable models, which are generally divided into two classes: joint factor analysis and joint mixture modeling, with continuous and discrete parameterizations of the latent variables respectively. Despite recent progresses, many issues remain. In particular, existing integration methods based on joint factor analysis may be inadequate to model multiple clusters due to the unimodality of the assumed Gaussian distribution, while those based on joint mixture modeling may not have the ability for dimension reduction and/or feature selection. In this paper, we employ a nonlinear joint latent variable model to allow for flexible modeling that can account for multiple clusters as well as conduct dimension reduction and feature selection. We propose a method, called integrative and regularized generative topographic mapping (irGTM), to perform simultaneous dimension reduction across multiple types of data while achieving feature selection separately for each data type. Simulations are performed to examine the operating characteristics of the methods, in which the proposed method compares favorably against the popular iCluster that is based on a linear joint latent variable model. Finally, a glioblastoma multiforme (GBM) dataset is examined.
{"title":"Nonlinear Joint Latent Variable Models and Integrative Tumor Subtype Discovery.","authors":"Binghui Liu, Xiaotong Shen, Wei Pan","doi":"10.1002/sam.11306","DOIUrl":"https://doi.org/10.1002/sam.11306","url":null,"abstract":"<p><p>Integrative analysis has been used to identify clusters by integrating data of disparate types, such as deoxyribonucleic acid (DNA) copy number alterations and DNA methylation changes for discovering novel subtypes of tumors. Most existing integrative analysis methods are based on joint latent variable models, which are generally divided into two classes: joint factor analysis and joint mixture modeling, with continuous and discrete parameterizations of the latent variables respectively. Despite recent progresses, many issues remain. In particular, existing integration methods based on joint factor analysis may be inadequate to model multiple clusters due to the unimodality of the assumed Gaussian distribution, while those based on joint mixture modeling may not have the ability for dimension reduction and/or feature selection. In this paper, we employ a nonlinear joint latent variable model to allow for flexible modeling that can account for multiple clusters as well as conduct dimension reduction and feature selection. We propose a method, called integrative and regularized generative topographic mapping (irGTM), to perform simultaneous dimension reduction across multiple types of data while achieving feature selection separately for each data type. Simulations are performed to examine the operating characteristics of the methods, in which the proposed method compares favorably against the popular iCluster that is based on a linear joint latent variable model. Finally, a glioblastoma multiforme (GBM) dataset is examined.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"9 2","pages":"106-116"},"PeriodicalIF":1.3,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11306","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"35736330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2016-04-01Epub Date: 2016-01-08DOI: 10.1002/sam.11300
Guanhua Chen, Yufeng Liu, Dinggang Shen, Michael R Kosorok
High dimensional classification problems are prevalent in a wide range of modern scientific applications. Despite a large number of candidate classification techniques available to use, practitioners often face a dilemma of choosing between linear and general nonlinear classifiers. Specifically, simple linear classifiers have good interpretability, but may have limitations in handling data with complex structures. In contrast, general nonlinear classifiers are more flexible, but may lose interpretability and have higher tendency for overfitting. In this paper, we consider data with potential latent subgroups in the classes of interest. We propose a new method, namely the Composite Large Margin Classifier (CLM), to address the issue of classification with latent subclasses. The CLM aims to find three linear functions simultaneously: one linear function to split the data into two parts, with each part being classified by a different linear classifier. Our method has comparable prediction accuracy to a general nonlinear classifier, and it maintains the interpretability of traditional linear classifiers. We demonstrate the competitive performance of the CLM through comparisons with several existing linear and nonlinear classifiers by Monte Carlo experiments. Analysis of the Alzheimer's disease classification problem using CLM not only provides a lower classification error in discriminating cases and controls, but also identifies subclasses in controls that are more likely to develop the disease in the future.
{"title":"Composite large margin classifiers with latent subclasses for heterogeneous biomedical data.","authors":"Guanhua Chen, Yufeng Liu, Dinggang Shen, Michael R Kosorok","doi":"10.1002/sam.11300","DOIUrl":"10.1002/sam.11300","url":null,"abstract":"<p><p>High dimensional classification problems are prevalent in a wide range of modern scientific applications. Despite a large number of candidate classification techniques available to use, practitioners often face a dilemma of choosing between linear and general nonlinear classifiers. Specifically, simple linear classifiers have good interpretability, but may have limitations in handling data with complex structures. In contrast, general nonlinear classifiers are more flexible, but may lose interpretability and have higher tendency for overfitting. In this paper, we consider data with potential latent subgroups in the classes of interest. We propose a new method, namely the Composite Large Margin Classifier (CLM), to address the issue of classification with latent subclasses. The CLM aims to find three linear functions simultaneously: one linear function to split the data into two parts, with each part being classified by a different linear classifier. Our method has comparable prediction accuracy to a general nonlinear classifier, and it maintains the interpretability of traditional linear classifiers. We demonstrate the competitive performance of the CLM through comparisons with several existing linear and nonlinear classifiers by Monte Carlo experiments. Analysis of the Alzheimer's disease classification problem using CLM not only provides a lower classification error in discriminating cases and controls, but also identifies subclasses in controls that are more likely to develop the disease in the future.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"9 2","pages":"75-88"},"PeriodicalIF":1.3,"publicationDate":"2016-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4912001/pdf/nihms737408.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34597836","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2015-02-01Epub Date: 2015-01-26DOI: 10.1002/sam.11259
Samiran Ghosh, Yazhen Wang
The support vector machine (SVM) and other reproducing kernel Hilbert space (RKHS) based classifier systems are drawing much attention recently due to its robustness and generalization capability. General theme here is to construct classifiers based on the training data in a high dimensional space by using all available dimensions. The SVM achieves huge data compression by selecting only few observations which lie close to the boundary of the classifier function. However when the number of observations are not very large (small n) but the number of dimensions/features are large (large p), then it is not necessary that all available features are of equal importance in the classification context. Possible selection of an useful fraction of the available features may result in huge data compression. In this paper we propose an algorithmic approach by means of which such an optimal set of features could be selected. In short, we reverse the traditional sequential observation selection strategy of SVM to that of sequential feature selection. To achieve this we have modified the solution proposed by Zhu and Hastie (2005) in the context of import vector machine (IVM), to select an optimal sub-dimensional model to build the final classifier with sufficient accuracy.
{"title":"Feature Import Vector Machine: A General Classifier with Flexible Feature Selection.","authors":"Samiran Ghosh, Yazhen Wang","doi":"10.1002/sam.11259","DOIUrl":"https://doi.org/10.1002/sam.11259","url":null,"abstract":"<p><p>The support vector machine (SVM) and other reproducing kernel Hilbert space (RKHS) based classifier systems are drawing much attention recently due to its robustness and generalization capability. General theme here is to construct classifiers based on the training data in a high dimensional space by using all available dimensions. The SVM achieves huge data compression by selecting only few observations which lie close to the boundary of the classifier function. However when the number of observations are not very large (small <i>n</i>) but the number of dimensions/features are large (large <i>p</i>), then it is not necessary that all available features are of equal importance in the classification context. Possible selection of an useful fraction of the available features may result in huge data compression. In this paper we propose an algorithmic approach by means of which such an <i>optimal</i> set of features could be selected. In short, we reverse the traditional sequential observation selection strategy of SVM to that of sequential feature selection. To achieve this we have modified the solution proposed by Zhu and Hastie (2005) in the context of import vector machine (IVM), to select an <i>optimal</i> sub-dimensional model to build the final classifier with sufficient accuracy.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"8 1","pages":"49-63"},"PeriodicalIF":1.3,"publicationDate":"2015-02-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://sci-hub-pdf.com/10.1002/sam.11259","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"34463560","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2014-10-01Epub Date: 2014-08-19DOI: 10.1002/sam.11236
Yolanda Hagar, David Albers, Rimma Pivovarov, Herbert Chase, Vanja Dukic, Noémie Elhadad
This paper presents a detailed survival analysis for chronic kidney disease (CKD). The analysis is based on the EHR data comprising almost two decades of clinical observations collected at New York-Presbyterian, a large hospital in New York City with one of the oldest electronic health records in the United States. Our survival analysis approach centers around Bayesian multiresolution hazard modeling, with an objective to capture the changing hazard of CKD over time, adjusted for patient clinical covariates and kidney-related laboratory tests. Special attention is paid to statistical issues common to all EHR data, such as cohort definition, missing data and censoring, variable selection, and potential for joint survival and longitudinal modeling, all of which are discussed alone and within the EHR CKD context.
{"title":"Survival Analysis with Electronic Health Record Data: Experiments with Chronic Kidney Disease.","authors":"Yolanda Hagar, David Albers, Rimma Pivovarov, Herbert Chase, Vanja Dukic, Noémie Elhadad","doi":"10.1002/sam.11236","DOIUrl":"10.1002/sam.11236","url":null,"abstract":"<p><p>This paper presents a detailed survival analysis for chronic kidney disease (CKD). The analysis is based on the EHR data comprising almost two decades of clinical observations collected at New York-Presbyterian, a large hospital in New York City with one of the oldest electronic health records in the United States. Our survival analysis approach centers around Bayesian multiresolution hazard modeling, with an objective to capture the changing hazard of CKD over time, adjusted for patient clinical covariates and kidney-related laboratory tests. Special attention is paid to statistical issues common to all EHR data, such as cohort definition, missing data and censoring, variable selection, and potential for joint survival and longitudinal modeling, all of which are discussed alone and within the EHR CKD context.</p>","PeriodicalId":48684,"journal":{"name":"Statistical Analysis and Data Mining","volume":"7 5","pages":"385-403"},"PeriodicalIF":2.1,"publicationDate":"2014-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8112603/pdf/nihms-1697574.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"38975743","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}