Evan L Reynolds, Brian C Callaghan, Michael Gaies, Mousumi Banerjee
{"title":"Regression Trees and Ensemble for Multivariate Outcomes.","authors":"Evan L Reynolds, Brian C Callaghan, Michael Gaies, Mousumi Banerjee","doi":"10.1007/s13571-023-00301-z","DOIUrl":null,"url":null,"abstract":"<p><p>Tree-based methods have become one of the most flexible, intuitive, and powerful analytic tools for exploring complex data structures. The best documented, and arguably most popular uses of tree-based methods are in biomedical research, where multivariateoutcomes occur commonly (e.g. diastolic and systolic blood pressure and nerve conduction measures in studies of neuropathy). Existing tree-based methods for multivariate outcomes do not appropriately take into account the correlation that exists in such data. In this paper, we develop goodness-of-split measures for building multivariate regression trees for continuous multivariate outcomes. We propose two general approaches: minimizing within-node homogeneity and maximizing between-node separation. Within-node homogeneity is measured using the average Mahalanobis distance and the determinant of the variance-covariance matrix. Between-node separation is measured using the Mahalanobis distance, Euclidean distance and standardized Euclidean distance. To enhance prediction accuracy we extend the single multivariate regression tree to an ensemble of multivariate trees. Extensive simulations are presented to examine the properties of our goodness-of-split measures. Finally, the proposed methods are illustrated using two clinical datasets of neuropathy and pediatric cardiac surgery.</p>","PeriodicalId":45608,"journal":{"name":"Sankhya-Series B-Applied and Interdisciplinary Statistics","volume":"85 1","pages":"77-109"},"PeriodicalIF":0.7000,"publicationDate":"2023-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12711322/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Sankhya-Series B-Applied and Interdisciplinary Statistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1007/s13571-023-00301-z","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2023/2/16 0:00:00","PubModel":"Epub","JCR":"Q4","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
Abstract
Tree-based methods have become one of the most flexible, intuitive, and powerful analytic tools for exploring complex data structures. The best documented, and arguably most popular uses of tree-based methods are in biomedical research, where multivariateoutcomes occur commonly (e.g. diastolic and systolic blood pressure and nerve conduction measures in studies of neuropathy). Existing tree-based methods for multivariate outcomes do not appropriately take into account the correlation that exists in such data. In this paper, we develop goodness-of-split measures for building multivariate regression trees for continuous multivariate outcomes. We propose two general approaches: minimizing within-node homogeneity and maximizing between-node separation. Within-node homogeneity is measured using the average Mahalanobis distance and the determinant of the variance-covariance matrix. Between-node separation is measured using the Mahalanobis distance, Euclidean distance and standardized Euclidean distance. To enhance prediction accuracy we extend the single multivariate regression tree to an ensemble of multivariate trees. Extensive simulations are presented to examine the properties of our goodness-of-split measures. Finally, the proposed methods are illustrated using two clinical datasets of neuropathy and pediatric cardiac surgery.
期刊介绍:
Sankhya, Series A, publishes original, high quality research articles in various areas of modern statistics, such as probability, theoretical statistics, mathematical statistics and machine learning. The areas are interpreted in a broad sense. Articles are judged on the basis of their novelty and technical correctness.
Sankhya, Series B, primarily covers applied and interdisciplinary statistics including data sciences. Applied articles should preferably include analysis of original data of broad interest, novel applications of methodology and development of methods and techniques of immediate practical use. Authoritative reviews and comprehensive discussion articles in areas of vigorous current research are also welcome.