Sarah Peskoe, Emily Slade, Lacey Rende, Mary Boulos, Manisha Desai, Mihir Gandhi, Jonathan A. L. Gelfond, Shokoufeh Khalatbari, Phillip J. Schulte, Denise C. Snyder, Sandra L. Taylor, Jesse D. Troy, Roger Vaughan, Gina‐Maria Pomann
Collaborative quantitative scientists, including biostatisticians, epidemiologists, bioinformaticists, and data‐related professionals, play vital roles in research, from study design to data analysis and dissemination. It is imperative that academic health care centers (AHCs) establish an environment that provides opportunities for the quantitative scientists who are hired as staff to develop and advance their careers. With the rapid growth of clinical and translational research, AHCs are charged with establishing organizational methods, training tools, best practices, and guidelines to accelerate and support hiring, training, and retaining this staff workforce. This paper describes three essential elements for building and maintaining a successful unit of collaborative staff quantitative scientists in academic health care centers: (1) organizational infrastructure and management, (2) recruitment, and (3) career development and retention. Specific strategies are provided as examples of how AHCs can excel in these areas.
{"title":"Methods for building a staff workforce of quantitative scientists in academic health care","authors":"Sarah Peskoe, Emily Slade, Lacey Rende, Mary Boulos, Manisha Desai, Mihir Gandhi, Jonathan A. L. Gelfond, Shokoufeh Khalatbari, Phillip J. Schulte, Denise C. Snyder, Sandra L. Taylor, Jesse D. Troy, Roger Vaughan, Gina‐Maria Pomann","doi":"10.1002/sta4.683","DOIUrl":"https://doi.org/10.1002/sta4.683","url":null,"abstract":"Collaborative quantitative scientists, including biostatisticians, epidemiologists, bioinformaticists, and data‐related professionals, play vital roles in research, from study design to data analysis and dissemination. It is imperative that academic health care centers (AHCs) establish an environment that provides opportunities for the quantitative scientists who are hired as staff to develop and advance their careers. With the rapid growth of clinical and translational research, AHCs are charged with establishing organizational methods, training tools, best practices, and guidelines to accelerate and support hiring, training, and retaining this staff workforce. This paper describes three essential elements for building and maintaining a successful unit of collaborative staff quantitative scientists in academic health care centers: (1) organizational infrastructure and management, (2) recruitment, and (3) career development and retention. Specific strategies are provided as examples of how AHCs can excel in these areas.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"42 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140883732","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In operating an academic statistical consulting centre, it is essential to develop a strategy for covering the anticipated costs incurred, such as personnel, facilities, third‐party data, professional development and marketing, and for handling the revenues generated from sources such as university commitments, extramural grants, fees for service, internal memorandums of understanding and consulting courses. As such, this article describes each of these costs and revenue sources in turn, discusses how they vary over phases of a project and life cycles of a centre, provides a review of both historical and modern perspectives in the literature and includes illustrative examples of financial models from three different institutions. These points of consideration are meant to inform consulting groups who are interested in becoming either more or less centrally structured.
{"title":"Considerations in developing a financial model for an academic statistical consulting centre","authors":"Christy Brown, Yanming Di, Stacey Slone","doi":"10.1002/sta4.688","DOIUrl":"https://doi.org/10.1002/sta4.688","url":null,"abstract":"In operating an academic statistical consulting centre, it is essential to develop a strategy for covering the anticipated costs incurred, such as personnel, facilities, third‐party data, professional development and marketing, and for handling the revenues generated from sources such as university commitments, extramural grants, fees for service, internal memorandums of understanding and consulting courses. As such, this article describes each of these costs and revenue sources in turn, discusses how they vary over phases of a project and life cycles of a centre, provides a review of both historical and modern perspectives in the literature and includes illustrative examples of financial models from three different institutions. These points of consideration are meant to inform consulting groups who are interested in becoming either more or less centrally structured.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"4 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140833310","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sparse structure learning in high‐dimensional Gaussian graphical models is an important problem in multivariate statistical inference, since the sparsity pattern naturally encodes the conditional independence relationship among variables. However, maximum a posteriori (MAP) estimation is challenging under hierarchical prior models, and traditional numerical optimization routines or expectation–maximization algorithms are difficult to implement. To this end, our contribution is a novel local linear approximation scheme that circumvents this issue using a very simple computational algorithm. Most importantly, the condition under which our algorithm is guaranteed to converge to the MAP estimate is explicitly stated and is shown to cover a broad class of completely monotone priors, including the graphical horseshoe. Further, the resulting MAP estimate is shown to be sparse and consistent in the ‐norm. Numerical results validate the speed, scalability and statistical performance of the proposed method.
{"title":"Maximum a posteriori estimation in graphical models using local linear approximation","authors":"Ksheera Sagar, Jyotishka Datta, Sayantan Banerjee, Anindya Bhadra","doi":"10.1002/sta4.682","DOIUrl":"https://doi.org/10.1002/sta4.682","url":null,"abstract":"Sparse structure learning in high‐dimensional Gaussian graphical models is an important problem in multivariate statistical inference, since the sparsity pattern naturally encodes the conditional independence relationship among variables. However, maximum a posteriori (MAP) estimation is challenging under hierarchical prior models, and traditional numerical optimization routines or expectation–maximization algorithms are difficult to implement. To this end, our contribution is a novel local linear approximation scheme that circumvents this issue using a very simple computational algorithm. Most importantly, the condition under which our algorithm is guaranteed to converge to the MAP estimate is explicitly stated and is shown to cover a broad class of completely monotone priors, including the graphical horseshoe. Further, the resulting MAP estimate is shown to be sparse and consistent in the ‐norm. Numerical results validate the speed, scalability and statistical performance of the proposed method.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"18 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140833216","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The minimum moment aberration and the minimum Lee‐moment aberration criteria are two popular conceptually simple and computationally cheap criteria for selecting good designs. However, the minimum moment aberration is suitable for qualitative factors, and the minimum Lee‐moment aberration cannot distinguish some designs with high‐level quantitative factors. In this paper, the minimum absolute‐moment aberration criterion is proposed to compare and select designs with multi‐level quantitative factors. We validate the statistical justifications of this criterion from theoretical and numerical aspects. Furthermore, we extend the minimum absolute‐moment aberration criterion into screening designs with both qualitative and quantitative factors, naming the new criterion as the minimum mixed‐moment aberration criterion. Then we utilise a numerical study to compare and evaluate the performance of some popular designs with both qualitative and quantitative factors in computer experiments.
{"title":"Generalised minimum moment aberration for designs with both qualitative and quantitative factors","authors":"Yao Xiao, Na Zou, Hong Qin, Kang Wang","doi":"10.1002/sta4.684","DOIUrl":"https://doi.org/10.1002/sta4.684","url":null,"abstract":"The minimum moment aberration and the minimum Lee‐moment aberration criteria are two popular conceptually simple and computationally cheap criteria for selecting good designs. However, the minimum moment aberration is suitable for qualitative factors, and the minimum Lee‐moment aberration cannot distinguish some designs with high‐level quantitative factors. In this paper, the minimum absolute‐moment aberration criterion is proposed to compare and select designs with multi‐level quantitative factors. We validate the statistical justifications of this criterion from theoretical and numerical aspects. Furthermore, we extend the minimum absolute‐moment aberration criterion into screening designs with both qualitative and quantitative factors, naming the new criterion as the minimum mixed‐moment aberration criterion. Then we utilise a numerical study to compare and evaluate the performance of some popular designs with both qualitative and quantitative factors in computer experiments.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"36 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140833309","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Despite the increasing importance of high‐dimensional varying coefficient models, the study of their Bayesian versions is still in its infancy. This paper contributes to the literature by developing a sparse empirical Bayes formulation that addresses the problem of high‐dimensional model selection in the framework of Bayesian varying coefficient modelling under Gaussian process (GP) priors. To break the computational bottleneck of GP‐based varying coefficient modelling, we introduce the low‐cost computation strategy that incorporates linear algebra techniques and the Laplace approximation into the evaluation of the high‐dimensional posterior model distribution. A simulation study is conducted to demonstrate the superiority of the proposed Bayesian method compared to an existing high‐dimensional varying coefficient modelling approach. In addition, its applicability to real data analysis is illustrated using yeast cell cycle data.
尽管高维变化系数模型越来越重要,但对其贝叶斯版本的研究仍处于起步阶段。本文通过开发一种稀疏经验贝叶斯公式,在高斯过程(GP)先验下的贝叶斯变化系数建模框架内解决了高维模型选择问题,为相关文献做出了贡献。为了打破基于 GP 的变化系数建模的计算瓶颈,我们引入了低成本计算策略,将线性代数技术和拉普拉斯近似纳入高维后验模型分布的评估中。我们进行了一项模拟研究,以证明与现有的高维变化系数建模方法相比,所提出的贝叶斯方法更具优势。此外,还利用酵母细胞周期数据说明了该方法在实际数据分析中的适用性。
{"title":"A sparse empirical Bayes approach to high‐dimensional Gaussian process‐based varying coefficient models","authors":"Myungjin Kim, Gyuhyeong Goh","doi":"10.1002/sta4.678","DOIUrl":"https://doi.org/10.1002/sta4.678","url":null,"abstract":"Despite the increasing importance of high‐dimensional varying coefficient models, the study of their Bayesian versions is still in its infancy. This paper contributes to the literature by developing a sparse empirical Bayes formulation that addresses the problem of high‐dimensional model selection in the framework of Bayesian varying coefficient modelling under Gaussian process (GP) priors. To break the computational bottleneck of GP‐based varying coefficient modelling, we introduce the low‐cost computation strategy that incorporates linear algebra techniques and the Laplace approximation into the evaluation of the high‐dimensional posterior model distribution. A simulation study is conducted to demonstrate the superiority of the proposed Bayesian method compared to an existing high‐dimensional varying coefficient modelling approach. In addition, its applicability to real data analysis is illustrated using yeast cell cycle data.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"87 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140627430","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Emily Slade, Sarah Jane K. Robbins, Kristen J. McQuerry, Anthony A. Mangino
Collaborative biostatistics units within universities and academic medical centres operate under a wide range of different funding models; common to many of these models is the challenge of allocating time to activities that are not linked to a specific research project, such as professional development, mentorship and administrative tasks. The purpose of this paper is to describe a proposed model for ‘flexible funding’, that is, funding that is not linked to a specific research project, within a collaborative biostatistics unit and to detail the benefits and challenges associated with the proposed model. We present results from a qualitative study representing the perspectives of collaborative biostatisticians working under the proposed flexible funding model. In addition to providing examples of activities undertaken as part of time allocated to flexible funding, the qualitative results reveal several benefits of flexible funding both for a collaborative biostatistician (e.g., job satisfaction and professional development) and for the collaborative biostatistics unit as a whole (e.g., retention, process improvement, and leadership).
{"title":"The value of flexible funding for collaborative biostatistics units in universities and academic medical centres","authors":"Emily Slade, Sarah Jane K. Robbins, Kristen J. McQuerry, Anthony A. Mangino","doi":"10.1002/sta4.679","DOIUrl":"https://doi.org/10.1002/sta4.679","url":null,"abstract":"Collaborative biostatistics units within universities and academic medical centres operate under a wide range of different funding models; common to many of these models is the challenge of allocating time to activities that are not linked to a specific research project, such as professional development, mentorship and administrative tasks. The purpose of this paper is to describe a proposed model for ‘flexible funding’, that is, funding that is not linked to a specific research project, within a collaborative biostatistics unit and to detail the benefits and challenges associated with the proposed model. We present results from a qualitative study representing the perspectives of collaborative biostatisticians working under the proposed flexible funding model. In addition to providing examples of activities undertaken as part of time allocated to flexible funding, the qualitative results reveal several benefits of flexible funding both for a collaborative biostatistician (e.g., job satisfaction and professional development) and for the collaborative biostatistics unit as a whole (e.g., retention, process improvement, and leadership).","PeriodicalId":56159,"journal":{"name":"Stat","volume":"35 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140627162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nicola Hewett, Lee Fawcett, Andrew Golightly, Neil Thorpe
Improving road safety is hugely important with the number of deaths on the world's roads remaining unacceptably high; an estimated 1.3 million people die each year as a result of road traffic collisions. Current practice for treating collision hotspots is almost always reactive: once a threshold level of collisions has been overtopped during some pre‐determined observation period, treatment is applied (e.g., road safety cameras). Traffic collisions are rare, so prolonged observation periods are necessary. However, traffic conflicts are more frequent and are a margin of the social cost; hence, traffic conflict before/after studies can be conducted over shorter time periods. We investigate the effect of implementing the leading pedestrian interval treatment at signalised intersections as a safety intervention in a city in north America. Pedestrian‐vehicle traffic conflict data were collected from treatment and control sites during the before and after periods. We implement a before/after study on post‐encroachment times (PETs) where small PET values denote ‘near‐misses’. Hence, extreme value theory is employed to model extremes of our PET processes, with adjustments to the usual modelling framework to account for temporal dependence and treatment effects.
改善道路安全极为重要,因为全球道路上的死亡人数仍然高得令人无法接受;据估计,每年有 130 万人死于道路交通碰撞事故。目前处理碰撞热点的做法几乎总是被动的:一旦在某个预先确定的观察期内碰撞次数超过了临界值,就会采取相应的处理措施(如道路安全摄像机)。交通碰撞很少发生,因此有必要延长观察期。然而,交通冲突较为频繁,是社会成本的一个边际;因此,交通冲突前后的研究可以在较短的时间段内进行。我们在美国北部的一个城市调查了在信号灯控制的交叉路口实施领先行人间隔处理作为安全干预措施的效果。在实施前后,我们分别从实施地点和对照地点收集了行人与车辆交通冲突的数据。我们对蚕食后时间(PET)进行了前后研究,其中较小的 PET 值表示 "近乎失误"。因此,我们采用极值理论对 PET 过程的极值进行建模,并对通常的建模框架进行调整,以考虑时间依赖性和处理效果。
{"title":"Using extreme value theory to evaluate the leading pedestrian interval road safety intervention","authors":"Nicola Hewett, Lee Fawcett, Andrew Golightly, Neil Thorpe","doi":"10.1002/sta4.676","DOIUrl":"https://doi.org/10.1002/sta4.676","url":null,"abstract":"Improving road safety is hugely important with the number of deaths on the world's roads remaining unacceptably high; an estimated 1.3 million people die each year as a result of road traffic collisions. Current practice for treating collision hotspots is almost always reactive: once a threshold level of collisions has been overtopped during some pre‐determined observation period, treatment is applied (e.g., road safety cameras). Traffic collisions are rare, so prolonged observation periods are necessary. However, traffic <jats:italic>conflicts</jats:italic> are more frequent and are a margin of the social cost; hence, traffic conflict before/after studies can be conducted over shorter time periods. We investigate the effect of implementing the leading pedestrian interval treatment at signalised intersections as a safety intervention in a city in north America. Pedestrian‐vehicle traffic conflict data were collected from treatment and control sites during the before and after periods. We implement a before/after study on post‐encroachment times (PETs) where small PET values denote ‘near‐misses’. Hence, extreme value theory is employed to model extremes of our PET processes, with adjustments to the usual modelling framework to account for temporal dependence and treatment effects.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"10 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140626894","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
C. Taylor Brown, Megan Mehta, Mahathi Ryali, Xiaoran Dong, Iliya Shadfar, Jacqueline Dominquez Davalos, Aaron Culich, Anthony Suen
As one of the largest data science research incubator initiatives in the country, the University of California, Berkeley's Data Science Discovery Program serves as a case study for a scalable and sustainable model of data science consulting in higher education. This case contributes to the broader literature on data science consulting in higher education by analysing the programme's development, institutional influences; staffing and structural model; and defining features, which may prove instructive to similar programmes at other institutions. The programme is characterised by a unique structure of undergraduate consultations led by graduate student mentorship and governance; a streamlined, multidepartmental model that facilitates scalability and sustainability; and diverse modes for undergraduate consulting—including one‐on‐one ad‐hoc data science consultations, extended data science project development and management, peer mentorship and data science workshop instruction. This case demonstrates that universities may be able to initiate a low‐stakes, small‐scale data science consulting initiative and then progressively scale up the project in collaboration with multiple departments and organisations across campus.
{"title":"The data science discovery program: A model for data science consulting in higher education","authors":"C. Taylor Brown, Megan Mehta, Mahathi Ryali, Xiaoran Dong, Iliya Shadfar, Jacqueline Dominquez Davalos, Aaron Culich, Anthony Suen","doi":"10.1002/sta4.677","DOIUrl":"https://doi.org/10.1002/sta4.677","url":null,"abstract":"As one of the largest data science research incubator initiatives in the country, the University of California, Berkeley's Data Science Discovery Program serves as a case study for a scalable and sustainable model of data science consulting in higher education. This case contributes to the broader literature on data science consulting in higher education by analysing the programme's development, institutional influences; staffing and structural model; and defining features, which may prove instructive to similar programmes at other institutions. The programme is characterised by a unique structure of undergraduate consultations led by graduate student mentorship and governance; a streamlined, multidepartmental model that facilitates scalability and sustainability; and diverse modes for undergraduate consulting—including one‐on‐one ad‐hoc data science consultations, extended data science project development and management, peer mentorship and data science workshop instruction. This case demonstrates that universities may be able to initiate a low‐stakes, small‐scale data science consulting initiative and then progressively scale up the project in collaboration with multiple departments and organisations across campus.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"78 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-04-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140630597","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mediation analysis intends to unveil the underlying relationship between an outcome variable and an exposure variable through one or more intermediate variables called mediators. In recent decades, research on mediation analysis has been focusing on multivariate mediation models, where the number of mediating variables is possibly of high dimension. This paper concerns high-dimensional mediation analysis and proposes a three-step algorithm that extracts and utilizes inter-connectivity among candidate mediators. More specifically, the proposed methodology starts with a screening procedure to reduce the dimensionality of the initial set of candidate mediators, followed by a penalized regression model that incorporates both parameter- and group-wise regularization, and ends with fitting a multivariate mediation model and identifying active mediating variables through a joint significance test. To showcase the performance of the proposed algorithm, we conducted two simulation studies in high-dimensional and ultra-high-dimensional settings, respectively. Furthermore, we demonstrate the practical applications of the proposal using a real data set that uncovers the possible impact of environmental toxicants on women's gestational age at delivery through 61 biomarkers that belong to 7 biological pathways.
{"title":"Utilizing latent connectivity among mediators in high-dimensional mediation analysis","authors":"Jia Yuan Hu, Marley DeSimone, Qing Wang","doi":"10.1002/sta4.675","DOIUrl":"https://doi.org/10.1002/sta4.675","url":null,"abstract":"Mediation analysis intends to unveil the underlying relationship between an outcome variable and an exposure variable through one or more intermediate variables called mediators. In recent decades, research on mediation analysis has been focusing on multivariate mediation models, where the number of mediating variables is possibly of high dimension. This paper concerns high-dimensional mediation analysis and proposes a three-step algorithm that extracts and utilizes inter-connectivity among candidate mediators. More specifically, the proposed methodology starts with a screening procedure to reduce the dimensionality of the initial set of candidate mediators, followed by a penalized regression model that incorporates both parameter- and group-wise regularization, and ends with fitting a multivariate mediation model and identifying active mediating variables through a joint significance test. To showcase the performance of the proposed algorithm, we conducted two simulation studies in high-dimensional and ultra-high-dimensional settings, respectively. Furthermore, we demonstrate the practical applications of the proposal using a real data set that uncovers the possible impact of environmental toxicants on women's gestational age at delivery through 61 biomarkers that belong to 7 biological pathways.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"57 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-04-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140583973","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
SummaryFeature screening is an important tool in analysing ultrahigh‐dimensional data, particularly in the field of Omics and oncology studies. However, most attention has been focused on identifying features that have a linear or monotonic impact on the response variable. Detecting a sparse set of variables that have a nonlinear or nonmonotonic relationship with the response variable is still a challenging task. To fill the gap, this paper proposed a robust model‐free screening approach for right‐censored survival data by providing a new perspective of quantifying the covariate effect on the restricted mean survival time, rather than the routinely used hazard function. The proposed measure, based on the difference between the restricted mean survival time of covariate‐stratified and overall data, is able to identify comprehensive types of associations including linear, nonlinear, nonmonotone and even local dependencies like change points. The sure screening property is established, and a more flexible iterative screening procedure is developed to increase the accuracy of the variable screening. Simulation studies are carried out to demonstrate the superiority of the proposed method in selecting important features with a complex association with the response variable. The potential of applying the proposed method to handle interval‐censored failure time data has also been explored in simulations, and the results have been promising. The method is applied to a breast cancer dataset to identify potential prognostic factors, which reveals potential associations between breast cancer and lymphoma.
{"title":"High‐dimensional feature screening for nonlinear associations with survival outcome using restricted mean survival time","authors":"Yaxian Chen, Kwok Fai Lam, Zhonghua Liu","doi":"10.1002/sta4.673","DOIUrl":"https://doi.org/10.1002/sta4.673","url":null,"abstract":"SummaryFeature screening is an important tool in analysing ultrahigh‐dimensional data, particularly in the field of Omics and oncology studies. However, most attention has been focused on identifying features that have a linear or monotonic impact on the response variable. Detecting a sparse set of variables that have a nonlinear or nonmonotonic relationship with the response variable is still a challenging task. To fill the gap, this paper proposed a robust model‐free screening approach for right‐censored survival data by providing a new perspective of quantifying the covariate effect on the restricted mean survival time, rather than the routinely used hazard function. The proposed measure, based on the difference between the restricted mean survival time of covariate‐stratified and overall data, is able to identify comprehensive types of associations including linear, nonlinear, nonmonotone and even local dependencies like change points. The sure screening property is established, and a more flexible iterative screening procedure is developed to increase the accuracy of the variable screening. Simulation studies are carried out to demonstrate the superiority of the proposed method in selecting important features with a complex association with the response variable. The potential of applying the proposed method to handle interval‐censored failure time data has also been explored in simulations, and the results have been promising. The method is applied to a breast cancer dataset to identify potential prognostic factors, which reveals potential associations between breast cancer and lymphoma.","PeriodicalId":56159,"journal":{"name":"Stat","volume":"39 1","pages":""},"PeriodicalIF":1.7,"publicationDate":"2024-04-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140583941","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}