Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxae054
Jiasheng Shi, Yizhao Zhou, Jing Huang
The time-since-infection (TSI) models, which use disease surveillance data to model infectious diseases, have become increasingly popular due to their flexibility and capacity to address complex disease control questions. However, a notable limitation of TSI models is their primary reliance on incidence data. Even when hospitalization data are available, existing TSI models have not been crafted to improve the estimation of disease transmission or to estimate hospitalization-related parameters-metrics crucial for understanding a pandemic and planning hospital resources. Moreover, their dependence on reported infection data makes them vulnerable to variations in data quality. In this study, we advance TSI models by integrating hospitalization data, marking a significant step forward in modeling with TSI models. We introduce hospitalization propensity parameters to jointly model incidence and hospitalization data. We use a composite likelihood function to accommodate complex data structure and a Monte Carlo expectation-maximization algorithm to estimate model parameters. We analyze COVID-19 data to estimate disease transmission, assess risk factor impacts, and calculate hospitalization propensity. Our model improves the accuracy of estimating the instantaneous reproduction number in TSI models, particularly when hospitalization data is of higher quality than incidence data. It enables the estimation of key infectious disease parameters without relying on contact tracing data and provides a foundation for integrating TSI models with other infectious disease models.
自感染以来时间(TSI)模型利用疾病监测数据对传染病进行建模,因其灵活性和解决复杂疾病控制问题的能力而越来越受欢迎。然而,TSI 模型的一个显著局限是主要依赖于发病率数据。即使在有住院数据的情况下,现有的 TSI 模型也无法改进对疾病传播的估计或估计与住院相关的参数,而这些参数对于了解大流行病和规划医院资源至关重要。此外,这些模型依赖于报告的感染数据,因此容易受到数据质量变化的影响。在本研究中,我们通过整合住院数据推进了 TSI 模型,标志着 TSI 模型在建模方面迈出了重要一步。我们引入了住院倾向参数,对发病率和住院数据进行联合建模。我们使用复合似然函数来适应复杂的数据结构,并使用蒙特卡罗期望最大化算法来估计模型参数。我们分析了 COVID-19 数据,以估计疾病传播、评估风险因素影响并计算住院倾向。我们的模型提高了 TSI 模型中估计瞬时繁殖数量的准确性,尤其是当住院数据的质量高于发病数据时。它可以在不依赖接触追踪数据的情况下估算传染病的关键参数,并为 TSI 模型与其他传染病模型的整合奠定了基础。
{"title":"Unlocking the power of time-since-infection models: data augmentation for improved instantaneous reproduction number estimation.","authors":"Jiasheng Shi, Yizhao Zhou, Jing Huang","doi":"10.1093/biostatistics/kxae054","DOIUrl":"10.1093/biostatistics/kxae054","url":null,"abstract":"<p><p>The time-since-infection (TSI) models, which use disease surveillance data to model infectious diseases, have become increasingly popular due to their flexibility and capacity to address complex disease control questions. However, a notable limitation of TSI models is their primary reliance on incidence data. Even when hospitalization data are available, existing TSI models have not been crafted to improve the estimation of disease transmission or to estimate hospitalization-related parameters-metrics crucial for understanding a pandemic and planning hospital resources. Moreover, their dependence on reported infection data makes them vulnerable to variations in data quality. In this study, we advance TSI models by integrating hospitalization data, marking a significant step forward in modeling with TSI models. We introduce hospitalization propensity parameters to jointly model incidence and hospitalization data. We use a composite likelihood function to accommodate complex data structure and a Monte Carlo expectation-maximization algorithm to estimate model parameters. We analyze COVID-19 data to estimate disease transmission, assess risk factor impacts, and calculate hospitalization propensity. Our model improves the accuracy of estimating the instantaneous reproduction number in TSI models, particularly when hospitalization data is of higher quality than incidence data. It enables the estimation of key infectious disease parameters without relying on contact tracing data and provides a foundation for integrating TSI models with other infectious disease models.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11878408/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143558889","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxaf006
Lucas Etourneau, Laura Fancello, Samuel Wieczorek, Nelle Varoquaux, Thomas Burger
Label-free bottom-up proteomics using mass spectrometry and liquid chromatography has long been established as one of the most popular high-throughput analysis workflows for proteome characterization. However, it produces data hindered by complex and heterogeneous missing values, which imputation has long remained problematic. To cope with this, we introduce Pirat, an algorithm that harnesses this challenge using an original likelihood maximization strategy. Notably, it models the instrument limit by learning a global censoring mechanism from the data available. Moreover, it estimates the covariance matrix between enzymatic cleavage products (ie peptides or precursor ions), while offering a natural way to integrate complementary transcriptomic information when multi-omic assays are available. Our benchmarking on several datasets covering a variety of experimental designs (number of samples, acquisition mode, missingness patterns, etc.) and using a variety of metrics (differential analysis ground truth or imputation errors) shows that Pirat outperforms all pre-existing imputation methods. Beyond the interest of Pirat as an imputation tool, these results pinpoint the need for a paradigm change in proteomics imputation, as most pre-existing strategies could be boosted by incorporating similar models to account for the instrument censorship or for the correlation structures, either grounded to the analytical pipeline or arising from a multi-omic approach.
{"title":"Penalized likelihood optimization for censored missing value imputation in proteomics.","authors":"Lucas Etourneau, Laura Fancello, Samuel Wieczorek, Nelle Varoquaux, Thomas Burger","doi":"10.1093/biostatistics/kxaf006","DOIUrl":"10.1093/biostatistics/kxaf006","url":null,"abstract":"<p><p>Label-free bottom-up proteomics using mass spectrometry and liquid chromatography has long been established as one of the most popular high-throughput analysis workflows for proteome characterization. However, it produces data hindered by complex and heterogeneous missing values, which imputation has long remained problematic. To cope with this, we introduce Pirat, an algorithm that harnesses this challenge using an original likelihood maximization strategy. Notably, it models the instrument limit by learning a global censoring mechanism from the data available. Moreover, it estimates the covariance matrix between enzymatic cleavage products (ie peptides or precursor ions), while offering a natural way to integrate complementary transcriptomic information when multi-omic assays are available. Our benchmarking on several datasets covering a variety of experimental designs (number of samples, acquisition mode, missingness patterns, etc.) and using a variety of metrics (differential analysis ground truth or imputation errors) shows that Pirat outperforms all pre-existing imputation methods. Beyond the interest of Pirat as an imputation tool, these results pinpoint the need for a paradigm change in proteomics imputation, as most pre-existing strategies could be boosted by incorporating similar models to account for the instrument censorship or for the correlation structures, either grounded to the analytical pipeline or arising from a multi-omic approach.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":1.8,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143694529","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxae039
Richard J Cook, Jerald F Lawless
In life history analysis of data from cohort studies, it is important to address the process by which participants are identified and selected. Many health studies select or enrol individuals based on whether they have experienced certain health related events, for example, disease diagnosis or some complication from disease. Standard methods of analysis rely on assumptions concerning the independence of selection and a person's prospective life history process, given their prior history. Violations of such assumptions are common, however, and can bias estimation of process features. This has implications for the internal and external validity of cohort studies, and for the transportabilty of results to a population. In this paper, we study failure time analysis by proposing a joint model for the cohort selection process and the failure process of interest. This allows us to address both independence assumptions and the transportability of study results. It is shown that transportability cannot be guaranteed in the absence of auxiliary information on the population. Conditions that produce dependent selection and types of auxiliary data are discussed and illustrated in numerical studies. The proposed framework is applied to a study of the risk of psoriatic arthritis in persons with psoriasis.
{"title":"Selection processes, transportability, and failure time analysis in life history studies.","authors":"Richard J Cook, Jerald F Lawless","doi":"10.1093/biostatistics/kxae039","DOIUrl":"10.1093/biostatistics/kxae039","url":null,"abstract":"<p><p>In life history analysis of data from cohort studies, it is important to address the process by which participants are identified and selected. Many health studies select or enrol individuals based on whether they have experienced certain health related events, for example, disease diagnosis or some complication from disease. Standard methods of analysis rely on assumptions concerning the independence of selection and a person's prospective life history process, given their prior history. Violations of such assumptions are common, however, and can bias estimation of process features. This has implications for the internal and external validity of cohort studies, and for the transportabilty of results to a population. In this paper, we study failure time analysis by proposing a joint model for the cohort selection process and the failure process of interest. This allows us to address both independence assumptions and the transportability of study results. It is shown that transportability cannot be guaranteed in the absence of auxiliary information on the population. Conditions that produce dependent selection and types of auxiliary data are discussed and illustrated in numerical studies. The proposed framework is applied to a study of the risk of psoriatic arthritis in persons with psoriasis.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11823244/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142513408","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxaf037
Pierre-Emmanuel Sugier, Yazdan Asgari, Mohammed Sedki, Thérèse Truong, Benoit Liquet
Genome-wide association studies (GWASs) have highlighted the importance of pleiotropy in human diseases, where one gene can impact 2 or more unrelated traits. Examining shared genetic risk factors across multiple diseases can enhance our understanding of these conditions by pinpointing new genes and biological pathways involved. Furthermore, with an increasing wealth of GWAS summary statistics available to the scientific community, leveraging these findings across multiple phenotypes could unveil novel pleiotropic associations. Existing selection methods examine pleiotropic associations one by one at a scale of either the genetic variant or the gene, and thus cannot consider all the genetic information at the same time. To address this limitation, we propose a new approach called MPSG (Meta-analysis model adapted for Pleiotropy Selection with Group structure). This method performs a penalized multivariate meta-analysis method adapted for pleiotropy and takes into account the group structure information nested in the data to select relevant variants and genes (or pathways) from all the genetic information. To do so, we implemented an alternating direction method of multipliers algorithm. We compared the performance of the method with other benchmark meta-analysis approaches such as GCPBayes, PLACO, and ASSET by considering as inputs different kinds of summary statistics. We provide an application of our method to the identification of potential pleiotropic genes between breast and thyroid cancers.
{"title":"Meta-analysis models with group structure for pleiotropy detection at gene and variant level using summary statistics from multiple datasets.","authors":"Pierre-Emmanuel Sugier, Yazdan Asgari, Mohammed Sedki, Thérèse Truong, Benoit Liquet","doi":"10.1093/biostatistics/kxaf037","DOIUrl":"10.1093/biostatistics/kxaf037","url":null,"abstract":"<p><p>Genome-wide association studies (GWASs) have highlighted the importance of pleiotropy in human diseases, where one gene can impact 2 or more unrelated traits. Examining shared genetic risk factors across multiple diseases can enhance our understanding of these conditions by pinpointing new genes and biological pathways involved. Furthermore, with an increasing wealth of GWAS summary statistics available to the scientific community, leveraging these findings across multiple phenotypes could unveil novel pleiotropic associations. Existing selection methods examine pleiotropic associations one by one at a scale of either the genetic variant or the gene, and thus cannot consider all the genetic information at the same time. To address this limitation, we propose a new approach called MPSG (Meta-analysis model adapted for Pleiotropy Selection with Group structure). This method performs a penalized multivariate meta-analysis method adapted for pleiotropy and takes into account the group structure information nested in the data to select relevant variants and genes (or pathways) from all the genetic information. To do so, we implemented an alternating direction method of multipliers algorithm. We compared the performance of the method with other benchmark meta-analysis approaches such as GCPBayes, PLACO, and ASSET by considering as inputs different kinds of summary statistics. We provide an application of our method to the identification of potential pleiotropic genes between breast and thyroid cancers.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":"26 1","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145535173","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-12-31DOI: 10.1093/biostatistics/kxae047
Ziren Jiang, Gen Li, Eric F Lock
Data increasingly take the form of a multi-way array, or tensor, in several biomedical domains. Such tensors are often incompletely observed. For example, we are motivated by longitudinal microbiome studies in which several timepoints are missing for several subjects. There is a growing literature on missing data imputation for tensors. However, existing methods give a point estimate for missing values without capturing uncertainty. We propose a multiple imputation approach for tensors in a flexible Bayesian framework, that yields realistic simulated values for missing entries and can propagate uncertainty through subsequent analyses. Our model uses efficient and widely applicable conjugate priors for a CANDECOMP/PARAFAC (CP) factorization, with a separable residual covariance structure. This approach is shown to perform well with respect to both imputation accuracy and uncertainty calibration, for scenarios in which either single entries or entire fibers of the tensor are missing. For two microbiome applications, it is shown to accurately capture uncertainty in the full microbiome profile at missing timepoints and used to infer trends in species diversity for the population. Documented R code to perform our multiple imputation approach is available at https://github.com/lockEF/MultiwayImputation.
{"title":"BAMITA: Bayesian multiple imputation for tensor arrays.","authors":"Ziren Jiang, Gen Li, Eric F Lock","doi":"10.1093/biostatistics/kxae047","DOIUrl":"10.1093/biostatistics/kxae047","url":null,"abstract":"<p><p>Data increasingly take the form of a multi-way array, or tensor, in several biomedical domains. Such tensors are often incompletely observed. For example, we are motivated by longitudinal microbiome studies in which several timepoints are missing for several subjects. There is a growing literature on missing data imputation for tensors. However, existing methods give a point estimate for missing values without capturing uncertainty. We propose a multiple imputation approach for tensors in a flexible Bayesian framework, that yields realistic simulated values for missing entries and can propagate uncertainty through subsequent analyses. Our model uses efficient and widely applicable conjugate priors for a CANDECOMP/PARAFAC (CP) factorization, with a separable residual covariance structure. This approach is shown to perform well with respect to both imputation accuracy and uncertainty calibration, for scenarios in which either single entries or entire fibers of the tensor are missing. For two microbiome applications, it is shown to accurately capture uncertainty in the full microbiome profile at missing timepoints and used to infer trends in species diversity for the population. Documented R code to perform our multiple imputation approach is available at https://github.com/lockEF/MultiwayImputation.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":""},"PeriodicalIF":2.0,"publicationDate":"2024-12-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11823239/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142824682","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1093/biostatistics/kxae017
Simone Tiberi, Joël Meili, Peiying Cai, Charlotte Soneson, Dongze He, Hirak Sarkar, Alejandra Avalos-Pacheco, Rob Patro, Mark D Robinson
Although transcriptomics data is typically used to analyze mature spliced mRNA, recent attention has focused on jointly investigating spliced and unspliced (or precursor-) mRNA, which can be used to study gene regulation and changes in gene expression production. Nonetheless, most methods for spliced/unspliced inference (such as RNA velocity tools) focus on individual samples, and rarely allow comparisons between groups of samples (e.g. healthy vs. diseased). Furthermore, this kind of inference is challenging, because spliced and unspliced mRNA abundance is characterized by a high degree of quantification uncertainty, due to the prevalence of multi-mapping reads, ie reads compatible with multiple transcripts (or genes), and/or with both their spliced and unspliced versions. Here, we present DifferentialRegulation, a Bayesian hierarchical method to discover changes between experimental conditions with respect to the relative abundance of unspliced mRNA (over the total mRNA). We model the quantification uncertainty via a latent variable approach, where reads are allocated to their gene/transcript of origin, and to the respective splice version. We designed several benchmarks where our approach shows good performance, in terms of sensitivity and error control, vs. state-of-the-art competitors. Importantly, our tool is flexible, and works with both bulk and single-cell RNA-sequencing data. DifferentialRegulation is distributed as a Bioconductor R package.
{"title":"DifferentialRegulation: a Bayesian hierarchical approach to identify differentially regulated genes.","authors":"Simone Tiberi, Joël Meili, Peiying Cai, Charlotte Soneson, Dongze He, Hirak Sarkar, Alejandra Avalos-Pacheco, Rob Patro, Mark D Robinson","doi":"10.1093/biostatistics/kxae017","DOIUrl":"10.1093/biostatistics/kxae017","url":null,"abstract":"<p><p>Although transcriptomics data is typically used to analyze mature spliced mRNA, recent attention has focused on jointly investigating spliced and unspliced (or precursor-) mRNA, which can be used to study gene regulation and changes in gene expression production. Nonetheless, most methods for spliced/unspliced inference (such as RNA velocity tools) focus on individual samples, and rarely allow comparisons between groups of samples (e.g. healthy vs. diseased). Furthermore, this kind of inference is challenging, because spliced and unspliced mRNA abundance is characterized by a high degree of quantification uncertainty, due to the prevalence of multi-mapping reads, ie reads compatible with multiple transcripts (or genes), and/or with both their spliced and unspliced versions. Here, we present DifferentialRegulation, a Bayesian hierarchical method to discover changes between experimental conditions with respect to the relative abundance of unspliced mRNA (over the total mRNA). We model the quantification uncertainty via a latent variable approach, where reads are allocated to their gene/transcript of origin, and to the respective splice version. We designed several benchmarks where our approach shows good performance, in terms of sensitivity and error control, vs. state-of-the-art competitors. Importantly, our tool is flexible, and works with both bulk and single-cell RNA-sequencing data. DifferentialRegulation is distributed as a Bioconductor R package.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1079-1093"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11639160/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141421995","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1093/biostatistics/kxae004
Salil Koner, Sheng Luo
Modern longitudinal studies collect multiple outcomes as the primary endpoints to understand the complex dynamics of the diseases. Oftentimes, especially in clinical trials, the joint variation among the multidimensional responses plays a significant role in assessing the differential characteristics between two or more groups, rather than drawing inferences based on a single outcome. We develop a projection-based two-sample significance test to identify the population-level difference between the multivariate profiles observed under a sparse longitudinal design. The methodology is built upon widely adopted multivariate functional principal component analysis to reduce the dimension of the infinite-dimensional multi-modal functions while preserving the dynamic correlation between the components. The test applies to a wide class of (non-stationary) covariance structures of the response, and it detects a significant group difference based on a single p-value, thereby overcoming the issue of adjusting for multiple p-values that arise due to comparing the means in each of components separately. Finite-sample numerical studies demonstrate that the test maintains the type-I error, and is powerful to detect significant group differences, compared to the state-of-the-art testing procedures. The test is carried out on two significant longitudinal studies for Alzheimer's disease and Parkinson's disease (PD) patients, namely, TOMMORROW study of individuals at high risk of mild cognitive impairment to detect differences in the cognitive test scores between the pioglitazone and the placebo groups, and Azillect study to assess the efficacy of rasagiline as a potential treatment to slow down the progression of PD.
现代纵向研究收集多种结果作为主要终点,以了解疾病的复杂动态。通常情况下,特别是在临床试验中,多维反应之间的联合变化在评估两个或多个组之间的差异特征方面发挥着重要作用,而不是根据单一结果进行推断。我们开发了一种基于投影的双样本显著性检验,以确定在稀疏纵向设计下观察到的多变量特征之间的群体水平差异。该方法建立在广泛采用的多元函数主成分分析的基础上,以降低无限维多模态函数的维度,同时保留各成分之间的动态相关性。该检验适用于反应的多种(非平稳)协方差结构,而且只需一个 p 值就能检测出显著的组间差异,从而克服了因分别比较各分量的均值而产生的多个 p 值的调整问题。有限样本数值研究表明,与最先进的检验程序相比,该检验保持了 I 型误差,并能有力地检测出显著的组间差异。该检验在两项针对阿尔茨海默病和帕金森病(PD)患者的重要纵向研究中进行,即针对轻度认知障碍高危人群的 TOMMORROW 研究,以检测吡格列酮组和安慰剂组之间认知测试得分的差异;以及 Azillect 研究,以评估拉沙吉兰作为一种潜在治疗方法对延缓帕金森病进展的疗效。
{"title":"Projection-based two-sample inference for sparsely observed multivariate functional data.","authors":"Salil Koner, Sheng Luo","doi":"10.1093/biostatistics/kxae004","DOIUrl":"10.1093/biostatistics/kxae004","url":null,"abstract":"<p><p>Modern longitudinal studies collect multiple outcomes as the primary endpoints to understand the complex dynamics of the diseases. Oftentimes, especially in clinical trials, the joint variation among the multidimensional responses plays a significant role in assessing the differential characteristics between two or more groups, rather than drawing inferences based on a single outcome. We develop a projection-based two-sample significance test to identify the population-level difference between the multivariate profiles observed under a sparse longitudinal design. The methodology is built upon widely adopted multivariate functional principal component analysis to reduce the dimension of the infinite-dimensional multi-modal functions while preserving the dynamic correlation between the components. The test applies to a wide class of (non-stationary) covariance structures of the response, and it detects a significant group difference based on a single p-value, thereby overcoming the issue of adjusting for multiple p-values that arise due to comparing the means in each of components separately. Finite-sample numerical studies demonstrate that the test maintains the type-I error, and is powerful to detect significant group differences, compared to the state-of-the-art testing procedures. The test is carried out on two significant longitudinal studies for Alzheimer's disease and Parkinson's disease (PD) patients, namely, TOMMORROW study of individuals at high risk of mild cognitive impairment to detect differences in the cognitive test scores between the pioglitazone and the placebo groups, and Azillect study to assess the efficacy of rasagiline as a potential treatment to slow down the progression of PD.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1156-1177"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11639128/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139984624","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1093/biostatistics/kxae007
Shanghong Xie, R Todd Ogden
Linear and generalized linear scalar-on-function modeling have been commonly used to understand the relationship between a scalar response variable (e.g. continuous, binary outcomes) and functional predictors. Such techniques are sensitive to model misspecification when the relationship between the response variable and the functional predictors is complex. On the other hand, support vector machines (SVMs) are among the most robust prediction models but do not take account of the high correlations between repeated measurements and cannot be used for irregular data. In this work, we propose a novel method to integrate functional principal component analysis with SVM techniques for classification and regression to account for the continuous nature of functional data and the nonlinear relationship between the scalar response variable and the functional predictors. We demonstrate the performance of our method through extensive simulation experiments and two real data applications: the classification of alcoholics using electroencephalography signals and the prediction of glucobrassicin concentration using near-infrared reflectance spectroscopy. Our methods especially have more advantages when the measurement errors in functional predictors are relatively large.
{"title":"Functional support vector machine.","authors":"Shanghong Xie, R Todd Ogden","doi":"10.1093/biostatistics/kxae007","DOIUrl":"10.1093/biostatistics/kxae007","url":null,"abstract":"<p><p>Linear and generalized linear scalar-on-function modeling have been commonly used to understand the relationship between a scalar response variable (e.g. continuous, binary outcomes) and functional predictors. Such techniques are sensitive to model misspecification when the relationship between the response variable and the functional predictors is complex. On the other hand, support vector machines (SVMs) are among the most robust prediction models but do not take account of the high correlations between repeated measurements and cannot be used for irregular data. In this work, we propose a novel method to integrate functional principal component analysis with SVM techniques for classification and regression to account for the continuous nature of functional data and the nonlinear relationship between the scalar response variable and the functional predictors. We demonstrate the performance of our method through extensive simulation experiments and two real data applications: the classification of alcoholics using electroencephalography signals and the prediction of glucobrassicin concentration using near-infrared reflectance spectroscopy. Our methods especially have more advantages when the measurement errors in functional predictors are relatively large.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1178-1194"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11639177/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"140112299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1093/biostatistics/kxae022
{"title":"Correction to: Exponential family measurement error models for single-cell CRISPR screens.","authors":"","doi":"10.1093/biostatistics/kxae022","DOIUrl":"10.1093/biostatistics/kxae022","url":null,"abstract":"","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"1273"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12097883/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141319004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-01DOI: 10.1093/biostatistics/kxad035
Arman Oganisian, Kelly D Getz, Todd A Alonzo, Richard Aplenc, Jason A Roy
We develop a Bayesian semiparametric model for the impact of dynamic treatment rules on survival among patients diagnosed with pediatric acute myeloid leukemia (AML). The data consist of a subset of patients enrolled in a phase III clinical trial in which patients move through a sequence of four treatment courses. At each course, they undergo treatment that may or may not include anthracyclines (ACT). While ACT is known to be effective at treating AML, it is also cardiotoxic and can lead to early death for some patients. Our task is to estimate the potential survival probability under hypothetical dynamic ACT treatment strategies, but there are several impediments. First, since ACT is not randomized, its effect on survival is confounded over time. Second, subjects initiate the next course depending on when they recover from the previous course, making timing potentially informative of subsequent treatment and survival. Third, patients may die or drop out before ever completing the full treatment sequence. We develop a generative Bayesian semiparametric model based on Gamma Process priors to address these complexities. At each treatment course, the model captures subjects' transition to subsequent treatment or death in continuous time. G-computation is used to compute a posterior over potential survival probability that is adjusted for time-varying confounding. Using our approach, we estimate the efficacy of hypothetical treatment rules that dynamically modify ACT based on evolving cardiac function.
我们针对动态治疗规则对小儿急性髓性白血病(AML)患者生存期的影响建立了一个贝叶斯半参数模型。数据由参加 III 期临床试验的患者子集组成,在该试验中,患者依次接受四个疗程的治疗。在每个疗程中,患者接受的治疗可能包括也可能不包括蒽环类药物(ACT)。众所周知,蒽环类药物能有效治疗急性髓细胞白血病,但它也有心脏毒性,可能导致一些患者过早死亡。我们的任务是估算假设的动态 ACT 治疗策略下的潜在生存概率,但这有几个障碍。首先,由于 ACT 不是随机的,它对生存的影响会随着时间的推移而受到干扰。其次,受试者何时开始下一疗程取决于他们何时从上一疗程中康复,这使得时间可能对后续治疗和存活率产生影响。第三,患者可能在完成全部治疗序列之前死亡或退出。我们开发了一种基于伽马过程先验的贝叶斯半参数生成模型来解决这些复杂问题。在每个疗程中,该模型都能连续捕捉受试者向后续治疗或死亡的转变。G 计算用于计算潜在存活概率的后验值,并根据时变混杂因素进行调整。利用我们的方法,我们估算了假设治疗规则的疗效,这些规则根据不断变化的心脏功能动态修改 ACT。
{"title":"Bayesian semiparametric model for sequential treatment decisions with informative timing.","authors":"Arman Oganisian, Kelly D Getz, Todd A Alonzo, Richard Aplenc, Jason A Roy","doi":"10.1093/biostatistics/kxad035","DOIUrl":"10.1093/biostatistics/kxad035","url":null,"abstract":"<p><p>We develop a Bayesian semiparametric model for the impact of dynamic treatment rules on survival among patients diagnosed with pediatric acute myeloid leukemia (AML). The data consist of a subset of patients enrolled in a phase III clinical trial in which patients move through a sequence of four treatment courses. At each course, they undergo treatment that may or may not include anthracyclines (ACT). While ACT is known to be effective at treating AML, it is also cardiotoxic and can lead to early death for some patients. Our task is to estimate the potential survival probability under hypothetical dynamic ACT treatment strategies, but there are several impediments. First, since ACT is not randomized, its effect on survival is confounded over time. Second, subjects initiate the next course depending on when they recover from the previous course, making timing potentially informative of subsequent treatment and survival. Third, patients may die or drop out before ever completing the full treatment sequence. We develop a generative Bayesian semiparametric model based on Gamma Process priors to address these complexities. At each treatment course, the model captures subjects' transition to subsequent treatment or death in continuous time. G-computation is used to compute a posterior over potential survival probability that is adjusted for time-varying confounding. Using our approach, we estimate the efficacy of hypothetical treatment rules that dynamically modify ACT based on evolving cardiac function.</p>","PeriodicalId":55357,"journal":{"name":"Biostatistics","volume":" ","pages":"947-961"},"PeriodicalIF":1.8,"publicationDate":"2024-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11471958/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"139479547","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}