Pub Date : 2025-02-03eCollection Date: 2025-01-01DOI: 10.1080/02664763.2025.2460076
Wanyang Dai
We conduct gene mutation rate estimations via developing mutual information and Ewens sampling based convolutional neural network (CNN) and machine learning algorithms. More precisely, we develop a systematic methodology through constructing a CNN. Meanwhile, we develop two machine learning algorithms to study protein production with target gene sequences and protein structures. The core of the CNN and machine learning approach is to address a two-stage optimization problem to balance gene mutation rates during protein production. To wit, we try to optimally coordinate the consistency between the given input DNA sequences and the given (or optimally computed) target ones through controlling their intermediate gene mutation rates. The purposes in doing so are aimed to conduct gene editing and protein structure prediction. For example, after the gene mutation rates are estimated, the computing complexity of protein structure prediction will be reduced to a reasonable degree. Our developed CNN numerical optimization scheme consists of two newly designed machine learning algorithms. The stochastic gradients for the two algorithms are designed according to the Kuhn-Tucker conditions with boundary constraints and with the support of Ewens sampling, multi-input multi-output (MIMO) mutual information, and codon optimization techniques. The associated learning rate bounds are explicitly derived from the method and the two algorithms are numerically implemented. The convergence and optimality of the algorithms are mathematically proved. To illustrate the usage of our study, we also conduct a real-world data implementation.
{"title":"Gene mutation estimations via mutual information and Ewens sampling based CNN & machine learning algorithms.","authors":"Wanyang Dai","doi":"10.1080/02664763.2025.2460076","DOIUrl":"https://doi.org/10.1080/02664763.2025.2460076","url":null,"abstract":"<p><p>We conduct gene mutation rate estimations via developing mutual information and Ewens sampling based convolutional neural network (CNN) and machine learning algorithms. More precisely, we develop a systematic methodology through constructing a CNN. Meanwhile, we develop two machine learning algorithms to study protein production with target gene sequences and protein structures. The core of the CNN and machine learning approach is to address a two-stage optimization problem to balance gene mutation rates during protein production. To wit, we try to optimally coordinate the consistency between the given input DNA sequences and the given (or optimally computed) target ones through controlling their intermediate gene mutation rates. The purposes in doing so are aimed to conduct gene editing and protein structure prediction. For example, after the gene mutation rates are estimated, the computing complexity of protein structure prediction will be reduced to a reasonable degree. Our developed CNN numerical optimization scheme consists of two newly designed machine learning algorithms. The stochastic gradients for the two algorithms are designed according to the Kuhn-Tucker conditions with boundary constraints and with the support of Ewens sampling, multi-input multi-output (MIMO) mutual information, and codon optimization techniques. The associated learning rate bounds are explicitly derived from the method and the two algorithms are numerically implemented. The convergence and optimality of the algorithms are mathematically proved. To illustrate the usage of our study, we also conduct a real-world data implementation.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 12","pages":"2321-2353"},"PeriodicalIF":1.1,"publicationDate":"2025-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12416021/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145029916","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-02eCollection Date: 2025-01-01DOI: 10.1080/02664763.2025.2455636
Shabbir Ahmad, Muhammad Riaz, Tahir Mahmood, Nasir Abbas
Air pollution has a direct impact on every society, leading to consequential effects on the economy of a nation. Poor air quality adversely affects human health, resulting in various economic outcomes such as rising healthcare costs, diminished labor productivity, negative impacts on tourism and living standards, increased regulatory expenses for businesses, and heightened economic disparities. Effective control methods are essential to monitor factors influencing the economy, including air quality. The presence of toxic substances in the air reduces air quality, necessitating its monitoring through indices like PM10. Among statistical process control tools, control charts are the most prominent for efficient change point detection. This study introduces a new process monitoring tool that incorporates additional auxiliary information, if available, alongside the main variable of interest. The proposed methodology ensures detection ability remains robust, even under disturbances in the auxiliary variable. Furthermore, mathematical analyses reveal that many existing statistical quality control tools become special cases of the proposed structure for specific sensitivity parameter values. Evaluated through properties of run length distribution, the proposed chart allows control of the robustness-efficiency balance by adjusting its sensitivity parameter. A practical implementation demonstrates the effectiveness of the chart in monitoring air quality data.
{"title":"Change point detection to analyze air pollution and its economic effects: an exponentially weighted moving average perspective.","authors":"Shabbir Ahmad, Muhammad Riaz, Tahir Mahmood, Nasir Abbas","doi":"10.1080/02664763.2025.2455636","DOIUrl":"10.1080/02664763.2025.2455636","url":null,"abstract":"<p><p>Air pollution has a direct impact on every society, leading to consequential effects on the economy of a nation. Poor air quality adversely affects human health, resulting in various economic outcomes such as rising healthcare costs, diminished labor productivity, negative impacts on tourism and living standards, increased regulatory expenses for businesses, and heightened economic disparities. Effective control methods are essential to monitor factors influencing the economy, including air quality. The presence of toxic substances in the air reduces air quality, necessitating its monitoring through indices like PM10. Among statistical process control tools, control charts are the most prominent for efficient change point detection. This study introduces a new process monitoring tool that incorporates additional auxiliary information, if available, alongside the main variable of interest. The proposed methodology ensures detection ability remains robust, even under disturbances in the auxiliary variable. Furthermore, mathematical analyses reveal that many existing statistical quality control tools become special cases of the proposed structure for specific sensitivity parameter values. Evaluated through properties of run length distribution, the proposed chart allows control of the robustness-efficiency balance by adjusting its sensitivity parameter. A practical implementation demonstrates the effectiveness of the chart in monitoring air quality data.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 11","pages":"2113-2155"},"PeriodicalIF":1.1,"publicationDate":"2025-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404093/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144992776","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-02eCollection Date: 2025-01-01DOI: 10.1080/02664763.2025.2459245
M-A El-Aroui
The paper first highlights important drawbacks and biases related to the common use of time-rescaling to assess the goodness-of-fit (Gof) of self-exciting temporal point process (SETPP) models. Then it presents a new predictive time-rescaling approach leading to an asymptotically unbiased Gof framework for general SETPPs in the case of single observed trajectories. The predictive approach focuses on forecasting accuracy and addresses the bias problem resulting from the plugged-in estimated parameters. Dawid's prequential approach is used and the models' checking is mainly based on the forecasting accuracy of arrival times. These times are transformed, using sequentially estimated parameters, into random vectors which are proved to converge in probability under the null hypothesis and standard regulatory conditions to vectors of iid Exponential(1) rv's. Numerical experiments are used to compare the performances of the standard and predictive time-rescaling for Gof assessment of non-homogeneous Poisson and Hawkes self-exciting temporal processes. Data of Japanese seismic events are also used to illustrate the dynamic aspect of the proposed model-checking approach.
{"title":"On the use and misuse of time-rescaling to assess the goodness-of-fit of self-exciting temporal point processes.","authors":"M-A El-Aroui","doi":"10.1080/02664763.2025.2459245","DOIUrl":"https://doi.org/10.1080/02664763.2025.2459245","url":null,"abstract":"<p><p>The paper first highlights important drawbacks and biases related to the common use of time-rescaling to assess the goodness-of-fit (Gof) of self-exciting temporal point process (SETPP) models. Then it presents a new predictive time-rescaling approach leading to an asymptotically unbiased Gof framework for general SETPPs in the case of single observed trajectories. The predictive approach focuses on forecasting accuracy and addresses the bias problem resulting from the plugged-in estimated parameters. Dawid's prequential approach is used and the models' checking is mainly based on the forecasting accuracy of arrival times. These times are transformed, using sequentially estimated parameters, into random vectors which are proved to converge in probability under the null hypothesis and standard regulatory conditions to vectors of iid Exponential(1) rv's. Numerical experiments are used to compare the performances of the standard and predictive time-rescaling for Gof assessment of non-homogeneous Poisson and Hawkes self-exciting temporal processes. Data of Japanese seismic events are also used to illustrate the dynamic aspect of the proposed model-checking approach.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 12","pages":"2247-2270"},"PeriodicalIF":1.1,"publicationDate":"2025-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12416029/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145029909","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-02eCollection Date: 2025-01-01DOI: 10.1080/02664763.2025.2457008
Laura Vicuña Torres de Paula, Idemauro Antonio Rodrigues de Lara, Cesar Auguto Taconeli, Carolina Reigada, Rafael de Andrade Moral
Longitudinal studies in discrete or continuous time involving categorical data are common in agricultural sciences. Transition models can be used as a means to analyse the resulting data, especially when the aim is to describe category changes over time, as well as to accommodate covariates due to experimental design. Here we focus on discrete-time models, for which it is critical to assess whether the underlying process is stationary or not. Tests based on likelihood procedures are very useful, and here we propose the Gradient test to assess stationary, or homogeneity of transition probabilities. We carried out simulation studies to evaluate the performance of the proposed test, which indicated a good performance regarding type-I error and power when compared to other classical tests available in the literature. As motivation we present two studies with agricultural data, the first one applied to entomology with nominal responses and the second application refers to the degree of injury in pigs. Using our proposed test, stationarity and non-stationarity were verified respectively in the applications. Since the gradient test to assess stationarity has a simplified structure when compared to other tests, it is therefore a useful alternative when carrying out inference in these types of models.
{"title":"Gradient test to assess homogeneity of probabilities in discrete-time transition models with application in agricultural science data.","authors":"Laura Vicuña Torres de Paula, Idemauro Antonio Rodrigues de Lara, Cesar Auguto Taconeli, Carolina Reigada, Rafael de Andrade Moral","doi":"10.1080/02664763.2025.2457008","DOIUrl":"10.1080/02664763.2025.2457008","url":null,"abstract":"<p><p>Longitudinal studies in discrete or continuous time involving categorical data are common in agricultural sciences. Transition models can be used as a means to analyse the resulting data, especially when the aim is to describe category changes over time, as well as to accommodate covariates due to experimental design. Here we focus on discrete-time models, for which it is critical to assess whether the underlying process is stationary or not. Tests based on likelihood procedures are very useful, and here we propose the Gradient test to assess stationary, or homogeneity of transition probabilities. We carried out simulation studies to evaluate the performance of the proposed test, which indicated a good performance regarding type-I error and power when compared to other classical tests available in the literature. As motivation we present two studies with agricultural data, the first one applied to entomology with nominal responses and the second application refers to the degree of injury in pigs. Using our proposed test, stationarity and non-stationarity were verified respectively in the applications. Since the gradient test to assess stationarity has a simplified structure when compared to other tests, it is therefore a useful alternative when carrying out inference in these types of models.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 11","pages":"2172-2190"},"PeriodicalIF":1.1,"publicationDate":"2025-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404091/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144992721","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-02-02eCollection Date: 2025-01-01DOI: 10.1080/02664763.2025.2460073
Yang Liu
Overdispersion is a common phenomenon in genetic data, such as gene expression count data. In genetic association studies, it is important to investigate the association between a gene expression and a set of genetic variants from a pathway. However, existing approaches for pathway analysis are primarily designed for continuous and binary outcomes and are not applicable to overdispersed count data. In this paper, we propose a hierarchical approach to analyze the association between an overdispersed count response and a set of low-frequency genetic variants in negative binomial regression. We derive score-type test statistics for both fixed and random effects of genetic variants, and further introduce a novel procedure for efficiently combining these two statistics for global testing. Through simulation studies, we demonstrate that the proposed method tends to be more powerful than existing methods under a wide range of scenarios. Additionally, we apply the proposed method to a colorectal cancer study, demonstrating its power in identifying associations between gene expression and somatic mutations.
{"title":"Pathway-based genetic association analysis for overdispersed count data.","authors":"Yang Liu","doi":"10.1080/02664763.2025.2460073","DOIUrl":"https://doi.org/10.1080/02664763.2025.2460073","url":null,"abstract":"<p><p>Overdispersion is a common phenomenon in genetic data, such as gene expression count data. In genetic association studies, it is important to investigate the association between a gene expression and a set of genetic variants from a pathway. However, existing approaches for pathway analysis are primarily designed for continuous and binary outcomes and are not applicable to overdispersed count data. In this paper, we propose a hierarchical approach to analyze the association between an overdispersed count response and a set of low-frequency genetic variants in negative binomial regression. We derive score-type test statistics for both fixed and random effects of genetic variants, and further introduce a novel procedure for efficiently combining these two statistics for global testing. Through simulation studies, we demonstrate that the proposed method tends to be more powerful than existing methods under a wide range of scenarios. Additionally, we apply the proposed method to a colorectal cancer study, demonstrating its power in identifying associations between gene expression and somatic mutations.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 12","pages":"2306-2320"},"PeriodicalIF":1.1,"publicationDate":"2025-02-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12416034/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145029923","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-28eCollection Date: 2025-01-01DOI: 10.1080/02664763.2025.2457011
Rob Cameron, Tianyu Guan, Haolun Shi, Zhenhua Lin
Penalized functional regression is a useful tool to estimate models for applications where the effect/coefficient function is assumed to be truncated. The truncated coefficient function occurs when the functional predictor does not influence the response after a certain cutoff point on the time domain. The R package PFLR offers an extensive suite of methods for advanced functional regression techniques with penalization. The package implements four distinct methods, each tailored to different models, effectively addressing a range of scenarios. This is demonstrated through simulations as well as an application to particulate matter emissions data. Generic S3 methods are also implemented for each model to help with summary, visualization and interpretation.
{"title":"Penalized functional regression using R package PFLR.","authors":"Rob Cameron, Tianyu Guan, Haolun Shi, Zhenhua Lin","doi":"10.1080/02664763.2025.2457011","DOIUrl":"10.1080/02664763.2025.2457011","url":null,"abstract":"<p><p>Penalized functional regression is a useful tool to estimate models for applications where the effect/coefficient function is assumed to be truncated. The truncated coefficient function occurs when the functional predictor does not influence the response after a certain cutoff point on the time domain. The R package <b>PFLR</b> offers an extensive suite of methods for advanced functional regression techniques with penalization. The package implements four distinct methods, each tailored to different models, effectively addressing a range of scenarios. This is demonstrated through simulations as well as an application to particulate matter emissions data. Generic S3 methods are also implemented for each model to help with summary, visualization and interpretation.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 11","pages":"2191-2205"},"PeriodicalIF":1.1,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12424440/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145064707","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-28eCollection Date: 2025-01-01DOI: 10.1080/02664763.2025.2452966
G Babykina, V Vandewalle, J Carretero-Bravo
Nowadays data are often timestamped, thus, when analysing the events which may occur several times (recurrent events), it is desirable to model the whole dynamics of the counting process rather than to focus on a total number of events. Such kind of data can be encountered in hospital readmissions, disease recurrences or repeated failures of industrial systems. Recurrent events can be analysed in the counting process framework, as in the Andersen-Gill model, assuming that the baseline intensity depends on time and on covariates, as in the Cox model. However, observed covariates are often insufficient to explain the observed heterogeneity in the data. We propose a mixture model for recurrent events, allowing to account for the unobserved heterogeneity and to perform clustering of individuals (unsupervised classification allowing to partition of the heterogeneous data according to unobserved, or latent, variables). Within each cluster, the recurrent event process intensity is specified parametrically and is adjusted for covariates. Model parameters are estimated by maximum likelihood using the EM algorithm; the BIC criterion is adopted to choose an optimal number of clusters. The model feasibility is checked on simulated data. Real data on hospital readmissions of elderly people, which motivated the development of the proposed clustering model, are analysed. The obtained results allow a fine understanding of the recurrent event process in each cluster.
{"title":"Clustering of recurrent events data.","authors":"G Babykina, V Vandewalle, J Carretero-Bravo","doi":"10.1080/02664763.2025.2452966","DOIUrl":"10.1080/02664763.2025.2452966","url":null,"abstract":"<p><p>Nowadays data are often timestamped, thus, when analysing the events which may occur several times (recurrent events), it is desirable to model the whole dynamics of the counting process rather than to focus on a total number of events. Such kind of data can be encountered in hospital readmissions, disease recurrences or repeated failures of industrial systems. Recurrent events can be analysed in the counting process framework, as in the Andersen-Gill model, assuming that the baseline intensity depends on time and on covariates, as in the Cox model. However, observed covariates are often insufficient to explain the observed heterogeneity in the data. We propose a mixture model for recurrent events, allowing to account for the unobserved heterogeneity and to perform clustering of individuals (unsupervised classification allowing to partition of the heterogeneous data according to unobserved, or latent, variables). Within each cluster, the recurrent event process intensity is specified parametrically and is adjusted for covariates. Model parameters are estimated by maximum likelihood using the EM algorithm; the BIC criterion is adopted to choose an optimal number of clusters. The model feasibility is checked on simulated data. Real data on hospital readmissions of elderly people, which motivated the development of the proposed clustering model, are analysed. The obtained results allow a fine understanding of the recurrent event process in each cluster.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 11","pages":"2031-2059"},"PeriodicalIF":1.1,"publicationDate":"2025-01-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404095/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144992763","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-27eCollection Date: 2025-01-01DOI: 10.1080/02664763.2025.2456614
Anik Roy, Partha Sarathi Mukherjee
Image monitoring is an important research problem that has wide applications in various fields, including manufacturing industries, satellite imaging, medical diagnostics, and so forth. Traditional image monitoring control charts perform rather poorly when the changes occur at very small regions of the image, and when the changes of image intensity values are small in those regions. Their performances get worse if the images contain noise, and the changes occur near the edges of image objects. In applications such as manufacturing industries, the changes in the images are often too small to be detected by human eyes. In this article, we propose a CUSUM-type control chart for online monitoring of grayscale images. Depending on what kind of changes we wish to detect, big or small, we propose to use a certain upper quantile of the local CUSUM statistics. We incorporate a state-of-the-art jump preserving image smoothing technique in the proposed chart that ensures good performance even in presence of low to moderate noise. Theoretical justifications, and superior performance in numerical comparisons ensure that the proposed control chart can be useful to many researchers and practitioners.
{"title":"Upper quantile-based CUSUM-type control chart for detecting small changes in image data.","authors":"Anik Roy, Partha Sarathi Mukherjee","doi":"10.1080/02664763.2025.2456614","DOIUrl":"10.1080/02664763.2025.2456614","url":null,"abstract":"<p><p>Image monitoring is an important research problem that has wide applications in various fields, including manufacturing industries, satellite imaging, medical diagnostics, and so forth. Traditional image monitoring control charts perform rather poorly when the changes occur at very small regions of the image, and when the changes of image intensity values are small in those regions. Their performances get worse if the images contain noise, and the changes occur near the edges of image objects. In applications such as manufacturing industries, the changes in the images are often too small to be detected by human eyes. In this article, we propose a CUSUM-type control chart for online monitoring of grayscale images. Depending on what kind of changes we wish to detect, big or small, we propose to use a certain upper quantile of the local CUSUM statistics. We incorporate a state-of-the-art jump preserving image smoothing technique in the proposed chart that ensures good performance even in presence of low to moderate noise. Theoretical justifications, and superior performance in numerical comparisons ensure that the proposed control chart can be useful to many researchers and practitioners.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 11","pages":"2156-2171"},"PeriodicalIF":1.1,"publicationDate":"2025-01-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404064/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144992685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-24eCollection Date: 2025-01-01DOI: 10.1080/02664763.2025.2457013
Halima S Twabi, Samuel O M Manda, Dylan S Small, Hans-Peter Kohler
This paper presents a causal inference estimation method for longitudinal observational studies with multiple outcomes. The method uses marginal structural models with inverse probability treatment weights (MSM-IPTWs). In developing the proposed method, we re-define the weights as a product of inverse weights at each time point, accounting for time-varying confounders and treatment exposures and possible correlation between and within (serial) the multiple outcomes. The proposed method is evaluated by simulation studies and with an application to estimate the effect of HIV positivity awareness on condom use and multiple sexual partners using the Malawi Longitudinal Study of Families and Health (MLSFH) data. The simulation study shows that the joint MSM-IPTW performs well with coverage within the expected 95% level for a large sample size (n = 1000) and moderate to strong between and within outcome correlation strength ( , 0.75, , 0.8) when the effects are similar. The joint MSM-IPTW performed relatively the same as the adjusted standard joint model when the treatment effect estimate was the same for the outcomes. In the application, HIV positivity awareness increased the usage of condoms and did not affect the number of sexual partners. We recommend using the proposed MSM-IPTWs to correctly control for time-varying treatment and confounders when estimating causal effects for longitudinal observational studies with multiple outcomes.
{"title":"Derivation of a multivariate longitudinal causal effects model.","authors":"Halima S Twabi, Samuel O M Manda, Dylan S Small, Hans-Peter Kohler","doi":"10.1080/02664763.2025.2457013","DOIUrl":"10.1080/02664763.2025.2457013","url":null,"abstract":"<p><p>This paper presents a causal inference estimation method for longitudinal observational studies with multiple outcomes. The method uses marginal structural models with inverse probability treatment weights (MSM-IPTWs). In developing the proposed method, we re-define the weights as a product of inverse weights at each time point, accounting for time-varying confounders and treatment exposures and possible correlation between and within (serial) the multiple outcomes. The proposed method is evaluated by simulation studies and with an application to estimate the effect of HIV positivity awareness on condom use and multiple sexual partners using the Malawi Longitudinal Study of Families and Health (MLSFH) data. The simulation study shows that the joint MSM-IPTW performs well with coverage within the expected 95% level for a large sample size (<i>n</i> = 1000) and moderate to strong between and within outcome correlation strength ( <math><msub><mi>ρ</mi> <mi>j</mi></msub> <mo>=</mo> <mn>0.3</mn></math> , 0.75, <math><msub><mi>ρ</mi> <mi>k</mi></msub> <mo>=</mo> <mn>0.4</mn></math> , 0.8) when the effects are similar. The joint MSM-IPTW performed relatively the same as the adjusted standard joint model when the treatment effect estimate was the same for the outcomes. In the application, HIV positivity awareness increased the usage of condoms and did not affect the number of sexual partners. We recommend using the proposed MSM-IPTWs to correctly control for time-varying treatment and confounders when estimating causal effects for longitudinal observational studies with multiple outcomes.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 12","pages":"2207-2225"},"PeriodicalIF":1.1,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12416008/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145029929","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-24eCollection Date: 2025-01-01DOI: 10.1080/02664763.2025.2455626
Youngjoo Cho, Cheng Zheng, Lihong Qi, Ross L Prentice, Mei-Jie Zhang
The double-blinded randomized trial is considered the gold standard to estimate the average causal effect (ACE). The naive estimator without adjusting any covariate is consistent. However, incorporating the covariates that are strong predictors of the outcome could reduce the issue of unbalanced covariate distribution between the treated and controlled groups and can improve efficiency. Recent work has shown that thanks to randomization, for linear regression, an estimator under risk consistency (e.g. Random Forest) for the regression coefficients could maintain the convergence rate even when a nonparametric model is assumed for the effect of covariates. Also, such an adjusted estimator will always lead to efficiency gain compared to the naive unadjusted estimator. In this paper, we extend this result to the competing risk data setting and show that under similar assumptions, the augmented inverse probability censoring weighting (AIPCW) based adjusted estimator has the same convergence rate and efficiency gain. Extensive simulations were performed to show the efficiency gain in the finite sample setting. To illustrate our proposed method, we apply it to the Women's Health Initiative (WHI) dietary modification trial studying the effect of a low-fat diet on cardiovascular disease (CVD) related mortality among those who have prior CVD.
{"title":"Causal effect estimation for competing risk data in randomized trial: adjusting covariates to gain efficiency.","authors":"Youngjoo Cho, Cheng Zheng, Lihong Qi, Ross L Prentice, Mei-Jie Zhang","doi":"10.1080/02664763.2025.2455626","DOIUrl":"10.1080/02664763.2025.2455626","url":null,"abstract":"<p><p>The double-blinded randomized trial is considered the gold standard to estimate the average causal effect (ACE). The naive estimator without adjusting any covariate is consistent. However, incorporating the covariates that are strong predictors of the outcome could reduce the issue of unbalanced covariate distribution between the treated and controlled groups and can improve efficiency. Recent work has shown that thanks to randomization, for linear regression, an estimator under risk consistency (e.g. Random Forest) for the regression coefficients could maintain the convergence rate even when a nonparametric model is assumed for the effect of covariates. Also, such an adjusted estimator will always lead to efficiency gain compared to the naive unadjusted estimator. In this paper, we extend this result to the competing risk data setting and show that under similar assumptions, the augmented inverse probability censoring weighting (AIPCW) based adjusted estimator has the same convergence rate and efficiency gain. Extensive simulations were performed to show the efficiency gain in the finite sample setting. To illustrate our proposed method, we apply it to the Women's Health Initiative (WHI) dietary modification trial studying the effect of a low-fat diet on cardiovascular disease (CVD) related mortality among those who have prior CVD.</p>","PeriodicalId":15239,"journal":{"name":"Journal of Applied Statistics","volume":"52 11","pages":"2094-2112"},"PeriodicalIF":1.1,"publicationDate":"2025-01-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12404078/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144992709","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}