DNA methylation (DNAm) is a key epigenetic modification, and datasets capturing DNAm are typically high-dimensional. Although dimension reduction (DR) techniques are commonly applied, it remains unclear how different DR methods perform specifically on the DNAm dataset. In this study, we aim to evaluate the performance of several DR techniques to determine their performance for reducing the dimensionality of DNAm datasets. We leveraged the DNAm dataset from the STRiDE (STratification of Risk of Diabetes in Early pregnancy) prospective study, which consists of 8,62,927 CpG sites each from 258 pregnant women. Women were categorized as Normal (n = 146); and GDM (n = 112). Epigenome-wide DNA methylation profiles from peripheral blood were quantified using an Infinium Methylation EPIC array. All stated DR techniques were performed, and the retained amount of information, local neighborhood preservation criteria, and global structure-holding approaches were assessed using various statistical measures and compared. Across Shannon entropy, local-neighborhood metrics (König's measure, Spearman's ρ, trustworthiness/continuity) and global-structure metrics (Kruskal stress, Sammon's stress, residual variance), MDS and PCA consistently achieved the best performance. PLS-DA trailed closely, while ISOMAP showed moderate results and UMAP performed worst, exhibiting higher entropy, lower correlation preservation, and greater distortion of both local and global structures.
{"title":"Performance evaluation of dimensionality reduction techniques on high-dimensional DNA methylation data.","authors":"Kuldeep Kumar Sharma, Kuppan Gokulakrishnan, Binu V S, Binukumar Bhaskarapillai, Chinnasamy Thirumoorthy, Ponnusamy Saravanan","doi":"10.1515/ijb-2025-0071","DOIUrl":"https://doi.org/10.1515/ijb-2025-0071","url":null,"abstract":"<p><p>DNA methylation (DNAm) is a key epigenetic modification, and datasets capturing DNAm are typically high-dimensional. Although dimension reduction (DR) techniques are commonly applied, it remains unclear how different DR methods perform specifically on the DNAm dataset. In this study, we aim to evaluate the performance of several DR techniques to determine their performance for reducing the dimensionality of DNAm datasets. We leveraged the DNAm dataset from the STRiDE (STratification of Risk of Diabetes in Early pregnancy) prospective study, which consists of 8,62,927 CpG sites each from 258 pregnant women. Women were categorized as Normal (<i>n</i> = 146); and GDM (<i>n</i> = 112). Epigenome-wide DNA methylation profiles from peripheral blood were quantified using an Infinium Methylation EPIC array. All stated DR techniques were performed, and the retained amount of information, local neighborhood preservation criteria, and global structure-holding approaches were assessed using various statistical measures and compared. Across Shannon entropy, local-neighborhood metrics (König's measure, Spearman's ρ, trustworthiness/continuity) and global-structure metrics (Kruskal stress, Sammon's stress, residual variance), MDS and PCA consistently achieved the best performance. PLS-DA trailed closely, while ISOMAP showed moderate results and UMAP performed worst, exhibiting higher entropy, lower correlation preservation, and greater distortion of both local and global structures.</p>","PeriodicalId":50333,"journal":{"name":"International Journal of Biostatistics","volume":" ","pages":""},"PeriodicalIF":1.2,"publicationDate":"2026-02-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146214785","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Reuben Adatorwovor, Aurelien Latouche, Jason P Fine
Quantifying disease-specific survival in patients with competing risks is generally done when reliable cause of death (CoD) information is available. With known CoD, cause-specific and cumulative incidence functions for competing risk data are applicable in estimating disease-specific survival. When CoD is unreliable, unknown, or subject to misspecifications, relative survival methods are used for estimating disease-specific survival. This estimator, under the independent competing risks assumption, is the ratio of all-cause survival in the disease-specific cohort group to the known expected survival from a general reference population. The disease-specific death competes with other causes of mortality, potentially creating interdependence among the CoD. The standard ratio estimate is only valid when death from disease and death from competing causes are independent. We relaxed this assumption by formulating the dependence between the times to disease-specific death and competing causes of mortality using a copula. We fit a nonparametric copula-based approach to the distribution of disease-specific death which reduces to the ratio estimator under independence. This nonparametric method is robust compared to the previously proposed copula-based parametric method. We demonstrate the utility of our method through simulation studies and an application to French breast cancer registry data.
{"title":"A nonparametric dependent competing risk method for net survival analysis.","authors":"Reuben Adatorwovor, Aurelien Latouche, Jason P Fine","doi":"10.1515/ijb-2024-0035","DOIUrl":"https://doi.org/10.1515/ijb-2024-0035","url":null,"abstract":"<p><p>Quantifying disease-specific survival in patients with competing risks is generally done when reliable cause of death (CoD) information is available. With known CoD, cause-specific and cumulative incidence functions for competing risk data are applicable in estimating disease-specific survival. When CoD is unreliable, unknown, or subject to misspecifications, relative survival methods are used for estimating disease-specific survival. This estimator, under the independent competing risks assumption, is the ratio of all-cause survival in the disease-specific cohort group to the known expected survival from a general reference population. The disease-specific death competes with other causes of mortality, potentially creating interdependence among the CoD. The standard ratio estimate is only valid when death from disease and death from competing causes are independent. We relaxed this assumption by formulating the dependence between the times to disease-specific death and competing causes of mortality using a copula. We fit a nonparametric copula-based approach to the distribution of disease-specific death which reduces to the ratio estimator under independence. This nonparametric method is robust compared to the previously proposed copula-based parametric method. We demonstrate the utility of our method through simulation studies and an application to French breast cancer registry data.</p>","PeriodicalId":50333,"journal":{"name":"International Journal of Biostatistics","volume":" ","pages":""},"PeriodicalIF":1.2,"publicationDate":"2026-02-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146203616","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In recent years, the growing availability of biomedical datasets featuring numerous longitudinal covariates has motivated the development of several multi-step methods for the dynamic prediction of survival outcomes. These methods employ either mixed-effects models or multivariate functional principal component analysis to model and summarize the longitudinal covariates' evolution over time. Then, they use Cox models or random survival forests to predict survival probabilities, using as covariates both baseline variables and the summaries of the longitudinal variables obtained in the previous modelling step. Because these multi-step methods are still quite new, to date little is known about their applicability, limitations, and predictive performance when applied to real-world data. To gain a better understanding of these aspects, we performed a benchmarking of these multi-step methods (and two simpler prediction approaches) using three datasets that differ in sample size, number of longitudinal covariates and length of follow-up. We discuss the different modelling choices made by these methods, and some adjustments that one may need to do in order to be able to apply them to real-world data. Furthermore, we compare their predictive performance using multiple performance measures and landmark times, assess their computing time, and discuss their strengths and limitations.
{"title":"Benchmarking multi-step methods for the dynamic prediction of survival with numerous longitudinal predictors.","authors":"Signorelli Mirko, Sophie Retif","doi":"10.1515/ijb-2025-0049","DOIUrl":"https://doi.org/10.1515/ijb-2025-0049","url":null,"abstract":"<p><p>In recent years, the growing availability of biomedical datasets featuring numerous longitudinal covariates has motivated the development of several multi-step methods for the dynamic prediction of survival outcomes. These methods employ either mixed-effects models or multivariate functional principal component analysis to model and summarize the longitudinal covariates' evolution over time. Then, they use Cox models or random survival forests to predict survival probabilities, using as covariates both baseline variables and the summaries of the longitudinal variables obtained in the previous modelling step. Because these multi-step methods are still quite new, to date little is known about their applicability, limitations, and predictive performance when applied to real-world data. To gain a better understanding of these aspects, we performed a benchmarking of these multi-step methods (and two simpler prediction approaches) using three datasets that differ in sample size, number of longitudinal covariates and length of follow-up. We discuss the different modelling choices made by these methods, and some adjustments that one may need to do in order to be able to apply them to real-world data. Furthermore, we compare their predictive performance using multiple performance measures and landmark times, assess their computing time, and discuss their strengths and limitations.</p>","PeriodicalId":50333,"journal":{"name":"International Journal of Biostatistics","volume":" ","pages":""},"PeriodicalIF":1.2,"publicationDate":"2025-12-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145794943","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper considers semiparametric estimation strategies for the nonlinear semiparametric regression model (NSRM) under the sparsity assumption by modifying the Gauss-Newton method for both low- and high-dimensional data scenarios. In the low-dimensional case, coefficients are partitioned into two parts that represent nonzero (strong signals) and sparse coefficients. In the high-dimensional case, a weighted-ridge approach is employed, and coefficients are partitioned into three parts, adding weak signals as well. Shrinkage estimators are then obtained in both cases. More importantly, in this paper, we assume that a nonlinear structure is present in the parametric component of the model, which makes the direct application of penalized least squares to the NSRM impossible. To solve this problem, we employ the iterative Gauss-Newton method to obtain the final NSRM estimators. We provide both theoretical and practical details for the suggested estimators. Asymptotic results are derived for both low- and high-dimensional cases. We conduct an extensive simulation study to evaluate the performance of the estimators in a practical setting. Moreover, we substantiate our findings with data examples from two distinct breast cancer datasets: the Breast Cancer in the United States (BCUS) and Wisconsin datasets. By demonstrating the effectiveness of our introduced estimators in these particular biostatistical contexts, our numerical study provides support for the theoretical efficacy of shrinkage estimators, suggesting their potential relevance to breast cancer research and biostatistical methodologies.
{"title":"Post-shrinkage strategies for nonlinear semiparametric regression models in low and high-dimensional settings.","authors":"S Ejaz Ahmed, Dursun Aydın, Ersin Yılmaz","doi":"10.1515/ijb-2024-0011","DOIUrl":"https://doi.org/10.1515/ijb-2024-0011","url":null,"abstract":"<p><p>This paper considers semiparametric estimation strategies for the nonlinear semiparametric regression model (NSRM) under the sparsity assumption by modifying the Gauss-Newton method for both low- and high-dimensional data scenarios. In the low-dimensional case, coefficients are partitioned into two parts that represent nonzero (strong signals) and sparse coefficients. In the high-dimensional case, a weighted-ridge approach is employed, and coefficients are partitioned into three parts, adding weak signals as well. Shrinkage estimators are then obtained in both cases. More importantly, in this paper, we assume that a nonlinear structure is present in the parametric component of the model, which makes the direct application of penalized least squares to the NSRM impossible. To solve this problem, we employ the iterative Gauss-Newton method to obtain the final NSRM estimators. We provide both theoretical and practical details for the suggested estimators. Asymptotic results are derived for both low- and high-dimensional cases. We conduct an extensive simulation study to evaluate the performance of the estimators in a practical setting. Moreover, we substantiate our findings with data examples from two distinct breast cancer datasets: the Breast Cancer in the United States (BCUS) and Wisconsin datasets. By demonstrating the effectiveness of our introduced estimators in these particular biostatistical contexts, our numerical study provides support for the theoretical efficacy of shrinkage estimators, suggesting their potential relevance to breast cancer research and biostatistical methodologies.</p>","PeriodicalId":50333,"journal":{"name":"International Journal of Biostatistics","volume":" ","pages":""},"PeriodicalIF":1.2,"publicationDate":"2025-11-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145558461","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Information borrowing from historical data is gaining increasing attention in clinical trials for rare and pediatric diseases, where small sample sizes may lead to insufficient statistical power for confirming efficacy. While Bayesian information borrowing methods are well established, recent frequentist approaches, such as the test-then-pool and equivalence-based test-then-pool methods, have been proposed to determine whether historical data should be incorporated into statistical hypothesis testing. Depending on the outcome of these hypothesis tests, historical data may or may not be utilized. This paper introduces a dynamic borrowing method for leveraging historical information based on the similarity between current and historical data. Similar to Bayesian dynamic borrowing, our proposed method adjusts the degree of information borrowing dynamically, ranging from 0 to 100 %. We present two approaches to measure similarity: one using the density function of the t-distribution and the other employing a logistic function. The performance of the proposed methods is evaluated through Monte Carlo simulations. Additionally, we demonstrate the utility of dynamic information borrowing by reanalyzing data from an actual clinical trial.
{"title":"DBMS: dynamic borrowing method for frequentist hybrid control designs based on historical-current data similarity.","authors":"Masahiro Kojima","doi":"10.1515/ijb-2024-0051","DOIUrl":"https://doi.org/10.1515/ijb-2024-0051","url":null,"abstract":"<p><p>Information borrowing from historical data is gaining increasing attention in clinical trials for rare and pediatric diseases, where small sample sizes may lead to insufficient statistical power for confirming efficacy. While Bayesian information borrowing methods are well established, recent frequentist approaches, such as the test-then-pool and equivalence-based test-then-pool methods, have been proposed to determine whether historical data should be incorporated into statistical hypothesis testing. Depending on the outcome of these hypothesis tests, historical data may or may not be utilized. This paper introduces a dynamic borrowing method for leveraging historical information based on the similarity between current and historical data. Similar to Bayesian dynamic borrowing, our proposed method adjusts the degree of information borrowing dynamically, ranging from 0 to 100 %. We present two approaches to measure similarity: one using the density function of the t-distribution and the other employing a logistic function. The performance of the proposed methods is evaluated through Monte Carlo simulations. Additionally, we demonstrate the utility of dynamic information borrowing by reanalyzing data from an actual clinical trial.</p>","PeriodicalId":50333,"journal":{"name":"International Journal of Biostatistics","volume":" ","pages":""},"PeriodicalIF":1.2,"publicationDate":"2025-11-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145459898","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03eCollection Date: 2025-11-01DOI: 10.1515/ijb-2025-0011
Jesús Gutiérrez-Botella, Carmen Armero, Thomas Kneib, María P Pata, Javier García-Seara
Competing risks models are survival models with several events of interest acting in competition and whose occurrence is only observed for the event that occurs first in time. This paper presents a Bayesian approach to these models in which the issue of model selection is treated in a special way by proposing generalizations of some of the Bayesian procedures used in univariate survival analysis. This research is motivated by a study on the survival of patients with heart failure undergoing cardiac resynchronization therapy, a procedure which involves the implant of a device to stabilize the heartbeat. Two different causes of death have been considered: cardiovascular and non-cardiovascular, and a set of baseline covariates are examined in order to better understand their relationship with both causes of death. Model selection, model checking, and model comparison procedures have been implemented and assessed. The posterior distribution of some relevant outputs such as the overall survival function, cumulative incidence functions, and transition probabilities have been computed and discussed.
{"title":"Bayesian competing risks survival modeling for assessing the cause of death of patients with heart failure.","authors":"Jesús Gutiérrez-Botella, Carmen Armero, Thomas Kneib, María P Pata, Javier García-Seara","doi":"10.1515/ijb-2025-0011","DOIUrl":"10.1515/ijb-2025-0011","url":null,"abstract":"<p><p>Competing risks models are survival models with several events of interest acting in competition and whose occurrence is only observed for the event that occurs first in time. This paper presents a Bayesian approach to these models in which the issue of model selection is treated in a special way by proposing generalizations of some of the Bayesian procedures used in univariate survival analysis. This research is motivated by a study on the survival of patients with heart failure undergoing cardiac resynchronization therapy, a procedure which involves the implant of a device to stabilize the heartbeat. Two different causes of death have been considered: cardiovascular and non-cardiovascular, and a set of baseline covariates are examined in order to better understand their relationship with both causes of death. Model selection, model checking, and model comparison procedures have been implemented and assessed. The posterior distribution of some relevant outputs such as the overall survival function, cumulative incidence functions, and transition probabilities have been computed and discussed.</p>","PeriodicalId":50333,"journal":{"name":"International Journal of Biostatistics","volume":" ","pages":"439-461"},"PeriodicalIF":1.2,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145423494","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-11-03eCollection Date: 2025-11-01DOI: 10.1515/ijb-2025-0016
Pablo Martínez-Camblor, Sonia Pérez-Fernández
The binary classification problem (BCP) aims to correctly allocate subjects in one of two possible groups. The groups are frequently defined as having or not one characteristic of interest. With this goal, we are allowed to use different types of information. There is a huge number of methods dealing with this problem; including standard binary regression models, or complex machine learning techniques such as support vector machine, boosting, or perceptron, among others. When this information is summarized in a continuous score, we have to define classification regions (or subsets) which will determine whether the subjects are classified as positive, with the characteristic under study, or as negative, otherwise. The standard (or regular) receiver-operating characteristic (ROC) curve assumes that higher values of the marker are associated with higher probabilities of being positive and considers as positive those patients with values within the intervals [c, ∞) , and plots the true- against the false- positive rates (sensitivity against one minus specificity) for all potential c. The so-called generalized ROC curve, gROC, allows that both higher and lower values of the score are associated with higher probabilities of being positive. The efficient ROC curve, eROC, considers the best ROC curve based on a transformation of the score. In this manuscript, we are interested in studying, comparing and approximating the transformations leading to the eROC and to the gROC curves. We will prove that, when the optimal transformation does not have relative maximum, both curves are equivalent. Besides, we investigate the use of the gROC curve on some theoretical models, explore the relationship between the gROC and the eROC curves, and propose two non-parametric procedures for approximating the transformation leading to the gROC curve. The finite-sample behavior of the proposed estimators is explored through Monte Carlo simulations. Two real-data sets illustrate the practical use of the proposed methods.
{"title":"The gROC curve and the optimal classification.","authors":"Pablo Martínez-Camblor, Sonia Pérez-Fernández","doi":"10.1515/ijb-2025-0016","DOIUrl":"10.1515/ijb-2025-0016","url":null,"abstract":"<p><p>The binary classification problem (BCP) aims to correctly allocate subjects in one of two possible groups. The groups are frequently defined as having or not one characteristic of interest. With this goal, we are allowed to use different types of information. There is a huge number of methods dealing with this problem; including standard binary regression models, or complex machine learning techniques such as support vector machine, boosting, or perceptron, among others. When this information is summarized in a continuous score, we have to define classification regions (or subsets) which will determine whether the subjects are classified as positive, with the characteristic under study, or as negative, otherwise. The standard (or regular) receiver-operating characteristic (ROC) curve assumes that higher values of the marker are associated with higher probabilities of being positive and considers as positive those patients with values within the intervals [<i>c</i>, ∞) <math><mrow><mo>(</mo> <mrow><mi>c</mi> <mo>∈</mo> <mi>R</mi></mrow> <mo>)</mo></mrow> </math> , and plots the true- against the false- positive rates (sensitivity against one minus specificity) for all potential <i>c</i>. The so-called generalized ROC curve, gROC, allows that both higher and lower values of the score are associated with higher probabilities of being positive. The efficient ROC curve, eROC, considers the best ROC curve based on a transformation of the score. In this manuscript, we are interested in studying, comparing and approximating the transformations leading to the eROC and to the gROC curves. We will prove that, when the optimal transformation does not have relative maximum, both curves are equivalent. Besides, we investigate the use of the gROC curve on some theoretical models, explore the relationship between the gROC and the eROC curves, and propose two non-parametric procedures for approximating the transformation leading to the gROC curve. The finite-sample behavior of the proposed estimators is explored through Monte Carlo simulations. Two real-data sets illustrate the practical use of the proposed methods.</p>","PeriodicalId":50333,"journal":{"name":"International Journal of Biostatistics","volume":" ","pages":"255-270"},"PeriodicalIF":1.2,"publicationDate":"2025-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145423453","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-31eCollection Date: 2025-11-01DOI: 10.1515/ijb-2023-0131
Tianmin Wu, Ao Yuan, Ming Tan
In observational studies, the treatment assignment is typically not random. Even in randomized clinical trials, the randomization may be imperfect given the limitation of sample size. In these cases, traditional statistical methods may lead to biased estimates of treatment effects, and causal inference methods are needed to obtain unbiased estimates. The doubly robust estimator (DRE) is a recent development in causal inference, but the literature on DRE for survival data is very limited, and existing methods tend to have complicated forms and may not have double robustness in the original sense. Some are constructed based on the Nelson-Aalen estimator, and to our knowledge no DRE is constructed based on the Kaplan-Meier estimator. Furthermore, in these methods, the propensity score model is often subjectively specified with a logistic model. DRE can be seriously biased if the propensity score and outcome models are slightly misspecified. Here we propose a new semiparametric robust estimator that utilizes the Kaplan-Meier estimator and Stute weighted empirical form to address these issues. Our proposed estimator is not only doubly robust in the original sense but also enhances robustness with the use of semiparametric specification. The asymptotic properties of the proposed estimator are derived, and extensive simulation studies are conducted to evaluate its finite sample performance and compare it with existing methods. Finally, we apply our proposed method to a real clinical study.
{"title":"Enhanced doubly robust estimate with semiparametric models for causal inference of survival outcome.","authors":"Tianmin Wu, Ao Yuan, Ming Tan","doi":"10.1515/ijb-2023-0131","DOIUrl":"10.1515/ijb-2023-0131","url":null,"abstract":"<p><p>In observational studies, the treatment assignment is typically not random. Even in randomized clinical trials, the randomization may be imperfect given the limitation of sample size. In these cases, traditional statistical methods may lead to biased estimates of treatment effects, and causal inference methods are needed to obtain unbiased estimates. The doubly robust estimator (DRE) is a recent development in causal inference, but the literature on DRE for survival data is very limited, and existing methods tend to have complicated forms and may not have double robustness in the original sense. Some are constructed based on the Nelson-Aalen estimator, and to our knowledge no DRE is constructed based on the Kaplan-Meier estimator. Furthermore, in these methods, the propensity score model is often subjectively specified with a logistic model. DRE can be seriously biased if the propensity score and outcome models are slightly misspecified. Here we propose a new semiparametric robust estimator that utilizes the Kaplan-Meier estimator and Stute weighted empirical form to address these issues. Our proposed estimator is not only doubly robust in the original sense but also enhances robustness with the use of semiparametric specification. The asymptotic properties of the proposed estimator are derived, and extensive simulation studies are conducted to evaluate its finite sample performance and compare it with existing methods. Finally, we apply our proposed method to a real clinical study.</p>","PeriodicalId":50333,"journal":{"name":"International Journal of Biostatistics","volume":" ","pages":"285-298"},"PeriodicalIF":1.2,"publicationDate":"2025-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145410580","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-10-09eCollection Date: 2025-11-01DOI: 10.1515/ijb-2025-0038
Shuying Wang, Danping Zhou, Yunfei Yang, Bo Zhao
Traditional survival analysis typically assumes that all subjects will eventually experience the event of interest given a sufficiently long follow-up period. Nevertheless, due to advancements in medical technology, researchers now frequently observe that some subjects never experience the event and are considered cured. Furthermore, traditional survival analysis assumes independence between failure time and censoring time. However, practical applications often reveal dependence between them. Ignoring both the cured subgroup and this dependence structure can introduce bias in model estimates. Among the methods for handling dependent censoring data, the numerical integration process of frailty models is complex and sensitive to the assumptions about the latent variable distribution. In contrast, the copula method, by flexibly modeling the dependence between variables, avoids strong assumptions about the latent variable structure, offering greater robustness and computational feasibility. Therefore, this paper proposes a copula-based method to handle dependent current status data involving a cure fraction. In the modeling process, we establish a logistic model to describe the susceptible rate and a Cox proportional hazards model to describe the failure time and censoring time. In the estimation process, we employ a sieve maximum likelihood estimation method based on Bernstein polynomials for parameter estimation. Extensive simulation experiments show that the proposed method demonstrates consistency and asymptotic efficiency under various settings. Finally, this paper applies the method to lymph follicle cell data, verifying its effectiveness in practical data analysis.
{"title":"Copula-based Cox models for dependent current status data with a cure fraction.","authors":"Shuying Wang, Danping Zhou, Yunfei Yang, Bo Zhao","doi":"10.1515/ijb-2025-0038","DOIUrl":"10.1515/ijb-2025-0038","url":null,"abstract":"<p><p>Traditional survival analysis typically assumes that all subjects will eventually experience the event of interest given a sufficiently long follow-up period. Nevertheless, due to advancements in medical technology, researchers now frequently observe that some subjects never experience the event and are considered cured. Furthermore, traditional survival analysis assumes independence between failure time and censoring time. However, practical applications often reveal dependence between them. Ignoring both the cured subgroup and this dependence structure can introduce bias in model estimates. Among the methods for handling dependent censoring data, the numerical integration process of frailty models is complex and sensitive to the assumptions about the latent variable distribution. In contrast, the copula method, by flexibly modeling the dependence between variables, avoids strong assumptions about the latent variable structure, offering greater robustness and computational feasibility. Therefore, this paper proposes a copula-based method to handle dependent current status data involving a cure fraction. In the modeling process, we establish a logistic model to describe the susceptible rate and a Cox proportional hazards model to describe the failure time and censoring time. In the estimation process, we employ a sieve maximum likelihood estimation method based on Bernstein polynomials for parameter estimation. Extensive simulation experiments show that the proposed method demonstrates consistency and asymptotic efficiency under various settings. Finally, this paper applies the method to lymph follicle cell data, verifying its effectiveness in practical data analysis.</p>","PeriodicalId":50333,"journal":{"name":"International Journal of Biostatistics","volume":" ","pages":"385-409"},"PeriodicalIF":1.2,"publicationDate":"2025-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145253416","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-09-23eCollection Date: 2025-11-01DOI: 10.1515/ijb-2025-0065
Hoa Pham, Huong T T Pham, Kai Siong Yow
Multi-stage models for cohort data are widely used in various fields, including disease progression, the biological development of plants and animals, and laboratory studies of life cycle development. However, the likelihood functions of these models are often intractable and complex. These complexities in the likelihood functions frequently result in significant biases and high computational costs when estimating parameters using current Bayesian methods. This paper aims to address these challenges by applying the enhanced Sequential Monte Carlo approximate Bayesian computation (ABC-SMC) method, which does not rely on explicit likelihood functions, to stage-structured development models with non-hazard rates and stage-wise constant hazard rates. Instead of using a likelihood function, the proposed method determines parameter estimates based on matching vector summary statistics. It incorporates stage-wise parameter estimations and retains accepted parameters across stages. This approach not only reduces model biases but also improves the computational efficiency of parameter estimations, despite the computational intractability of the likelihood functions. The proposed ABC-SMC method is validated through simulation studies on stage-structured development models and applied to a case study of breast development in New Zealand schoolgirls. The results demonstrate that the proposed methods effectively reduce biases in later-stage estimates for stage-structured models, enhance computational efficiency, and maintain accuracy and reliability in parameter estimations compared to the current methods.
{"title":"An enhanced approximate Bayesian computation method for stage-structured development models.","authors":"Hoa Pham, Huong T T Pham, Kai Siong Yow","doi":"10.1515/ijb-2025-0065","DOIUrl":"10.1515/ijb-2025-0065","url":null,"abstract":"<p><p>Multi-stage models for cohort data are widely used in various fields, including disease progression, the biological development of plants and animals, and laboratory studies of life cycle development. However, the likelihood functions of these models are often intractable and complex. These complexities in the likelihood functions frequently result in significant biases and high computational costs when estimating parameters using current Bayesian methods. This paper aims to address these challenges by applying the enhanced Sequential Monte Carlo approximate Bayesian computation (ABC-SMC) method, which does not rely on explicit likelihood functions, to stage-structured development models with non-hazard rates and stage-wise constant hazard rates. Instead of using a likelihood function, the proposed method determines parameter estimates based on matching vector summary statistics. It incorporates stage-wise parameter estimations and retains accepted parameters across stages. This approach not only reduces model biases but also improves the computational efficiency of parameter estimations, despite the computational intractability of the likelihood functions. The proposed ABC-SMC method is validated through simulation studies on stage-structured development models and applied to a case study of breast development in New Zealand schoolgirls. The results demonstrate that the proposed methods effectively reduce biases in later-stage estimates for stage-structured models, enhance computational efficiency, and maintain accuracy and reliability in parameter estimations compared to the current methods.</p>","PeriodicalId":50333,"journal":{"name":"International Journal of Biostatistics","volume":" ","pages":"423-437"},"PeriodicalIF":1.2,"publicationDate":"2025-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145126234","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}