In this anniversary issue I briefly review some work on the notion of collapsibility and indicate some lingering questions.
In this anniversary issue I briefly review some work on the notion of collapsibility and indicate some lingering questions.
High-dimensional data with left-censored responses are increasingly common in modern applications, yet existing methods for analyzing them are limited. Classical Tobit models fail to handle nonlinear relationships or perform high-dimensional variable selection, whereas deep learning approaches often prioritize prediction performance but lack selection and interpretation capabilities. To address this gap, we propose an integrated deep learning framework, the Deep Tobit model, which employs the negative Tobit log-likelihood as its loss function to properly account for data censoring. A two-stage feature selection algorithm is further developed, with theoretical guarantees on convergence rate and selection consistency. Extensive simulation studies and real-data applications on left-censored aero-engine casing vibration data and HIV viral load data demonstrate that the proposed framework outperforms several state-of-the-art baselines in both variable selection and prediction accuracy.
When comparing multiple groups in clinical trials, we are not only interested in whether there is a difference between any groups but rather where the difference is. Such research questions lead to testing multiple individual hypotheses. To control the familywise error rate (FWER), we must apply some corrections or introduce tests that control the FWER by design. In the case of time-to-event data, a Bonferroni-corrected log-rank test is commonly used. This approach has two significant drawbacks: (i) it loses power when the proportional hazards assumption is violated and (ii) the correction generally leads to a lower power, especially when the test statistics are not independent. We propose two new tests based on combined weighted log-rank tests. One is a simple multiple contrast test of weighted log-rank tests, and one is an extension of the so-called CASANOVA test. The latter was introduced for factorial designs. We propose a new multiple contrast test based on the CASANOVA approach. Our test shows promise of being more powerful under crossing hazards and eliminates the need for additional p-value correction. We assess the performance of our tests through extensive Monte Carlo simulation studies covering both proportional and non-proportional hazard scenarios. Finally, we apply the new and reference methods to a real-world data example. The new approaches control the FWER and show reasonable power in all scenarios. They outperform the adjusted approaches in some non-proportional settings in terms of power.
In high-dimensional survival analysis, sparse learning is critically important, as evidenced by applications in molecular biology, economics, and climate science. Despite rapid advances on sparse modeling of survival data, achieving valid statistical inference under measurement errors remains largely unexplored. In this article, we introduce a new method called the double debiased Lasso (DDL) for constructing confidence intervals in high-dimensional error-in-variables accelerated failure time (AFT) models. It not only corrects the bias of an initial weighted least squares Lasso estimate by inverting the Karush-Kuhn-Tucker (KKT) conditions, but also alleviates the impact of measurement errors when estimating both the initial estimator and the inverse covariance matrix by using the nearest positive semi-definite projection technique. Furthermore, we establish comprehensive theoretical properties, including the asymptotic normality of the proposed DDL estimator, as well as estimation consistency for the initial estimator. The effectiveness of our method is demonstrated through numerical studies and real-data analysis.
We consider semiparametric random-effects models for recurrent events in the presence of a terminal event. The recurrent events have either a proportional marginal rate model (Cox in J Roy Stat Soc Ser B 34:406-424, 1972) or a proportional marginal mean model (Ghosh and Lin in Stat Sin 34: 663-688, 2002), while the marginal rate of the terminal event is given by a proportional model. The dependency between the recurrent events and the terminal event is described by two variants of random effects models that allow the processes to share the random effect, either fully or partly. The models are formulated as two-stage models, where the marginals can be fitted in an initial stage, and then subsequently random effects parameters can be estimated. The estimation of parameters does not require the choice of any tuning parameters, in contrast to procedures based on numerical integration, and the numerical procedure works well. Standard errors were computed by bootstrapping. The methods are applied to the Taichung Peritoneal Dialysis Study (Chen et al. in Biom J 57(2):215-233, 2015) that considered recurrent inflammations in dialysis patients.
Motivated by the need to analyze continuously updated data sets in the context of time-to-event modeling, we propose a promising and practically feasible nonparametric approach to estimate the conditional hazard function given a set of continuous and discrete predictors. The method is based on a representation of the conditional hazard as a ratio between a joint density and a conditional expectation determined by the distribution of the observed variables. It is shown that such ratio representations are available for uni- and bivariate time-to-events, in the presence of common types of random censoring, truncation, and with possibly cured individuals, as well as for competing risks. This opens the door to nonparametric approaches in many time-to-event predictive models. To estimate joint densities and conditional expectations we propose the recursive kernel smoothing, which is well suited for online estimation. Asymptotic results for such estimators are derived and it is shown that they achieve optimal convergence rates. Simulation experiments show the good finite sample performance of our recursive estimator with right censoring. The method is applied to a real dataset of primary breast cancer.
Partly interval-censored data with a cure fraction are commonly encountered in epidemiological and biomedical studies, where exact failure times are observed for some subjects while others fall within certain intervals. For cure survival data, two-component mixture cure models that directly model the probability of being uncured and the conditional survival function of susceptible subjects, have attracted considerable attention. However, conventional cure models typically assume linear covariate effects in both components, which may limit their flexibility and applicability for potential nonlinear relationships. In this paper, we propose a flexible semiparametric mixture cure model that incorporates parametric and nonparametric covariate structures for both the cure probability and the event-time distribution of susceptible subjects. We utilize spline-based techniques to approximate unspecified functions and implement a four-stage data augmentation approach to address the complexities inherent in the model and data structure. A computationally convenient Bayesian approach is developed to obtain posterior estimates of the model parameters. The finite-sample performance of the proposed method is evaluated through simulation studies. The practical utility of the approach is demonstrated by an analysis of child mortality data.
Independent censoring is a key assumption usually made when analyzing time-to-event data. However, this assumption is difficult to assess and can be problematic, particularly in studies with disproportionate loss to follow-up due to adverse events. This paper addresses the challenges associated with dependent censoring by introducing a likelihood-based approach for analyzing bivariate survival data under dependent censoring. A flexible Joe-Hu copula is used to formulate the interdependence within the quadruple times (two events and two censoring times). The marginal distribution of each event/censoring time is modeled via the Cox proportional hazards model. Our estimator possesses consistency and desirable asymptotic properties under regularity conditions. We present results from extensive simulation studies and further illustrate our approach using prostate cancer data.
Limited sample size and censoring inherently limit the statistical efficiency of high-dimensional data analysis. While integrating data from multiple sources can enhance estimation efficiency, concerns remain regarding data privacy breaches and between-site heterogeneity. In this paper, we propose a privacy-preserving approach to integrate the high-dimensional right-censored data with source-level heterogeneity. The proposed method is based on the local computation strategy: each site can obtain an integrative estimation based on its local full dataset and the summary statistics from other sites. For each party, this strategy not only meets the data privacy constraints but also maximizes its local data's utilization. Moreover, we introduce a refined procedure for practical use to avoid the shrinkage of the local covariate effect that is unique across all sites. Theoretical results of the proposed estimates including consistency, asymptotic normality and efficiency gains are attained. Simulation experiments demonstrate its superiority over the integrative methods relying solely on summary statistics and the local estimations. The application to multi-source clinical data of ovarian cancer further verifies its practical effectiveness.
Overdispersion is a phenomenon which is quite common in many real-life count data sets and these variability often results due to an excessive number of zeros. To address this issue, zero-inflated distributions provide a flexible modeling approach capable of capturing high levels of dispersion. In this paper we introduce a new count distribution known as the zero-inflated transmuted geometric distribution. We explore its key statistical properties, reliability aspects and actuarial traits. Additionally we employ different estimation strategies and conduct a simulation study to assess the performance of the estimators. We demonstrate the practical utility of the proposed model through the analysis of three empirical data sets. Lastly, we also carry out the likelihood ratio test to justify the use of the proposed zero-inflated distribution.

