{"title":"A discussion on “A selective review of statistical methods using calibration information from similar studies”","authors":"Lingzhi Zhou, P. Song","doi":"10.1080/24754269.2022.2084930","DOIUrl":null,"url":null,"abstract":"It is our pleasure to have an opportunity of making comments on this fine work in that the authors present a comprehensive review on empirical likelihood (EL) methods for integrative data analyses. This paper focuses on a unified methodological framework based on EL and estimating equations (EE) to sequentially combine summary information from individual data batches to obtain desirable estimation and inference comparable to those obtained by the EL method utilizing all individual-level data. The latter is sometimes referred to as an oracle estimation and inference in the setting of massively distributed data batches. An obvious strength of this review paper concerns the detailed theoretical properties in connection to the improved estimation efficiency through the utility of auxiliary information. In this paper, the authors consider a typical data integration situation where individual-level data from the Kth data batch is combined with certain ‘good’ summary information from the previous K−1 data batches. While appreciating the theoretical strengths in this paper, we notice a few interesting aspects that are worth some discussions. Distributed data structures: In practice, both individual data batch size and the number of data batches may appear rather heterogeneous, requiring different theory and algorithms in the data analysis. Such heterogeneity in distributed data structures is not well aligned with the methodological framework reviewed in the paper. One important practical scenario is that the number of data batches tends to infinity. Such setting may arise from distributed data collected from millions of mobile device users, or from electronic health records (EHR) data sources distributed across thousands of hospitals. In the presence of massively distributed data batches, a natural question pertains to a trade-off between data communication efficiency and analytic approximation accuracy. Although oneround data communication is popular in this type of integrative data analysis, multiple rounds of data communication may be also viable in the implementation via high-performance computing clusters. Our experience suggests that sacrifice in the flexibility of data communication (e.g., limited to one-round communication in the Hadoop paradigm), although enjoys computational speed, may pay a substantial price on the loss of approximation accuracy, leading to potentially accumulated estimation bias when the number of data batches increases. This issue of estimation bias is a technical challenge in nonlinear models due to the invocation of approximations to linearize both estimation procedure and numerical search algorithm. On the other hand, relaxing the restrictions on data communication, such as the operations within the lambda architecture, can help reduce the approximation error and lower estimation bias. Clearly, the latter requires more computational resources. This important issue was investigated by Zhou et al. (2022) that studied asymptotical equivalence between distributed EL estimator and oracle EL estimator under both one-round communication and unlimited rounds of communicationwhen the number of distributed data batches increases perpetually. They found that under one-round communication, if the number of data batches, K, increases with the sample size n at a slow order of O(n1/2−δ) with 0 < δ ≤ 1/2 and all individual batch sizes increase (i.e., nmin = mink nk → ∞), their proposed distributed EL estimator is asymptotically equivalent to the oracle EL estimator in the mode of convergence in distribution. Interestingly, they found that if there is no limit on communication, both technical conditions above can be removed, and moreover, under much weaker conditions the distributed EL estimator and the oracle EL estimator are asymptotically equivalent in the mode of convergence in probability. The latter is a stronger convergence result than the former. Furthermore, assisted by the ADMM algorithm, even if there exist serious unbalanced","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"196 - 198"},"PeriodicalIF":0.7000,"publicationDate":"2022-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Theory and Related Fields","FirstCategoryId":"96","ListUrlMain":"https://doi.org/10.1080/24754269.2022.2084930","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
Abstract
It is our pleasure to have an opportunity of making comments on this fine work in that the authors present a comprehensive review on empirical likelihood (EL) methods for integrative data analyses. This paper focuses on a unified methodological framework based on EL and estimating equations (EE) to sequentially combine summary information from individual data batches to obtain desirable estimation and inference comparable to those obtained by the EL method utilizing all individual-level data. The latter is sometimes referred to as an oracle estimation and inference in the setting of massively distributed data batches. An obvious strength of this review paper concerns the detailed theoretical properties in connection to the improved estimation efficiency through the utility of auxiliary information. In this paper, the authors consider a typical data integration situation where individual-level data from the Kth data batch is combined with certain ‘good’ summary information from the previous K−1 data batches. While appreciating the theoretical strengths in this paper, we notice a few interesting aspects that are worth some discussions. Distributed data structures: In practice, both individual data batch size and the number of data batches may appear rather heterogeneous, requiring different theory and algorithms in the data analysis. Such heterogeneity in distributed data structures is not well aligned with the methodological framework reviewed in the paper. One important practical scenario is that the number of data batches tends to infinity. Such setting may arise from distributed data collected from millions of mobile device users, or from electronic health records (EHR) data sources distributed across thousands of hospitals. In the presence of massively distributed data batches, a natural question pertains to a trade-off between data communication efficiency and analytic approximation accuracy. Although oneround data communication is popular in this type of integrative data analysis, multiple rounds of data communication may be also viable in the implementation via high-performance computing clusters. Our experience suggests that sacrifice in the flexibility of data communication (e.g., limited to one-round communication in the Hadoop paradigm), although enjoys computational speed, may pay a substantial price on the loss of approximation accuracy, leading to potentially accumulated estimation bias when the number of data batches increases. This issue of estimation bias is a technical challenge in nonlinear models due to the invocation of approximations to linearize both estimation procedure and numerical search algorithm. On the other hand, relaxing the restrictions on data communication, such as the operations within the lambda architecture, can help reduce the approximation error and lower estimation bias. Clearly, the latter requires more computational resources. This important issue was investigated by Zhou et al. (2022) that studied asymptotical equivalence between distributed EL estimator and oracle EL estimator under both one-round communication and unlimited rounds of communicationwhen the number of distributed data batches increases perpetually. They found that under one-round communication, if the number of data batches, K, increases with the sample size n at a slow order of O(n1/2−δ) with 0 < δ ≤ 1/2 and all individual batch sizes increase (i.e., nmin = mink nk → ∞), their proposed distributed EL estimator is asymptotically equivalent to the oracle EL estimator in the mode of convergence in distribution. Interestingly, they found that if there is no limit on communication, both technical conditions above can be removed, and moreover, under much weaker conditions the distributed EL estimator and the oracle EL estimator are asymptotically equivalent in the mode of convergence in probability. The latter is a stronger convergence result than the former. Furthermore, assisted by the ADMM algorithm, even if there exist serious unbalanced