A discussion on “A selective review of statistical methods using calibration information from similar studies”

IF 0.7 Q3 STATISTICS & PROBABILITY Statistical Theory and Related Fields Pub Date : 2022-06-10 DOI:10.1080/24754269.2022.2084930

Lingzhi Zhou, P. Song

{"title":"A discussion on “A selective review of statistical methods using calibration information from similar studies”","authors":"Lingzhi Zhou, P. Song","doi":"10.1080/24754269.2022.2084930","DOIUrl":null,"url":null,"abstract":"It is our pleasure to have an opportunity of making comments on this fine work in that the authors present a comprehensive review on empirical likelihood (EL) methods for integrative data analyses. This paper focuses on a unified methodological framework based on EL and estimating equations (EE) to sequentially combine summary information from individual data batches to obtain desirable estimation and inference comparable to those obtained by the EL method utilizing all individual-level data. The latter is sometimes referred to as an oracle estimation and inference in the setting of massively distributed data batches. An obvious strength of this review paper concerns the detailed theoretical properties in connection to the improved estimation efficiency through the utility of auxiliary information. In this paper, the authors consider a typical data integration situation where individual-level data from the Kth data batch is combined with certain ‘good’ summary information from the previous K−1 data batches. While appreciating the theoretical strengths in this paper, we notice a few interesting aspects that are worth some discussions. Distributed data structures: In practice, both individual data batch size and the number of data batches may appear rather heterogeneous, requiring different theory and algorithms in the data analysis. Such heterogeneity in distributed data structures is not well aligned with the methodological framework reviewed in the paper. One important practical scenario is that the number of data batches tends to infinity. Such setting may arise from distributed data collected from millions of mobile device users, or from electronic health records (EHR) data sources distributed across thousands of hospitals. In the presence of massively distributed data batches, a natural question pertains to a trade-off between data communication efficiency and analytic approximation accuracy. Although oneround data communication is popular in this type of integrative data analysis, multiple rounds of data communication may be also viable in the implementation via high-performance computing clusters. Our experience suggests that sacrifice in the flexibility of data communication (e.g., limited to one-round communication in the Hadoop paradigm), although enjoys computational speed, may pay a substantial price on the loss of approximation accuracy, leading to potentially accumulated estimation bias when the number of data batches increases. This issue of estimation bias is a technical challenge in nonlinear models due to the invocation of approximations to linearize both estimation procedure and numerical search algorithm. On the other hand, relaxing the restrictions on data communication, such as the operations within the lambda architecture, can help reduce the approximation error and lower estimation bias. Clearly, the latter requires more computational resources. This important issue was investigated by Zhou et al. (2022) that studied asymptotical equivalence between distributed EL estimator and oracle EL estimator under both one-round communication and unlimited rounds of communicationwhen the number of distributed data batches increases perpetually. They found that under one-round communication, if the number of data batches, K, increases with the sample size n at a slow order of O(n1/2−δ) with 0 < δ ≤ 1/2 and all individual batch sizes increase (i.e., nmin = mink nk → ∞), their proposed distributed EL estimator is asymptotically equivalent to the oracle EL estimator in the mode of convergence in distribution. Interestingly, they found that if there is no limit on communication, both technical conditions above can be removed, and moreover, under much weaker conditions the distributed EL estimator and the oracle EL estimator are asymptotically equivalent in the mode of convergence in probability. The latter is a stronger convergence result than the former. Furthermore, assisted by the ADMM algorithm, even if there exist serious unbalanced","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"196 - 198"},"PeriodicalIF":0.7000,"publicationDate":"2022-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Theory and Related Fields","FirstCategoryId":"96","ListUrlMain":"https://doi.org/10.1080/24754269.2022.2084930","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}

引用次数: 0

Abstract

It is our pleasure to have an opportunity of making comments on this fine work in that the authors present a comprehensive review on empirical likelihood (EL) methods for integrative data analyses. This paper focuses on a unified methodological framework based on EL and estimating equations (EE) to sequentially combine summary information from individual data batches to obtain desirable estimation and inference comparable to those obtained by the EL method utilizing all individual-level data. The latter is sometimes referred to as an oracle estimation and inference in the setting of massively distributed data batches. An obvious strength of this review paper concerns the detailed theoretical properties in connection to the improved estimation efficiency through the utility of auxiliary information. In this paper, the authors consider a typical data integration situation where individual-level data from the Kth data batch is combined with certain ‘good’ summary information from the previous K−1 data batches. While appreciating the theoretical strengths in this paper, we notice a few interesting aspects that are worth some discussions. Distributed data structures: In practice, both individual data batch size and the number of data batches may appear rather heterogeneous, requiring different theory and algorithms in the data analysis. Such heterogeneity in distributed data structures is not well aligned with the methodological framework reviewed in the paper. One important practical scenario is that the number of data batches tends to infinity. Such setting may arise from distributed data collected from millions of mobile device users, or from electronic health records (EHR) data sources distributed across thousands of hospitals. In the presence of massively distributed data batches, a natural question pertains to a trade-off between data communication efficiency and analytic approximation accuracy. Although oneround data communication is popular in this type of integrative data analysis, multiple rounds of data communication may be also viable in the implementation via high-performance computing clusters. Our experience suggests that sacrifice in the flexibility of data communication (e.g., limited to one-round communication in the Hadoop paradigm), although enjoys computational speed, may pay a substantial price on the loss of approximation accuracy, leading to potentially accumulated estimation bias when the number of data batches increases. This issue of estimation bias is a technical challenge in nonlinear models due to the invocation of approximations to linearize both estimation procedure and numerical search algorithm. On the other hand, relaxing the restrictions on data communication, such as the operations within the lambda architecture, can help reduce the approximation error and lower estimation bias. Clearly, the latter requires more computational resources. This important issue was investigated by Zhou et al. (2022) that studied asymptotical equivalence between distributed EL estimator and oracle EL estimator under both one-round communication and unlimited rounds of communicationwhen the number of distributed data batches increases perpetually. They found that under one-round communication, if the number of data batches, K, increases with the sample size n at a slow order of O(n1/2−δ) with 0 < δ ≤ 1/2 and all individual batch sizes increase (i.e., nmin = mink nk → ∞), their proposed distributed EL estimator is asymptotically equivalent to the oracle EL estimator in the mode of convergence in distribution. Interestingly, they found that if there is no limit on communication, both technical conditions above can be removed, and moreover, under much weaker conditions the distributed EL estimator and the oracle EL estimator are asymptotically equivalent in the mode of convergence in probability. The latter is a stronger convergence result than the former. Furthermore, assisted by the ADMM algorithm, even if there exist serious unbalanced

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

关于“使用类似研究的校准信息对统计方法进行选择性审查”的讨论

我们很高兴有机会对这项优秀的工作发表评论，因为作者对综合数据分析的经验似然（EL）方法进行了全面的综述。本文侧重于一个基于EL和估计方程（EE）的统一方法框架，以顺序组合来自各个数据批次的汇总信息，从而获得与利用所有个体水平数据的EL方法所获得的估计和推断相比较的期望估计和推断。后者有时被称为大规模分布式数据批次设置中的预言机估计和推理。这篇综述论文的一个明显优势涉及通过利用辅助信息提高估计效率的详细理论性质。在本文中，作者考虑了一种典型的数据集成情况，即第K个数据批次的单个级别数据与前K−1个数据批次中的某些“良好”汇总信息相结合。在欣赏本文理论优势的同时，我们注意到一些有趣的方面值得讨论。分布式数据结构：在实践中，单个数据批次的大小和数据批次的数量可能看起来相当异构，需要在数据分析中使用不同的理论和算法。分布式数据结构中的这种异质性与论文中回顾的方法框架并不一致。一个重要的实际场景是数据批处理的数量趋于无穷大。这种设置可能来自从数百万移动设备用户收集的分布式数据，或者来自分布在数千家医院的电子健康记录（EHR）数据源。在存在大规模分布式数据批的情况下，一个自然的问题涉及数据通信效率和分析近似精度之间的权衡。尽管单轮数据通信在这种类型的综合数据分析中很流行，但在通过高性能计算集群实现的过程中，多轮数据通信也可能是可行的。我们的经验表明，牺牲数据通信的灵活性（例如，Hadoop范式中仅限于一轮通信），尽管享有计算速度，但可能会为近似精度的损失付出巨大代价，从而在数据批次数量增加时导致潜在的累积估计偏差。由于调用近似来线性化估计过程和数值搜索算法，估计偏差问题在非线性模型中是一个技术挑战。另一方面，放松对数据通信的限制，例如lambda架构内的操作，可以帮助减少近似误差和降低估计偏差。显然，后者需要更多的计算资源。周等人研究了这一重要问题。（2022）研究了当分布式数据批的数量不断增加时，在单轮通信和无限轮通信下，分布式EL估计器和oracle EL估计员之间的渐近等价性。他们发现，在一轮通信中，如果数据批次的数量K随着样本量n的增加而以O（n1/2-δ）的慢序增加，0<δ≤1/2，并且所有单个批次的大小都增加（即，nmin=mink nk→ ∞), 他们提出的分布式EL估计器在分布收敛模式下渐近等价于oracle-EL估计器。有趣的是，他们发现，如果通信没有限制，上述两个技术条件都可以消除，而且，在弱得多的条件下，分布式EL估计器和预言EL估计员在概率收敛模式下是渐近等价的。后者是比前者更强的收敛结果。此外，在ADMM算法的辅助下，即使存在严重的不平衡

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊