Pub Date : 2022-08-26DOI: 10.1080/24754269.2022.2077903
Jiahua Chen
Being a long-time friend of Dr. Qin and served as a supervisor of Drs. Li and Liu, I am as proud as authors of the richness of the content as well as the broadness of this paper. It helps me to play catch up and shames me to work hard rather than hardly work. As a discussant, I wish to come up with some additional insight on this research topic but this is deemed a very difficult task. I should congratulate the authors for covering a vast territory and leave no room for that. Instead, I raise two not so important technical issues which might be of interest to some fellow researchers.
{"title":"A discussion of ‘A selective review on calibration information from similar studies’","authors":"Jiahua Chen","doi":"10.1080/24754269.2022.2077903","DOIUrl":"https://doi.org/10.1080/24754269.2022.2077903","url":null,"abstract":"Being a long-time friend of Dr. Qin and served as a supervisor of Drs. Li and Liu, I am as proud as authors of the richness of the content as well as the broadness of this paper. It helps me to play catch up and shames me to work hard rather than hardly work. As a discussant, I wish to come up with some additional insight on this research topic but this is deemed a very difficult task. I should congratulate the authors for covering a vast territory and leave no room for that. Instead, I raise two not so important technical issues which might be of interest to some fellow researchers.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"201 - 203"},"PeriodicalIF":0.5,"publicationDate":"2022-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"43180850","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-26DOI: 10.1080/24754269.2022.2111059
J. Qin, Yukun Liu, Pengfei Li
We thank Professor Jun Shao for organizing this interesting discussion. We also thank the six discussants formany insightful comments and suggestions. Assembling data from different sources has been becoming a very popular topic nowadays. In our review paper, we have mainly discussed many integration methods when internal data and external data share a common distribution, though the external data may not have information for some underlying variables collected in the internal study. Indeed the common distribution assumption is very strong in practical applications. Due to the technology advance, the collection of data is gettingmuch easier, for example, by using i-phone, satellite image, etc. As those collected data are not obtained by well-designed probability sampling, inevitably, they may not represent the general population. As a consequence, there probably exists a systematic bias. In the survey sampling literature, how to combine survey sampling data with non probability sampling data has also got very popular (Chen et al., 2020). Without bias correction, most existing methods may produce biased results if the common distribution assumption is violated. One has to be careful to assess the impartiality before data integration. Before we respond to the common concern by the reviewers on the heterogeneity among different studies, we first outline the possible distributional shifts or changes in each source data. In themachine learning literature, the concepts of covariate shift, label shift, and transfer learning have been widely used (QuiñoneroCandela et al., 2009). We briefly highlight those concepts in terms of statistical joint density or conditional density. Covariate shift: Let Y and X be, respectively, the outcome and a vector of covariates in Statistic terminology, or a label variable and a vector of features in Machine Learning Languish. Suppose we have two data-sets:
我们感谢邵军教授组织这次有趣的讨论。我们也感谢六位讨论者提出的富有见地的意见和建议。收集来自不同来源的数据已成为当今一个非常流行的话题。在我们的综述中,我们主要讨论了当内部数据和外部数据共享共同分布时的许多集成方法,尽管外部数据可能没有内部研究中收集的一些潜在变量的信息。事实上,共同分布假设在实际应用中是非常强大的。由于技术的进步,数据的收集变得更加容易,例如通过使用手机、卫星图像等。由于这些收集的数据不是通过精心设计的概率采样获得的,不可避免地,它们可能无法代表一般人群。因此,可能存在系统性的偏见。在调查抽样文献中,如何将调查抽样数据与非概率抽样数据相结合也变得非常流行(Chen et al.,2020)。在没有偏差校正的情况下,如果违反共同分布假设,大多数现有方法可能会产生有偏差的结果。在数据整合之前,必须谨慎评估公正性。在我们回应审稿人对不同研究之间异质性的共同担忧之前,我们首先概述了每个源数据中可能的分布变化。在机器学习文献中,协变移位、标签移位和迁移学习的概念已被广泛使用(QuiñoneroCandela等人,2009)。我们简要强调了统计节理密度或条件密度方面的这些概念。协变量移位:设Y和X分别是统计学术语中协变量的结果和向量,或机器学习语言中的标签变量和特征向量。假设我们有两个数据集:
{"title":"Rejoinder on “A selective review of statistical methods using calibration information from similar studies”","authors":"J. Qin, Yukun Liu, Pengfei Li","doi":"10.1080/24754269.2022.2111059","DOIUrl":"https://doi.org/10.1080/24754269.2022.2111059","url":null,"abstract":"We thank Professor Jun Shao for organizing this interesting discussion. We also thank the six discussants formany insightful comments and suggestions. Assembling data from different sources has been becoming a very popular topic nowadays. In our review paper, we have mainly discussed many integration methods when internal data and external data share a common distribution, though the external data may not have information for some underlying variables collected in the internal study. Indeed the common distribution assumption is very strong in practical applications. Due to the technology advance, the collection of data is gettingmuch easier, for example, by using i-phone, satellite image, etc. As those collected data are not obtained by well-designed probability sampling, inevitably, they may not represent the general population. As a consequence, there probably exists a systematic bias. In the survey sampling literature, how to combine survey sampling data with non probability sampling data has also got very popular (Chen et al., 2020). Without bias correction, most existing methods may produce biased results if the common distribution assumption is violated. One has to be careful to assess the impartiality before data integration. Before we respond to the common concern by the reviewers on the heterogeneity among different studies, we first outline the possible distributional shifts or changes in each source data. In themachine learning literature, the concepts of covariate shift, label shift, and transfer learning have been widely used (QuiñoneroCandela et al., 2009). We briefly highlight those concepts in terms of statistical joint density or conditional density. Covariate shift: Let Y and X be, respectively, the outcome and a vector of covariates in Statistic terminology, or a label variable and a vector of features in Machine Learning Languish. Suppose we have two data-sets:","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"204 - 207"},"PeriodicalIF":0.5,"publicationDate":"2022-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48324852","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-08-06DOI: 10.1080/24754269.2022.2107974
X. Zeng, Yuanyuan Ju, Liucang Wu
A regression model with skew-normal errors provides a useful extension for traditional normal regression models when the data involve asymmetric outcomes. Moreover, data that arise from a heterogeneous population can be efficiently analysed by a finite mixture of regression models. These observations motivate us to propose a novel finite mixture of median regression model based on a mixture of the skew-normal distributions to explore asymmetrical data from several subpopulations. With the appropriate choice of the tuning parameters, we establish the theoretical properties of the proposed procedure, including consistency for variable selection method and the oracle property in estimation. A productive nonparametric clustering method is applied to select the number of components, and an efficient EM algorithm for numerical computations is developed. Simulation studies and a real data set are used to illustrate the performance of the proposed methodologies.
{"title":"Variable selection in finite mixture of median regression models using skew-normal distribution","authors":"X. Zeng, Yuanyuan Ju, Liucang Wu","doi":"10.1080/24754269.2022.2107974","DOIUrl":"https://doi.org/10.1080/24754269.2022.2107974","url":null,"abstract":"A regression model with skew-normal errors provides a useful extension for traditional normal regression models when the data involve asymmetric outcomes. Moreover, data that arise from a heterogeneous population can be efficiently analysed by a finite mixture of regression models. These observations motivate us to propose a novel finite mixture of median regression model based on a mixture of the skew-normal distributions to explore asymmetrical data from several subpopulations. With the appropriate choice of the tuning parameters, we establish the theoretical properties of the proposed procedure, including consistency for variable selection method and the oracle property in estimation. A productive nonparametric clustering method is applied to select the number of components, and an efficient EM algorithm for numerical computations is developed. Simulation studies and a real data set are used to illustrate the performance of the proposed methodologies.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"7 1","pages":"30 - 48"},"PeriodicalIF":0.5,"publicationDate":"2022-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47462879","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-07-30DOI: 10.1080/24754269.2021.1978206
Cong Lin, Dongchu Sun, Chengyuan Song
ABSTRACT Bayesian Hierarchical models has been widely used in modern statistical application. To deal with the data having complex structures, we propose a generalized hierarchical normal linear (GHNL) model which accommodates arbitrarily many levels, usual design matrices and ‘vanilla’ covariance matrices. Objective hyperpriors can be employed for the GHNL model to express ignorance or match frequentist properties, yet the common objective Bayesian approaches are infeasible or fraught with danger in hierarchical modelling. To tackle this issue, [Berger, J., Sun, D., & Song, C. (2020b). An objective prior for hyperparameters in normal hierarchical models. Journal of Multivariate Analysis, 178, 104606. https://doi.org/10.1016/j.jmva.2020.104606] proposed a particular objective prior and investigated its properties comprehensively. Posterior propriety is important for the choice of priors to guarantee the convergence of MCMC samplers. James Berger conjectured that the resulting posterior is proper for a hierarchical normal model with arbitrarily many levels, a rigorous proof of which was not given, however. In this paper, we complete this story and provide an user-friendly guidance. One main contribution of this paper is to propose a new technique for deriving an elaborate upper bound on the integrated likelihood, but also one unified approach to checking the posterior propriety for linear models. An efficient Gibbs sampling method is also introduced and outperforms other sampling approaches considerably.
{"title":"Posterior propriety of an objective prior for generalized hierarchical normal linear models","authors":"Cong Lin, Dongchu Sun, Chengyuan Song","doi":"10.1080/24754269.2021.1978206","DOIUrl":"https://doi.org/10.1080/24754269.2021.1978206","url":null,"abstract":"ABSTRACT Bayesian Hierarchical models has been widely used in modern statistical application. To deal with the data having complex structures, we propose a generalized hierarchical normal linear (GHNL) model which accommodates arbitrarily many levels, usual design matrices and ‘vanilla’ covariance matrices. Objective hyperpriors can be employed for the GHNL model to express ignorance or match frequentist properties, yet the common objective Bayesian approaches are infeasible or fraught with danger in hierarchical modelling. To tackle this issue, [Berger, J., Sun, D., & Song, C. (2020b). An objective prior for hyperparameters in normal hierarchical models. Journal of Multivariate Analysis, 178, 104606. https://doi.org/10.1016/j.jmva.2020.104606] proposed a particular objective prior and investigated its properties comprehensively. Posterior propriety is important for the choice of priors to guarantee the convergence of MCMC samplers. James Berger conjectured that the resulting posterior is proper for a hierarchical normal model with arbitrarily many levels, a rigorous proof of which was not given, however. In this paper, we complete this story and provide an user-friendly guidance. One main contribution of this paper is to propose a new technique for deriving an elaborate upper bound on the integrated likelihood, but also one unified approach to checking the posterior propriety for linear models. An efficient Gibbs sampling method is also introduced and outperforms other sampling approaches considerably.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"17 1","pages":"309 - 326"},"PeriodicalIF":0.5,"publicationDate":"2022-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41289512","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-27DOI: 10.1080/24754269.2021.1963183
Juan Yang
ABSTRACT In this article, we obtain a central limit theorem and prove a moderate deviation principle for stochastic reaction-diffusion systems with multiplicative noise and non-Lipschitz reaction term.
{"title":"Moderate deviation principle for stochastic reaction-diffusion systems with multiplicative noise and non-Lipschitz reaction","authors":"Juan Yang","doi":"10.1080/24754269.2021.1963183","DOIUrl":"https://doi.org/10.1080/24754269.2021.1963183","url":null,"abstract":"ABSTRACT In this article, we obtain a central limit theorem and prove a moderate deviation principle for stochastic reaction-diffusion systems with multiplicative noise and non-Lipschitz reaction term.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"299 - 308"},"PeriodicalIF":0.5,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44846401","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-10DOI: 10.1080/24754269.2022.2084930
Lingzhi Zhou, P. Song
It is our pleasure to have an opportunity of making comments on this fine work in that the authors present a comprehensive review on empirical likelihood (EL) methods for integrative data analyses. This paper focuses on a unified methodological framework based on EL and estimating equations (EE) to sequentially combine summary information from individual data batches to obtain desirable estimation and inference comparable to those obtained by the EL method utilizing all individual-level data. The latter is sometimes referred to as an oracle estimation and inference in the setting of massively distributed data batches. An obvious strength of this review paper concerns the detailed theoretical properties in connection to the improved estimation efficiency through the utility of auxiliary information. In this paper, the authors consider a typical data integration situation where individual-level data from the Kth data batch is combined with certain ‘good’ summary information from the previous K−1 data batches. While appreciating the theoretical strengths in this paper, we notice a few interesting aspects that are worth some discussions. Distributed data structures: In practice, both individual data batch size and the number of data batches may appear rather heterogeneous, requiring different theory and algorithms in the data analysis. Such heterogeneity in distributed data structures is not well aligned with the methodological framework reviewed in the paper. One important practical scenario is that the number of data batches tends to infinity. Such setting may arise from distributed data collected from millions of mobile device users, or from electronic health records (EHR) data sources distributed across thousands of hospitals. In the presence of massively distributed data batches, a natural question pertains to a trade-off between data communication efficiency and analytic approximation accuracy. Although oneround data communication is popular in this type of integrative data analysis, multiple rounds of data communication may be also viable in the implementation via high-performance computing clusters. Our experience suggests that sacrifice in the flexibility of data communication (e.g., limited to one-round communication in the Hadoop paradigm), although enjoys computational speed, may pay a substantial price on the loss of approximation accuracy, leading to potentially accumulated estimation bias when the number of data batches increases. This issue of estimation bias is a technical challenge in nonlinear models due to the invocation of approximations to linearize both estimation procedure and numerical search algorithm. On the other hand, relaxing the restrictions on data communication, such as the operations within the lambda architecture, can help reduce the approximation error and lower estimation bias. Clearly, the latter requires more computational resources. This important issue was investigated by Zhou et al. (2022) that studied asympt
{"title":"A discussion on “A selective review of statistical methods using calibration information from similar studies”","authors":"Lingzhi Zhou, P. Song","doi":"10.1080/24754269.2022.2084930","DOIUrl":"https://doi.org/10.1080/24754269.2022.2084930","url":null,"abstract":"It is our pleasure to have an opportunity of making comments on this fine work in that the authors present a comprehensive review on empirical likelihood (EL) methods for integrative data analyses. This paper focuses on a unified methodological framework based on EL and estimating equations (EE) to sequentially combine summary information from individual data batches to obtain desirable estimation and inference comparable to those obtained by the EL method utilizing all individual-level data. The latter is sometimes referred to as an oracle estimation and inference in the setting of massively distributed data batches. An obvious strength of this review paper concerns the detailed theoretical properties in connection to the improved estimation efficiency through the utility of auxiliary information. In this paper, the authors consider a typical data integration situation where individual-level data from the Kth data batch is combined with certain ‘good’ summary information from the previous K−1 data batches. While appreciating the theoretical strengths in this paper, we notice a few interesting aspects that are worth some discussions. Distributed data structures: In practice, both individual data batch size and the number of data batches may appear rather heterogeneous, requiring different theory and algorithms in the data analysis. Such heterogeneity in distributed data structures is not well aligned with the methodological framework reviewed in the paper. One important practical scenario is that the number of data batches tends to infinity. Such setting may arise from distributed data collected from millions of mobile device users, or from electronic health records (EHR) data sources distributed across thousands of hospitals. In the presence of massively distributed data batches, a natural question pertains to a trade-off between data communication efficiency and analytic approximation accuracy. Although oneround data communication is popular in this type of integrative data analysis, multiple rounds of data communication may be also viable in the implementation via high-performance computing clusters. Our experience suggests that sacrifice in the flexibility of data communication (e.g., limited to one-round communication in the Hadoop paradigm), although enjoys computational speed, may pay a substantial price on the loss of approximation accuracy, leading to potentially accumulated estimation bias when the number of data batches increases. This issue of estimation bias is a technical challenge in nonlinear models due to the invocation of approximations to linearize both estimation procedure and numerical search algorithm. On the other hand, relaxing the restrictions on data communication, such as the operations within the lambda architecture, can help reduce the approximation error and lower estimation bias. Clearly, the latter requires more computational resources. This important issue was investigated by Zhou et al. (2022) that studied asympt","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"196 - 198"},"PeriodicalIF":0.5,"publicationDate":"2022-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"42466102","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-06-10DOI: 10.1080/24754269.2022.2084929
Peisong Han
We Qin, Liu and Li (QLL) on a thoughtful and much needed review of many interesting methods for combining information from similar studies. We appreciate being given the opportunity to make a discussion. QLL cover a variety of different settings and methods. Based on that, we will provide a brief review on some additional relevant literature with a focus on methods that deal with population heterogeneity, since it is most likely that different studies sample different and whether information be combined depends on how similar those among many other To the we will follow the setting in of QLL, most of methods more broadly applied.
{"title":"A discussion on “A selective review of statistical methods using calibration information from similar studies” by Qin, Liu and Li","authors":"Peisong Han","doi":"10.1080/24754269.2022.2084929","DOIUrl":"https://doi.org/10.1080/24754269.2022.2084929","url":null,"abstract":"We Qin, Liu and Li (QLL) on a thoughtful and much needed review of many interesting methods for combining information from similar studies. We appreciate being given the opportunity to make a discussion. QLL cover a variety of different settings and methods. Based on that, we will provide a brief review on some additional relevant literature with a focus on methods that deal with population heterogeneity, since it is most likely that different studies sample different and whether information be combined depends on how similar those among many other To the we will follow the setting in of QLL, most of methods more broadly applied.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"193 - 195"},"PeriodicalIF":0.5,"publicationDate":"2022-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48494981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-19DOI: 10.1080/24754269.2022.2075083
J. Lawless
Qin, Liu and Li (henceforth QLL) review methods for combining information using empirical likelihood and related approaches; many of these ideas originated in the earlier work of Jing Qin. I thank the authors for their review, and for the opportunity to contribute to its discussion. I have little to say about technical aspects, which are well established but will comment briefly on broader aspects of data integration, and implications for methods like those in the article. I will focus on settings where there is a response variable Y and covariates X , Z and assume the target of inference is either the distribution f ( y | x , z ) of Y given X , Z or the ‘marginal’ distribution f m ( y | x ) of Y given X . In health research Y might represent (time to) the occurrence of some specific event, and X , Z covariates, exposures or interventions. The distribution f ( y | x , z ) is important for individual-level decisions; in settings where X represents interventions f m ( y | x ) is relevant in randomized trials and comparative effectiveness research. The authors consider two main topics in data integration: (i) the use of external auxiliary data to augment the analysis of a specific ‘internal’ study, and (ii) the combination of data from separate studies with a view to for common parameters or They focus on where,
{"title":"Discussion of “A selective review of statistical methods using calibration information from similar studies” and some remarks on data integration","authors":"J. Lawless","doi":"10.1080/24754269.2022.2075083","DOIUrl":"https://doi.org/10.1080/24754269.2022.2075083","url":null,"abstract":"Qin, Liu and Li (henceforth QLL) review methods for combining information using empirical likelihood and related approaches; many of these ideas originated in the earlier work of Jing Qin. I thank the authors for their review, and for the opportunity to contribute to its discussion. I have little to say about technical aspects, which are well established but will comment briefly on broader aspects of data integration, and implications for methods like those in the article. I will focus on settings where there is a response variable Y and covariates X , Z and assume the target of inference is either the distribution f ( y | x , z ) of Y given X , Z or the ‘marginal’ distribution f m ( y | x ) of Y given X . In health research Y might represent (time to) the occurrence of some specific event, and X , Z covariates, exposures or interventions. The distribution f ( y | x , z ) is important for individual-level decisions; in settings where X represents interventions f m ( y | x ) is relevant in randomized trials and comparative effectiveness research. The authors consider two main topics in data integration: (i) the use of external auxiliary data to augment the analysis of a specific ‘internal’ study, and (ii) the combination of data from separate studies with a view to for common parameters or They focus on where,","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"191 - 192"},"PeriodicalIF":0.5,"publicationDate":"2022-05-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"47416322","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-15DOI: 10.1080/24754269.2022.2075082
J. Ning
Combining information from similar studies has attracted substantial attention and continues to become increasingly important to assemble quality evidence in comparative effectiveness research. To my knowledge, this is the first paper to systematically review classical and up-to-date methods on how different statistical methods, such as meta-analysis, empirical likelihood (EL), renewal estimation and incremental inference, can be applied to incorporate information from multiple sources. This review paper succinctly presents both basic and advanced issues and will be greatly beneficial for researchers who are interested in this field. Because of the wide array of related methods, this paper consists of cohesive but relatively independent sections. Although it is a review paper, the focus and contents are quite different from those of original papers. For example, an optimal combination of two estimators from two independent studies is derived by two methods from different perspectives: a linear combination with the smallest asymptotic variance and the maximum likelihood method. Another example is how to select a more efficient way to synthesize auxiliary information from other studies. In Section 5 of the review paper, two different sets of constraints, in which one involves parameter of interest and the other does not, have been presented and compared in terms of efficiency improvement. Both statistical intuition and theoretical justification are provided, which help readers create a better way to combine aggregate information for improved efficiency in practice. Such insightful discussions are not easily found elsewhere. The paper also nicely derives the conclusion that, similar to parametric-likelihood-based meta-analysis, the calibration methods (e.g., EL and generalized method of moments (GMM)) based on aggregate information have no efficiency loss compared to these methods using all individual data. Such deep insight into these methods greatly promotes their use for information calibration, since it is always challenging to obtain individual-level data. As stated in the title, this review paper mainly focuses on statistical methods using calibration information from similar studies. One crucial assumption of these methods is homogeneity between the cohort with individual data (e.g., target cohort) and these similar studies (e.g., external sources).When the calibration information from the external sources are not comparable with those of the target cohort, such calibration methods may result in severe bias in estimation and misleading conclusions (Chen et al., 2021; Huang et al., 2016). One way to address this issue is to test the comparability by comparing calibration information between the target cohort and external sources before combining such information. Using the setup in Section 4 of the reviewpaper as an example, assume that the auxiliary information from external sources is the mean of Y by subgroups (e.g., subgroups determined by
结合来自类似研究的信息已经引起了大量关注,并且在比较有效性研究中收集高质量证据变得越来越重要。据我所知,这是第一篇系统回顾经典和最新方法的论文,介绍了不同的统计方法,如元分析、经验似然(EL)、更新估计和增量推理,如何应用于整合来自多个来源的信息。本文简明扼要地介绍了该领域的基本问题和高级问题,对感兴趣的研究人员将大有裨益。由于相关的方法种类繁多,本文由连贯但相对独立的部分组成。虽然是一篇综述性的论文,但其重点和内容与原论文有很大的不同。例如,两个独立研究的两个估计量的最优组合通过两种不同角度的方法得到:最小渐近方差的线性组合和最大似然方法。另一个例子是如何选择一种更有效的方法来综合其他研究的辅助信息。在回顾论文的第5节中,已经提出并比较了两组不同的约束,其中一组涉及感兴趣的参数,另一组不涉及,并在效率改进方面进行了比较。提供了统计直觉和理论依据,帮助读者创建更好的方法来组合汇总信息,以提高实践中的效率。这样深刻的讨论在其他地方很难找到。本文还很好地得出结论,与基于参数似然的元分析类似,基于聚合信息的校准方法(如EL和广义矩法(GMM))与使用所有单个数据的方法相比没有效率损失。对这些方法的深入了解极大地促进了它们在信息校准中的应用,因为获取个人层面的数据总是具有挑战性的。如标题所述,本文主要侧重于利用类似研究的校准信息的统计方法。这些方法的一个关键假设是具有个体数据的队列(例如目标队列)和这些类似研究(例如外部来源)之间的同质性。当来自外部来源的校准信息与目标队列的校准信息不具有可比性时,这种校准方法可能导致严重的估计偏差和误导性结论(Chen et al., 2021;黄等人,2016)。解决这一问题的一种方法是在合并这些信息之前,通过比较目标队列和外部来源之间的校准信息来测试可比性。以综述文章第4节中的设置为例,假设来自外部来源的辅助信息是Y按子组(例如,由年龄和性别等协变量确定的子组)的平均值,
{"title":"Discussion of ‘A selective review of statistical methods using calibration information from similar studies’","authors":"J. Ning","doi":"10.1080/24754269.2022.2075082","DOIUrl":"https://doi.org/10.1080/24754269.2022.2075082","url":null,"abstract":"Combining information from similar studies has attracted substantial attention and continues to become increasingly important to assemble quality evidence in comparative effectiveness research. To my knowledge, this is the first paper to systematically review classical and up-to-date methods on how different statistical methods, such as meta-analysis, empirical likelihood (EL), renewal estimation and incremental inference, can be applied to incorporate information from multiple sources. This review paper succinctly presents both basic and advanced issues and will be greatly beneficial for researchers who are interested in this field. Because of the wide array of related methods, this paper consists of cohesive but relatively independent sections. Although it is a review paper, the focus and contents are quite different from those of original papers. For example, an optimal combination of two estimators from two independent studies is derived by two methods from different perspectives: a linear combination with the smallest asymptotic variance and the maximum likelihood method. Another example is how to select a more efficient way to synthesize auxiliary information from other studies. In Section 5 of the review paper, two different sets of constraints, in which one involves parameter of interest and the other does not, have been presented and compared in terms of efficiency improvement. Both statistical intuition and theoretical justification are provided, which help readers create a better way to combine aggregate information for improved efficiency in practice. Such insightful discussions are not easily found elsewhere. The paper also nicely derives the conclusion that, similar to parametric-likelihood-based meta-analysis, the calibration methods (e.g., EL and generalized method of moments (GMM)) based on aggregate information have no efficiency loss compared to these methods using all individual data. Such deep insight into these methods greatly promotes their use for information calibration, since it is always challenging to obtain individual-level data. As stated in the title, this review paper mainly focuses on statistical methods using calibration information from similar studies. One crucial assumption of these methods is homogeneity between the cohort with individual data (e.g., target cohort) and these similar studies (e.g., external sources).When the calibration information from the external sources are not comparable with those of the target cohort, such calibration methods may result in severe bias in estimation and misleading conclusions (Chen et al., 2021; Huang et al., 2016). One way to address this issue is to test the comparability by comparing calibration information between the target cohort and external sources before combining such information. Using the setup in Section 4 of the reviewpaper as an example, assume that the auxiliary information from external sources is the mean of Y by subgroups (e.g., subgroups determined by","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"199 - 200"},"PeriodicalIF":0.5,"publicationDate":"2022-05-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"48425685","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-05-09DOI: 10.1080/24754269.2022.2064611
Asish Banik, T. Maiti, Andrew R. Bender
ABSTRACT The main goal of this paper is to employ longitudinal trajectories in a significant number of sub-regional brain volumetric MRI data as statistical predictors for Alzheimer's disease (AD) classification. We use logistic regression in a Bayesian framework that includes many functional predictors. The direct sampling of regression coefficients from the Bayesian logistic model is difficult due to its complicated likelihood function. In high-dimensional scenarios, the selection of predictors is paramount with the introduction of either spike-and-slab priors, non-local priors, or Horseshoe priors. We seek to avoid the complicated Metropolis-Hastings approach and to develop an easily implementable Gibbs sampler. In addition, the Bayesian estimation provides proper estimates of the model parameters, which are also useful for building inference. Another advantage of working with logistic regression is that it calculates the log of odds of relative risk for AD compared to normal control based on the selected longitudinal predictors, rather than simply classifying patients based on cross-sectional estimates. Ultimately, however, we combine approaches and use a probability threshold to classify individual patients. We employ 49 functional predictors consisting of volumetric estimates of brain sub-regions, chosen for their established clinical significance. Moreover, the use of spike-and-slab priors ensures that many redundant predictors are dropped from the model.
{"title":"Bayesian penalized model for classification and selection of functional predictors using longitudinal MRI data from ADNI","authors":"Asish Banik, T. Maiti, Andrew R. Bender","doi":"10.1080/24754269.2022.2064611","DOIUrl":"https://doi.org/10.1080/24754269.2022.2064611","url":null,"abstract":"ABSTRACT The main goal of this paper is to employ longitudinal trajectories in a significant number of sub-regional brain volumetric MRI data as statistical predictors for Alzheimer's disease (AD) classification. We use logistic regression in a Bayesian framework that includes many functional predictors. The direct sampling of regression coefficients from the Bayesian logistic model is difficult due to its complicated likelihood function. In high-dimensional scenarios, the selection of predictors is paramount with the introduction of either spike-and-slab priors, non-local priors, or Horseshoe priors. We seek to avoid the complicated Metropolis-Hastings approach and to develop an easily implementable Gibbs sampler. In addition, the Bayesian estimation provides proper estimates of the model parameters, which are also useful for building inference. Another advantage of working with logistic regression is that it calculates the log of odds of relative risk for AD compared to normal control based on the selected longitudinal predictors, rather than simply classifying patients based on cross-sectional estimates. Ultimately, however, we combine approaches and use a probability threshold to classify individual patients. We employ 49 functional predictors consisting of volumetric estimates of brain sub-regions, chosen for their established clinical significance. Moreover, the use of spike-and-slab priors ensures that many redundant predictors are dropped from the model.","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"327 - 343"},"PeriodicalIF":0.5,"publicationDate":"2022-05-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"41643341","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}