{"title":"关于“使用类似研究的校准信息对统计方法进行选择性审查”的复辩状","authors":"J. Qin, Yukun Liu, Pengfei Li","doi":"10.1080/24754269.2022.2111059","DOIUrl":null,"url":null,"abstract":"We thank Professor Jun Shao for organizing this interesting discussion. We also thank the six discussants formany insightful comments and suggestions. Assembling data from different sources has been becoming a very popular topic nowadays. In our review paper, we have mainly discussed many integration methods when internal data and external data share a common distribution, though the external data may not have information for some underlying variables collected in the internal study. Indeed the common distribution assumption is very strong in practical applications. Due to the technology advance, the collection of data is gettingmuch easier, for example, by using i-phone, satellite image, etc. As those collected data are not obtained by well-designed probability sampling, inevitably, they may not represent the general population. As a consequence, there probably exists a systematic bias. In the survey sampling literature, how to combine survey sampling data with non probability sampling data has also got very popular (Chen et al., 2020). Without bias correction, most existing methods may produce biased results if the common distribution assumption is violated. One has to be careful to assess the impartiality before data integration. Before we respond to the common concern by the reviewers on the heterogeneity among different studies, we first outline the possible distributional shifts or changes in each source data. In themachine learning literature, the concepts of covariate shift, label shift, and transfer learning have been widely used (QuiñoneroCandela et al., 2009). We briefly highlight those concepts in terms of statistical joint density or conditional density. Covariate shift: Let Y and X be, respectively, the outcome and a vector of covariates in Statistic terminology, or a label variable and a vector of features in Machine Learning Languish. Suppose we have two data-sets:","PeriodicalId":22070,"journal":{"name":"Statistical Theory and Related Fields","volume":"6 1","pages":"204 - 207"},"PeriodicalIF":0.7000,"publicationDate":"2022-08-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Rejoinder on “A selective review of statistical methods using calibration information from similar studies”\",\"authors\":\"J. Qin, Yukun Liu, Pengfei Li\",\"doi\":\"10.1080/24754269.2022.2111059\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We thank Professor Jun Shao for organizing this interesting discussion. We also thank the six discussants formany insightful comments and suggestions. Assembling data from different sources has been becoming a very popular topic nowadays. In our review paper, we have mainly discussed many integration methods when internal data and external data share a common distribution, though the external data may not have information for some underlying variables collected in the internal study. Indeed the common distribution assumption is very strong in practical applications. Due to the technology advance, the collection of data is gettingmuch easier, for example, by using i-phone, satellite image, etc. As those collected data are not obtained by well-designed probability sampling, inevitably, they may not represent the general population. As a consequence, there probably exists a systematic bias. In the survey sampling literature, how to combine survey sampling data with non probability sampling data has also got very popular (Chen et al., 2020). Without bias correction, most existing methods may produce biased results if the common distribution assumption is violated. One has to be careful to assess the impartiality before data integration. Before we respond to the common concern by the reviewers on the heterogeneity among different studies, we first outline the possible distributional shifts or changes in each source data. In themachine learning literature, the concepts of covariate shift, label shift, and transfer learning have been widely used (QuiñoneroCandela et al., 2009). We briefly highlight those concepts in terms of statistical joint density or conditional density. Covariate shift: Let Y and X be, respectively, the outcome and a vector of covariates in Statistic terminology, or a label variable and a vector of features in Machine Learning Languish. Suppose we have two data-sets:\",\"PeriodicalId\":22070,\"journal\":{\"name\":\"Statistical Theory and Related Fields\",\"volume\":\"6 1\",\"pages\":\"204 - 207\"},\"PeriodicalIF\":0.7000,\"publicationDate\":\"2022-08-26\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Statistical Theory and Related Fields\",\"FirstCategoryId\":\"96\",\"ListUrlMain\":\"https://doi.org/10.1080/24754269.2022.2111059\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"STATISTICS & PROBABILITY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Theory and Related Fields","FirstCategoryId":"96","ListUrlMain":"https://doi.org/10.1080/24754269.2022.2111059","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"STATISTICS & PROBABILITY","Score":null,"Total":0}
引用次数: 0
摘要
我们感谢邵军教授组织这次有趣的讨论。我们也感谢六位讨论者提出的富有见地的意见和建议。收集来自不同来源的数据已成为当今一个非常流行的话题。在我们的综述中,我们主要讨论了当内部数据和外部数据共享共同分布时的许多集成方法,尽管外部数据可能没有内部研究中收集的一些潜在变量的信息。事实上,共同分布假设在实际应用中是非常强大的。由于技术的进步,数据的收集变得更加容易,例如通过使用手机、卫星图像等。由于这些收集的数据不是通过精心设计的概率采样获得的,不可避免地,它们可能无法代表一般人群。因此,可能存在系统性的偏见。在调查抽样文献中,如何将调查抽样数据与非概率抽样数据相结合也变得非常流行(Chen et al.,2020)。在没有偏差校正的情况下,如果违反共同分布假设,大多数现有方法可能会产生有偏差的结果。在数据整合之前,必须谨慎评估公正性。在我们回应审稿人对不同研究之间异质性的共同担忧之前,我们首先概述了每个源数据中可能的分布变化。在机器学习文献中,协变移位、标签移位和迁移学习的概念已被广泛使用(QuiñoneroCandela等人,2009)。我们简要强调了统计节理密度或条件密度方面的这些概念。协变量移位:设Y和X分别是统计学术语中协变量的结果和向量,或机器学习语言中的标签变量和特征向量。假设我们有两个数据集:
Rejoinder on “A selective review of statistical methods using calibration information from similar studies”
We thank Professor Jun Shao for organizing this interesting discussion. We also thank the six discussants formany insightful comments and suggestions. Assembling data from different sources has been becoming a very popular topic nowadays. In our review paper, we have mainly discussed many integration methods when internal data and external data share a common distribution, though the external data may not have information for some underlying variables collected in the internal study. Indeed the common distribution assumption is very strong in practical applications. Due to the technology advance, the collection of data is gettingmuch easier, for example, by using i-phone, satellite image, etc. As those collected data are not obtained by well-designed probability sampling, inevitably, they may not represent the general population. As a consequence, there probably exists a systematic bias. In the survey sampling literature, how to combine survey sampling data with non probability sampling data has also got very popular (Chen et al., 2020). Without bias correction, most existing methods may produce biased results if the common distribution assumption is violated. One has to be careful to assess the impartiality before data integration. Before we respond to the common concern by the reviewers on the heterogeneity among different studies, we first outline the possible distributional shifts or changes in each source data. In themachine learning literature, the concepts of covariate shift, label shift, and transfer learning have been widely used (QuiñoneroCandela et al., 2009). We briefly highlight those concepts in terms of statistical joint density or conditional density. Covariate shift: Let Y and X be, respectively, the outcome and a vector of covariates in Statistic terminology, or a label variable and a vector of features in Machine Learning Languish. Suppose we have two data-sets: