Pub Date : 2022-06-01Epub Date: 2022-05-25DOI: 10.1093/jssam/smac016
Mandi Yu, Yulei He, Trivellore E Raghunathan
Data synthesis is an effective statistical approach for reducing data disclosure risk. Generating fully synthetic data might minimize such risk, but its modeling and application can be difficult for data from large, complex surveys. This article extended the two-stage imputation to simultaneously impute item missing values and generate fully synthetic data. A new combining rule for making inferences using data generated in this manner was developed. Two semiparametric missing data imputation models were adapted to generate fully synthetic data for skewed continuous variable and sparse binary variable, respectively. The proposed approach was evaluated using simulated data and real longitudinal data from the Health and Retirement Study. The proposed approach was also compared with two existing synthesis approaches: (1) parametric regressions models as implemented in IVEware; and (2) nonparametric Classification and Regression Trees as implemented in synthpop package for R using real data. The results show that high data utility is maintained for a wide variety of descriptive and model-based statistics using the proposed strategy. The proposed strategy also performs better than existing methods for sophisticated analyses such as factor analysis.
数据合成是降低数据披露风险的有效统计方法。生成全合成数据可以最大限度地降低这种风险,但其建模和应用对于来自大型复杂调查的数据来说可能比较困难。本文对两阶段估算进行了扩展,以同时估算项目缺失值和生成全合成数据。文章开发了一种新的组合规则,用于使用以这种方式生成的数据进行推断。对两个半参数缺失数据估算模型进行了调整,以分别生成偏斜连续变量和稀疏二元变量的全合成数据。使用模拟数据和健康与退休研究的真实纵向数据对所提出的方法进行了评估。此外,还将提出的方法与现有的两种合成方法进行了比较:(1) 在 IVEware 中实现的参数回归模型;(2) 使用真实数据在 R 的 synthpop 软件包中实现的非参数分类和回归树。结果表明,使用所提出的策略,各种描述性和基于模型的统计数据都能保持较高的数据效用。在进行因子分析等复杂分析时,拟议策略的表现也优于现有方法。
{"title":"A SEMIPARAMETRIC MULTIPLE IMPUTATION APPROACH TO FULLY SYNTHETIC DATA FOR COMPLEX SURVEYS.","authors":"Mandi Yu, Yulei He, Trivellore E Raghunathan","doi":"10.1093/jssam/smac016","DOIUrl":"10.1093/jssam/smac016","url":null,"abstract":"<p><p>Data synthesis is an effective statistical approach for reducing data disclosure risk. Generating fully synthetic data might minimize such risk, but its modeling and application can be difficult for data from large, complex surveys. This article extended the two-stage imputation to simultaneously impute item missing values and generate fully synthetic data. A new combining rule for making inferences using data generated in this manner was developed. Two semiparametric missing data imputation models were adapted to generate fully synthetic data for skewed continuous variable and sparse binary variable, respectively. The proposed approach was evaluated using simulated data and real longitudinal data from the Health and Retirement Study. The proposed approach was also compared with two existing synthesis approaches: (1) parametric regressions models as implemented in <i>IVEware</i>; and (2) nonparametric Classification and Regression Trees as implemented in <i>synthpop</i> package for R using real data. The results show that high data utility is maintained for a wide variety of descriptive and model-based statistics using the proposed strategy. The proposed strategy also performs better than existing methods for sophisticated analyses such as factor analysis.</p>","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":"1 1","pages":"618-641"},"PeriodicalIF":2.1,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11044899/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"61006847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Brady T West, Ai Rene Ong, Frederick G Conrad, Michael F Schober, Kallan M Larsen, Andrew L Hupp
Live video (LV) communication tools (e.g., Zoom) have the potential to provide survey researchers with many of the benefits of in-person interviewing, while also greatly reducing data collection costs, given that interviewers do not need to travel and make in-person visits to sampled households. The COVID-19 pandemic has exposed the vulnerability of in-person data collection to public health crises, forcing survey researchers to explore remote data collection modes-such as LV interviewing-that seem likely to yield high-quality data without in-person interaction. Given the potential benefits of these technologies, the operational and methodological aspects of video interviewing have started to receive research attention from survey methodologists. Although it is remote, video interviewing still involves respondent-interviewer interaction that introduces the possibility of interviewer effects. No research to date has evaluated this potential threat to the quality of the data collected in video interviews. This research note presents an evaluation of interviewer effects in a recent experimental study of alternative approaches to video interviewing including both LV interviewing and the use of prerecorded videos of the same interviewers asking questions embedded in a web survey ("prerecorded video" interviewing). We find little evidence of significant interviewer effects when using these two approaches, which is a promising result. We also find that when interviewer effects were present, they tended to be slightly larger in the LV approach as would be expected in light of its being an interactive approach. We conclude with a discussion of the implications of these findings for future research using video interviewing.
{"title":"INTERVIEWER EFFECTS IN LIVE VIDEO AND PRERECORDED VIDEO INTERVIEWING.","authors":"Brady T West, Ai Rene Ong, Frederick G Conrad, Michael F Schober, Kallan M Larsen, Andrew L Hupp","doi":"10.1093/jssam/smab040","DOIUrl":"https://doi.org/10.1093/jssam/smab040","url":null,"abstract":"<p><p>Live video (LV) communication tools (e.g., Zoom) have the potential to provide survey researchers with many of the benefits of in-person interviewing, while also greatly reducing data collection costs, given that interviewers do not need to travel and make in-person visits to sampled households. The COVID-19 pandemic has exposed the vulnerability of in-person data collection to public health crises, forcing survey researchers to explore remote data collection modes-such as LV interviewing-that seem likely to yield high-quality data without in-person interaction. Given the potential benefits of these technologies, the operational and methodological aspects of video interviewing have started to receive research attention from survey methodologists. Although it is remote, video interviewing still involves respondent-interviewer interaction that introduces the possibility of interviewer effects. No research to date has evaluated this potential threat to the quality of the data collected in video interviews. This research note presents an evaluation of interviewer effects in a recent experimental study of alternative approaches to video interviewing including both LV interviewing and the use of prerecorded videos of the same interviewers asking questions embedded in a web survey (\"prerecorded video\" interviewing). We find little evidence of significant interviewer effects when using these two approaches, which is a promising result. We also find that when interviewer effects were present, they tended to be slightly larger in the LV approach as would be expected in light of its being an interactive approach. We conclude with a discussion of the implications of these findings for future research using video interviewing.</p>","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":"10 2","pages":"317-336"},"PeriodicalIF":2.1,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC8690284/pdf/smab040.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9793318","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Respondent-driven sampling (RDS) is a popular method of conducting surveys in hard to reach populations where strong assumptions are required in order to make valid statistical inferences. In this paper we investigate the assumption that network degrees are measured accurately by the RDS survey and find that there is likely significant measurement error present in typical studies. We prove that most RDS estimators remain consistent under an imperfect measurement model with little to no added bias, though the variance of the estimators does increase.
{"title":"On The Robustness Of Respondent-Driven Sampling Estimators To Measurement Error.","authors":"Ian E Fellows","doi":"10.1093/jssam/smab056","DOIUrl":"https://doi.org/10.1093/jssam/smab056","url":null,"abstract":"<p><p>Respondent-driven sampling (RDS) is a popular method of conducting surveys in hard to reach populations where strong assumptions are required in order to make valid statistical inferences. In this paper we investigate the assumption that network degrees are measured accurately by the RDS survey and find that there is likely significant measurement error present in typical studies. We prove that most RDS estimators remain consistent under an imperfect measurement model with little to no added bias, though the variance of the estimators does increase.</p>","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":"10 2","pages":"377-396"},"PeriodicalIF":2.1,"publicationDate":"2022-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9014508/pdf/smab056.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"10483066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Respondent-driven sampling (RDS) is a popular method of conducting surveys in hard to reach populations where strong assumptions are required in order to make valid statistical inferences. In this paper we investigate the assumption that network degrees are measured accurately by the RDS survey and find that there is likely significant measurement error present in typical studies. We prove that most RDS estimators remain consistent under an imperfect measurement model with little to no added bias, though the variance of the estimators does increase.
{"title":"On The Robustness Of Respondent-Driven Sampling Estimators To Measurement Error.","authors":"Ian E. Fellows","doi":"10.1093/jssam/smac004","DOIUrl":"https://doi.org/10.1093/jssam/smac004","url":null,"abstract":"Respondent-driven sampling (RDS) is a popular method of conducting surveys in hard to reach populations where strong assumptions are required in order to make valid statistical inferences. In this paper we investigate the assumption that network degrees are measured accurately by the RDS survey and find that there is likely significant measurement error present in typical studies. We prove that most RDS estimators remain consistent under an imperfect measurement model with little to no added bias, though the variance of the estimators does increase.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":"10 2 1","pages":"377-396"},"PeriodicalIF":2.1,"publicationDate":"2022-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"45789287","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Fabienne Kraemer, Henning Silber, Bella Struminskaya, M. Bošnjak, J. Kossmann, Bernd Weiss
Learning effects due to repeated interviewing, which are referred to as panel conditioning, are a major threat to response quality in later waves of a panel study. Up to date, research has not provided a clear picture regarding the circumstances, mechanisms, and dimensions of potential panel conditioning effects. Especially the effects of conditioning frequency, that is, different levels of experience within a panel, on response quality are underexplored. Against this background, we investigated the effects of panel conditioning by using data from the GESIS Panel, a German mixed-mode probability-based panel study. Using two refreshment samples, we compared three panel cohorts with differing levels of experience with respect to several response quality indicators related to the mechanisms of reflection, satisficing, and social desirability. Overall, we find evidence for both negative (i.e., disadvantageous for response quality) as well as positive (i.e., advantageous for response quality) panel conditioning. Highly experienced respondents were more likely to satisfice by selecting mid-point responses or by speeding through the questionnaire. They also had a higher probability of refusing to answer sensitive questions than less experienced panel members. However, more experienced respondents were also more likely to optimize the response processes by needing less time compared to panelists with lower experience levels (when controlling for speeding). In contrast, we did not find significant differences with respect to the number of “don’t know” responses, non-differentiation, the selection of first response categories, and the number of non-triggered filter questions. Of the observed differences, speeding showed the highest magnitude with an average increase of 5.9 percentage points for highly experienced panel members compared to low experienced panelists.
{"title":"Panel Conditioning in a German Probability-Based Longitudinal Study: A Comparison of Respondents with Different Levels of Survey Experience","authors":"Fabienne Kraemer, Henning Silber, Bella Struminskaya, M. Bošnjak, J. Kossmann, Bernd Weiss","doi":"10.31235/osf.io/vd5xp","DOIUrl":"https://doi.org/10.31235/osf.io/vd5xp","url":null,"abstract":"Learning effects due to repeated interviewing, which are referred to as panel conditioning, are a major threat to response quality in later waves of a panel study. Up to date, research has not provided a clear picture regarding the circumstances, mechanisms, and dimensions of potential panel conditioning effects. Especially the effects of conditioning frequency, that is, different levels of experience within a panel, on response quality are underexplored. Against this background, we investigated the effects of panel conditioning by using data from the GESIS Panel, a German mixed-mode probability-based panel study. Using two refreshment samples, we compared three panel cohorts with differing levels of experience with respect to several response quality indicators related to the mechanisms of reflection, satisficing, and social desirability. Overall, we find evidence for both negative (i.e., disadvantageous for response quality) as well as positive (i.e., advantageous for response quality) panel conditioning. Highly experienced respondents were more likely to satisfice by selecting mid-point responses or by speeding through the questionnaire. They also had a higher probability of refusing to answer sensitive questions than less experienced panel members. However, more experienced respondents were also more likely to optimize the response processes by needing less time compared to panelists with lower experience levels (when controlling for speeding). In contrast, we did not find significant differences with respect to the number of “don’t know” responses, non-differentiation, the selection of first response categories, and the number of non-triggered filter questions. Of the observed differences, speeding showed the highest magnitude with an average increase of 5.9 percentage points for highly experienced panel members compared to low experienced panelists.","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":" ","pages":""},"PeriodicalIF":2.1,"publicationDate":"2022-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"44830407","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2022-01-20eCollection Date: 2023-04-01DOI: 10.1093/jssam/smab049
Yutao Liu, Andrew Gelman, Qixuan Chen
We consider inference from nonrandom samples in data-rich settings where high-dimensional auxiliary information is available both in the sample and the target population, with survey inference being a special case. We propose a regularized prediction approach that predicts the outcomes in the population using a large number of auxiliary variables such that the ignorability assumption is reasonable and the Bayesian framework is straightforward for quantification of uncertainty. Besides the auxiliary variables, we also extend the approach by estimating the propensity score for a unit to be included in the sample and also including it as a predictor in the machine learning models. We find in simulation studies that the regularized predictions using soft Bayesian additive regression trees yield valid inference for the population means and coverage rates close to the nominal levels. We demonstrate the application of the proposed methods using two different real data applications, one in a survey and one in an epidemiologic study.
{"title":"Inference from Nonrandom Samples Using Bayesian Machine Learning.","authors":"Yutao Liu, Andrew Gelman, Qixuan Chen","doi":"10.1093/jssam/smab049","DOIUrl":"10.1093/jssam/smab049","url":null,"abstract":"<p><p>We consider inference from nonrandom samples in data-rich settings where high-dimensional auxiliary information is available both in the sample and the target population, with survey inference being a special case. We propose a regularized prediction approach that predicts the outcomes in the population using a large number of auxiliary variables such that the ignorability assumption is reasonable and the Bayesian framework is straightforward for quantification of uncertainty. Besides the auxiliary variables, we also extend the approach by estimating the propensity score for a unit to be included in the sample and also including it as a predictor in the machine learning models. We find in simulation studies that the regularized predictions using soft Bayesian additive regression trees yield valid inference for the population means and coverage rates close to the nominal levels. We demonstrate the application of the proposed methods using two different real data applications, one in a survey and one in an epidemiologic study.</p>","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":"11 2","pages":"433-455"},"PeriodicalIF":2.1,"publicationDate":"2022-01-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10080218/pdf/smab049.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"9637930","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OUP accepted manuscript","authors":"","doi":"10.1093/jssam/smac009","DOIUrl":"https://doi.org/10.1093/jssam/smac009","url":null,"abstract":"","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":"1 1","pages":""},"PeriodicalIF":2.1,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"61006227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OUP accepted manuscript","authors":"","doi":"10.1093/jssam/smac002","DOIUrl":"https://doi.org/10.1093/jssam/smac002","url":null,"abstract":"","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":"1 1","pages":""},"PeriodicalIF":2.1,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"61006491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OUP accepted manuscript","authors":"","doi":"10.1093/jssam/smac003","DOIUrl":"https://doi.org/10.1093/jssam/smac003","url":null,"abstract":"","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":"1 1","pages":""},"PeriodicalIF":2.1,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"61006522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
{"title":"OUP accepted manuscript","authors":"","doi":"10.1093/jssam/smac015","DOIUrl":"https://doi.org/10.1093/jssam/smac015","url":null,"abstract":"","PeriodicalId":17146,"journal":{"name":"Journal of Survey Statistics and Methodology","volume":"1 1","pages":""},"PeriodicalIF":2.1,"publicationDate":"2022-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"61006819","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":4,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}