Survey researchers have carefully modified their data collection operations for various reasons, including the rising costs of data collection and the ongoing Coronavirus disease (COVID-19) pandemic, both of which have made in-person interviewing difficult. For large national surveys that require household (HH) screening to determine survey eligibility, cost-efficient screening methods that do not include in-person visits need additional evaluation and testing. A new study, known as the American Family Health Study (AFHS), recently initiated data collection with a national probability sample, using a sequential mixed-mode mail/web protocol for push-to-web US HH screening (targeting persons aged 18-49 years). To better understand optimal approaches for this type of national screening effort, we embedded two randomized experiments in the AFHS data collection. The first tested the use of bilingual respondent materials where mailed invitations to the screener were sent in both English and Spanish to 50 percent of addresses with a high predicted likelihood of having a Spanish speaker and 10 percent of all other addresses. We found that the bilingual approach did not increase the response rate of high-likelihood Spanish-speaking addresses, but consistent with prior work, it increased the proportion of eligible Hispanic respondents identified among completed screeners, especially among addresses predicted to have a high likelihood of having Spanish speakers. The second tested a form of nonresponse follow-up, where a subsample of active sampled HHs that had not yet responded to the screening invitations was sent a priority mailing with a $5 incentive, adding to the $2 incentive provided for all sampled HHs in the initial screening invitation. We found this approach to be quite valuable for increasing the screening survey response rate.
Data synthesis is an effective statistical approach for reducing data disclosure risk. Generating fully synthetic data might minimize such risk, but its modeling and application can be difficult for data from large, complex surveys. This article extended the two-stage imputation to simultaneously impute item missing values and generate fully synthetic data. A new combining rule for making inferences using data generated in this manner was developed. Two semiparametric missing data imputation models were adapted to generate fully synthetic data for skewed continuous variable and sparse binary variable, respectively. The proposed approach was evaluated using simulated data and real longitudinal data from the Health and Retirement Study. The proposed approach was also compared with two existing synthesis approaches: (1) parametric regressions models as implemented in IVEware; and (2) nonparametric Classification and Regression Trees as implemented in synthpop package for R using real data. The results show that high data utility is maintained for a wide variety of descriptive and model-based statistics using the proposed strategy. The proposed strategy also performs better than existing methods for sophisticated analyses such as factor analysis.
Live video (LV) communication tools (e.g., Zoom) have the potential to provide survey researchers with many of the benefits of in-person interviewing, while also greatly reducing data collection costs, given that interviewers do not need to travel and make in-person visits to sampled households. The COVID-19 pandemic has exposed the vulnerability of in-person data collection to public health crises, forcing survey researchers to explore remote data collection modes-such as LV interviewing-that seem likely to yield high-quality data without in-person interaction. Given the potential benefits of these technologies, the operational and methodological aspects of video interviewing have started to receive research attention from survey methodologists. Although it is remote, video interviewing still involves respondent-interviewer interaction that introduces the possibility of interviewer effects. No research to date has evaluated this potential threat to the quality of the data collected in video interviews. This research note presents an evaluation of interviewer effects in a recent experimental study of alternative approaches to video interviewing including both LV interviewing and the use of prerecorded videos of the same interviewers asking questions embedded in a web survey ("prerecorded video" interviewing). We find little evidence of significant interviewer effects when using these two approaches, which is a promising result. We also find that when interviewer effects were present, they tended to be slightly larger in the LV approach as would be expected in light of its being an interactive approach. We conclude with a discussion of the implications of these findings for future research using video interviewing.
Respondent-driven sampling (RDS) is a popular method of conducting surveys in hard to reach populations where strong assumptions are required in order to make valid statistical inferences. In this paper we investigate the assumption that network degrees are measured accurately by the RDS survey and find that there is likely significant measurement error present in typical studies. We prove that most RDS estimators remain consistent under an imperfect measurement model with little to no added bias, though the variance of the estimators does increase.
Respondent driven sampling (RDS) is an approach commonly used to recruit nonprobability samples of rare and hard-to-find populations. The purpose of this study was to explore the utility of phone and web-based RDS methodology to sample sexual minority women (SMW) for participation in a telephone survey. Key features included 1) utilizing a national probability survey sample to select seeds; 2) web-based recruitment with emailed coupons; and 3) virtual processes for orienting, screening and scheduling potential participants for computer-assisted telephone interviews. Rather than resulting in a large diverse sample of SMW, only a small group of randomly selected women completed the survey and agreed to recruit their peers, and very few women recruited even one participant. Only seeds from the most recent of two waves of the probability study generated new SMW recruits. Three RDS attempts to recruit SMW over several years and findings from brief qualitative interviews revealed four key challenges to successful phone and web-based RDS with this population. First, population-based sampling precludes sampling based on participant characteristics that are often used in RDS. Second, methods that distance prospective participants from the research team may impede development of relationships, investment in the study, and motivation to participate. Third, recruitment for telephone surveys may be impeded by multiple burdens on seeds and recruits (e.g., survey length, understanding the study and RDS process). Finally, many seeds from a population-based sample may be needed, which is not generally feasible when working with a limited pool of potential seeds. This method may yield short recruitment chains, which would not meet key RDS assumptions for approximation of a probability sample. In conclusion, potential challenges to using RDS in studies with SMW, particularly those using virtual approaches, should be considered.
We consider inference from nonrandom samples in data-rich settings where high-dimensional auxiliary information is available both in the sample and the target population, with survey inference being a special case. We propose a regularized prediction approach that predicts the outcomes in the population using a large number of auxiliary variables such that the ignorability assumption is reasonable and the Bayesian framework is straightforward for quantification of uncertainty. Besides the auxiliary variables, we also extend the approach by estimating the propensity score for a unit to be included in the sample and also including it as a predictor in the machine learning models. We find in simulation studies that the regularized predictions using soft Bayesian additive regression trees yield valid inference for the population means and coverage rates close to the nominal levels. We demonstrate the application of the proposed methods using two different real data applications, one in a survey and one in an epidemiologic study.