Background: Disease burden of sexually transmitted infections such as chlamydia, gonorrhea, and syphilis is often compared across age categories, sex categories, and race and ethnicity categories. Missing data may prevent researchers from accurately characterizing health disparities between populations. This article describes the methods used to impute race and Hispanic ethnicity in a large national surveillance data set.
Methods: All US cases of chlamydia, gonorrhea, and syphilis (excluding congenital syphilis) reported through the National Notifiable Diseases Surveillance System from the year 2019 were included in the analyses. We used fully conditional specification to impute missing race and Hispanic ethnicity data. After imputation, reported case rates were calculated, by disease, for each race and Hispanic ethnicity category using Vintage 2019 Population and Housing Unit Estimates from the US Census. We then used case counts from subsets that contained only complete race and Hispanic ethnicity information to investigate if the confidence intervals from the multiply imputed data included the observed number of cases in each race and Hispanic ethnicity category.
Results: Among the 2,553,038 cases reported in 2019, race and Hispanic ethnicity were multiply imputed for 9% of syphilis cases, 22% of gonorrhea cases, and 33% of chlamydia cases. In the subset analyses, every nonzero rate of reported cases was contained within the confidence intervals that were calculated from multiply imputed data.
Conclusions: Confidence intervals that account for the uncertainty of the predictions are an advantage of multiple imputation over complete-case analysis because a realistic variance estimate allows for valid hypothesis testing results.