Pub Date : 2024-11-18eCollection Date: 2025-10-01DOI: 10.1093/jrsssa/qnae109
Sofia L Vega, Rachel C Nethery
Although some pollutants emitted in vehicle exhaust, such as benzene, are known to cause leukaemia in adults with high exposure levels, less is known about the relationship between traffic-related air pollution (TRAP) and childhood haematologic cancer. In the 1990s, the US EPA enacted the reformulated gasoline program in select areas of the U.S., which drastically reduced ambient TRAP in affected areas. This created an ideal quasi-experiment to study the effects of TRAP on childhood haematologic cancers. However, existing methods for quasi-experimental analyses can perform poorly when outcomes are rare and unstable, as with childhood cancer incidence. We develop Bayesian spatio-temporal matrix completion methods to conduct causal inference in quasi-experimental settings with rare outcomes. Selective information sharing across space and time enables stable estimation, and the Bayesian approach facilitates uncertainty quantification. We evaluate the methods through simulations and apply them to estimate the causal effects of TRAP on childhood leukaemia and lymphoma.
{"title":"Spatio-temporal quasi-experimental methods for rare disease outcomes: the impact of reformulated gasoline on childhood haematologic cancer.","authors":"Sofia L Vega, Rachel C Nethery","doi":"10.1093/jrsssa/qnae109","DOIUrl":"10.1093/jrsssa/qnae109","url":null,"abstract":"<p><p>Although some pollutants emitted in vehicle exhaust, such as benzene, are known to cause leukaemia in adults with high exposure levels, less is known about the relationship between traffic-related air pollution (TRAP) and childhood haematologic cancer. In the 1990s, the US EPA enacted the reformulated gasoline program in select areas of the U.S., which drastically reduced ambient TRAP in affected areas. This created an ideal quasi-experiment to study the effects of TRAP on childhood haematologic cancers. However, existing methods for quasi-experimental analyses can perform poorly when outcomes are rare and unstable, as with childhood cancer incidence. We develop Bayesian spatio-temporal matrix completion methods to conduct causal inference in quasi-experimental settings with rare outcomes. Selective information sharing across space and time enables stable estimation, and the Bayesian approach facilitates uncertainty quantification. We evaluate the methods through simulations and apply them to estimate the causal effects of TRAP on childhood leukaemia and lymphoma.</p>","PeriodicalId":49983,"journal":{"name":"Journal of the Royal Statistical Society Series A-Statistics in Society","volume":"188 4","pages":"1184-1202"},"PeriodicalIF":1.6,"publicationDate":"2024-11-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12503115/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145253449","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-23eCollection Date: 2025-01-01DOI: 10.1093/jrsssa/qnae107
Eric A Bai, Botao Ju, Madeleine Beckner, Jerome P Reiter, M Giovanna Merli, Ted Mouw
Many population surveys do not provide information on respondents' residential addresses, instead offering coarse geographies like zip code or higher aggregations. However, fine resolution geography can be beneficial for characterizing neighbourhoods, especially for relatively rare populations such as immigrants. One way to obtain such information is to link survey records to records in auxiliary databases that include residential addresses by matching on variables common to both files. We present an approach based on probabilistic record linkage that enables matching survey participants in the Chinese Immigrants in Raleigh-Durham Study to records from InfoUSA, an information provider of residential records. The two files use different Chinese name romanization practices, which we address through a novel and generalizable strategy for constructing records' pairwise comparison vectors for romanized names. Using a fully Bayesian record linkage model, we characterize the geospatial distribution of Chinese immigrants in the Raleigh-Durham area of North Carolina.
{"title":"Studying Chinese immigrants' spatial distribution in the Raleigh-Durham area by linking survey and commercial data using romanized names.","authors":"Eric A Bai, Botao Ju, Madeleine Beckner, Jerome P Reiter, M Giovanna Merli, Ted Mouw","doi":"10.1093/jrsssa/qnae107","DOIUrl":"10.1093/jrsssa/qnae107","url":null,"abstract":"<p><p>Many population surveys do not provide information on respondents' residential addresses, instead offering coarse geographies like zip code or higher aggregations. However, fine resolution geography can be beneficial for characterizing neighbourhoods, especially for relatively rare populations such as immigrants. One way to obtain such information is to link survey records to records in auxiliary databases that include residential addresses by matching on variables common to both files. We present an approach based on probabilistic record linkage that enables matching survey participants in the Chinese Immigrants in Raleigh-Durham Study to records from InfoUSA, an information provider of residential records. The two files use different Chinese name romanization practices, which we address through a novel and generalizable strategy for constructing records' pairwise comparison vectors for romanized names. Using a fully Bayesian record linkage model, we characterize the geospatial distribution of Chinese immigrants in the Raleigh-Durham area of North Carolina.</p>","PeriodicalId":49983,"journal":{"name":"Journal of the Royal Statistical Society Series A-Statistics in Society","volume":"188 1","pages":"84-97"},"PeriodicalIF":1.6,"publicationDate":"2024-10-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11728054/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142985303","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-24eCollection Date: 2025-01-01DOI: 10.1093/jrsssa/qnae093
Philip S Boonstra, Pedro Orozco Del Pino
Model integration refers to the process of incorporating a fitted historical model into the estimation of a current study to increase statistical efficiency. Integration can be challenging when the current model includes new covariates, leading to potential model misspecification. We present and evaluate seven existing and novel model integration techniques, which employ both likelihood constraints and Bayesian informative priors. Using a simulation study of logistic regression, we quantify how efficiency-assessed by bias and variance-changes with the sample sizes of both historical and current studies and in response to violations to transportability assumptions. We also apply these methods to a case study in which the goal is to use novel predictors to update a risk prediction model for in-hospital mortality among pediatric extracorporeal membrane oxygenation patients. Our simulation study and case study suggest that (i) when historical sample size is small, accounting for this statistical uncertainty is more efficient; (ii) all methods lose efficiency when there exist differences between the historical and current data-generating mechanisms; (iii) additional shrinkage to zero can improve efficiency in higher-dimensional settings but at the cost of bias in estimation.
{"title":"A comparison of some existing and novel methods for integrating historical models to improve estimation of coefficients in logistic regression.","authors":"Philip S Boonstra, Pedro Orozco Del Pino","doi":"10.1093/jrsssa/qnae093","DOIUrl":"10.1093/jrsssa/qnae093","url":null,"abstract":"<p><p>Model integration refers to the process of incorporating a fitted historical model into the estimation of a current study to increase statistical efficiency. Integration can be challenging when the current model includes new covariates, leading to potential model misspecification. We present and evaluate seven existing and novel model integration techniques, which employ both likelihood constraints and Bayesian informative priors. Using a simulation study of logistic regression, we quantify how efficiency-assessed by bias and variance-changes with the sample sizes of both historical and current studies and in response to violations to transportability assumptions. We also apply these methods to a case study in which the goal is to use novel predictors to update a risk prediction model for in-hospital mortality among pediatric extracorporeal membrane oxygenation patients. Our simulation study and case study suggest that (i) when historical sample size is small, accounting for this statistical uncertainty is more efficient; (ii) all methods lose efficiency when there exist differences between the historical and current data-generating mechanisms; (iii) additional shrinkage to zero can improve efficiency in higher-dimensional settings but at the cost of bias in estimation.</p>","PeriodicalId":49983,"journal":{"name":"Journal of the Royal Statistical Society Series A-Statistics in Society","volume":"188 1","pages":"46-67"},"PeriodicalIF":1.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11728056/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142985253","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kenta Takatsu, Alexander W Levis, Edward Kennedy, Rachel Kelz, Luke Keele
Comparative effectiveness research frequently employs the instrumental variable design since randomized trials can be infeasible for many reasons. In this study, we investigate treatments for emergency cholecystitis-inflammation of the gallbladder. A standard treatment for cholecystitis is surgical removal of the gallbladder, while alternative non-surgical treatments include managed care and pharmaceutical options. As randomized trials are judged to violate the principle of equipoise, we consider an instrument for operative care: the surgeon's tendency to operate. Standard instrumental variable estimation methods, however, often rely on parametric models that are prone to bias from model misspecification. Thus, we outline instrumental variable methods based on the doubly robust machine learning framework. These methods enable us to employ various machine learning techniques, delivering consistent estimates, and permitting valid inference on various estimands. We use these methods to estimate the primary target estimand in an instrumental variable design. Additionally, we expand these methods to develop new estimators for heterogeneous causal effects, profiling principal strata, and sensitivity analyses for a key instrumental variable assumption. We conduct a simulation study to demonstrate scenarios where more flexible estimation methods outperform standard methods. Our findings indicate that operative care is generally more effective for cholecystitis patients, although the benefits of surgery can be less pronounced for key patient subgroups.
{"title":"Doubly robust machine learning-based estimation methods for instrumental variables with an application to surgical care for cholecystitis.","authors":"Kenta Takatsu, Alexander W Levis, Edward Kennedy, Rachel Kelz, Luke Keele","doi":"10.1093/jrsssa/qnae089","DOIUrl":"10.1093/jrsssa/qnae089","url":null,"abstract":"<p><p>Comparative effectiveness research frequently employs the instrumental variable design since randomized trials can be infeasible for many reasons. In this study, we investigate treatments for emergency <i>cholecystitis</i>-inflammation of the gallbladder. A standard treatment for cholecystitis is surgical removal of the gallbladder, while alternative non-surgical treatments include managed care and pharmaceutical options. As randomized trials are judged to violate the principle of equipoise, we consider an instrument for operative care: the surgeon's tendency to operate. Standard instrumental variable estimation methods, however, often rely on parametric models that are prone to bias from model misspecification. Thus, we outline instrumental variable methods based on the doubly robust machine learning framework. These methods enable us to employ various machine learning techniques, delivering consistent estimates, and permitting valid inference on various estimands. We use these methods to estimate the primary target estimand in an instrumental variable design. Additionally, we expand these methods to develop new estimators for heterogeneous causal effects, profiling principal strata, and sensitivity analyses for a key instrumental variable assumption. We conduct a simulation study to demonstrate scenarios where more flexible estimation methods outperform standard methods. Our findings indicate that operative care is generally more effective for cholecystitis patients, although the benefits of surgery can be less pronounced for key patient subgroups.</p>","PeriodicalId":49983,"journal":{"name":"Journal of the Royal Statistical Society Series A-Statistics in Society","volume":" ","pages":""},"PeriodicalIF":1.6,"publicationDate":"2024-09-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12223449/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144692227","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-17eCollection Date: 2025-07-01DOI: 10.1093/jrsssa/qnae086
Francesca Gasperoni, Christopher H Jackson, Angela M Wood, Michael J Sweeting, Paul J Newcombe, David Stevens, Jessica K Barrett
In this work, we introduce a personalized and age-specific net benefit function, composed of benefits and costs, to recommend optimal timing of risk assessments for cardiovascular disease (CVD) prevention. We extend the 2-stage landmarking model to estimate patient-specific CVD risk profiles, adjusting for time-varying covariates. We apply our model to data from the Clinical Practice Research Datalink, comprising primary care electronic health records from the UK. We find that people at lower risk could be recommended an optimal risk-assessment interval of 5 years or more. Time-varying risk factors are required to discriminate between more frequent schedules for high-risk people.
{"title":"Optimal risk-assessment scheduling for primary prevention of cardiovascular disease.","authors":"Francesca Gasperoni, Christopher H Jackson, Angela M Wood, Michael J Sweeting, Paul J Newcombe, David Stevens, Jessica K Barrett","doi":"10.1093/jrsssa/qnae086","DOIUrl":"10.1093/jrsssa/qnae086","url":null,"abstract":"<p><p>In this work, we introduce a personalized and age-specific net benefit function, composed of benefits and costs, to recommend optimal timing of risk assessments for cardiovascular disease (CVD) prevention. We extend the 2-stage landmarking model to estimate patient-specific CVD risk profiles, adjusting for time-varying covariates. We apply our model to data from the Clinical Practice Research Datalink, comprising primary care electronic health records from the UK. We find that people at lower risk could be recommended an optimal risk-assessment interval of 5 years or more. Time-varying risk factors are required to discriminate between more frequent schedules for high-risk people.</p>","PeriodicalId":49983,"journal":{"name":"Journal of the Royal Statistical Society Series A-Statistics in Society","volume":"188 3","pages":"920-934"},"PeriodicalIF":1.5,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12256122/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144638527","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-16eCollection Date: 2025-10-01DOI: 10.1093/jrsssa/qnae090
Joshua L Warren, Ottavia Prunas, A David Paltiel, Thomas Thornhill, Gregg S Gonsalves
Mobile testing services provide opportunities for active surveillance of infectious diseases for hard-to-reach and/or high-risk individuals who do not know their disease status. Identifying as many infected individuals as possible is important for mitigating disease transmission. Recently, multi-armed bandit sampling approaches have been adapted and applied in this setting to maximize the cumulative number of positive tests collected over time. However, these algorithms have not considered the possibility of variability in the number of tests administered across testing sites. What impact this variability has on the ability of these approaches to maximize yield is currently unknown. Therefore, we investigate this question by extending existing sampling frameworks to directly account for variability in testing volume while also maintaining the computational tractability of the previous methods. Through a simulation study based on human immunodeficiency virus infection characteristics in the Republic of the Congo (Congo-Brazzaville) as well as an application to COVID-19 testing data in Connecticut, we find improved long- and short-term performances of the new methods compared to several existing approaches. Based on these findings and the ease of computation, we recommend use of the newly developed methods for active surveillance of infectious diseases when variability in testing volume may be present.
{"title":"Integrating testing volume into bandit algorithms for infectious disease surveillance.","authors":"Joshua L Warren, Ottavia Prunas, A David Paltiel, Thomas Thornhill, Gregg S Gonsalves","doi":"10.1093/jrsssa/qnae090","DOIUrl":"10.1093/jrsssa/qnae090","url":null,"abstract":"<p><p>Mobile testing services provide opportunities for active surveillance of infectious diseases for hard-to-reach and/or high-risk individuals who do not know their disease status. Identifying as many infected individuals as possible is important for mitigating disease transmission. Recently, multi-armed bandit sampling approaches have been adapted and applied in this setting to maximize the cumulative number of positive tests collected over time. However, these algorithms have not considered the possibility of variability in the number of tests administered across testing sites. What impact this variability has on the ability of these approaches to maximize yield is currently unknown. Therefore, we investigate this question by extending existing sampling frameworks to directly account for variability in testing volume while also maintaining the computational tractability of the previous methods. Through a simulation study based on human immunodeficiency virus infection characteristics in the Republic of the Congo (Congo-Brazzaville) as well as an application to COVID-19 testing data in Connecticut, we find improved long- and short-term performances of the new methods compared to several existing approaches. Based on these findings and the ease of computation, we recommend use of the newly developed methods for active surveillance of infectious diseases when variability in testing volume may be present.</p>","PeriodicalId":49983,"journal":{"name":"Journal of the Royal Statistical Society Series A-Statistics in Society","volume":"188 4","pages":"1029-1043"},"PeriodicalIF":1.6,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12503114/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145253487","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-02eCollection Date: 2025-01-01DOI: 10.1093/jrsssa/qnae084
Paul N Zivich, Jessie K Edwards, Bonnie E Shook-Sa, Eric T Lofgren, Justin Lessler, Stephen R Cole
Studies intended to estimate the effect of a treatment, like randomized trials, may not be sampled from the desired target population. To correct for this discrepancy, estimates can be transported to the target population. Methods for transporting between populations are often premised on a positivity assumption, such that all relevant covariate patterns in one population are also present in the other. However, eligibility criteria, particularly in the case of trials, can result in violations of positivity when transporting to external populations. To address nonpositivity, a synthesis of statistical and mathematical models can be considered. This approach integrates multiple data sources (e.g. trials, observational, pharmacokinetic studies) to estimate treatment effects, leveraging mathematical models to handle positivity violations. This approach was previously demonstrated for positivity violations by a single binary covariate. Here, we extend the synthesis approach for positivity violations with a continuous covariate. For estimation, two novel augmented inverse probability weighting estimators are proposed. Both estimators are contrasted with other common approaches for addressing nonpositivity. Empirical performance is compared via Monte Carlo simulation. Finally, the competing approaches are illustrated with an example in the context of two-drug vs. one-drug antiretroviral therapy on CD4 T cell counts among women with HIV.
{"title":"Synthesis estimators for transportability with positivity violations by a continuous covariate.","authors":"Paul N Zivich, Jessie K Edwards, Bonnie E Shook-Sa, Eric T Lofgren, Justin Lessler, Stephen R Cole","doi":"10.1093/jrsssa/qnae084","DOIUrl":"10.1093/jrsssa/qnae084","url":null,"abstract":"<p><p>Studies intended to estimate the effect of a treatment, like randomized trials, may not be sampled from the desired target population. To correct for this discrepancy, estimates can be transported to the target population. Methods for transporting between populations are often premised on a positivity assumption, such that all relevant covariate patterns in one population are also present in the other. However, eligibility criteria, particularly in the case of trials, can result in violations of positivity when transporting to external populations. To address nonpositivity, a synthesis of statistical and mathematical models can be considered. This approach integrates multiple data sources (e.g. trials, observational, pharmacokinetic studies) to estimate treatment effects, leveraging mathematical models to handle positivity violations. This approach was previously demonstrated for positivity violations by a single binary covariate. Here, we extend the synthesis approach for positivity violations with a continuous covariate. For estimation, two novel augmented inverse probability weighting estimators are proposed. Both estimators are contrasted with other common approaches for addressing nonpositivity. Empirical performance is compared via Monte Carlo simulation. Finally, the competing approaches are illustrated with an example in the context of two-drug vs. one-drug antiretroviral therapy on CD4 T cell counts among women with HIV.</p>","PeriodicalId":49983,"journal":{"name":"Journal of the Royal Statistical Society Series A-Statistics in Society","volume":"188 1","pages":"158-180"},"PeriodicalIF":1.6,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11728055/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142985305","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-08-19eCollection Date: 2025-07-01DOI: 10.1093/jrsssa/qnae079
Yuzi Zhang, Howard H Chang, Angela D Iuliano, Carrie Reed
Disease surveillance data are used for monitoring and understanding disease burden, which provides valuable information in allocating health programme resources. Statistical methods play an important role in estimating disease burden since disease surveillance systems are prone to undercounting. This paper is motivated by the challenge of estimating mortality associated with respiratory infections (e.g. influenza and COVID-19) that are not ascertained from death certificates. We propose a Bayesian spatial-temporal model incorporating measures of infection activity to estimate excess deaths. Particularly, the inclusion of time-varying coefficients allows us to better characterize associations between infection activity and mortality counts time series. Software to implement this method is available in the R package NBRegAD. Applying our modelling framework to weekly state-wide COVID-19 data in the US from 8 March 2020 to 3 July 2022, we identified temporal and spatial differences in excess deaths between different age groups. We estimated the total number of COVID-19 deaths in the US to be 1,168,481 (95% CI: 1,148,953 1,187,187) compared to the 1,022,147 from using only death certificate information. The analysis also suggests that the most severe undercounting was in the 18-49 years age group with an estimated underascertainment rate of 0.21 (95% CI: 0.16, 0.25).
{"title":"A Bayesian spatial-temporal varying coefficients model for estimating excess deaths associated with respiratory infections.","authors":"Yuzi Zhang, Howard H Chang, Angela D Iuliano, Carrie Reed","doi":"10.1093/jrsssa/qnae079","DOIUrl":"10.1093/jrsssa/qnae079","url":null,"abstract":"<p><p>Disease surveillance data are used for monitoring and understanding disease burden, which provides valuable information in allocating health programme resources. Statistical methods play an important role in estimating disease burden since disease surveillance systems are prone to undercounting. This paper is motivated by the challenge of estimating mortality associated with respiratory infections (e.g. influenza and COVID-19) that are not ascertained from death certificates. We propose a Bayesian spatial-temporal model incorporating measures of infection activity to estimate excess deaths. Particularly, the inclusion of time-varying coefficients allows us to better characterize associations between infection activity and mortality counts time series. Software to implement this method is available in the R package NBRegAD. Applying our modelling framework to weekly state-wide COVID-19 data in the US from 8 March 2020 to 3 July 2022, we identified temporal and spatial differences in excess deaths between different age groups. We estimated the total number of COVID-19 deaths in the US to be 1,168,481 (95% CI: 1,148,953 1,187,187) compared to the 1,022,147 from using only death certificate information. The analysis also suggests that the most severe undercounting was in the 18-49 years age group with an estimated underascertainment rate of 0.21 (95% CI: 0.16, 0.25).</p>","PeriodicalId":49983,"journal":{"name":"Journal of the Royal Statistical Society Series A-Statistics in Society","volume":"188 3","pages":"843-858"},"PeriodicalIF":1.5,"publicationDate":"2024-08-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC12256124/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"144638526","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-12eCollection Date: 2025-01-01DOI: 10.1093/jrsssa/qnae059
Lingxiao Wang, Yan Li, Barry I Graubard, Hormuzd A Katki
Accurate cancer risk estimation is crucial to clinical decision-making, such as identifying high-risk people for screening. However, most existing cancer risk models incorporate data from epidemiologic studies, which usually cannot represent the target population. While population-based health surveys are ideal for making inference to the target population, they typically do not collect time-to-cancer incidence data. Instead, time-to-cancer specific mortality is often readily available on surveys via linkage to vital statistics. We develop calibrated pseudoweighting methods that integrate individual-level data from a cohort and a survey, and summary statistics of cancer incidence from national cancer registries. By leveraging individual-level cancer mortality data in the survey, the proposed methods impute time-to-cancer incidence for survey sample individuals and use survey calibration with auxiliary variables of influence functions generated from Cox regression to improve robustness and efficiency of the inverse-propensity pseudoweighting method in estimating pure risks. We develop a lung cancer incidence pure risk model from the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial using our proposed methods by integrating data from the National Health Interview Survey and cancer registries.
{"title":"Data-integration with pseudoweights and survey-calibration: application to developing US-representative lung cancer risk models for use in screening.","authors":"Lingxiao Wang, Yan Li, Barry I Graubard, Hormuzd A Katki","doi":"10.1093/jrsssa/qnae059","DOIUrl":"10.1093/jrsssa/qnae059","url":null,"abstract":"<p><p>Accurate cancer risk estimation is crucial to clinical decision-making, such as identifying high-risk people for screening. However, most existing cancer risk models incorporate data from epidemiologic studies, which usually cannot represent the target population. While population-based health surveys are ideal for making inference to the target population, they typically do not collect time-to-cancer incidence data. Instead, time-to-cancer specific mortality is often readily available on surveys via linkage to vital statistics. We develop calibrated pseudoweighting methods that integrate individual-level data from a cohort and a survey, and summary statistics of cancer incidence from national cancer registries. By leveraging individual-level cancer mortality data in the survey, the proposed methods impute time-to-cancer incidence for survey sample individuals and use survey calibration with auxiliary variables of influence functions generated from Cox regression to improve robustness and efficiency of the inverse-propensity pseudoweighting method in estimating pure risks. We develop a lung cancer incidence pure risk model from the Prostate, Lung, Colorectal, and Ovarian Cancer Screening Trial using our proposed methods by integrating data from the National Health Interview Survey and cancer registries.</p>","PeriodicalId":49983,"journal":{"name":"Journal of the Royal Statistical Society Series A-Statistics in Society","volume":"188 1","pages":"119-139"},"PeriodicalIF":1.5,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11728053/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142985289","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-05-02eCollection Date: 2024-08-01DOI: 10.1093/jrsssa/qnae039
Ritoban Kundu, Xu Shi, Jean Morrison, Jessica Barrett, Bhramar Mukherjee
Using administrative patient-care data such as Electronic Health Records (EHR) and medical/pharmaceutical claims for population-based scientific research has become increasingly common. With vast sample sizes leading to very small standard errors, researchers need to pay more attention to potential biases in the estimates of association parameters of interest, specifically to biases that do not diminish with increasing sample size. Of these multiple sources of biases, in this paper, we focus on understanding selection bias. We present an analytic framework using directed acyclic graphs for guiding applied researchers to dissect how different sources of selection bias may affect estimates of the association between a binary outcome and an exposure (continuous or categorical) of interest. We consider four easy-to-implement weighting approaches to reduce selection bias with accompanying variance formulae. We demonstrate through a simulation study when they can rescue us in practice with analysis of real-world data. We compare these methods using a data example where our goal is to estimate the well-known association of cancer and biological sex, using EHR from a longitudinal biorepository at the University of Michigan Healthcare system. We provide annotated R codes to implement these weighted methods with associated inference.
利用电子健康记录(EHR)和医疗/药品报销单等患者护理管理数据进行基于人群的科学研究已变得越来越普遍。庞大的样本量会导致极小的标准误差,因此研究人员需要更多地关注相关联参数估计中的潜在偏差,特别是那些不会随着样本量的增加而减少的偏差。在这些多种偏差来源中,我们在本文中将重点了解选择偏差。我们提出了一个使用有向无环图的分析框架,用于指导应用研究人员剖析不同来源的选择偏倚如何影响二元结果与相关暴露(连续或分类)之间关联的估计值。我们考虑了四种易于实施的加权方法来减少选择偏差,并给出了相应的方差公式。我们通过一项模拟研究来证明,在实际分析真实世界数据时,这些方法何时能拯救我们。我们使用一个数据示例来比较这些方法,我们的目标是利用密歇根大学医疗保健系统纵向生物库中的电子病历来估计众所周知的癌症与生理性别的关联。我们提供了附有注释的 R 代码,以实现这些加权方法和相关推断。
{"title":"A framework for understanding selection bias in real-world healthcare data.","authors":"Ritoban Kundu, Xu Shi, Jean Morrison, Jessica Barrett, Bhramar Mukherjee","doi":"10.1093/jrsssa/qnae039","DOIUrl":"10.1093/jrsssa/qnae039","url":null,"abstract":"<p><p>Using administrative patient-care data such as Electronic Health Records (EHR) and medical/pharmaceutical claims for population-based scientific research has become increasingly common. With vast sample sizes leading to very small standard errors, researchers need to pay more attention to potential biases in the estimates of association parameters of interest, specifically to biases that do not diminish with increasing sample size. Of these multiple sources of biases, in this paper, we focus on understanding selection bias. We present an analytic framework using directed acyclic graphs for guiding applied researchers to dissect how different sources of selection bias may affect estimates of the association between a binary outcome and an exposure (continuous or categorical) of interest. We consider four easy-to-implement weighting approaches to reduce selection bias with accompanying variance formulae. We demonstrate through a simulation study when they can rescue us in practice with analysis of real-world data. We compare these methods using a data example where our goal is to estimate the well-known association of cancer and biological sex, using EHR from a longitudinal biorepository at the University of Michigan Healthcare system. We provide annotated R codes to implement these weighted methods with associated inference.</p>","PeriodicalId":49983,"journal":{"name":"Journal of the Royal Statistical Society Series A-Statistics in Society","volume":"187 3","pages":"606-635"},"PeriodicalIF":1.5,"publicationDate":"2024-05-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC11393555/pdf/","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142299713","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"数学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}