Matthew Spotnitz, John Giannini, Yechiam Ostchega, Stephanie L Goff, Lakshmi Priya Anandan, Emily Clark, Tamara R Litwin, Lew Berman
{"title":"Assessing the Data Quality Dimensions of Partial and Complete Mastectomy Cohorts in the <i>All of Us</i> Research Program: Cross-Sectional Study.","authors":"Matthew Spotnitz, John Giannini, Yechiam Ostchega, Stephanie L Goff, Lakshmi Priya Anandan, Emily Clark, Tamara R Litwin, Lew Berman","doi":"10.2196/59298","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Breast cancer is prevalent among females in the United States. Nonmetastatic disease is treated by partial or complete mastectomy procedures. However, the rates of those procedures vary across practices. Generating real-world evidence on breast cancer surgery could lead to improved and consistent practices. We investigated the quality of data from the All of Us Research Program, which is a precision medicine initiative that collected real-world electronic health care data from different sites in the United States both retrospectively and prospectively to participant enrollment.</p><p><strong>Objective: </strong>The paper aims to determine whether All of Us data are fit for use in generating real-world evidence on mastectomy procedures.</p><p><strong>Methods: </strong>Our mastectomy phenotype consisted of adult female participants who had CPT4 (Current Procedural Terminology 4), ICD-9 (International Classification of Diseases, Ninth Revision) procedure, or SNOMED (Systematized Nomenclature of Medicine) codes for a partial or complete mastectomy procedure that mapped to Observational Medical Outcomes Partnership Common Data Model concepts. We evaluated the phenotype with a data quality dimensions (DQD) framework that consisted of 5 elements: conformance, completeness, concordance, plausibility, and temporality. Also, we applied a previously developed DQD checklist to evaluate concept selection, internal verification, and external validation for each dimension. We compared the DQD of our cohort to a control group of adult women who did not have a mastectomy procedure. Our subgroup analysis compared partial to complete mastectomy procedure phenotypes.</p><p><strong>Results: </strong>There were 4175 female participants aged 18 years or older in the partial or complete mastectomy cohort, and 168,226 participants in the control cohort. The geospatial distribution of our cohort varied across states. For example, our cohort consisted of 835 (20%) participants from Massachusetts, but multiple other states contributed fewer than 20 participants. We compared the sociodemographic characteristics of the partial (n=2607) and complete (n=1568) mastectomy subgroups. Those groups differed in the distribution of age at procedure (P<.001), education (P=.02), and income (P=.03) levels, as per χ2 analysis. A total of 367 (9.9%) participants in our cohort had overlapping CPT4 and SNOMED codes for a mastectomy, and 63 (1.5%) had overlapping ICD-9 procedure and SNOMED codes. The prevalence of breast cancer-related concepts was higher in our cohort compared to the control group (P<.001). In both the partial and complete mastectomy subgroups, the correlations among concepts were consistent with the clinical management of breast cancer. The median time between biopsy and mastectomy was 5.5 (IQR 3.5-11.2) weeks. Although we did not have external benchmark comparisons, we were able to evaluate concept selection and internal verification for all domains.</p><p><strong>Conclusions: </strong>Our data quality framework was implemented successfully on a mastectomy phenotype. Our systematic approach identified data missingness. Moreover, the framework allowed us to differentiate breast-conserving therapy and complete mastectomy subgroups in the All of Us data.</p>","PeriodicalId":45538,"journal":{"name":"JMIR Cancer","volume":"11 ","pages":"e59298"},"PeriodicalIF":3.3000,"publicationDate":"2025-03-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"JMIR Cancer","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2196/59298","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Breast cancer is prevalent among females in the United States. Nonmetastatic disease is treated by partial or complete mastectomy procedures. However, the rates of those procedures vary across practices. Generating real-world evidence on breast cancer surgery could lead to improved and consistent practices. We investigated the quality of data from the All of Us Research Program, which is a precision medicine initiative that collected real-world electronic health care data from different sites in the United States both retrospectively and prospectively to participant enrollment.
Objective: The paper aims to determine whether All of Us data are fit for use in generating real-world evidence on mastectomy procedures.
Methods: Our mastectomy phenotype consisted of adult female participants who had CPT4 (Current Procedural Terminology 4), ICD-9 (International Classification of Diseases, Ninth Revision) procedure, or SNOMED (Systematized Nomenclature of Medicine) codes for a partial or complete mastectomy procedure that mapped to Observational Medical Outcomes Partnership Common Data Model concepts. We evaluated the phenotype with a data quality dimensions (DQD) framework that consisted of 5 elements: conformance, completeness, concordance, plausibility, and temporality. Also, we applied a previously developed DQD checklist to evaluate concept selection, internal verification, and external validation for each dimension. We compared the DQD of our cohort to a control group of adult women who did not have a mastectomy procedure. Our subgroup analysis compared partial to complete mastectomy procedure phenotypes.
Results: There were 4175 female participants aged 18 years or older in the partial or complete mastectomy cohort, and 168,226 participants in the control cohort. The geospatial distribution of our cohort varied across states. For example, our cohort consisted of 835 (20%) participants from Massachusetts, but multiple other states contributed fewer than 20 participants. We compared the sociodemographic characteristics of the partial (n=2607) and complete (n=1568) mastectomy subgroups. Those groups differed in the distribution of age at procedure (P<.001), education (P=.02), and income (P=.03) levels, as per χ2 analysis. A total of 367 (9.9%) participants in our cohort had overlapping CPT4 and SNOMED codes for a mastectomy, and 63 (1.5%) had overlapping ICD-9 procedure and SNOMED codes. The prevalence of breast cancer-related concepts was higher in our cohort compared to the control group (P<.001). In both the partial and complete mastectomy subgroups, the correlations among concepts were consistent with the clinical management of breast cancer. The median time between biopsy and mastectomy was 5.5 (IQR 3.5-11.2) weeks. Although we did not have external benchmark comparisons, we were able to evaluate concept selection and internal verification for all domains.
Conclusions: Our data quality framework was implemented successfully on a mastectomy phenotype. Our systematic approach identified data missingness. Moreover, the framework allowed us to differentiate breast-conserving therapy and complete mastectomy subgroups in the All of Us data.