利用自然语言处理和电子健康记录数据增强国家癌症数据库内容

Christina M. Stuart , Yizhou Fei , Richard D. Schulick , Kathryn L. Colborn , Robert A. Meguid
{"title":"利用自然语言处理和电子健康记录数据增强国家癌症数据库内容","authors":"Christina M. Stuart ,&nbsp;Yizhou Fei ,&nbsp;Richard D. Schulick ,&nbsp;Kathryn L. Colborn ,&nbsp;Robert A. Meguid","doi":"10.1016/j.soi.2024.100058","DOIUrl":null,"url":null,"abstract":"<div><h3>Background</h3><p>The prevalence of missing data in the National Cancer Database (NCDB) has marked implications on clinical care and research. The objective of this study was to enhance the NCDB by decreasing rates of missingness and adding new variables using automated statistical methodology.</p></div><div><h3>Methods</h3><p>One health system’s NCDB data from 2011–2021 was linked to electronic health record (EHR). Variables with frequent missingness and new clinically significant variables not yet included in the NCDB including patient Eastern Cooperative Oncology Group (ECOG) score, specific chemotherapy regimen, American Society of Anesthesiologists Physical Status Classification (ASA class), and discrete surgical procedure were identified in structured and unstructured EHR data. After automated incorporation of structured data from EHR, a natural language processing tool incorporating rule-based algorithms was designed to further extract variables from unstructured notes. Rates of missingness were compared between the original NCDB and the enhanced dataset, and example multivariable models were run to assess for altered model performance with reduced missingness and the addition of new clinically significant variables (chemotherapy regimen).</p></div><div><h3>Results</h3><p>A total of 6050 patients with NCDB records were linked to their EHR data. Prior to enhancement, rates of missingness for key variables ranged from 2.0% to 5.3%. Following dataset enhancement, missingness was significantly reduced, with relative missingness being reduced between 31.9% to 68.0%. Of the new variables added, 1367 (22.6%) of 6050 patients gained ECOG score, and 1099 (57.8%) of 1901 who received chemotherapy gained their chemotherapy regimen. Of 2989 who underwent surgery, 979 (32.8%) gained their procedure name and 621 (20.8%) gained ASA class. Comparison of the multivariable models demonstrated significant differences between the original NCDB and the enhanced dataset. Specifically, when replacing the binary predictor for chemotherapy in the original NCDB data with discrete regimens, the effect of ethnicity diminished, and the effect of radiation became significant.</p></div><div><h3>Discussion</h3><p>We applied statistical methodology to reduce rates of missingness in existing variables and add new variables to enrich the NCDB. While further refinement is needed to decrease missingness in new variables, this automated methodology can replace or augment manual chart review and improve the ability of to use the NCDB to study unanswered questions leading to clinical advancements in oncology.</p></div>","PeriodicalId":101191,"journal":{"name":"Surgical Oncology Insight","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2950247024000677/pdfft?md5=74493abf41e836a1b6d21845aba61887&pid=1-s2.0-S2950247024000677-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Enhancing the National Cancer Database content using natural language processing and electronic health record data\",\"authors\":\"Christina M. Stuart ,&nbsp;Yizhou Fei ,&nbsp;Richard D. Schulick ,&nbsp;Kathryn L. Colborn ,&nbsp;Robert A. Meguid\",\"doi\":\"10.1016/j.soi.2024.100058\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><h3>Background</h3><p>The prevalence of missing data in the National Cancer Database (NCDB) has marked implications on clinical care and research. The objective of this study was to enhance the NCDB by decreasing rates of missingness and adding new variables using automated statistical methodology.</p></div><div><h3>Methods</h3><p>One health system’s NCDB data from 2011–2021 was linked to electronic health record (EHR). Variables with frequent missingness and new clinically significant variables not yet included in the NCDB including patient Eastern Cooperative Oncology Group (ECOG) score, specific chemotherapy regimen, American Society of Anesthesiologists Physical Status Classification (ASA class), and discrete surgical procedure were identified in structured and unstructured EHR data. After automated incorporation of structured data from EHR, a natural language processing tool incorporating rule-based algorithms was designed to further extract variables from unstructured notes. Rates of missingness were compared between the original NCDB and the enhanced dataset, and example multivariable models were run to assess for altered model performance with reduced missingness and the addition of new clinically significant variables (chemotherapy regimen).</p></div><div><h3>Results</h3><p>A total of 6050 patients with NCDB records were linked to their EHR data. Prior to enhancement, rates of missingness for key variables ranged from 2.0% to 5.3%. Following dataset enhancement, missingness was significantly reduced, with relative missingness being reduced between 31.9% to 68.0%. Of the new variables added, 1367 (22.6%) of 6050 patients gained ECOG score, and 1099 (57.8%) of 1901 who received chemotherapy gained their chemotherapy regimen. Of 2989 who underwent surgery, 979 (32.8%) gained their procedure name and 621 (20.8%) gained ASA class. Comparison of the multivariable models demonstrated significant differences between the original NCDB and the enhanced dataset. Specifically, when replacing the binary predictor for chemotherapy in the original NCDB data with discrete regimens, the effect of ethnicity diminished, and the effect of radiation became significant.</p></div><div><h3>Discussion</h3><p>We applied statistical methodology to reduce rates of missingness in existing variables and add new variables to enrich the NCDB. While further refinement is needed to decrease missingness in new variables, this automated methodology can replace or augment manual chart review and improve the ability of to use the NCDB to study unanswered questions leading to clinical advancements in oncology.</p></div>\",\"PeriodicalId\":101191,\"journal\":{\"name\":\"Surgical Oncology Insight\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-05-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2950247024000677/pdfft?md5=74493abf41e836a1b6d21845aba61887&pid=1-s2.0-S2950247024000677-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Surgical Oncology Insight\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2950247024000677\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Surgical Oncology Insight","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2950247024000677","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

背景美国国家癌症数据库(NCDB)中普遍存在的数据缺失现象对临床治疗和研究产生了重大影响。本研究的目的是利用自动统计方法降低缺失率并增加新变量,从而增强 NCDB。在结构化和非结构化的电子病历数据中识别出了经常遗漏的变量和尚未纳入 NCDB 的具有临床意义的新变量,包括患者东部合作肿瘤学组(ECOG)评分、特定化疗方案、美国麻醉医师协会体力状态分类(ASA 等级)和离散手术过程。在自动整合电子病历中的结构化数据后,设计了一种基于规则算法的自然语言处理工具,以进一步从非结构化笔记中提取变量。比较了原始 NCDB 数据集和增强后数据集的遗漏率,并运行了示例多变量模型,以评估随着遗漏率的降低和新临床变量(化疗方案)的增加,模型的性能是否会发生变化。在数据集增强之前,关键变量的遗漏率在 2.0% 到 5.3% 之间。数据集增强后,遗漏率显著降低,相对遗漏率从 31.9% 降至 68.0%。在新增的变量中,6050 名患者中有 1367 人(22.6%)获得了 ECOG 评分,1901 名接受化疗的患者中有 1099 人(57.8%)获得了化疗方案。在接受手术的 2989 名患者中,979 人(32.8%)获得了手术名称,621 人(20.8%)获得了 ASA 分级。多变量模型的比较表明,原始 NCDB 和增强型数据集之间存在显著差异。讨论我们应用统计方法降低了现有变量的遗漏率,并增加了新变量以丰富 NCDB。虽然还需要进一步改进以减少新变量的遗漏率,但这种自动化方法可以取代或增强人工病历审查,并提高使用 NCDB 研究未解问题的能力,从而推动肿瘤学的临床进步。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Enhancing the National Cancer Database content using natural language processing and electronic health record data

Background

The prevalence of missing data in the National Cancer Database (NCDB) has marked implications on clinical care and research. The objective of this study was to enhance the NCDB by decreasing rates of missingness and adding new variables using automated statistical methodology.

Methods

One health system’s NCDB data from 2011–2021 was linked to electronic health record (EHR). Variables with frequent missingness and new clinically significant variables not yet included in the NCDB including patient Eastern Cooperative Oncology Group (ECOG) score, specific chemotherapy regimen, American Society of Anesthesiologists Physical Status Classification (ASA class), and discrete surgical procedure were identified in structured and unstructured EHR data. After automated incorporation of structured data from EHR, a natural language processing tool incorporating rule-based algorithms was designed to further extract variables from unstructured notes. Rates of missingness were compared between the original NCDB and the enhanced dataset, and example multivariable models were run to assess for altered model performance with reduced missingness and the addition of new clinically significant variables (chemotherapy regimen).

Results

A total of 6050 patients with NCDB records were linked to their EHR data. Prior to enhancement, rates of missingness for key variables ranged from 2.0% to 5.3%. Following dataset enhancement, missingness was significantly reduced, with relative missingness being reduced between 31.9% to 68.0%. Of the new variables added, 1367 (22.6%) of 6050 patients gained ECOG score, and 1099 (57.8%) of 1901 who received chemotherapy gained their chemotherapy regimen. Of 2989 who underwent surgery, 979 (32.8%) gained their procedure name and 621 (20.8%) gained ASA class. Comparison of the multivariable models demonstrated significant differences between the original NCDB and the enhanced dataset. Specifically, when replacing the binary predictor for chemotherapy in the original NCDB data with discrete regimens, the effect of ethnicity diminished, and the effect of radiation became significant.

Discussion

We applied statistical methodology to reduce rates of missingness in existing variables and add new variables to enrich the NCDB. While further refinement is needed to decrease missingness in new variables, this automated methodology can replace or augment manual chart review and improve the ability of to use the NCDB to study unanswered questions leading to clinical advancements in oncology.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Combining liver-directed and immunotherapy in advanced hepatocellular carcinoma: A review and future directions Timing of breast biopsy and axillary ultrasound does not affect the false positive rate of the axillary ultrasound Small bowel cancers: A population-based analysis of epidemiology, treatment and outcomes in Ontario, Canada from 2005-2020 Pelvic Floor Physical Therapy Prehabilitation (PrePFPT) for the prevention of low anterior resection syndrome Pepsinogen and Helicobacter pylori: Serum biomarkers for gastric cancer risk in a diverse United States population
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1