使用语言模型从非结构化肿瘤学笔记中提取和推算东部合作肿瘤学组的表现状态。

IF 3.3 Q2 ONCOLOGY JCO Clinical Cancer Informatics Pub Date : 2024-05-01 DOI:10.1200/CCI.23.00269
Wenxin Xu, Bowen Gu, William E Lotter, Kenneth L Kehl
{"title":"使用语言模型从非结构化肿瘤学笔记中提取和推算东部合作肿瘤学组的表现状态。","authors":"Wenxin Xu, Bowen Gu, William E Lotter, Kenneth L Kehl","doi":"10.1200/CCI.23.00269","DOIUrl":null,"url":null,"abstract":"<p><strong>Purpose: </strong>Eastern Cooperative Oncology Group (ECOG) performance status (PS) is a key clinical variable for cancer treatment and research, but it is usually only recorded in unstructured form in the electronic health record. We investigated whether natural language processing (NLP) models can impute ECOG PS using unstructured note text.</p><p><strong>Materials and methods: </strong>Medical oncology notes were identified from all patients with cancer at our center from 1997 to 2023 and divided at the patient level into training (approximately 80%), tuning/validation (approximately 10%), and test (approximately 10%) sets. Regular expressions were used to extract explicitly documented PS. Extracted PS labels were used to train NLP models to impute ECOG PS (0-1 <i>v</i> 2-4) from the remainder of the notes (with regular expression-extracted PS documentation removed). We assessed associations between imputed PS and overall survival (OS).</p><p><strong>Results: </strong>ECOG PS was extracted using regular expressions from 495,862 notes, corresponding to 79,698 patients. A Transformer-based Longformer model imputed PS with high discrimination (test set area under the receiver operating characteristic curve 0.95, area under the precision-recall curve 0.73). Imputed poor PS was associated with worse OS, including among notes with no explicit documentation of PS detected (OS hazard ratio, 11.9; 95% CI, 11.1 to 12.8).</p><p><strong>Conclusion: </strong>NLP models can be used to impute performance status from unstructured oncologist notes at scale. This may aid the annotation of oncology data sets for clinical outcomes research and cancer care delivery.</p>","PeriodicalId":51626,"journal":{"name":"JCO Clinical Cancer Informatics","volume":null,"pages":null},"PeriodicalIF":3.3000,"publicationDate":"2024-05-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Extraction and Imputation of Eastern Cooperative Oncology Group Performance Status From Unstructured Oncology Notes Using Language Models.\",\"authors\":\"Wenxin Xu, Bowen Gu, William E Lotter, Kenneth L Kehl\",\"doi\":\"10.1200/CCI.23.00269\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<p><strong>Purpose: </strong>Eastern Cooperative Oncology Group (ECOG) performance status (PS) is a key clinical variable for cancer treatment and research, but it is usually only recorded in unstructured form in the electronic health record. We investigated whether natural language processing (NLP) models can impute ECOG PS using unstructured note text.</p><p><strong>Materials and methods: </strong>Medical oncology notes were identified from all patients with cancer at our center from 1997 to 2023 and divided at the patient level into training (approximately 80%), tuning/validation (approximately 10%), and test (approximately 10%) sets. Regular expressions were used to extract explicitly documented PS. Extracted PS labels were used to train NLP models to impute ECOG PS (0-1 <i>v</i> 2-4) from the remainder of the notes (with regular expression-extracted PS documentation removed). We assessed associations between imputed PS and overall survival (OS).</p><p><strong>Results: </strong>ECOG PS was extracted using regular expressions from 495,862 notes, corresponding to 79,698 patients. A Transformer-based Longformer model imputed PS with high discrimination (test set area under the receiver operating characteristic curve 0.95, area under the precision-recall curve 0.73). Imputed poor PS was associated with worse OS, including among notes with no explicit documentation of PS detected (OS hazard ratio, 11.9; 95% CI, 11.1 to 12.8).</p><p><strong>Conclusion: </strong>NLP models can be used to impute performance status from unstructured oncologist notes at scale. This may aid the annotation of oncology data sets for clinical outcomes research and cancer care delivery.</p>\",\"PeriodicalId\":51626,\"journal\":{\"name\":\"JCO Clinical Cancer Informatics\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":3.3000,\"publicationDate\":\"2024-05-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"JCO Clinical Cancer Informatics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1200/CCI.23.00269\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q2\",\"JCRName\":\"ONCOLOGY\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"JCO Clinical Cancer Informatics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1200/CCI.23.00269","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"ONCOLOGY","Score":null,"Total":0}
引用次数: 0

摘要

目的:东部合作肿瘤学组(Eastern Cooperative Oncology Group,ECOG)的表现状态(PS)是癌症治疗和研究的一个关键临床变量,但它通常只以非结构化的形式记录在电子健康记录中。我们研究了自然语言处理(NLP)模型能否利用非结构化笔记文本推算 ECOG PS:从 1997 年到 2023 年,我们从中心的所有癌症患者中识别出了肿瘤内科笔记,并在患者层面上将其分为训练集(约占 80%)、调整/验证集(约占 10%)和测试集(约占 10%)。正则表达式用于提取明确记录的 PS。提取的 PS 标签用于训练 NLP 模型,以便从笔记的其余部分(去除正则表达式提取的 PS 文档)推算 ECOG PS(0-1 v 2-4)。我们评估了推算的 PS 与总生存期(OS)之间的关联:结果:使用正则表达式从 495,862 份笔记中提取了 ECOG PS,这些笔记对应于 79,698 名患者。基于变换器的 Longformer 模型以较高的辨别率估算了 PS(接收者操作特征曲线下的测试集面积为 0.95,精确度-召回曲线下的面积为 0.73)。推算出的不良PS与较差的OS有关,包括在没有明确PS检测记录的病例中(OS危险比,11.9;95% CI,11.1至12.8):结论:NLP 模型可用于从非结构化的肿瘤学家笔记中大规模推断患者的表现状态。结论:NLP 模型可以大规模地从非结构化的肿瘤医生笔记中推断患者的表现状态,这将有助于为临床结果研究和癌症治疗提供肿瘤数据集注释。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Extraction and Imputation of Eastern Cooperative Oncology Group Performance Status From Unstructured Oncology Notes Using Language Models.

Purpose: Eastern Cooperative Oncology Group (ECOG) performance status (PS) is a key clinical variable for cancer treatment and research, but it is usually only recorded in unstructured form in the electronic health record. We investigated whether natural language processing (NLP) models can impute ECOG PS using unstructured note text.

Materials and methods: Medical oncology notes were identified from all patients with cancer at our center from 1997 to 2023 and divided at the patient level into training (approximately 80%), tuning/validation (approximately 10%), and test (approximately 10%) sets. Regular expressions were used to extract explicitly documented PS. Extracted PS labels were used to train NLP models to impute ECOG PS (0-1 v 2-4) from the remainder of the notes (with regular expression-extracted PS documentation removed). We assessed associations between imputed PS and overall survival (OS).

Results: ECOG PS was extracted using regular expressions from 495,862 notes, corresponding to 79,698 patients. A Transformer-based Longformer model imputed PS with high discrimination (test set area under the receiver operating characteristic curve 0.95, area under the precision-recall curve 0.73). Imputed poor PS was associated with worse OS, including among notes with no explicit documentation of PS detected (OS hazard ratio, 11.9; 95% CI, 11.1 to 12.8).

Conclusion: NLP models can be used to impute performance status from unstructured oncologist notes at scale. This may aid the annotation of oncology data sets for clinical outcomes research and cancer care delivery.

求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
CiteScore
6.20
自引率
4.80%
发文量
190
期刊最新文献
Increasing Power in Phase III Oncology Trials With Multivariable Regression: An Empirical Assessment of 535 Primary End Point Analyses. Validation of Non-Small Cell Lung Cancer Clinical Insights Using a Generalized Oncology Natural Language Processing Model. Interinstitutional Approach to Advancing Geospatial Technologies for US Cancer Centers. Classification and Regression Trees to Predict for Survival for Patients With Hepatocellular Carcinoma Treated With Atezolizumab and Bevacizumab. Cureit: An End-to-End Pipeline for Implementing Mixture Cure Models With an Application to Liposarcoma Data.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1