从病理报告中准确识别肌纤维肉瘤的新自然语言处理算法

IF 4.2 2区医学 Q1 ORTHOPEDICS Clinical Orthopaedics and Related Research® Pub Date : 2024-10-02 DOI:10.1097/CORR.0000000000003270

Sarah E Lindsay, Cecelia J Madison, Duncan C Ramsey, Yee-Cheen Doung, Kenneth R Gundle

{"title":"从病理报告中准确识别肌纤维肉瘤的新自然语言处理算法","authors":"Sarah E Lindsay, Cecelia J Madison, Duncan C Ramsey, Yee-Cheen Doung, Kenneth R Gundle","doi":"10.1097/CORR.0000000000003270","DOIUrl":null,"url":null,"abstract":"Background: Available codes in the ICD-10 do not accurately reflect soft tissue sarcoma diagnoses, and this can result in an underrepresentation of soft tissue sarcoma in databases. The National VA Database provides a unique opportunity for soft tissue sarcoma investigation because of the availability of all clinical results and pathology reports. In the setting of soft tissue sarcoma, natural language processing (NLP) has the potential to be applied to clinical documents such as pathology reports to identify soft tissue sarcoma independent of ICD codes, allowing sarcoma researchers to build more comprehensive databases capable of answering a myriad of research questions.Questions/purposes: (1) What proportion of patients with myxofibrosarcoma within the National VA Database would be missed by searching only by soft tissue sarcoma ICD codes? (2) Is a de novo NLP algorithm capable of analyzing pathology reports to accurately identify patients with myxofibrosarcoma?Methods: All pathology reports (10.7 million) in the national VA corporate data warehouse were identified from 2003 to 2022. Using the word-search functionality, reports from 403 veterans were found to contain the term \"myxofibrosarcoma.\" The resulting pathology reports were manually reviewed to develop a gold-standard cohort that contained only those veterans with pathologist-confirmed myxofibrosarcoma diagnoses. The cohort had a mean ± SD age of 70 ± 12 years, and 96% (287 of 300) were men. Diagnosis codes were abstracted, and differences in appropriate ICD coding were compared. An NLP algorithm was iteratively refined and tested using confounders, negation, and emphasis terms for myxofibrosarcoma. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy were calculated for the NLP-generated cohorts through comparison with the manually reviewed gold-standard cohorts.Results: The records of 27% (81 of 300) of myxofibrosarcoma patients within the VA database were missing a sarcoma ICD code. A de novo NLP algorithm more accurately (92% [276 of 300]) identified patients with myxofibrosarcoma compared with ICD codes (73% [219 of 300]) or basic word searches (74% [300 of 403]) (p < 0.001). Three final algorithm models were generated with accuracies ranging from 92% to 100%.Conclusion: An NLP algorithm can identify patients with myxofibrosarcoma from pathology reports with high accuracy, which is an improvement over ICD-based cohort creation and simple word search. This algorithm is freely available on GitHub (https://github.com/sarcoma-shark/myxofibrosarcoma-shark) and is available to facilitate external validation and improvement through testing in other cohorts.Level of evidence: Level II, diagnostic study.","PeriodicalId":10404,"journal":{"name":"Clinical Orthopaedics and Related Research®","volume":" ","pages":""},"PeriodicalIF":4.2000,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"De Novo Natural Language Processing Algorithm Accurately Identifies Myxofibrosarcoma From Pathology Reports.\",\"authors\":\"Sarah E Lindsay, Cecelia J Madison, Duncan C Ramsey, Yee-Cheen Doung, Kenneth R Gundle\",\"doi\":\"10.1097/CORR.0000000000003270\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Background: Available codes in the ICD-10 do not accurately reflect soft tissue sarcoma diagnoses, and this can result in an underrepresentation of soft tissue sarcoma in databases. The National VA Database provides a unique opportunity for soft tissue sarcoma investigation because of the availability of all clinical results and pathology reports. In the setting of soft tissue sarcoma, natural language processing (NLP) has the potential to be applied to clinical documents such as pathology reports to identify soft tissue sarcoma independent of ICD codes, allowing sarcoma researchers to build more comprehensive databases capable of answering a myriad of research questions.Questions/purposes: (1) What proportion of patients with myxofibrosarcoma within the National VA Database would be missed by searching only by soft tissue sarcoma ICD codes? (2) Is a de novo NLP algorithm capable of analyzing pathology reports to accurately identify patients with myxofibrosarcoma?Methods: All pathology reports (10.7 million) in the national VA corporate data warehouse were identified from 2003 to 2022. Using the word-search functionality, reports from 403 veterans were found to contain the term \\\"myxofibrosarcoma.\\\" The resulting pathology reports were manually reviewed to develop a gold-standard cohort that contained only those veterans with pathologist-confirmed myxofibrosarcoma diagnoses. The cohort had a mean ± SD age of 70 ± 12 years, and 96% (287 of 300) were men. Diagnosis codes were abstracted, and differences in appropriate ICD coding were compared. An NLP algorithm was iteratively refined and tested using confounders, negation, and emphasis terms for myxofibrosarcoma. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy were calculated for the NLP-generated cohorts through comparison with the manually reviewed gold-standard cohorts.Results: The records of 27% (81 of 300) of myxofibrosarcoma patients within the VA database were missing a sarcoma ICD code. A de novo NLP algorithm more accurately (92% [276 of 300]) identified patients with myxofibrosarcoma compared with ICD codes (73% [219 of 300]) or basic word searches (74% [300 of 403]) (p < 0.001). Three final algorithm models were generated with accuracies ranging from 92% to 100%.Conclusion: An NLP algorithm can identify patients with myxofibrosarcoma from pathology reports with high accuracy, which is an improvement over ICD-based cohort creation and simple word search. This algorithm is freely available on GitHub (https://github.com/sarcoma-shark/myxofibrosarcoma-shark) and is available to facilitate external validation and improvement through testing in other cohorts.Level of evidence: Level II, diagnostic study.\",\"PeriodicalId\":10404,\"journal\":{\"name\":\"Clinical Orthopaedics and Related Research®\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":4.2000,\"publicationDate\":\"2024-10-02\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Clinical Orthopaedics and Related Research®\",\"FirstCategoryId\":\"3\",\"ListUrlMain\":\"https://doi.org/10.1097/CORR.0000000000003270\",\"RegionNum\":2,\"RegionCategory\":\"医学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ORTHOPEDICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Orthopaedics and Related Research®","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/CORR.0000000000003270","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}

引用次数: 0

摘要

背景：ICD-10 中的可用代码不能准确反映软组织肉瘤的诊断，这可能导致数据库中软组织肉瘤的代表性不足。国家退伍军人数据库可提供所有临床结果和病理报告，为软组织肉瘤调查提供了一个独特的机会。在软组织肉瘤的研究中，自然语言处理（NLP）有可能应用于病理报告等临床文件，以识别独立于ICD代码的软组织肉瘤，从而使肉瘤研究人员能够建立更全面的数据库，回答大量的研究问题。问题/目的：（1）在国家退伍军人数据库中，仅根据软组织肉瘤ICD代码进行搜索会遗漏多少比例的肌纤维肉瘤患者？(2）一种全新的 NLP 算法是否能够通过分析病理报告来准确识别肌纤维肉瘤患者？对退伍军人事务部全国企业数据仓库中 2003 年至 2022 年的所有病理报告（1070 万份）进行了识别。使用单词搜索功能，发现403名退伍军人的报告中包含 "肌纤维肉瘤 "一词。通过人工审核所得到的病理报告，建立了一个黄金标准队列，该队列只包含那些经病理学家确诊为肌纤维肉瘤的退伍军人。该队列的平均（±SD）年龄为 70 ± 12 岁，96%（300 人中有 287 名男性）为男性。对诊断代码进行了摘录，并比较了 ICD 适当编码的差异。使用肌纤维肉瘤的混淆项、否定项和强调项对 NLP 算法进行了反复改进和测试。通过与人工审核的黄金标准队列进行比较，计算出 NLP 生成队列的敏感性、特异性、阳性预测值 (PPV)、阴性预测值 (NPV) 和准确性：退伍军人事务部数据库中有27%的肌纤维肉瘤患者（300人中有81人）的记录缺少肉瘤ICD代码。与 ICD 编码（73% [300 例中的 219 例]）或基本词汇搜索（74% [403 例中的 300 例]）相比，全新 NLP 算法能更准确地识别肌纤维肉瘤患者（92% [300 例中的 276 例]）（p < 0.001）。最终生成的三个算法模型的准确率从92%到100%不等：结论：NLP 算法能从病理报告中高精度地识别肌纤维肉瘤患者，比基于 ICD 的队列创建和简单的单词搜索有所改进。该算法可在 GitHub (https://github.com/sarcoma-shark/myxofibrosarcoma-shark) 上免费获取，并可通过在其他队列中进行测试来促进外部验证和改进：证据级别：二级，诊断研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

De Novo Natural Language Processing Algorithm Accurately Identifies Myxofibrosarcoma From Pathology Reports.

Background: Available codes in the ICD-10 do not accurately reflect soft tissue sarcoma diagnoses, and this can result in an underrepresentation of soft tissue sarcoma in databases. The National VA Database provides a unique opportunity for soft tissue sarcoma investigation because of the availability of all clinical results and pathology reports. In the setting of soft tissue sarcoma, natural language processing (NLP) has the potential to be applied to clinical documents such as pathology reports to identify soft tissue sarcoma independent of ICD codes, allowing sarcoma researchers to build more comprehensive databases capable of answering a myriad of research questions.

Questions/purposes: (1) What proportion of patients with myxofibrosarcoma within the National VA Database would be missed by searching only by soft tissue sarcoma ICD codes? (2) Is a de novo NLP algorithm capable of analyzing pathology reports to accurately identify patients with myxofibrosarcoma?

Methods: All pathology reports (10.7 million) in the national VA corporate data warehouse were identified from 2003 to 2022. Using the word-search functionality, reports from 403 veterans were found to contain the term "myxofibrosarcoma." The resulting pathology reports were manually reviewed to develop a gold-standard cohort that contained only those veterans with pathologist-confirmed myxofibrosarcoma diagnoses. The cohort had a mean ± SD age of 70 ± 12 years, and 96% (287 of 300) were men. Diagnosis codes were abstracted, and differences in appropriate ICD coding were compared. An NLP algorithm was iteratively refined and tested using confounders, negation, and emphasis terms for myxofibrosarcoma. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy were calculated for the NLP-generated cohorts through comparison with the manually reviewed gold-standard cohorts.

Results: The records of 27% (81 of 300) of myxofibrosarcoma patients within the VA database were missing a sarcoma ICD code. A de novo NLP algorithm more accurately (92% [276 of 300]) identified patients with myxofibrosarcoma compared with ICD codes (73% [219 of 300]) or basic word searches (74% [300 of 403]) (p < 0.001). Three final algorithm models were generated with accuracies ranging from 92% to 100%.

Conclusion: An NLP algorithm can identify patients with myxofibrosarcoma from pathology reports with high accuracy, which is an improvement over ICD-based cohort creation and simple word search. This algorithm is freely available on GitHub (https://github.com/sarcoma-shark/myxofibrosarcoma-shark) and is available to facilitate external validation and improvement through testing in other cohorts.

Level of evidence: Level II, diagnostic study.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Clinical Orthopaedics and Related Research® 医学-外科

CiteScore

7.00

自引率

11.90%

发文量

722

审稿时长

2.5 months

期刊介绍： Clinical Orthopaedics and Related Research® is a leading peer-reviewed journal devoted to the dissemination of new and important orthopaedic knowledge. CORR® brings readers the latest clinical and basic research, along with columns, commentaries, and interviews with authors.