Sarah E Lindsay, Cecelia J Madison, Duncan C Ramsey, Yee-Cheen Doung, Kenneth R Gundle
{"title":"De Novo Natural Language Processing Algorithm Accurately Identifies Myxofibrosarcoma From Pathology Reports.","authors":"Sarah E Lindsay, Cecelia J Madison, Duncan C Ramsey, Yee-Cheen Doung, Kenneth R Gundle","doi":"10.1097/CORR.0000000000003270","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Available codes in the ICD-10 do not accurately reflect soft tissue sarcoma diagnoses, and this can result in an underrepresentation of soft tissue sarcoma in databases. The National VA Database provides a unique opportunity for soft tissue sarcoma investigation because of the availability of all clinical results and pathology reports. In the setting of soft tissue sarcoma, natural language processing (NLP) has the potential to be applied to clinical documents such as pathology reports to identify soft tissue sarcoma independent of ICD codes, allowing sarcoma researchers to build more comprehensive databases capable of answering a myriad of research questions.</p><p><strong>Questions/purposes: </strong>(1) What proportion of patients with myxofibrosarcoma within the National VA Database would be missed by searching only by soft tissue sarcoma ICD codes? (2) Is a de novo NLP algorithm capable of analyzing pathology reports to accurately identify patients with myxofibrosarcoma?</p><p><strong>Methods: </strong>All pathology reports (10.7 million) in the national VA corporate data warehouse were identified from 2003 to 2022. Using the word-search functionality, reports from 403 veterans were found to contain the term \"myxofibrosarcoma.\" The resulting pathology reports were manually reviewed to develop a gold-standard cohort that contained only those veterans with pathologist-confirmed myxofibrosarcoma diagnoses. The cohort had a mean ± SD age of 70 ± 12 years, and 96% (287 of 300) were men. Diagnosis codes were abstracted, and differences in appropriate ICD coding were compared. An NLP algorithm was iteratively refined and tested using confounders, negation, and emphasis terms for myxofibrosarcoma. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy were calculated for the NLP-generated cohorts through comparison with the manually reviewed gold-standard cohorts.</p><p><strong>Results: </strong>The records of 27% (81 of 300) of myxofibrosarcoma patients within the VA database were missing a sarcoma ICD code. A de novo NLP algorithm more accurately (92% [276 of 300]) identified patients with myxofibrosarcoma compared with ICD codes (73% [219 of 300]) or basic word searches (74% [300 of 403]) (p < 0.001). Three final algorithm models were generated with accuracies ranging from 92% to 100%.</p><p><strong>Conclusion: </strong>An NLP algorithm can identify patients with myxofibrosarcoma from pathology reports with high accuracy, which is an improvement over ICD-based cohort creation and simple word search. This algorithm is freely available on GitHub (https://github.com/sarcoma-shark/myxofibrosarcoma-shark) and is available to facilitate external validation and improvement through testing in other cohorts.</p><p><strong>Level of evidence: </strong>Level II, diagnostic study.</p>","PeriodicalId":10404,"journal":{"name":"Clinical Orthopaedics and Related Research®","volume":" ","pages":""},"PeriodicalIF":4.2000,"publicationDate":"2024-10-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Clinical Orthopaedics and Related Research®","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1097/CORR.0000000000003270","RegionNum":2,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ORTHOPEDICS","Score":null,"Total":0}
引用次数: 0
Abstract
Background: Available codes in the ICD-10 do not accurately reflect soft tissue sarcoma diagnoses, and this can result in an underrepresentation of soft tissue sarcoma in databases. The National VA Database provides a unique opportunity for soft tissue sarcoma investigation because of the availability of all clinical results and pathology reports. In the setting of soft tissue sarcoma, natural language processing (NLP) has the potential to be applied to clinical documents such as pathology reports to identify soft tissue sarcoma independent of ICD codes, allowing sarcoma researchers to build more comprehensive databases capable of answering a myriad of research questions.
Questions/purposes: (1) What proportion of patients with myxofibrosarcoma within the National VA Database would be missed by searching only by soft tissue sarcoma ICD codes? (2) Is a de novo NLP algorithm capable of analyzing pathology reports to accurately identify patients with myxofibrosarcoma?
Methods: All pathology reports (10.7 million) in the national VA corporate data warehouse were identified from 2003 to 2022. Using the word-search functionality, reports from 403 veterans were found to contain the term "myxofibrosarcoma." The resulting pathology reports were manually reviewed to develop a gold-standard cohort that contained only those veterans with pathologist-confirmed myxofibrosarcoma diagnoses. The cohort had a mean ± SD age of 70 ± 12 years, and 96% (287 of 300) were men. Diagnosis codes were abstracted, and differences in appropriate ICD coding were compared. An NLP algorithm was iteratively refined and tested using confounders, negation, and emphasis terms for myxofibrosarcoma. Sensitivity, specificity, positive predictive value (PPV), negative predictive value (NPV), and accuracy were calculated for the NLP-generated cohorts through comparison with the manually reviewed gold-standard cohorts.
Results: The records of 27% (81 of 300) of myxofibrosarcoma patients within the VA database were missing a sarcoma ICD code. A de novo NLP algorithm more accurately (92% [276 of 300]) identified patients with myxofibrosarcoma compared with ICD codes (73% [219 of 300]) or basic word searches (74% [300 of 403]) (p < 0.001). Three final algorithm models were generated with accuracies ranging from 92% to 100%.
Conclusion: An NLP algorithm can identify patients with myxofibrosarcoma from pathology reports with high accuracy, which is an improvement over ICD-based cohort creation and simple word search. This algorithm is freely available on GitHub (https://github.com/sarcoma-shark/myxofibrosarcoma-shark) and is available to facilitate external validation and improvement through testing in other cohorts.
期刊介绍:
Clinical Orthopaedics and Related Research® is a leading peer-reviewed journal devoted to the dissemination of new and important orthopaedic knowledge.
CORR® brings readers the latest clinical and basic research, along with columns, commentaries, and interviews with authors.