Development and validation of a rheumatoid arthritis case definition: a machine learning approach using data from primary care electronic medical records.

IF 3.3 3区 医学 Q2 MEDICAL INFORMATICS BMC Medical Informatics and Decision Making Pub Date : 2024-11-27 DOI:10.1186/s12911-024-02776-w
Anh N Q Pham, Claire E H Barber, Neil Drummond, Lisa Jasper, Doug Klein, Cliff Lindeman, Jessica Widdifield, Tyler Williamson, C Allyson Jones
{"title":"Development and validation of a rheumatoid arthritis case definition: a machine learning approach using data from primary care electronic medical records.","authors":"Anh N Q Pham, Claire E H Barber, Neil Drummond, Lisa Jasper, Doug Klein, Cliff Lindeman, Jessica Widdifield, Tyler Williamson, C Allyson Jones","doi":"10.1186/s12911-024-02776-w","DOIUrl":null,"url":null,"abstract":"<p><strong>Background: </strong>Rheumatoid Arthritis (RA) is a chronic inflammatory disease that is primarily diagnosed and managed by rheumatologists; however, it is often primary care providers who first encounter RA-related symptoms. This study developed and validated a case definition for RA using national surveillance data in primary care settings.</p><p><strong>Methods: </strong>This cross-sectional validation study used structured electronic medical record (EMR) data from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN). Based on the reference set generated by EMR reviews by five experts, three machine learning steps: 'bag-of-words' approach to feature generation, feature reduction using a feature importance measure coupled with recursive feature elimination and clustering, and classification using tree-based methods (Decision Tree, Random Forest, and Extreme Gradient Boosting). The three tree-based algorithms were compared to identify the procedure that generated the optimal evaluation metrics. Nested cross-validation was used to allow evaluation and comparison and tuning of models simultaneously.</p><p><strong>Results: </strong>Of 1.3 million patients from seven Canadian provinces, 5,600 people aged 19 + were randomly selected. The optimal algorithm for selecting RA cases was generated by the XGBoost classification method. Based on feature importance scores for features in the XGBoost output, a human-readable case definition was created, where RA cases are identified when there are at least 2 occurrences of text \"rheumatoid\" in any billing, encounter diagnosis, or health condition table of the patient chart. The final case definition had sensitivity of 81.6% (95% CI, 75.6-86.4), specificity of 98.0% (95% CI, 97.4-98.5), positive predicted value of 76.3% (95% CI, 70.1-81.5), and negative predicted value of 98.6% (95% CI, 98.0-98.6).</p><p><strong>Conclusion: </strong>A case definition for RA in using primary care EMR data was developed based off the XGBoost algorithm. With high validity metrics, this case definition is expected to be a reliable tool for future epidemiological research and surveillance investigating the management of RA in CPCSSN dataset.</p>","PeriodicalId":9340,"journal":{"name":"BMC Medical Informatics and Decision Making","volume":"24 1","pages":"360"},"PeriodicalIF":3.3000,"publicationDate":"2024-11-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"BMC Medical Informatics and Decision Making","FirstCategoryId":"3","ListUrlMain":"https://doi.org/10.1186/s12911-024-02776-w","RegionNum":3,"RegionCategory":"医学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"MEDICAL INFORMATICS","Score":null,"Total":0}
引用次数: 0

Abstract

Background: Rheumatoid Arthritis (RA) is a chronic inflammatory disease that is primarily diagnosed and managed by rheumatologists; however, it is often primary care providers who first encounter RA-related symptoms. This study developed and validated a case definition for RA using national surveillance data in primary care settings.

Methods: This cross-sectional validation study used structured electronic medical record (EMR) data from the Canadian Primary Care Sentinel Surveillance Network (CPCSSN). Based on the reference set generated by EMR reviews by five experts, three machine learning steps: 'bag-of-words' approach to feature generation, feature reduction using a feature importance measure coupled with recursive feature elimination and clustering, and classification using tree-based methods (Decision Tree, Random Forest, and Extreme Gradient Boosting). The three tree-based algorithms were compared to identify the procedure that generated the optimal evaluation metrics. Nested cross-validation was used to allow evaluation and comparison and tuning of models simultaneously.

Results: Of 1.3 million patients from seven Canadian provinces, 5,600 people aged 19 + were randomly selected. The optimal algorithm for selecting RA cases was generated by the XGBoost classification method. Based on feature importance scores for features in the XGBoost output, a human-readable case definition was created, where RA cases are identified when there are at least 2 occurrences of text "rheumatoid" in any billing, encounter diagnosis, or health condition table of the patient chart. The final case definition had sensitivity of 81.6% (95% CI, 75.6-86.4), specificity of 98.0% (95% CI, 97.4-98.5), positive predicted value of 76.3% (95% CI, 70.1-81.5), and negative predicted value of 98.6% (95% CI, 98.0-98.6).

Conclusion: A case definition for RA in using primary care EMR data was developed based off the XGBoost algorithm. With high validity metrics, this case definition is expected to be a reliable tool for future epidemiological research and surveillance investigating the management of RA in CPCSSN dataset.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
类风湿性关节炎病例定义的开发与验证:利用初级保健电子病历数据的机器学习方法。
背景:类风湿性关节炎(RA)是一种慢性炎症性疾病,主要由风湿免疫科医生诊断和治疗;然而,最先出现 RA 相关症状的往往是初级保健提供者。本研究利用全国初级医疗机构的监测数据,制定并验证了 RA 的病例定义:这项横断面验证研究使用了来自加拿大初级医疗哨点监测网络(CPCSSN)的结构化电子病历(EMR)数据。在五位专家对电子病历进行审查后生成的参考集的基础上,进行了三个机器学习步骤:通过 "词袋 "方法生成特征,使用特征重要性度量结合递归特征消除和聚类来减少特征,以及使用基于树的方法(决策树、随机森林和极端梯度提升)进行分类。对这三种基于树的算法进行了比较,以确定产生最佳评价指标的程序。通过嵌套交叉验证,可以同时对模型进行评估、比较和调整:从加拿大 7 个省的 130 万名患者中随机抽取了 5600 名年龄在 19 岁以上的患者。选择 RA 病例的最佳算法由 XGBoost 分类法产生。根据 XGBoost 输出中的特征重要性得分,创建了一个人类可读的病例定义,即当病历的任何账单、会诊诊断或健康状况表中至少出现 2 次 "类风湿 "文字时,RA 病例即被识别。最终病例定义的灵敏度为 81.6%(95% CI,75.6-86.4),特异度为 98.0%(95% CI,97.4-98.5),阳性预测值为 76.3%(95% CI,70.1-81.5),阴性预测值为 98.6%(95% CI,98.0-98.6):基于XGBoost算法,利用初级医疗EMR数据开发出了RA病例定义。该病例定义具有较高的有效性指标,有望成为未来在CPCSSN数据集中开展流行病学研究和RA管理监测的可靠工具。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
7.20
自引率
5.70%
发文量
297
审稿时长
1 months
期刊介绍: BMC Medical Informatics and Decision Making is an open access journal publishing original peer-reviewed research articles in relation to the design, development, implementation, use, and evaluation of health information technologies and decision-making for human health.
期刊最新文献
How to establish and sustain a disease registry: insights from a qualitative study of six disease registries in the UK. Development and validation of a rheumatoid arthritis case definition: a machine learning approach using data from primary care electronic medical records. Machine learning predicts pulmonary Long Covid sequelae using clinical data. Natural language processing data services for healthcare providers. Prediction model of ICU readmission in Chinese patients with acute type A aortic dissection: a retrospective study.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1