Data Mining Approach to Identify Disease Cohorts from Primary Care Electronic Medical Records: A Case of Diabetes Mellitus

Q3 Computer Science Open Bioinformatics Journal Pub Date : 2017-12-12 DOI:10.2174/1875036201710010016

Ebenezer S. Owusu Adjah, O. Montvida, Julius Agbeve, S. Paul

{"title":"Data Mining Approach to Identify Disease Cohorts from Primary Care Electronic Medical Records: A Case of Diabetes Mellitus","authors":"Ebenezer S. Owusu Adjah, O. Montvida, Julius Agbeve, S. Paul","doi":"10.2174/1875036201710010016","DOIUrl":null,"url":null,"abstract":"Background: Identification of diseased patients from primary care based electronic medical records (EMRs) has methodological challenges that may impact epidemiologic inferences. Objective: To compare deterministic clinically guided selection algorithms with probabilistic machine learning (ML) methodologies for their ability to identify patients with type 2 diabetes mellitus (T2DM) from large population based EMRs from nationally representative primary care database. Methods: Four cohorts of patients with T2DM were defined by deterministic approach based on disease codes. The database was mined for a set of best predictors of T2DM and the performance of six ML algorithms were compared based on cross-validated true positive rate, true negative rate, and area under receiver operating characteristic curve. Results: In the database of 11,018,025 research suitable individuals, 379 657 (3.4%) were coded to have T2DM. Logistic Regression classifier was selected as best ML algorithm and resulted in a cohort of 383,330 patients with potential T2DM. Eighty-three percent (83%) of this cohort had a T2DM code, and 16% of the patients with T2DM code were not included in this ML cohort. Of those in the ML cohort without disease code, 52% had at least one measure of elevated glucose level and 22% had received at least one prescription for antidiabetic medication. Conclusion: Deterministic cohort selection based on disease coding potentially introduces significant mis-classification problem. ML techniques allow testing for potential disease predictors, and under meaningful data input, are able to identify diseased cohorts in a holistic way.","PeriodicalId":38956,"journal":{"name":"Open Bioinformatics Journal","volume":"10 1","pages":"16-27"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"22","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Open Bioinformatics Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2174/1875036201710010016","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"Computer Science","Score":null,"Total":0}

引用次数: 22

Abstract

Background: Identification of diseased patients from primary care based electronic medical records (EMRs) has methodological challenges that may impact epidemiologic inferences. Objective: To compare deterministic clinically guided selection algorithms with probabilistic machine learning (ML) methodologies for their ability to identify patients with type 2 diabetes mellitus (T2DM) from large population based EMRs from nationally representative primary care database. Methods: Four cohorts of patients with T2DM were defined by deterministic approach based on disease codes. The database was mined for a set of best predictors of T2DM and the performance of six ML algorithms were compared based on cross-validated true positive rate, true negative rate, and area under receiver operating characteristic curve. Results: In the database of 11,018,025 research suitable individuals, 379 657 (3.4%) were coded to have T2DM. Logistic Regression classifier was selected as best ML algorithm and resulted in a cohort of 383,330 patients with potential T2DM. Eighty-three percent (83%) of this cohort had a T2DM code, and 16% of the patients with T2DM code were not included in this ML cohort. Of those in the ML cohort without disease code, 52% had at least one measure of elevated glucose level and 22% had received at least one prescription for antidiabetic medication. Conclusion: Deterministic cohort selection based on disease coding potentially introduces significant mis-classification problem. ML techniques allow testing for potential disease predictors, and under meaningful data input, are able to identify diseased cohorts in a holistic way.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

从初级保健电子病历中识别疾病队列的数据挖掘方法：一例糖尿病

背景：从基于初级保健的电子医疗记录（EMR）中识别患病患者存在方法学挑战，可能会影响流行病学推断。目的：比较确定性临床指导选择算法和概率机器学习（ML）方法在从具有全国代表性的初级保健数据库中基于大规模人群的电子病历中识别2型糖尿病（T2DM）患者的能力。方法：以疾病编码为基础，采用确定性方法确定4组T2DM患者。在数据库中挖掘了一组T2DM的最佳预测因子，并基于交叉验证的真阳性率、真阴性率和受试者工作特征曲线下面积对六种ML算法的性能进行了比较。结果：在11018025个适合研究的个体的数据库中，379657（3.4%）被编码为患有T2DM。Logistic回归分类器被选为最佳ML算法，并产生了383330名潜在T2DM患者的队列。该队列中83%（83%）有T2DM代码，16%的T2DM代码患者不包括在该ML队列中。在没有疾病代码的ML队列中，52%的人至少有一次血糖水平升高，22%的人至少接受过一次抗糖尿病药物处方。结论：基于疾病编码的确定性队列选择可能会引入重大的错误分类问题。ML技术允许测试潜在的疾病预测因素，并且在有意义的数据输入下，能够以整体的方式识别患病队列。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Open Bioinformatics Journal Computer Science-Computer Science (miscellaneous)

CiteScore

2.40

自引率

0.00%

发文量

期刊介绍： The Open Bioinformatics Journal is an Open Access online journal, which publishes research articles, reviews/mini-reviews, letters, clinical trial studies and guest edited single topic issues in all areas of bioinformatics and computational biology. The coverage includes biomedicine, focusing on large data acquisition, analysis and curation, computational and statistical methods for the modeling and analysis of biological data, and descriptions of new algorithms and databases. The Open Bioinformatics Journal, a peer reviewed journal, is an important and reliable source of current information on the developments in the field. The emphasis will be on publishing quality articles rapidly and freely available worldwide.