Machine learning based study for the classification of Type 2 diabetes mellitus subtypes.

IF 6.1 3区生物学 Q1 MATHEMATICAL & COMPUTATIONAL BIOLOGY Biodata Mining Pub Date : 2023-08-22 DOI:10.1186/s13040-023-00340-2

Nelson E Ordoñez-Guillen, Jose Luis Gonzalez-Compean, Ivan Lopez-Arevalo, Miguel Contreras-Murillo, Edwin Aldana-Bobadilla

{"title":"Machine learning based study for the classification of Type 2 diabetes mellitus subtypes.","authors":"Nelson E Ordoñez-Guillen, Jose Luis Gonzalez-Compean, Ivan Lopez-Arevalo, Miguel Contreras-Murillo, Edwin Aldana-Bobadilla","doi":"10.1186/s13040-023-00340-2","DOIUrl":null,"url":null,"abstract":"Purpose: Data-driven diabetes research has increased its interest in exploring the heterogeneity of the disease, aiming to support in the development of more specific prognoses and treatments within the so-called precision medicine. Recently, one of these studies found five diabetes subgroups with varying risks of complications and treatment responses. Here, we tackle the development and assessment of different models for classifying Type 2 Diabetes (T2DM) subtypes through machine learning approaches, with the aim of providing a performance comparison and new insights on the matter.Methods: We developed a three-stage methodology starting with the preprocessing of public databases NHANES (USA) and ENSANUT (Mexico) to construct a dataset with N = 10,077 adult diabetes patient records. We used N = 2,768 records for training/validation of models and left the remaining (N = 7,309) for testing. In the second stage, groups of observations -each one representing a T2DM subtype- were identified. We tested different clustering techniques and strategies and validated them by using internal and external clustering indices; obtaining two annotated datasets Dset A and Dset B. In the third stage, we developed different classification models assaying four algorithms, seven input-data schemes, and two validation settings on each annotated dataset. We also tested the obtained models using a majority-vote approach for classifying unseen patient records in the hold-out dataset.Results: From the independently obtained bootstrap validation for Dset A and Dset B, mean accuracies across all seven data schemes were [Formula: see text] ([Formula: see text]) and [Formula: see text] ([Formula: see text]), respectively. Best accuracies were [Formula: see text] and [Formula: see text]. Both validation setting results were consistent. For the hold-out dataset, results were consonant with most of those obtained in the literature in terms of class proportions.Conclusion: The development of machine learning systems for the classification of diabetes subtypes constitutes an important task to support physicians for fast and timely decision-making. We expect to deploy this methodology in a data analysis platform to conduct studies for identifying T2DM subtypes in patient records from hospitals.","PeriodicalId":48947,"journal":{"name":"Biodata Mining","volume":"16 1","pages":"24"},"PeriodicalIF":6.1000,"publicationDate":"2023-08-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.ncbi.nlm.nih.gov/pmc/articles/PMC10463725/pdf/","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Biodata Mining","FirstCategoryId":"99","ListUrlMain":"https://doi.org/10.1186/s13040-023-00340-2","RegionNum":3,"RegionCategory":"生物学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"MATHEMATICAL & COMPUTATIONAL BIOLOGY","Score":null,"Total":0}

引用次数: 0

Abstract

Purpose: Data-driven diabetes research has increased its interest in exploring the heterogeneity of the disease, aiming to support in the development of more specific prognoses and treatments within the so-called precision medicine. Recently, one of these studies found five diabetes subgroups with varying risks of complications and treatment responses. Here, we tackle the development and assessment of different models for classifying Type 2 Diabetes (T2DM) subtypes through machine learning approaches, with the aim of providing a performance comparison and new insights on the matter.

Methods: We developed a three-stage methodology starting with the preprocessing of public databases NHANES (USA) and ENSANUT (Mexico) to construct a dataset with N = 10,077 adult diabetes patient records. We used N = 2,768 records for training/validation of models and left the remaining (N = 7,309) for testing. In the second stage, groups of observations -each one representing a T2DM subtype- were identified. We tested different clustering techniques and strategies and validated them by using internal and external clustering indices; obtaining two annotated datasets Dset A and Dset B. In the third stage, we developed different classification models assaying four algorithms, seven input-data schemes, and two validation settings on each annotated dataset. We also tested the obtained models using a majority-vote approach for classifying unseen patient records in the hold-out dataset.

Results: From the independently obtained bootstrap validation for Dset A and Dset B, mean accuracies across all seven data schemes were [Formula: see text] ([Formula: see text]) and [Formula: see text] ([Formula: see text]), respectively. Best accuracies were [Formula: see text] and [Formula: see text]. Both validation setting results were consistent. For the hold-out dataset, results were consonant with most of those obtained in the literature in terms of class proportions.

Conclusion: The development of machine learning systems for the classification of diabetes subtypes constitutes an important task to support physicians for fast and timely decision-making. We expect to deploy this methodology in a data analysis platform to conduct studies for identifying T2DM subtypes in patient records from hospitals.

Abstract Image

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于机器学习的2型糖尿病亚型分类研究。

目的:数据驱动的糖尿病研究增加了对探索疾病异质性的兴趣，旨在支持所谓的精准医学中更具体的预后和治疗的发展。最近，其中一项研究发现了五个糖尿病亚组，它们的并发症风险和治疗反应各不相同。在这里，我们通过机器学习方法解决了2型糖尿病(T2DM)亚型分类的不同模型的开发和评估，目的是提供性能比较和对该问题的新见解。方法:我们开发了一个三阶段的方法，从公共数据库NHANES(美国)和ENSANUT(墨西哥)的预处理开始，构建了一个包含N = 10,077例成人糖尿病患者记录的数据集。我们使用N = 2768条记录用于模型的训练/验证，剩下的(N = 7309)用于测试。在第二阶段，确定观察组-每组代表一个T2DM亚型。对不同的聚类技术和策略进行了测试，并利用内外聚类指标对其进行了验证;在第三阶段，我们开发了不同的分类模型，分析了每个注释数据集上的四种算法、七种输入数据方案和两种验证设置。我们还使用多数投票方法测试了获得的模型，用于对保留数据集中未见的患者记录进行分类。结果:从独立获得的Dset A和Dset B的bootstrap验证中，所有七个数据方案的平均精度分别为[公式:见文]([公式:见文])和[公式:见文]([公式:见文])。准确度最高的是[公式:见文]和[公式:见文]。两种验证设置结果一致。对于hold-out数据集，就类比例而言，结果与文献中获得的大多数结果一致。结论:开发用于糖尿病亚型分类的机器学习系统是支持医生快速及时决策的重要任务。我们希望在数据分析平台中部署这种方法，以开展在医院患者记录中识别T2DM亚型的研究。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Biodata Mining MATHEMATICAL & COMPUTATIONAL BIOLOGY-

CiteScore

7.90

自引率

0.00%

发文量

审稿时长

23 weeks

期刊介绍： BioData Mining is an open access, open peer-reviewed journal encompassing research on all aspects of data mining applied to high-dimensional biological and biomedical data, focusing on computational aspects of knowledge discovery from large-scale genetic, transcriptomic, genomic, proteomic, and metabolomic data. Topical areas include, but are not limited to: -Development, evaluation, and application of novel data mining and machine learning algorithms. -Adaptation, evaluation, and application of traditional data mining and machine learning algorithms. -Open-source software for the application of data mining and machine learning algorithms. -Design, development and integration of databases, software and web services for the storage, management, retrieval, and analysis of data from large scale studies. -Pre-processing, post-processing, modeling, and interpretation of data mining and machine learning results for biological interpretation and knowledge discovery.