Mohammad Mihrab Chowdhury , Ragib Shahariar Ayon , Md Sakhawat Hossain
{"title":"利用类不平衡 BRFSS 数据集研究用于糖尿病诊断的机器学习算法和数据增强技术","authors":"Mohammad Mihrab Chowdhury , Ragib Shahariar Ayon , Md Sakhawat Hossain","doi":"10.1016/j.health.2023.100297","DOIUrl":null,"url":null,"abstract":"<div><p>Diabetes is a prevalent chronic condition that poses significant challenges to early diagnosis and identifying at-risk individuals. Machine learning plays a crucial role in diabetes detection by leveraging its ability to process large volumes of data and identify complex patterns. However, imbalanced data, where the number of diabetic cases is substantially smaller than non-diabetic cases, complicates the identification of individuals with diabetes using machine learning algorithms. This study focuses on predicting whether a person is at risk of diabetes, considering the individual’s health and socio-economic conditions while mitigating the challenges posed by imbalanced data. We employ several data augmentation techniques, such as oversampling (Synthetic Minority Over Sampling for Nominal Data, i.e.SMOTE-N), undersampling (Edited Nearest Neighbor, i.e. ENN), and hybrid sampling techniques (SMOTE-Tomek and SMOTE-ENN) on training data before applying machine learning algorithms to minimize the impact of imbalanced data. Our study sheds light on the significance of carefully utilizing data augmentation techniques without any data leakage to enhance the effectiveness of machine learning algorithms. Moreover, it offers a complete machine learning structure for healthcare practitioners, from data obtaining to machine learning prediction, enabling them to make informed decisions.</p></div>","PeriodicalId":73222,"journal":{"name":"Healthcare analytics (New York, N.Y.)","volume":"5 ","pages":"Article 100297"},"PeriodicalIF":0.0000,"publicationDate":"2023-12-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S2772442523001648/pdfft?md5=cbb15d1b9b72127ef6f0b213ad40bae0&pid=1-s2.0-S2772442523001648-main.pdf","citationCount":"0","resultStr":"{\"title\":\"An investigation of machine learning algorithms and data augmentation techniques for diabetes diagnosis using class imbalanced BRFSS dataset\",\"authors\":\"Mohammad Mihrab Chowdhury , Ragib Shahariar Ayon , Md Sakhawat Hossain\",\"doi\":\"10.1016/j.health.2023.100297\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Diabetes is a prevalent chronic condition that poses significant challenges to early diagnosis and identifying at-risk individuals. Machine learning plays a crucial role in diabetes detection by leveraging its ability to process large volumes of data and identify complex patterns. However, imbalanced data, where the number of diabetic cases is substantially smaller than non-diabetic cases, complicates the identification of individuals with diabetes using machine learning algorithms. This study focuses on predicting whether a person is at risk of diabetes, considering the individual’s health and socio-economic conditions while mitigating the challenges posed by imbalanced data. We employ several data augmentation techniques, such as oversampling (Synthetic Minority Over Sampling for Nominal Data, i.e.SMOTE-N), undersampling (Edited Nearest Neighbor, i.e. ENN), and hybrid sampling techniques (SMOTE-Tomek and SMOTE-ENN) on training data before applying machine learning algorithms to minimize the impact of imbalanced data. Our study sheds light on the significance of carefully utilizing data augmentation techniques without any data leakage to enhance the effectiveness of machine learning algorithms. Moreover, it offers a complete machine learning structure for healthcare practitioners, from data obtaining to machine learning prediction, enabling them to make informed decisions.</p></div>\",\"PeriodicalId\":73222,\"journal\":{\"name\":\"Healthcare analytics (New York, N.Y.)\",\"volume\":\"5 \",\"pages\":\"Article 100297\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-12-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S2772442523001648/pdfft?md5=cbb15d1b9b72127ef6f0b213ad40bae0&pid=1-s2.0-S2772442523001648-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Healthcare analytics (New York, N.Y.)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S2772442523001648\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Healthcare analytics (New York, N.Y.)","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S2772442523001648","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
An investigation of machine learning algorithms and data augmentation techniques for diabetes diagnosis using class imbalanced BRFSS dataset
Diabetes is a prevalent chronic condition that poses significant challenges to early diagnosis and identifying at-risk individuals. Machine learning plays a crucial role in diabetes detection by leveraging its ability to process large volumes of data and identify complex patterns. However, imbalanced data, where the number of diabetic cases is substantially smaller than non-diabetic cases, complicates the identification of individuals with diabetes using machine learning algorithms. This study focuses on predicting whether a person is at risk of diabetes, considering the individual’s health and socio-economic conditions while mitigating the challenges posed by imbalanced data. We employ several data augmentation techniques, such as oversampling (Synthetic Minority Over Sampling for Nominal Data, i.e.SMOTE-N), undersampling (Edited Nearest Neighbor, i.e. ENN), and hybrid sampling techniques (SMOTE-Tomek and SMOTE-ENN) on training data before applying machine learning algorithms to minimize the impact of imbalanced data. Our study sheds light on the significance of carefully utilizing data augmentation techniques without any data leakage to enhance the effectiveness of machine learning algorithms. Moreover, it offers a complete machine learning structure for healthcare practitioners, from data obtaining to machine learning prediction, enabling them to make informed decisions.