利用平衡方法改进基于血糖的数据挖掘

Diogo Machado, Vítor Costa, Pedro Brandão
{"title":"利用平衡方法改进基于血糖的数据挖掘","authors":"Diogo Machado, Vítor Costa, Pedro Brandão","doi":"10.5220/0011797100003414","DOIUrl":null,"url":null,"abstract":": Imbalanced data sets pose a complex problem in data mining. Health related data sets, where the positive class is connected to the existence of an anomaly, are prone to be imbalanced. Data related to diabetes management follows this trend. In the case of diabetes, patients avoid situations of hypo/hyperglycaemia, which is the anomaly we want to detect. The use of balancing methods can provide more examples of the minority class, and assist the classifier by clearing the decision boundary. Nevertheless, each over-sampling and under-sampling method can affect the data set uniquely, which will influence the classifier’s performance. In this work, the authors studied the impact of the most known data-balancing methods applied to the Ohio and St. Louis diabetes related data sets. The best and most robust approach was the use of ENN with SMOTE. This hybrid method produced significant performance gains on all the performed tests. ENN in particular had a meaningful impact on all the tests. Given the limited volume of glycaemia-based data available for diabetes management, over-sampling methods would be expected to have a greater role in improving the classifier’s performance. In our experiments, the clearing of noise values by the under-sampling methods, produced better results.","PeriodicalId":20676,"journal":{"name":"Proceedings of the International Conference on Health Informatics and Medical Application Technology","volume":"92 1","pages":"188-198"},"PeriodicalIF":0.0000,"publicationDate":"2023-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Using Balancing Methods to Improve Glycaemia-Based Data Mining\",\"authors\":\"Diogo Machado, Vítor Costa, Pedro Brandão\",\"doi\":\"10.5220/0011797100003414\",\"DOIUrl\":null,\"url\":null,\"abstract\":\": Imbalanced data sets pose a complex problem in data mining. Health related data sets, where the positive class is connected to the existence of an anomaly, are prone to be imbalanced. Data related to diabetes management follows this trend. In the case of diabetes, patients avoid situations of hypo/hyperglycaemia, which is the anomaly we want to detect. The use of balancing methods can provide more examples of the minority class, and assist the classifier by clearing the decision boundary. Nevertheless, each over-sampling and under-sampling method can affect the data set uniquely, which will influence the classifier’s performance. In this work, the authors studied the impact of the most known data-balancing methods applied to the Ohio and St. Louis diabetes related data sets. The best and most robust approach was the use of ENN with SMOTE. This hybrid method produced significant performance gains on all the performed tests. ENN in particular had a meaningful impact on all the tests. Given the limited volume of glycaemia-based data available for diabetes management, over-sampling methods would be expected to have a greater role in improving the classifier’s performance. In our experiments, the clearing of noise values by the under-sampling methods, produced better results.\",\"PeriodicalId\":20676,\"journal\":{\"name\":\"Proceedings of the International Conference on Health Informatics and Medical Application Technology\",\"volume\":\"92 1\",\"pages\":\"188-198\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2023-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the International Conference on Health Informatics and Medical Application Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5220/0011797100003414\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the International Conference on Health Informatics and Medical Application Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5220/0011797100003414","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

不平衡数据集是数据挖掘中的一个复杂问题。与健康相关的数据集(其中正类与异常的存在相关联)容易出现不平衡。与糖尿病管理相关的数据也遵循这一趋势。在糖尿病的情况下,患者避免低血糖/高血糖的情况,这是我们想要检测的异常。使用平衡方法可以提供更多的少数类样本,并通过清除决策边界来辅助分类器。然而,每一种过采样和欠采样方法都会对数据集产生独特的影响,从而影响分类器的性能。在这项工作中,作者研究了应用于俄亥俄州和圣路易斯糖尿病相关数据集的最知名的数据平衡方法的影响。最好和最可靠的方法是将ENN与SMOTE结合使用。这种混合方法在所有执行的测试中产生了显著的性能增益。新奥集团尤其对所有测试产生了有意义的影响。鉴于可用于糖尿病管理的血糖数据量有限,过度抽样方法有望在提高分类器性能方面发挥更大作用。在我们的实验中,用欠采样的方法清除噪声值,取得了较好的效果。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Using Balancing Methods to Improve Glycaemia-Based Data Mining
: Imbalanced data sets pose a complex problem in data mining. Health related data sets, where the positive class is connected to the existence of an anomaly, are prone to be imbalanced. Data related to diabetes management follows this trend. In the case of diabetes, patients avoid situations of hypo/hyperglycaemia, which is the anomaly we want to detect. The use of balancing methods can provide more examples of the minority class, and assist the classifier by clearing the decision boundary. Nevertheless, each over-sampling and under-sampling method can affect the data set uniquely, which will influence the classifier’s performance. In this work, the authors studied the impact of the most known data-balancing methods applied to the Ohio and St. Louis diabetes related data sets. The best and most robust approach was the use of ENN with SMOTE. This hybrid method produced significant performance gains on all the performed tests. ENN in particular had a meaningful impact on all the tests. Given the limited volume of glycaemia-based data available for diabetes management, over-sampling methods would be expected to have a greater role in improving the classifier’s performance. In our experiments, the clearing of noise values by the under-sampling methods, produced better results.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Technical Realization and First Insights of the Multicenter Integrative Breast Cancer Registry INTREST Development of Learning System to Support for Passing Steps of Wheelchair On the Problem of Data Availability in Automatic Voice Disorder Detection An NLP-Enhanced Approach to Test Comorbidities Risk Scoring Based on Unstructured Health Data for Hospital Readmissions Prediction A Survey on Technologies Used During out of Hospital Cardiac Arrest
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1