Addressing imbalanced data in predicting injury severity after traffic crashes: A comparative analysis of machine learning models

Sadjad Bazarnovi , Abolfazl (Kouros) Mohammadian
{"title":"Addressing imbalanced data in predicting injury severity after traffic crashes: A comparative analysis of machine learning models","authors":"Sadjad Bazarnovi ,&nbsp;Abolfazl (Kouros) Mohammadian","doi":"10.1016/j.procs.2024.05.192","DOIUrl":null,"url":null,"abstract":"<div><p>Road traffic crashes are a significant public health concern, leading to substantial human and financial losses. Accurately predicting injury severity is crucial for optimizing rescue efforts and saving lives. This study utilizes various Machine Learning (ML) algorithms, such as Random Forest, Logistic Regression, XGBoost, and Support Vector Machine (SVM), to predict crash severity. The dataset spans from 2015 to 2023, comprising crash data from the City of Chicago, featuring a highly imbalanced ratio of non-severe to severe incidents (1000 to 1). To address class imbalance challenges, the study evaluates various data sampling methods, including Oversampling, Undersampling, and Hybridsampling. Model performance is assessed using AUC-ROC and recall to account for accuracy limitations in imbalanced datasets. Results reveal the inefficacy of conventional data sampling methods where data is highly imbalanced. Consequently, a novel approach was adopted, involving the random removal of observations before applying data sampling methods, leading to a significant improvement in model performance. SVM-SMOTE and ClusterCentroid emerge as the most effective resampling methods. Notably, among all ML models, SVM demonstrates the best overall performance. The final findings of this study aim to assist emergency responders in quickly evaluating the severity of an incident upon receiving a report.</p></div>","PeriodicalId":20465,"journal":{"name":"Procedia Computer Science","volume":"238 ","pages":"Pages 24-31"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1877050924012304/pdf?md5=e90fafd5a07a25bc04896b6b427ed94b&pid=1-s2.0-S1877050924012304-main.pdf","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Procedia Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877050924012304","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Road traffic crashes are a significant public health concern, leading to substantial human and financial losses. Accurately predicting injury severity is crucial for optimizing rescue efforts and saving lives. This study utilizes various Machine Learning (ML) algorithms, such as Random Forest, Logistic Regression, XGBoost, and Support Vector Machine (SVM), to predict crash severity. The dataset spans from 2015 to 2023, comprising crash data from the City of Chicago, featuring a highly imbalanced ratio of non-severe to severe incidents (1000 to 1). To address class imbalance challenges, the study evaluates various data sampling methods, including Oversampling, Undersampling, and Hybridsampling. Model performance is assessed using AUC-ROC and recall to account for accuracy limitations in imbalanced datasets. Results reveal the inefficacy of conventional data sampling methods where data is highly imbalanced. Consequently, a novel approach was adopted, involving the random removal of observations before applying data sampling methods, leading to a significant improvement in model performance. SVM-SMOTE and ClusterCentroid emerge as the most effective resampling methods. Notably, among all ML models, SVM demonstrates the best overall performance. The final findings of this study aim to assist emergency responders in quickly evaluating the severity of an incident upon receiving a report.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
在预测交通事故后受伤严重程度时处理不平衡数据:机器学习模型的比较分析
道路交通事故是一个重大的公共卫生问题,会造成巨大的人员和经济损失。准确预测受伤严重程度对于优化救援工作和挽救生命至关重要。本研究利用随机森林、逻辑回归、XGBoost 和支持向量机 (SVM) 等多种机器学习 (ML) 算法来预测车祸严重程度。数据集的时间跨度为 2015 年至 2023 年,由芝加哥市的碰撞数据组成,其中非严重事件与严重事件的比例极不平衡(1000 比 1)。为了应对类别不平衡的挑战,本研究评估了各种数据采样方法,包括过度采样、不足采样和混合采样。使用 AUC-ROC 和召回率评估模型性能,以考虑不平衡数据集的准确性限制。结果表明,在数据高度不平衡的情况下,传统的数据采样方法并不有效。因此,我们采用了一种新方法,即在应用数据抽样方法之前随机移除观测值,从而显著提高了模型性能。SVM-SMOTE 和 ClusterCentroid 成为最有效的重采样方法。值得注意的是,在所有 ML 模型中,SVM 的整体性能最佳。本研究的最终结果旨在帮助应急响应人员在接到报告后快速评估事件的严重性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
4.50
自引率
0.00%
发文量
0
期刊最新文献
Circular Supply Chains and Industry 4.0: An Analysis of Interfaces in Brazilian Foodtechs Potentials of the Metaverse for Robotized Applications in Industry 4.0 and Industry 5.0 Preface Preface Contents
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1