在预测交通事故后受伤严重程度时处理不平衡数据：机器学习模型的比较分析

Procedia Computer Science Pub Date : 2024-01-01 Epub Date: 2024-07-08 DOI:10.1016/j.procs.2024.05.192

Sadjad Bazarnovi , Abolfazl (Kouros) Mohammadian

{"title":"在预测交通事故后受伤严重程度时处理不平衡数据：机器学习模型的比较分析","authors":"Sadjad Bazarnovi , Abolfazl (Kouros) Mohammadian","doi":"10.1016/j.procs.2024.05.192","DOIUrl":null,"url":null,"abstract":"<div><p>Road traffic crashes are a significant public health concern, leading to substantial human and financial losses. Accurately predicting injury severity is crucial for optimizing rescue efforts and saving lives. This study utilizes various Machine Learning (ML) algorithms, such as Random Forest, Logistic Regression, XGBoost, and Support Vector Machine (SVM), to predict crash severity. The dataset spans from 2015 to 2023, comprising crash data from the City of Chicago, featuring a highly imbalanced ratio of non-severe to severe incidents (1000 to 1). To address class imbalance challenges, the study evaluates various data sampling methods, including Oversampling, Undersampling, and Hybridsampling. Model performance is assessed using AUC-ROC and recall to account for accuracy limitations in imbalanced datasets. Results reveal the inefficacy of conventional data sampling methods where data is highly imbalanced. Consequently, a novel approach was adopted, involving the random removal of observations before applying data sampling methods, leading to a significant improvement in model performance. SVM-SMOTE and ClusterCentroid emerge as the most effective resampling methods. Notably, among all ML models, SVM demonstrates the best overall performance. The final findings of this study aim to assist emergency responders in quickly evaluating the severity of an incident upon receiving a report.</p></div>","PeriodicalId":20465,"journal":{"name":"Procedia Computer Science","volume":"238 ","pages":"Pages 24-31"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1877050924012304/pdf?md5=e90fafd5a07a25bc04896b6b427ed94b&pid=1-s2.0-S1877050924012304-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Addressing imbalanced data in predicting injury severity after traffic crashes: A comparative analysis of machine learning models\",\"authors\":\"Sadjad Bazarnovi , Abolfazl (Kouros) Mohammadian\",\"doi\":\"10.1016/j.procs.2024.05.192\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Road traffic crashes are a significant public health concern, leading to substantial human and financial losses. Accurately predicting injury severity is crucial for optimizing rescue efforts and saving lives. This study utilizes various Machine Learning (ML) algorithms, such as Random Forest, Logistic Regression, XGBoost, and Support Vector Machine (SVM), to predict crash severity. The dataset spans from 2015 to 2023, comprising crash data from the City of Chicago, featuring a highly imbalanced ratio of non-severe to severe incidents (1000 to 1). To address class imbalance challenges, the study evaluates various data sampling methods, including Oversampling, Undersampling, and Hybridsampling. Model performance is assessed using AUC-ROC and recall to account for accuracy limitations in imbalanced datasets. Results reveal the inefficacy of conventional data sampling methods where data is highly imbalanced. Consequently, a novel approach was adopted, involving the random removal of observations before applying data sampling methods, leading to a significant improvement in model performance. SVM-SMOTE and ClusterCentroid emerge as the most effective resampling methods. Notably, among all ML models, SVM demonstrates the best overall performance. The final findings of this study aim to assist emergency responders in quickly evaluating the severity of an incident upon receiving a report.</p></div>\",\"PeriodicalId\":20465,\"journal\":{\"name\":\"Procedia Computer Science\",\"volume\":\"238 \",\"pages\":\"Pages 24-31\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1877050924012304/pdf?md5=e90fafd5a07a25bc04896b6b427ed94b&pid=1-s2.0-S1877050924012304-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Procedia Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1877050924012304\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"2024/7/8 0:00:00\",\"PubModel\":\"Epub\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Procedia Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877050924012304","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"2024/7/8 0:00:00","PubModel":"Epub","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

道路交通事故是一个重大的公共卫生问题，会造成巨大的人员和经济损失。准确预测受伤严重程度对于优化救援工作和挽救生命至关重要。本研究利用随机森林、逻辑回归、XGBoost 和支持向量机 (SVM) 等多种机器学习 (ML) 算法来预测车祸严重程度。数据集的时间跨度为 2015 年至 2023 年，由芝加哥市的碰撞数据组成，其中非严重事件与严重事件的比例极不平衡（1000 比 1）。为了应对类别不平衡的挑战，本研究评估了各种数据采样方法，包括过度采样、不足采样和混合采样。使用 AUC-ROC 和召回率评估模型性能，以考虑不平衡数据集的准确性限制。结果表明，在数据高度不平衡的情况下，传统的数据采样方法并不有效。因此，我们采用了一种新方法，即在应用数据抽样方法之前随机移除观测值，从而显著提高了模型性能。SVM-SMOTE 和 ClusterCentroid 成为最有效的重采样方法。值得注意的是，在所有 ML 模型中，SVM 的整体性能最佳。本研究的最终结果旨在帮助应急响应人员在接到报告后快速评估事件的严重性。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Addressing imbalanced data in predicting injury severity after traffic crashes: A comparative analysis of machine learning models

Road traffic crashes are a significant public health concern, leading to substantial human and financial losses. Accurately predicting injury severity is crucial for optimizing rescue efforts and saving lives. This study utilizes various Machine Learning (ML) algorithms, such as Random Forest, Logistic Regression, XGBoost, and Support Vector Machine (SVM), to predict crash severity. The dataset spans from 2015 to 2023, comprising crash data from the City of Chicago, featuring a highly imbalanced ratio of non-severe to severe incidents (1000 to 1). To address class imbalance challenges, the study evaluates various data sampling methods, including Oversampling, Undersampling, and Hybridsampling. Model performance is assessed using AUC-ROC and recall to account for accuracy limitations in imbalanced datasets. Results reveal the inefficacy of conventional data sampling methods where data is highly imbalanced. Consequently, a novel approach was adopted, involving the random removal of observations before applying data sampling methods, leading to a significant improvement in model performance. SVM-SMOTE and ClusterCentroid emerge as the most effective resampling methods. Notably, among all ML models, SVM demonstrates the best overall performance. The final findings of this study aim to assist emergency responders in quickly evaluating the severity of an incident upon receiving a report.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊