{"title":"在预测交通事故后受伤严重程度时处理不平衡数据:机器学习模型的比较分析","authors":"Sadjad Bazarnovi , Abolfazl (Kouros) Mohammadian","doi":"10.1016/j.procs.2024.05.192","DOIUrl":null,"url":null,"abstract":"<div><p>Road traffic crashes are a significant public health concern, leading to substantial human and financial losses. Accurately predicting injury severity is crucial for optimizing rescue efforts and saving lives. This study utilizes various Machine Learning (ML) algorithms, such as Random Forest, Logistic Regression, XGBoost, and Support Vector Machine (SVM), to predict crash severity. The dataset spans from 2015 to 2023, comprising crash data from the City of Chicago, featuring a highly imbalanced ratio of non-severe to severe incidents (1000 to 1). To address class imbalance challenges, the study evaluates various data sampling methods, including Oversampling, Undersampling, and Hybridsampling. Model performance is assessed using AUC-ROC and recall to account for accuracy limitations in imbalanced datasets. Results reveal the inefficacy of conventional data sampling methods where data is highly imbalanced. Consequently, a novel approach was adopted, involving the random removal of observations before applying data sampling methods, leading to a significant improvement in model performance. SVM-SMOTE and ClusterCentroid emerge as the most effective resampling methods. Notably, among all ML models, SVM demonstrates the best overall performance. The final findings of this study aim to assist emergency responders in quickly evaluating the severity of an incident upon receiving a report.</p></div>","PeriodicalId":20465,"journal":{"name":"Procedia Computer Science","volume":"238 ","pages":"Pages 24-31"},"PeriodicalIF":0.0000,"publicationDate":"2024-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S1877050924012304/pdf?md5=e90fafd5a07a25bc04896b6b427ed94b&pid=1-s2.0-S1877050924012304-main.pdf","citationCount":"0","resultStr":"{\"title\":\"Addressing imbalanced data in predicting injury severity after traffic crashes: A comparative analysis of machine learning models\",\"authors\":\"Sadjad Bazarnovi , Abolfazl (Kouros) Mohammadian\",\"doi\":\"10.1016/j.procs.2024.05.192\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"<div><p>Road traffic crashes are a significant public health concern, leading to substantial human and financial losses. Accurately predicting injury severity is crucial for optimizing rescue efforts and saving lives. This study utilizes various Machine Learning (ML) algorithms, such as Random Forest, Logistic Regression, XGBoost, and Support Vector Machine (SVM), to predict crash severity. The dataset spans from 2015 to 2023, comprising crash data from the City of Chicago, featuring a highly imbalanced ratio of non-severe to severe incidents (1000 to 1). To address class imbalance challenges, the study evaluates various data sampling methods, including Oversampling, Undersampling, and Hybridsampling. Model performance is assessed using AUC-ROC and recall to account for accuracy limitations in imbalanced datasets. Results reveal the inefficacy of conventional data sampling methods where data is highly imbalanced. Consequently, a novel approach was adopted, involving the random removal of observations before applying data sampling methods, leading to a significant improvement in model performance. SVM-SMOTE and ClusterCentroid emerge as the most effective resampling methods. Notably, among all ML models, SVM demonstrates the best overall performance. The final findings of this study aim to assist emergency responders in quickly evaluating the severity of an incident upon receiving a report.</p></div>\",\"PeriodicalId\":20465,\"journal\":{\"name\":\"Procedia Computer Science\",\"volume\":\"238 \",\"pages\":\"Pages 24-31\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"https://www.sciencedirect.com/science/article/pii/S1877050924012304/pdf?md5=e90fafd5a07a25bc04896b6b427ed94b&pid=1-s2.0-S1877050924012304-main.pdf\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Procedia Computer Science\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://www.sciencedirect.com/science/article/pii/S1877050924012304\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Procedia Computer Science","FirstCategoryId":"1085","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S1877050924012304","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Addressing imbalanced data in predicting injury severity after traffic crashes: A comparative analysis of machine learning models
Road traffic crashes are a significant public health concern, leading to substantial human and financial losses. Accurately predicting injury severity is crucial for optimizing rescue efforts and saving lives. This study utilizes various Machine Learning (ML) algorithms, such as Random Forest, Logistic Regression, XGBoost, and Support Vector Machine (SVM), to predict crash severity. The dataset spans from 2015 to 2023, comprising crash data from the City of Chicago, featuring a highly imbalanced ratio of non-severe to severe incidents (1000 to 1). To address class imbalance challenges, the study evaluates various data sampling methods, including Oversampling, Undersampling, and Hybridsampling. Model performance is assessed using AUC-ROC and recall to account for accuracy limitations in imbalanced datasets. Results reveal the inefficacy of conventional data sampling methods where data is highly imbalanced. Consequently, a novel approach was adopted, involving the random removal of observations before applying data sampling methods, leading to a significant improvement in model performance. SVM-SMOTE and ClusterCentroid emerge as the most effective resampling methods. Notably, among all ML models, SVM demonstrates the best overall performance. The final findings of this study aim to assist emergency responders in quickly evaluating the severity of an incident upon receiving a report.