A deep generative approach for crash frequency model with heterogeneous imbalanced data

IF 12.5 1区 工程技术 Q1 PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH Analytic Methods in Accident Research Pub Date : 2022-06-01 DOI:10.1016/j.amar.2022.100212
Hongliang Ding , Yuhuan Lu , N.N. Sze , Tiantian Chen , Yanyong Guo , Qinghai Lin
{"title":"A deep generative approach for crash frequency model with heterogeneous imbalanced data","authors":"Hongliang Ding ,&nbsp;Yuhuan Lu ,&nbsp;N.N. Sze ,&nbsp;Tiantian Chen ,&nbsp;Yanyong Guo ,&nbsp;Qinghai Lin","doi":"10.1016/j.amar.2022.100212","DOIUrl":null,"url":null,"abstract":"<div><p>Crash frequency model is often subject to excessive zero observation because of the rare nature of crashes. To address the problem of imbalanced crash data, a deep generative approach – augmented variational autoencoder – was proposed to generate synthetic crash data for the association measure between crash and possible explanatory factors. This approach was characterized by a factorized generative model and refined objective function. For instance, the generative model can handle heterogeneous data including real-valued, nominal and ordinal distributions. On the other hand, the refined objective function can control for the random effect by better recognizing both the zero-crash and non-zero crash cases. In this study, comprehensive traffic and crash data of multiple distribution types in Hong Kong in the period between 2014 and 2016 were used. To assess the data generation performance of the proposed augmented variational autoencoder method, a conventional data synthesis technique (synthetic minority oversampling technique-nominal continuous) was also considered. Performances of crash frequency models of total crashes and fatal and severe injury crashes are assessed. For total crashes, the results of parameter estimation, in terms of statistical fit, prediction accuracy, and explanatory factors identified, of the crash frequency model based on synthetic data using the augmented variational autoencoder method adhered closer to that based on original data, compared to that based on synthetic data using the synthetic minority oversampling technique-nominal continuous method. For fatal and severe injury crashes, zero-crash observations were prevalent, with the ratio of zero-crash to non-zero crash cases of 9 to 1. Crash data was first balanced using the proposed augmented variational autoencoder method. Then, fatal and severe injury crash frequency models using correlated random parameter models based on original data and balanced data were estimated respectively. Results indicate that fatal and severe injury crash frequency model based on balanced data outperforms its counterpart, with the lowest root mean square error, lowest mean absolute error, and highest number of crash explanatory factors identified. More importantly, correlation between the random parameters can be revealed. Findings of this study should shed light to both researchers and practitioners for the development of crash frequency models, with which the problem of excessive zero observations is prevalent when highly disaggregated traffic and crash data by time and space are used.</p></div>","PeriodicalId":47520,"journal":{"name":"Analytic Methods in Accident Research","volume":"34 ","pages":"Article 100212"},"PeriodicalIF":12.5000,"publicationDate":"2022-06-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"23","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Analytic Methods in Accident Research","FirstCategoryId":"5","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S221366572200001X","RegionNum":1,"RegionCategory":"工程技术","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"PUBLIC, ENVIRONMENTAL & OCCUPATIONAL HEALTH","Score":null,"Total":0}
引用次数: 23

Abstract

Crash frequency model is often subject to excessive zero observation because of the rare nature of crashes. To address the problem of imbalanced crash data, a deep generative approach – augmented variational autoencoder – was proposed to generate synthetic crash data for the association measure between crash and possible explanatory factors. This approach was characterized by a factorized generative model and refined objective function. For instance, the generative model can handle heterogeneous data including real-valued, nominal and ordinal distributions. On the other hand, the refined objective function can control for the random effect by better recognizing both the zero-crash and non-zero crash cases. In this study, comprehensive traffic and crash data of multiple distribution types in Hong Kong in the period between 2014 and 2016 were used. To assess the data generation performance of the proposed augmented variational autoencoder method, a conventional data synthesis technique (synthetic minority oversampling technique-nominal continuous) was also considered. Performances of crash frequency models of total crashes and fatal and severe injury crashes are assessed. For total crashes, the results of parameter estimation, in terms of statistical fit, prediction accuracy, and explanatory factors identified, of the crash frequency model based on synthetic data using the augmented variational autoencoder method adhered closer to that based on original data, compared to that based on synthetic data using the synthetic minority oversampling technique-nominal continuous method. For fatal and severe injury crashes, zero-crash observations were prevalent, with the ratio of zero-crash to non-zero crash cases of 9 to 1. Crash data was first balanced using the proposed augmented variational autoencoder method. Then, fatal and severe injury crash frequency models using correlated random parameter models based on original data and balanced data were estimated respectively. Results indicate that fatal and severe injury crash frequency model based on balanced data outperforms its counterpart, with the lowest root mean square error, lowest mean absolute error, and highest number of crash explanatory factors identified. More importantly, correlation between the random parameters can be revealed. Findings of this study should shed light to both researchers and practitioners for the development of crash frequency models, with which the problem of excessive zero observations is prevalent when highly disaggregated traffic and crash data by time and space are used.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于异构不平衡数据的碰撞频率模型的深度生成方法
由于碰撞的罕见性,碰撞频率模型往往受到过度零观察。为了解决碰撞数据不平衡的问题,提出了一种深度生成方法——增强变分自编码器,以生成综合碰撞数据,以衡量碰撞与可能的解释因素之间的关联。该方法具有因式生成模型和精细目标函数的特点。例如,生成模型可以处理异构数据,包括实值分布、标称分布和序数分布。另一方面,改进后的目标函数可以更好地识别零碰撞和非零碰撞情况,从而控制随机效应。本研究使用了2014 - 2016年香港多个分布类型的综合交通和碰撞数据。为了评估所提出的增广变分自编码器方法的数据生成性能,还考虑了传统的数据合成技术(合成少数过采样技术-标称连续)。评估了总碰撞、致命和严重伤害碰撞的碰撞频率模型的性能。对于总碰撞,采用增广变分自编码器方法的基于合成数据的碰撞频率模型的参数估计结果,在统计拟合、预测精度和确定的解释因素方面,比采用合成少数过采样技术-标称连续法的基于合成数据的碰撞频率模型的参数估计结果更接近于基于原始数据的模型。对于致命和严重伤害的碰撞,零碰撞观察是普遍存在的,零碰撞与非零碰撞案例的比例为9:1。首先采用增广变分自编码器方法平衡碰撞数据。然后分别利用基于原始数据和平衡数据的相关随机参数模型估计了致命和严重伤害碰撞频率模型。结果表明,基于平衡数据的致命和严重伤害碰撞频率模型具有最低的均方根误差、最低的平均绝对误差和最多的碰撞解释因素。更重要的是,可以揭示随机参数之间的相关性。本研究的发现应该为研究人员和从业人员的碰撞频率模型的发展提供启示,当使用时间和空间高度分解的交通和碰撞数据时,过量零观测值的问题很普遍。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
CiteScore
22.10
自引率
34.10%
发文量
35
审稿时长
24 days
期刊介绍: Analytic Methods in Accident Research is a journal that publishes articles related to the development and application of advanced statistical and econometric methods in studying vehicle crashes and other accidents. The journal aims to demonstrate how these innovative approaches can provide new insights into the factors influencing the occurrence and severity of accidents, thereby offering guidance for implementing appropriate preventive measures. While the journal primarily focuses on the analytic approach, it also accepts articles covering various aspects of transportation safety (such as road, pedestrian, air, rail, and water safety), construction safety, and other areas where human behavior, machine failures, or system failures lead to property damage or bodily harm.
期刊最新文献
Econometric approaches to examine the onset and duration of temporal variations in pedestrian and bicyclist injury severity analysis Determinants influencing alcohol-related two-vehicle crash severity: A multivariate Bayesian hierarchical random parameters correlated outcomes logit model Effects of sample size on pedestrian crash risk estimation from traffic conflicts using extreme value models Editorial Board A cross-comparison of different extreme value modeling techniques for traffic conflict-based crash risk estimation
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1