The Importance of Specific Phrases in Automatically Classifying Mine Accident Narratives Using Natural Language Processing

R. Pothina, R. Ganguli
{"title":"The Importance of Specific Phrases in Automatically Classifying Mine Accident Narratives Using Natural Language Processing","authors":"R. Pothina, R. Ganguli","doi":"10.3390/knowledge2030021","DOIUrl":null,"url":null,"abstract":"The mining industry is diligent about reporting on safety incidents. However, these reports are not necessarily analyzed holistically to gain deep insights. Previously, it was demonstrated that mine accident narratives at a partner mine site could be automatically classified using natural language processing (NLP)-based random forest (RF) models developed, using narratives from the United States Mine Safety and Health Administration (MSHA) database. Classification of narratives is important from a holistic perspective as it affects safety intervention strategies. This paper continued the work to improve the RF classification performance in the category “caught in”. In this context, three approaches were presented in the paper. At first, two new methods were developed, named, the similarity score (SS) method and the accident-specific expert choice vocabulary (ASECV) method. The SS method focused on words or phrases that occurred most frequently, while the ASECV, a heuristic approach, focused on a narrow set of phrases. The two methods were tested with a series of experiments (iterations) on the MSHA narratives of accident category “caught in”. The SS method was not very successful due to its high false positive rates. The ASECV method, on the other hand, had low false positive rates. As a third approach (the “stacking” method), when a highly successful incidence (iteration) from ASECV method was applied in combination with the previously developed RF model (by stacking), the overall predictability of the combined model improved from 71% to 73.28%. Thus, the research showed that some phrases are key to describing particular (“caught in” in this case) types of accidents.","PeriodicalId":74770,"journal":{"name":"Science of aging knowledge environment : SAGE KE","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Science of aging knowledge environment : SAGE KE","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3390/knowledge2030021","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

Abstract

The mining industry is diligent about reporting on safety incidents. However, these reports are not necessarily analyzed holistically to gain deep insights. Previously, it was demonstrated that mine accident narratives at a partner mine site could be automatically classified using natural language processing (NLP)-based random forest (RF) models developed, using narratives from the United States Mine Safety and Health Administration (MSHA) database. Classification of narratives is important from a holistic perspective as it affects safety intervention strategies. This paper continued the work to improve the RF classification performance in the category “caught in”. In this context, three approaches were presented in the paper. At first, two new methods were developed, named, the similarity score (SS) method and the accident-specific expert choice vocabulary (ASECV) method. The SS method focused on words or phrases that occurred most frequently, while the ASECV, a heuristic approach, focused on a narrow set of phrases. The two methods were tested with a series of experiments (iterations) on the MSHA narratives of accident category “caught in”. The SS method was not very successful due to its high false positive rates. The ASECV method, on the other hand, had low false positive rates. As a third approach (the “stacking” method), when a highly successful incidence (iteration) from ASECV method was applied in combination with the previously developed RF model (by stacking), the overall predictability of the combined model improved from 71% to 73.28%. Thus, the research showed that some phrases are key to describing particular (“caught in” in this case) types of accidents.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
基于自然语言处理的矿山事故叙述自动分类中特定短语的重要性
采矿业对安全事故的报道非常认真。然而,这些报告并不一定要全面分析,以获得深刻的见解。在此之前,研究表明,使用基于自然语言处理(NLP)的随机森林(RF)模型,使用来自美国矿山安全与健康管理局(MSHA)数据库的叙述,可以自动分类合作矿区的矿山事故叙述。从整体角度来看,叙事分类很重要,因为它影响安全干预策略。本文继续在“caught in”类别中改进射频分类性能的工作。在此背景下,本文提出了三种方法。首先,提出了两种新的方法,分别是相似度评分法(SS)和事故专家选择词汇法(ASECV)。SS方法侧重于出现最频繁的单词或短语,而ASECV是一种启发式方法,专注于一组狭窄的短语。对这两种方法进行了一系列的实验(迭代),以MSHA对事故类别“caught in”的叙述进行了测试。由于假阳性率高,SS方法不是很成功。另一方面,ASECV方法的假阳性率较低。作为第三种方法(“堆叠”方法),当ASECV方法的高度成功的发生率(迭代)与先前开发的RF模型(通过堆叠)相结合时,组合模型的整体可预测性从71%提高到73.28%。因此,研究表明,一些短语对于描述特定类型的事故(在这种情况下是“caught in”)至关重要。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Cognitive Factors Affecting the Manufacturing Optimization Skills of Rural Indian BPO Workers Enhancing Landfill Monitoring and Assessment: A Proposal Combining GIS-Based Analytic Hierarchy Processes and Fuzzy Artificial Intelligence Embedding Sustainability Justice in Greek Secondary Curricula through the DeCoRe Plus Methodology KRITERIA PEMILIHAN PASANGAN HIDUP DALAM PEMBENTUKAN KELUARGA HARMONIS EFEKTIFITAS PERENCANAAN PEMBANGUNAN DAERAH DENGAN PENDEKATAN PARTISIPASI MASYARAKAT DESA
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1