GeoNLPlify: A spatial data augmentation enhancing text classification for crisis monitoring

IF 0.8 4区 计算机科学 Q4 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Intelligent Data Analysis Pub Date : 2023-07-06 DOI:10.3233/ida-230040
R. Découpes, M. Roche, M. Teisseire
{"title":"GeoNLPlify: A spatial data augmentation enhancing text classification for crisis monitoring","authors":"R. Découpes, M. Roche, M. Teisseire","doi":"10.3233/ida-230040","DOIUrl":null,"url":null,"abstract":"Crises such as natural disasters and public health emergencies generate vast amounts of text data, making it challenging to classify the information into relevant categories. Acquiring expert-labeled data for such scenarios can be difficult, leading to limited training datasets for text classification by fine-tuning BERT-like models. Unfortunately, traditional data augmentation techniques only slightly improve F1-scores. How can data augmentation be used to obtain better results in this applied domain? In this paper, using neural network explicability methods, we aim to highlight that fine-tuned BERT-like models on crisis corpora give too much importance to spatial information to make their predictions. This overfitting of spatial information limits their ability to generalize especially when the event which occurs in a place has evolved and changed since the training dataset has been built. To reduce this bias, we propose GeoNLPlify,1 a novel data augmentation technique that leverages spatial information to generate new labeled data for text classification related to crises. Our approach aims to address overfitting without necessitating modifications to the underlying model architecture, distinguishing it from other prevalent methods employed to combat overfitting. Our results show that GeoNLPlify significantly improves F1-scores, demonstrating the potential of the spatial information for data augmentation for crisis-related text classification tasks. In order to evaluate the contribution of our method, GeoNLPlify is applied to three public datasets (PADI-web, CrisisNLP and SST2) and compared with classical natural language processing data augmentations.","PeriodicalId":50355,"journal":{"name":"Intelligent Data Analysis","volume":" ","pages":""},"PeriodicalIF":0.8000,"publicationDate":"2023-07-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Intelligent Data Analysis","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.3233/ida-230040","RegionNum":4,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Crises such as natural disasters and public health emergencies generate vast amounts of text data, making it challenging to classify the information into relevant categories. Acquiring expert-labeled data for such scenarios can be difficult, leading to limited training datasets for text classification by fine-tuning BERT-like models. Unfortunately, traditional data augmentation techniques only slightly improve F1-scores. How can data augmentation be used to obtain better results in this applied domain? In this paper, using neural network explicability methods, we aim to highlight that fine-tuned BERT-like models on crisis corpora give too much importance to spatial information to make their predictions. This overfitting of spatial information limits their ability to generalize especially when the event which occurs in a place has evolved and changed since the training dataset has been built. To reduce this bias, we propose GeoNLPlify,1 a novel data augmentation technique that leverages spatial information to generate new labeled data for text classification related to crises. Our approach aims to address overfitting without necessitating modifications to the underlying model architecture, distinguishing it from other prevalent methods employed to combat overfitting. Our results show that GeoNLPlify significantly improves F1-scores, demonstrating the potential of the spatial information for data augmentation for crisis-related text classification tasks. In order to evaluate the contribution of our method, GeoNLPlify is applied to three public datasets (PADI-web, CrisisNLP and SST2) and compared with classical natural language processing data augmentations.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
GeoNLPlify:一种用于危机监测的增强文本分类的空间数据增强
自然灾害和突发公共卫生事件等危机产生了大量的文本数据,因此将信息分类为相关类别具有挑战性。获取此类场景的专家标记数据可能很困难,通过微调类似BERT的模型,导致文本分类的训练数据集有限。不幸的是,传统的数据扩充技术只能略微提高F1成绩。如何使用数据扩充在这个应用领域中获得更好的结果?在本文中,使用神经网络可解释性方法,我们旨在强调危机语料库上微调的类BERT模型过于重视空间信息,无法进行预测。这种空间信息的过拟合限制了它们的泛化能力,尤其是当自训练数据集建立以来,发生在一个地方的事件发生了演变和变化时。为了减少这种偏差,我们提出了GeoNLPlify,1这是一种新的数据增强技术,利用空间信息生成新的标记数据,用于与危机相关的文本分类。我们的方法旨在解决过拟合问题,而无需对底层模型架构进行修改,将其与其他用于对抗过拟合的流行方法区分开来。我们的结果表明,GeoNLPlify显著提高了F1分数,证明了空间信息在危机相关文本分类任务中用于数据增强的潜力。为了评估我们的方法的贡献,将GeoNLPlify应用于三个公共数据集(PADI-web、CrisisNLP和SST2),并与经典的自然语言处理数据增强进行了比较。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Intelligent Data Analysis
Intelligent Data Analysis 工程技术-计算机:人工智能
CiteScore
2.20
自引率
5.90%
发文量
85
审稿时长
3.3 months
期刊介绍: Intelligent Data Analysis provides a forum for the examination of issues related to the research and applications of Artificial Intelligence techniques in data analysis across a variety of disciplines. These techniques include (but are not limited to): all areas of data visualization, data pre-processing (fusion, editing, transformation, filtering, sampling), data engineering, database mining techniques, tools and applications, use of domain knowledge in data analysis, big data applications, evolutionary algorithms, machine learning, neural nets, fuzzy logic, statistical pattern recognition, knowledge filtering, and post-processing. In particular, papers are preferred that discuss development of new AI related data analysis architectures, methodologies, and techniques and their applications to various domains.
期刊最新文献
ELCA: Enhanced boundary location for Chinese named entity recognition via contextual association Identifying relevant features of CSE-CIC-IDS2018 dataset for the development of an intrusion detection system Knowledge graph embedding in a uniform space MeFiNet: Modeling multi-semantic convolution-based feature interactions for CTR prediction Enhancing Adaboost performance in the presence of class-label noise: A comparative study on EEG-based classification of schizophrenic patients and benchmark datasets
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1