An effective approach for identifying keywords as high-quality filters to get emergency-implicated Twitter Spanish data

IF 3.1 3区 计算机科学 Q2 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Computer Speech and Language Pub Date : 2023-10-26 DOI:10.1016/j.csl.2023.101579
Joel Garcia-Arteaga , Jesús Zambrano-Zambrano , Jorge Parraga-Alava , Jorge Rodas-Silva
{"title":"An effective approach for identifying keywords as high-quality filters to get emergency-implicated Twitter Spanish data","authors":"Joel Garcia-Arteaga ,&nbsp;Jesús Zambrano-Zambrano ,&nbsp;Jorge Parraga-Alava ,&nbsp;Jorge Rodas-Silva","doi":"10.1016/j.csl.2023.101579","DOIUrl":null,"url":null,"abstract":"<div><p><span>Twitter has become a powerful knowledge source for data extraction for data mining projects due to the amount of data generated by its users, which allows researchers to find content of almost any topic in real time, but this depends on the quality of the keywords used, otherwise the extracted data will have a high percentage of irrelevant content. In this paper, we introduce a time-aware machine-learning-based approach to identify meaningful keywords to maximize the extraction of relevant emergency-related tweets when the Twitter API is used. We follow the </span><em>CRISP-DM</em> methodology. The first stage relies on <em>problem understanding</em>, where we detected the necessity of using meaningful keywords to filter content and extract data with more quality and reduce the percentage of irrelevant tweets. In the second stage, <em>data collection</em>, we used the official Twitter API to extract and label tweets as “<em>emergencia</em>” and “<em>no emergencia</em>”. After that, we analyzed the collected data (<em>data understanding</em><span>) to determine preprocessing techniques and to prepare the data for the model. Finally, in the </span><em>modeling</em> and <em>testing</em><span><span> stages, we trained a restricted Boltzmann machine and four variations of </span>autoencoders<span>, including an architecture proposed by a genetic algorithm, to use them as keyword identifiers and to determine which of them has the best performance to deploy it to production (</span></span><em>deployment</em> stage). The results show a slightly better performance of the autoencoder proposed by the genetic algorithm (GADAE), achieving a <span><math><msup><mrow><mi>R</mi></mrow><mrow><mn>2</mn></mrow></msup></math></span> <em>score</em> of 0.97, a <span><em>MAE</em></span> of <span><math><mrow><mn>14</mn><mo>×</mo><mn>1</mn><msup><mrow><mn>0</mn></mrow><mrow><mo>−</mo><mn>3</mn></mrow></msup></mrow></math></span>, and a <span><em>MSE</em></span> of <span><math><mrow><mn>4</mn><mo>×</mo><mn>1</mn><msup><mrow><mn>0</mn></mrow><mrow><mo>−</mo><mn>4</mn></mrow></msup></mrow></math></span>. GADAE, the best model, managed to extract 110% more relevant tweets than manual filtering in the context of emergency-implicated tweets in Ecuador.</p></div>","PeriodicalId":50638,"journal":{"name":"Computer Speech and Language","volume":null,"pages":null},"PeriodicalIF":3.1000,"publicationDate":"2023-10-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Computer Speech and Language","FirstCategoryId":"94","ListUrlMain":"https://www.sciencedirect.com/science/article/pii/S0885230823000980","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}
引用次数: 0

Abstract

Twitter has become a powerful knowledge source for data extraction for data mining projects due to the amount of data generated by its users, which allows researchers to find content of almost any topic in real time, but this depends on the quality of the keywords used, otherwise the extracted data will have a high percentage of irrelevant content. In this paper, we introduce a time-aware machine-learning-based approach to identify meaningful keywords to maximize the extraction of relevant emergency-related tweets when the Twitter API is used. We follow the CRISP-DM methodology. The first stage relies on problem understanding, where we detected the necessity of using meaningful keywords to filter content and extract data with more quality and reduce the percentage of irrelevant tweets. In the second stage, data collection, we used the official Twitter API to extract and label tweets as “emergencia” and “no emergencia”. After that, we analyzed the collected data (data understanding) to determine preprocessing techniques and to prepare the data for the model. Finally, in the modeling and testing stages, we trained a restricted Boltzmann machine and four variations of autoencoders, including an architecture proposed by a genetic algorithm, to use them as keyword identifiers and to determine which of them has the best performance to deploy it to production (deployment stage). The results show a slightly better performance of the autoencoder proposed by the genetic algorithm (GADAE), achieving a R2 score of 0.97, a MAE of 14×103, and a MSE of 4×104. GADAE, the best model, managed to extract 110% more relevant tweets than manual filtering in the context of emergency-implicated tweets in Ecuador.

查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
识别关键字作为高质量过滤器的有效方法,以获取涉及紧急情况的Twitter西班牙语数据
由于Twitter用户产生的大量数据,Twitter已经成为数据挖掘项目中数据提取的强大知识来源,这使得研究人员可以实时找到几乎任何主题的内容,但这取决于所使用的关键字的质量,否则提取的数据中会有很高比例的不相关内容。在本文中,我们引入了一种基于时间感知的机器学习方法来识别有意义的关键字,以便在使用Twitter API时最大限度地提取相关的紧急相关推文。我们遵循CRISP-DM方法。第一阶段依赖于对问题的理解,我们发现了使用有意义的关键字来过滤内容和提取质量更高的数据的必要性,并减少了不相关推文的百分比。在第二阶段,数据收集,我们使用Twitter官方API提取推文,并将其标记为“紧急情况”和“非紧急情况”。之后,我们对收集到的数据进行分析(数据理解),以确定预处理技术,并为模型准备数据。最后,在建模和测试阶段,我们训练了一个受限玻尔兹曼机和四种自编码器的变体,包括一种由遗传算法提出的架构,将它们用作关键字标识符,并确定其中哪一种具有最佳性能以将其部署到生产(部署阶段)。结果表明,由遗传算法(GADAE)提出的自编码器性能稍好,R2得分为0.97,MAE为14×10−3,MSE为4×10−4。GADAE是最好的模型,在厄瓜多尔涉及紧急事件的推文中,它比人工过滤多提取了110%的相关推文。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Computer Speech and Language
Computer Speech and Language 工程技术-计算机:人工智能
CiteScore
11.30
自引率
4.70%
发文量
80
审稿时长
22.9 weeks
期刊介绍: Computer Speech & Language publishes reports of original research related to the recognition, understanding, production, coding and mining of speech and language. The speech and language sciences have a long history, but it is only relatively recently that large-scale implementation of and experimentation with complex models of speech and language processing has become feasible. Such research is often carried out somewhat separately by practitioners of artificial intelligence, computer science, electronic engineering, information retrieval, linguistics, phonetics, or psychology.
期刊最新文献
Editorial Board Enhancing analysis of diadochokinetic speech using deep neural networks Copiously Quote Classics: Improving Chinese Poetry Generation with historical allusion knowledge Significance of chirp MFCC as a feature in speech and audio applications Artificial disfluency detection, uh no, disfluency generation for the masses
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1