Towards corpora creation from social web in Brazilian Portuguese to support public security analyses and decisions

IF 3.4 3区 管理学 0 INFORMATION SCIENCE & LIBRARY SCIENCE Library Hi Tech Pub Date : 2022-10-25 DOI:10.1108/lht-08-2022-0401
Victor Diogho Heuer de Carvalho, A. Costa
{"title":"Towards corpora creation from social web in Brazilian Portuguese to support public security analyses and decisions","authors":"Victor Diogho Heuer de Carvalho, A. Costa","doi":"10.1108/lht-08-2022-0401","DOIUrl":null,"url":null,"abstract":"PurposeThis article presents two Brazilian Portuguese corpora collected from different media concerning public security issues in a specific location. The primary motivation is supporting analyses, so security authorities can make appropriate decisions about their actions.Design/methodology/approachThe corpora were obtained through web scraping from a newspaper's website and tweets from a Brazilian metropolitan region. Natural language processing was applied considering: text cleaning, lemmatization, summarization, part-of-speech and dependencies parsing, named entities recognition, and topic modeling.FindingsSeveral results were obtained based on the methodology used, highlighting some: an example of a summarization using an automated process; dependency parsing; the most common topics in each corpus; the forty named entities and the most common slogans were extracted, highlighting those linked to public security.Research limitations/implicationsSome critical tasks were identified for the research perspective, related to the applied methodology: the treatment of noise from obtaining news on their source websites, passing through textual elements quite present in social network posts such as abbreviations, emojis/emoticons, and even writing errors; the treatment of subjectivity, to eliminate noise from irony and sarcasm; the search for authentic news of issues within the target domain. All these tasks aim to improve the process to enable interested authorities to perform accurate analyses.Practical implicationsThe corpora dedicated to the public security domain enable several analyses, such as mining public opinion on security actions in a given location; understanding criminals' behaviors reported in the news or even on social networks and drawing their attitudes timeline; detecting movements that may cause damage to public property and people welfare through texts from social networks; extracting the history and repercussions of police actions, crossing news with records on social networks; among many other possibilities.Originality/valueThe work on behalf of the corpora reported in this text represents one of the first initiatives to create textual bases in Portuguese, dedicated to Brazil's specific public security domain.","PeriodicalId":47196,"journal":{"name":"Library Hi Tech","volume":" ","pages":""},"PeriodicalIF":3.4000,"publicationDate":"2022-10-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Library Hi Tech","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1108/lht-08-2022-0401","RegionNum":3,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"0","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 4

Abstract

PurposeThis article presents two Brazilian Portuguese corpora collected from different media concerning public security issues in a specific location. The primary motivation is supporting analyses, so security authorities can make appropriate decisions about their actions.Design/methodology/approachThe corpora were obtained through web scraping from a newspaper's website and tweets from a Brazilian metropolitan region. Natural language processing was applied considering: text cleaning, lemmatization, summarization, part-of-speech and dependencies parsing, named entities recognition, and topic modeling.FindingsSeveral results were obtained based on the methodology used, highlighting some: an example of a summarization using an automated process; dependency parsing; the most common topics in each corpus; the forty named entities and the most common slogans were extracted, highlighting those linked to public security.Research limitations/implicationsSome critical tasks were identified for the research perspective, related to the applied methodology: the treatment of noise from obtaining news on their source websites, passing through textual elements quite present in social network posts such as abbreviations, emojis/emoticons, and even writing errors; the treatment of subjectivity, to eliminate noise from irony and sarcasm; the search for authentic news of issues within the target domain. All these tasks aim to improve the process to enable interested authorities to perform accurate analyses.Practical implicationsThe corpora dedicated to the public security domain enable several analyses, such as mining public opinion on security actions in a given location; understanding criminals' behaviors reported in the news or even on social networks and drawing their attitudes timeline; detecting movements that may cause damage to public property and people welfare through texts from social networks; extracting the history and repercussions of police actions, crossing news with records on social networks; among many other possibilities.Originality/valueThe work on behalf of the corpora reported in this text represents one of the first initiatives to create textual bases in Portuguese, dedicated to Brazil's specific public security domain.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
从巴西葡萄牙语的社交网络创建语料库,以支持公共安全分析和决策
本文介绍了两种巴西葡萄牙语语料库,这些语料库收集自不同的媒体,涉及特定地点的公共安全问题。主要动机是支持分析,因此安全当局可以对其行为做出适当的决策。设计/方法/方法语料库是通过从一家报纸的网站和巴西大都市地区的推特上抓取网络获得的。应用自然语言处理,考虑文本清理、词序化、摘要化、词性和依赖性分析、命名实体识别和主题建模。根据所使用的方法获得了几个结果,突出了一些:使用自动化过程的摘要示例;依赖性解析;每个语料库中最常见的主题;40个被点名的实体和最常见的口号被提取出来,突出显示了与公共安全有关的实体。研究局限/启示研究视角确定了一些关键任务,与应用方法相关:处理从源网站获取新闻的噪音,通过社交网络帖子中常见的文本元素,如缩写、表情符号/表情符号,甚至写作错误;对主观性的处理,消除反讽和讽刺的噪音;在目标领域内搜索问题的真实新闻。所有这些任务都旨在改进流程,使有关当局能够执行准确的分析。实际意义:公共安全领域专用的语料库可以进行多种分析,例如挖掘特定地点的安全行动的公众意见;了解新闻甚至社交网络报道的罪犯行为,绘制其态度时间线;通过社交网络中的文本信息,发现可能对公共财产和人民福利造成损害的行为;提取警察行动的历史和影响,将新闻与社交网络上的记录交叉;还有很多其他的可能性。原创性/价值本文中报告的代表语料库的工作是第一批创建葡萄牙语文本基础的倡议之一,致力于巴西特定的公共安全领域。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
Library Hi Tech
Library Hi Tech INFORMATION SCIENCE & LIBRARY SCIENCE-
CiteScore
8.30
自引率
44.10%
发文量
97
期刊介绍: ■Integrated library systems ■Networking ■Strategic planning ■Policy implementation across entire institutions ■Security ■Automation systems ■The role of consortia ■Resource access initiatives ■Architecture and technology ■Electronic publishing ■Library technology in specific countries ■User perspectives on technology ■How technology can help disabled library users ■Library-related web sites
期刊最新文献
From traditional to emerging technologies in supporting smart libraries. A bibliometric and thematic approach from 2013 to 2022 Digital reading: a bibliometric and visualization analysis Collective impression management and collective privacy concerns in co-owned information disclosure: the mediating role of relationship support and relationship risk Designing an axial code pattern for absorptive capacity of knowledge in academic libraries: examining the effect of individual and organizational learning Depth, breadth and structural virality: the influence of emotion, topic, authority and richness on misinformation spread
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1