An auxiliary Part‐of‐Speech tagger for blog and microblog cyber‐slang

Silvia Golia, Paola Zola
{"title":"An auxiliary Part‐of‐Speech tagger for blog and microblog cyber‐slang","authors":"Silvia Golia, Paola Zola","doi":"10.1002/sam.11596","DOIUrl":null,"url":null,"abstract":"The increasing impact of Web 2.0 involves a growing usage of slang, abbreviations, and emphasized words, which limit the performance of traditional natural language processing models. The state‐of‐the‐art Part‐of‐Speech (POS) taggers are often unable to assign a meaningful POS tag to all the words in a Web 2.0 text. To solve this limitation, we are proposing an auxiliary POS tagger that assigns the POS tag to a given token based on the information deriving from a sequence of preceding and following POS tags. The main advantage of the proposed auxiliary POS tagger is its ability to overcome the need of tokens' information since it only relies on the sequences of existing POS tags. This tagger is called auxiliary because it requires an initial POS tagging procedure that might be performed using online dictionaries (e.g., Wikidictionary) or other POS tagging algorithms. The auxiliary POS tagger relies on a Bayesian network that uses information about preceding and following POS tags. It was evaluated on the Brown Corpus, which is a general linguistics corpus, on the modern ARK dataset composed by Twitter messages, and on a corpus of manually labeled Web 2.0 data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Analysis and Data Mining: The ASA Data Science Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/sam.11596","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

The increasing impact of Web 2.0 involves a growing usage of slang, abbreviations, and emphasized words, which limit the performance of traditional natural language processing models. The state‐of‐the‐art Part‐of‐Speech (POS) taggers are often unable to assign a meaningful POS tag to all the words in a Web 2.0 text. To solve this limitation, we are proposing an auxiliary POS tagger that assigns the POS tag to a given token based on the information deriving from a sequence of preceding and following POS tags. The main advantage of the proposed auxiliary POS tagger is its ability to overcome the need of tokens' information since it only relies on the sequences of existing POS tags. This tagger is called auxiliary because it requires an initial POS tagging procedure that might be performed using online dictionaries (e.g., Wikidictionary) or other POS tagging algorithms. The auxiliary POS tagger relies on a Bayesian network that uses information about preceding and following POS tags. It was evaluated on the Brown Corpus, which is a general linguistics corpus, on the modern ARK dataset composed by Twitter messages, and on a corpus of manually labeled Web 2.0 data.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
一个辅助词性标注器,用于博客和微博网络俚语
Web 2.0的影响越来越大,俚语、缩写和强调词的使用越来越多,这限制了传统自然语言处理模型的性能。最先进的词性标注器通常无法为Web 2.0文本中的所有单词分配有意义的词性标注。为了解决这个限制,我们提出了一个辅助POS标记器,它根据从前面和后面的POS标记序列派生的信息将POS标记分配给给定的令牌。所提出的辅助POS标记器的主要优点是它能够克服对令牌信息的需求,因为它只依赖于现有POS标记的序列。这个标注器被称为辅助标注器,因为它需要一个初始的词性标注过程,这个过程可以使用在线字典(例如,Wikidictionary)或其他词性标注算法来执行。辅助POS标记器依赖于使用前后POS标记信息的贝叶斯网络。在Brown语料库(一个通用语言学语料库)、由Twitter消息组成的现代ARK数据集以及人工标记的Web 2.0数据语料库上对它进行了评估。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Neural interval‐censored survival regression with feature selection Bayesian batch optimization for molybdenum versus tungsten inertial confinement fusion double shell target design Gaussian process selections in semiparametric multi‐kernel machine regression for multi‐pathway analysis An automated alignment algorithm for identification of the source of footwear impressions with common class characteristics Confidence bounds for threshold similarity graph in random variable network
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1