{"title":"An auxiliary Part‐of‐Speech tagger for blog and microblog cyber‐slang","authors":"Silvia Golia, Paola Zola","doi":"10.1002/sam.11596","DOIUrl":null,"url":null,"abstract":"The increasing impact of Web 2.0 involves a growing usage of slang, abbreviations, and emphasized words, which limit the performance of traditional natural language processing models. The state‐of‐the‐art Part‐of‐Speech (POS) taggers are often unable to assign a meaningful POS tag to all the words in a Web 2.0 text. To solve this limitation, we are proposing an auxiliary POS tagger that assigns the POS tag to a given token based on the information deriving from a sequence of preceding and following POS tags. The main advantage of the proposed auxiliary POS tagger is its ability to overcome the need of tokens' information since it only relies on the sequences of existing POS tags. This tagger is called auxiliary because it requires an initial POS tagging procedure that might be performed using online dictionaries (e.g., Wikidictionary) or other POS tagging algorithms. The auxiliary POS tagger relies on a Bayesian network that uses information about preceding and following POS tags. It was evaluated on the Brown Corpus, which is a general linguistics corpus, on the modern ARK dataset composed by Twitter messages, and on a corpus of manually labeled Web 2.0 data.","PeriodicalId":342679,"journal":{"name":"Statistical Analysis and Data Mining: The ASA Data Science Journal","volume":"27 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Statistical Analysis and Data Mining: The ASA Data Science Journal","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1002/sam.11596","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
The increasing impact of Web 2.0 involves a growing usage of slang, abbreviations, and emphasized words, which limit the performance of traditional natural language processing models. The state‐of‐the‐art Part‐of‐Speech (POS) taggers are often unable to assign a meaningful POS tag to all the words in a Web 2.0 text. To solve this limitation, we are proposing an auxiliary POS tagger that assigns the POS tag to a given token based on the information deriving from a sequence of preceding and following POS tags. The main advantage of the proposed auxiliary POS tagger is its ability to overcome the need of tokens' information since it only relies on the sequences of existing POS tags. This tagger is called auxiliary because it requires an initial POS tagging procedure that might be performed using online dictionaries (e.g., Wikidictionary) or other POS tagging algorithms. The auxiliary POS tagger relies on a Bayesian network that uses information about preceding and following POS tags. It was evaluated on the Brown Corpus, which is a general linguistics corpus, on the modern ARK dataset composed by Twitter messages, and on a corpus of manually labeled Web 2.0 data.
Web 2.0的影响越来越大,俚语、缩写和强调词的使用越来越多,这限制了传统自然语言处理模型的性能。最先进的词性标注器通常无法为Web 2.0文本中的所有单词分配有意义的词性标注。为了解决这个限制,我们提出了一个辅助POS标记器,它根据从前面和后面的POS标记序列派生的信息将POS标记分配给给定的令牌。所提出的辅助POS标记器的主要优点是它能够克服对令牌信息的需求,因为它只依赖于现有POS标记的序列。这个标注器被称为辅助标注器,因为它需要一个初始的词性标注过程,这个过程可以使用在线字典(例如,Wikidictionary)或其他词性标注算法来执行。辅助POS标记器依赖于使用前后POS标记信息的贝叶斯网络。在Brown语料库(一个通用语言学语料库)、由Twitter消息组成的现代ARK数据集以及人工标记的Web 2.0数据语料库上对它进行了评估。