{"title":"从微博中挖掘词汇变体:一种无监督的多语言方法","authors":"Alejandro Mosquera, Paloma Moreda Pozo","doi":"10.3115/v1/W14-1301","DOIUrl":null,"url":null,"abstract":"User-generated content has become a recurrent resource for NLP tools and applications, hence many efforts have been made lately in order to handle the noise present in short social media texts. The use of normalisation techniques has been proven useful for identifying and replacing lexical variants on some of the most informal genres such as microblogs. But annotated data is needed in order to train and evaluate these systems, which usually involves a costly process. Until now, most of these approaches have been focused on English and they were not taking into account demographic variables such as the user location and gender. In this paper we describe the methodology used for automatically mining a corpus of variant and normalisation pairs from English and Spanish tweets.","PeriodicalId":178301,"journal":{"name":"Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-04-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Mining Lexical Variants from Microblogs: An Unsupervised Multilingual Approach\",\"authors\":\"Alejandro Mosquera, Paloma Moreda Pozo\",\"doi\":\"10.3115/v1/W14-1301\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"User-generated content has become a recurrent resource for NLP tools and applications, hence many efforts have been made lately in order to handle the noise present in short social media texts. The use of normalisation techniques has been proven useful for identifying and replacing lexical variants on some of the most informal genres such as microblogs. But annotated data is needed in order to train and evaluate these systems, which usually involves a costly process. Until now, most of these approaches have been focused on English and they were not taking into account demographic variables such as the user location and gender. In this paper we describe the methodology used for automatically mining a corpus of variant and normalisation pairs from English and Spanish tweets.\",\"PeriodicalId\":178301,\"journal\":{\"name\":\"Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)\",\"volume\":\"4 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-04-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3115/v1/W14-1301\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 5th Workshop on Language Analysis for Social Media (LASM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3115/v1/W14-1301","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Mining Lexical Variants from Microblogs: An Unsupervised Multilingual Approach
User-generated content has become a recurrent resource for NLP tools and applications, hence many efforts have been made lately in order to handle the noise present in short social media texts. The use of normalisation techniques has been proven useful for identifying and replacing lexical variants on some of the most informal genres such as microblogs. But annotated data is needed in order to train and evaluate these systems, which usually involves a costly process. Until now, most of these approaches have been focused on English and they were not taking into account demographic variables such as the user location and gender. In this paper we describe the methodology used for automatically mining a corpus of variant and normalisation pairs from English and Spanish tweets.