{"title":"荷兰语推文的可变性。对偏差词标记的比例的估计","authors":"H. V. Halteren, N. Oostdijk","doi":"10.21248/jlcl.29.2014.191","DOIUrl":null,"url":null,"abstract":"In this paper, we attempt to estimate which proportion of the word tokens in Dutch tweets are not covered by standard resources and can therefore be expected to cause problems for standard NLP applications. We fully annotated and analysed a small pilot corpus. We also used the corpus to calibrate automatic estimation procedures for proportions of non-word tokens and of out-of-vocabulary words, after which we applied these procedures to about 2 billion Dutch tweets. We find that the proportion of possibly problematic tokens is so high (e.g. an estimate of 15% of the words being problematic in the full tweet collection, and the annotated sample with death-threat-related tweets showing problematic words in three out of four tweets) that any NLP application designed/created for standard Dutch can be expected to be seriously hampered in its processing. We suggest a few approaches to alleviate the problem, but none of them will solve the problem completely.","PeriodicalId":402489,"journal":{"name":"J. Lang. Technol. Comput. Linguistics","volume":"5 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2014-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"5","resultStr":"{\"title\":\"Variability in Dutch Tweets. An estimate of the proportion of deviant word tokens\",\"authors\":\"H. V. Halteren, N. Oostdijk\",\"doi\":\"10.21248/jlcl.29.2014.191\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we attempt to estimate which proportion of the word tokens in Dutch tweets are not covered by standard resources and can therefore be expected to cause problems for standard NLP applications. We fully annotated and analysed a small pilot corpus. We also used the corpus to calibrate automatic estimation procedures for proportions of non-word tokens and of out-of-vocabulary words, after which we applied these procedures to about 2 billion Dutch tweets. We find that the proportion of possibly problematic tokens is so high (e.g. an estimate of 15% of the words being problematic in the full tweet collection, and the annotated sample with death-threat-related tweets showing problematic words in three out of four tweets) that any NLP application designed/created for standard Dutch can be expected to be seriously hampered in its processing. We suggest a few approaches to alleviate the problem, but none of them will solve the problem completely.\",\"PeriodicalId\":402489,\"journal\":{\"name\":\"J. Lang. Technol. Comput. Linguistics\",\"volume\":\"5 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2014-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"5\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"J. Lang. Technol. Comput. Linguistics\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.21248/jlcl.29.2014.191\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"J. Lang. Technol. Comput. Linguistics","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.21248/jlcl.29.2014.191","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Variability in Dutch Tweets. An estimate of the proportion of deviant word tokens
In this paper, we attempt to estimate which proportion of the word tokens in Dutch tweets are not covered by standard resources and can therefore be expected to cause problems for standard NLP applications. We fully annotated and analysed a small pilot corpus. We also used the corpus to calibrate automatic estimation procedures for proportions of non-word tokens and of out-of-vocabulary words, after which we applied these procedures to about 2 billion Dutch tweets. We find that the proportion of possibly problematic tokens is so high (e.g. an estimate of 15% of the words being problematic in the full tweet collection, and the annotated sample with death-threat-related tweets showing problematic words in three out of four tweets) that any NLP application designed/created for standard Dutch can be expected to be seriously hampered in its processing. We suggest a few approaches to alleviate the problem, but none of them will solve the problem completely.