Nagesh Bhattu Sristy, N. S. Krishna, B. S. Krishna, V. Ravi
{"title":"混合文字中的语言识别","authors":"Nagesh Bhattu Sristy, N. S. Krishna, B. S. Krishna, V. Ravi","doi":"10.1145/3158354.3158357","DOIUrl":null,"url":null,"abstract":"The text exchanged in social media conversations is often noisy with a mixture of stylistic and misspelt variations of original words. Any standard NLP techniques applied on such data such as POS tagging, Named entity recognition suffer because of noisy nature of the input. Usage of mixed script text is also prevalent in social media users. The current work addresses the identification of language at word level in mixed script scenarios, where all the text is written in roman script but the words being used by the users are transliterations of original words in native language into english. The core part of the problem is identifying the language, looking at small fragments of text among a set of languages. We propose a two stage approach for word-level language identification. In the first stage a mixing language combination is identified by using character n-grams of the sentence. Second stage consists of using the previous mixing combination class to make the word level language identification. We apply Conditional Random Fields(CRF) further in second stage to improve the performance of the word level language identification. Such simplification is essential, otherwise the number of states of the model will be huge and resultant model predictions are very noisy. Our methods improve the F-score of word level language identification by over 10% compared to the base-line.","PeriodicalId":306212,"journal":{"name":"Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation","volume":"22 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":"{\"title\":\"Language Identification in Mixed Script\",\"authors\":\"Nagesh Bhattu Sristy, N. S. Krishna, B. S. Krishna, V. Ravi\",\"doi\":\"10.1145/3158354.3158357\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The text exchanged in social media conversations is often noisy with a mixture of stylistic and misspelt variations of original words. Any standard NLP techniques applied on such data such as POS tagging, Named entity recognition suffer because of noisy nature of the input. Usage of mixed script text is also prevalent in social media users. The current work addresses the identification of language at word level in mixed script scenarios, where all the text is written in roman script but the words being used by the users are transliterations of original words in native language into english. The core part of the problem is identifying the language, looking at small fragments of text among a set of languages. We propose a two stage approach for word-level language identification. In the first stage a mixing language combination is identified by using character n-grams of the sentence. Second stage consists of using the previous mixing combination class to make the word level language identification. We apply Conditional Random Fields(CRF) further in second stage to improve the performance of the word level language identification. Such simplification is essential, otherwise the number of states of the model will be huge and resultant model predictions are very noisy. Our methods improve the F-score of word level language identification by over 10% compared to the base-line.\",\"PeriodicalId\":306212,\"journal\":{\"name\":\"Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation\",\"volume\":\"22 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2017-12-08\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"8\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3158354.3158357\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 9th Annual Meeting of the Forum for Information Retrieval Evaluation","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3158354.3158357","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
The text exchanged in social media conversations is often noisy with a mixture of stylistic and misspelt variations of original words. Any standard NLP techniques applied on such data such as POS tagging, Named entity recognition suffer because of noisy nature of the input. Usage of mixed script text is also prevalent in social media users. The current work addresses the identification of language at word level in mixed script scenarios, where all the text is written in roman script but the words being used by the users are transliterations of original words in native language into english. The core part of the problem is identifying the language, looking at small fragments of text among a set of languages. We propose a two stage approach for word-level language identification. In the first stage a mixing language combination is identified by using character n-grams of the sentence. Second stage consists of using the previous mixing combination class to make the word level language identification. We apply Conditional Random Fields(CRF) further in second stage to improve the performance of the word level language identification. Such simplification is essential, otherwise the number of states of the model will be huge and resultant model predictions are very noisy. Our methods improve the F-score of word level language identification by over 10% compared to the base-line.