J. Pailai, R. Kongkachandra, T. Supnithi, P. Boonkwan
{"title":"泰语词性标注技术的比较研究","authors":"J. Pailai, R. Kongkachandra, T. Supnithi, P. Boonkwan","doi":"10.1109/ECTICON.2013.6559527","DOIUrl":null,"url":null,"abstract":"The natural language processing (NLP) for Thai language is rather complicated using in the real tasks because it has a complex sequential structure of the sentence. The POS tagging can improve the accuracy of syntactic analysis so it can support the improvement of many NLP tasks. We present the supervised machine learning that is suitable for annotate the POS type for Thai language by comparison between the Support Vector Machine (SVM) and the Conditional Random Fields (CRFs). The BEST 2012 News and Entertainments corpus is utilized in our experiments. However, the sequential characteristic of Thai language is the interesting point and we use it as our feature in training set. Our sequential features contain forward 3-gram, backward 3-gram and 5-gram. The best accuracy of our experiments is 93.638% from SVMs POS tagging that learning by word of forward 3-gram when the size of training data is ten thousand tokens. Moreover, with the same training data, the best accuracy of CRFs is very close with SVM that is 93.254% when the learning form is the word with POS of 5-gram.","PeriodicalId":273802,"journal":{"name":"2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2013-07-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"9","resultStr":"{\"title\":\"A comparative study on different techniques for Thai part-of-speech tagging\",\"authors\":\"J. Pailai, R. Kongkachandra, T. Supnithi, P. Boonkwan\",\"doi\":\"10.1109/ECTICON.2013.6559527\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The natural language processing (NLP) for Thai language is rather complicated using in the real tasks because it has a complex sequential structure of the sentence. The POS tagging can improve the accuracy of syntactic analysis so it can support the improvement of many NLP tasks. We present the supervised machine learning that is suitable for annotate the POS type for Thai language by comparison between the Support Vector Machine (SVM) and the Conditional Random Fields (CRFs). The BEST 2012 News and Entertainments corpus is utilized in our experiments. However, the sequential characteristic of Thai language is the interesting point and we use it as our feature in training set. Our sequential features contain forward 3-gram, backward 3-gram and 5-gram. The best accuracy of our experiments is 93.638% from SVMs POS tagging that learning by word of forward 3-gram when the size of training data is ten thousand tokens. Moreover, with the same training data, the best accuracy of CRFs is very close with SVM that is 93.254% when the learning form is the word with POS of 5-gram.\",\"PeriodicalId\":273802,\"journal\":{\"name\":\"2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2013-07-18\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"9\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ECTICON.2013.6559527\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2013 10th International Conference on Electrical Engineering/Electronics, Computer, Telecommunications and Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ECTICON.2013.6559527","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
A comparative study on different techniques for Thai part-of-speech tagging
The natural language processing (NLP) for Thai language is rather complicated using in the real tasks because it has a complex sequential structure of the sentence. The POS tagging can improve the accuracy of syntactic analysis so it can support the improvement of many NLP tasks. We present the supervised machine learning that is suitable for annotate the POS type for Thai language by comparison between the Support Vector Machine (SVM) and the Conditional Random Fields (CRFs). The BEST 2012 News and Entertainments corpus is utilized in our experiments. However, the sequential characteristic of Thai language is the interesting point and we use it as our feature in training set. Our sequential features contain forward 3-gram, backward 3-gram and 5-gram. The best accuracy of our experiments is 93.638% from SVMs POS tagging that learning by word of forward 3-gram when the size of training data is ten thousand tokens. Moreover, with the same training data, the best accuracy of CRFs is very close with SVM that is 93.254% when the learning form is the word with POS of 5-gram.