Amarin Jettakul, Chavisa Thamjarat, Kawin Liaowongphuthorn, Can Udomcharoenchaikit, P. Vateekul, P. Boonkwan
{"title":"A Comparative Study on Various Deep Learning Techniques for Thai NLP Lexical and Syntactic Tasks on Noisy Data","authors":"Amarin Jettakul, Chavisa Thamjarat, Kawin Liaowongphuthorn, Can Udomcharoenchaikit, P. Vateekul, P. Boonkwan","doi":"10.1109/JCSSE.2018.8457368","DOIUrl":null,"url":null,"abstract":"In Natural Language Processing (NLP), there are three fundamental tasks of NLP which are Tokenization being a part of a lexical level, Part-of-Speech tagging (POS) and Named-Entity-Recognition (NER) being parts of a syntactic level. Recently, there have been many deep learning researches showing their success in many domains. However, there has been no comparative study for Thai NLP to suggest the most suitable technique for each task yet. In this paper, we aim to provide a performance comparison among various deep learning-based techniques on three NLP tasks, and study the effect on synthesized OOV words and the OOV handling algorithm with Levenshtein distance had been provided due to the fact that most existing works relied on a set of vocabularies in the trained model and not being fit for noisy text in the real use case. Our three experiments were conducted on BEST 2010 I2R, a standard Thai NLP corpus on F1 measurement, with the different percentage of noises having been synthesized. Firstly, for Tokenization, the result shows that Synthai, a jointed bidirectional LSTM, has the best performance. Additionally, for POS, bi-directional LSTM with CRF has obtained the best performance. For NER, variational bi-directional LSTM with CRF has outperformed other methods. Finally, the effect of noises reduces the performance of all algorithms on these foundation tasks and the result shows that our OOV handling technique could improve the performance on noisy data.","PeriodicalId":338973,"journal":{"name":"2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCSSE.2018.8457368","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12
Abstract
In Natural Language Processing (NLP), there are three fundamental tasks of NLP which are Tokenization being a part of a lexical level, Part-of-Speech tagging (POS) and Named-Entity-Recognition (NER) being parts of a syntactic level. Recently, there have been many deep learning researches showing their success in many domains. However, there has been no comparative study for Thai NLP to suggest the most suitable technique for each task yet. In this paper, we aim to provide a performance comparison among various deep learning-based techniques on three NLP tasks, and study the effect on synthesized OOV words and the OOV handling algorithm with Levenshtein distance had been provided due to the fact that most existing works relied on a set of vocabularies in the trained model and not being fit for noisy text in the real use case. Our three experiments were conducted on BEST 2010 I2R, a standard Thai NLP corpus on F1 measurement, with the different percentage of noises having been synthesized. Firstly, for Tokenization, the result shows that Synthai, a jointed bidirectional LSTM, has the best performance. Additionally, for POS, bi-directional LSTM with CRF has obtained the best performance. For NER, variational bi-directional LSTM with CRF has outperformed other methods. Finally, the effect of noises reduces the performance of all algorithms on these foundation tasks and the result shows that our OOV handling technique could improve the performance on noisy data.