不同深度学习技术在泰国语NLP有噪声数据下词法和句法任务的比较研究

2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE) Pub Date : 2018-07-01 DOI:10.1109/JCSSE.2018.8457368

Amarin Jettakul, Chavisa Thamjarat, Kawin Liaowongphuthorn, Can Udomcharoenchaikit, P. Vateekul, P. Boonkwan

{"title":"不同深度学习技术在泰国语NLP有噪声数据下词法和句法任务的比较研究","authors":"Amarin Jettakul, Chavisa Thamjarat, Kawin Liaowongphuthorn, Can Udomcharoenchaikit, P. Vateekul, P. Boonkwan","doi":"10.1109/JCSSE.2018.8457368","DOIUrl":null,"url":null,"abstract":"In Natural Language Processing (NLP), there are three fundamental tasks of NLP which are Tokenization being a part of a lexical level, Part-of-Speech tagging (POS) and Named-Entity-Recognition (NER) being parts of a syntactic level. Recently, there have been many deep learning researches showing their success in many domains. However, there has been no comparative study for Thai NLP to suggest the most suitable technique for each task yet. In this paper, we aim to provide a performance comparison among various deep learning-based techniques on three NLP tasks, and study the effect on synthesized OOV words and the OOV handling algorithm with Levenshtein distance had been provided due to the fact that most existing works relied on a set of vocabularies in the trained model and not being fit for noisy text in the real use case. Our three experiments were conducted on BEST 2010 I2R, a standard Thai NLP corpus on F1 measurement, with the different percentage of noises having been synthesized. Firstly, for Tokenization, the result shows that Synthai, a jointed bidirectional LSTM, has the best performance. Additionally, for POS, bi-directional LSTM with CRF has obtained the best performance. For NER, variational bi-directional LSTM with CRF has outperformed other methods. Finally, the effect of noises reduces the performance of all algorithms on these foundation tasks and the result shows that our OOV handling technique could improve the performance on noisy data.","PeriodicalId":338973,"journal":{"name":"2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":"{\"title\":\"A Comparative Study on Various Deep Learning Techniques for Thai NLP Lexical and Syntactic Tasks on Noisy Data\",\"authors\":\"Amarin Jettakul, Chavisa Thamjarat, Kawin Liaowongphuthorn, Can Udomcharoenchaikit, P. Vateekul, P. Boonkwan\",\"doi\":\"10.1109/JCSSE.2018.8457368\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In Natural Language Processing (NLP), there are three fundamental tasks of NLP which are Tokenization being a part of a lexical level, Part-of-Speech tagging (POS) and Named-Entity-Recognition (NER) being parts of a syntactic level. Recently, there have been many deep learning researches showing their success in many domains. However, there has been no comparative study for Thai NLP to suggest the most suitable technique for each task yet. In this paper, we aim to provide a performance comparison among various deep learning-based techniques on three NLP tasks, and study the effect on synthesized OOV words and the OOV handling algorithm with Levenshtein distance had been provided due to the fact that most existing works relied on a set of vocabularies in the trained model and not being fit for noisy text in the real use case. Our three experiments were conducted on BEST 2010 I2R, a standard Thai NLP corpus on F1 measurement, with the different percentage of noises having been synthesized. Firstly, for Tokenization, the result shows that Synthai, a jointed bidirectional LSTM, has the best performance. Additionally, for POS, bi-directional LSTM with CRF has obtained the best performance. For NER, variational bi-directional LSTM with CRF has outperformed other methods. Finally, the effect of noises reduces the performance of all algorithms on these foundation tasks and the result shows that our OOV handling technique could improve the performance on noisy data.\",\"PeriodicalId\":338973,\"journal\":{\"name\":\"2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE)\",\"volume\":\"37 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-07-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"12\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/JCSSE.2018.8457368\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCSSE.2018.8457368","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 12

摘要

在自然语言处理(NLP)中，有三个基本任务:词法层面的标记化、词性标注(POS)和句法层面的命名实体识别(NER)。近年来，深度学习的研究在许多领域都取得了成功。然而，目前还没有对泰国NLP进行比较研究，以建议最适合每个任务的技术。在本文中，我们旨在对基于深度学习的各种技术在三种NLP任务上的性能进行比较，并研究对合成OOV词的影响，以及由于现有的大多数工作依赖于训练模型中的一组词汇表而不适合实际用例中的噪声文本，因此提供了Levenshtein距离的OOV处理算法。我们的三个实验是在BEST 2010 I2R上进行的，这是一个标准的泰国NLP语料库，用于F1测量，合成了不同百分比的噪声。首先，在标记化方面，结果表明联合双向LSTM Synthai具有最好的性能。此外，在POS中，带CRF的双向LSTM获得了最好的性能。对于NER，基于CRF的变分双向LSTM优于其他方法。最后，噪声的影响降低了所有算法在这些基础任务上的性能，结果表明我们的OOV处理技术可以提高对噪声数据的性能。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A Comparative Study on Various Deep Learning Techniques for Thai NLP Lexical and Syntactic Tasks on Noisy Data

In Natural Language Processing (NLP), there are three fundamental tasks of NLP which are Tokenization being a part of a lexical level, Part-of-Speech tagging (POS) and Named-Entity-Recognition (NER) being parts of a syntactic level. Recently, there have been many deep learning researches showing their success in many domains. However, there has been no comparative study for Thai NLP to suggest the most suitable technique for each task yet. In this paper, we aim to provide a performance comparison among various deep learning-based techniques on three NLP tasks, and study the effect on synthesized OOV words and the OOV handling algorithm with Levenshtein distance had been provided due to the fact that most existing works relied on a set of vocabularies in the trained model and not being fit for noisy text in the real use case. Our three experiments were conducted on BEST 2010 I2R, a standard Thai NLP corpus on F1 measurement, with the different percentage of noises having been synthesized. Firstly, for Tokenization, the result shows that Synthai, a jointed bidirectional LSTM, has the best performance. Additionally, for POS, bi-directional LSTM with CRF has obtained the best performance. For NER, variational bi-directional LSTM with CRF has outperformed other methods. Finally, the effect of noises reduces the performance of all algorithms on these foundation tasks and the result shows that our OOV handling technique could improve the performance on noisy data.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助