A Comparative Study on Various Deep Learning Techniques for Thai NLP Lexical and Syntactic Tasks on Noisy Data

Amarin Jettakul, Chavisa Thamjarat, Kawin Liaowongphuthorn, Can Udomcharoenchaikit, P. Vateekul, P. Boonkwan
{"title":"A Comparative Study on Various Deep Learning Techniques for Thai NLP Lexical and Syntactic Tasks on Noisy Data","authors":"Amarin Jettakul, Chavisa Thamjarat, Kawin Liaowongphuthorn, Can Udomcharoenchaikit, P. Vateekul, P. Boonkwan","doi":"10.1109/JCSSE.2018.8457368","DOIUrl":null,"url":null,"abstract":"In Natural Language Processing (NLP), there are three fundamental tasks of NLP which are Tokenization being a part of a lexical level, Part-of-Speech tagging (POS) and Named-Entity-Recognition (NER) being parts of a syntactic level. Recently, there have been many deep learning researches showing their success in many domains. However, there has been no comparative study for Thai NLP to suggest the most suitable technique for each task yet. In this paper, we aim to provide a performance comparison among various deep learning-based techniques on three NLP tasks, and study the effect on synthesized OOV words and the OOV handling algorithm with Levenshtein distance had been provided due to the fact that most existing works relied on a set of vocabularies in the trained model and not being fit for noisy text in the real use case. Our three experiments were conducted on BEST 2010 I2R, a standard Thai NLP corpus on F1 measurement, with the different percentage of noises having been synthesized. Firstly, for Tokenization, the result shows that Synthai, a jointed bidirectional LSTM, has the best performance. Additionally, for POS, bi-directional LSTM with CRF has obtained the best performance. For NER, variational bi-directional LSTM with CRF has outperformed other methods. Finally, the effect of noises reduces the performance of all algorithms on these foundation tasks and the result shows that our OOV handling technique could improve the performance on noisy data.","PeriodicalId":338973,"journal":{"name":"2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE)","volume":"37 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"12","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 15th International Joint Conference on Computer Science and Software Engineering (JCSSE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCSSE.2018.8457368","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 12

Abstract

In Natural Language Processing (NLP), there are three fundamental tasks of NLP which are Tokenization being a part of a lexical level, Part-of-Speech tagging (POS) and Named-Entity-Recognition (NER) being parts of a syntactic level. Recently, there have been many deep learning researches showing their success in many domains. However, there has been no comparative study for Thai NLP to suggest the most suitable technique for each task yet. In this paper, we aim to provide a performance comparison among various deep learning-based techniques on three NLP tasks, and study the effect on synthesized OOV words and the OOV handling algorithm with Levenshtein distance had been provided due to the fact that most existing works relied on a set of vocabularies in the trained model and not being fit for noisy text in the real use case. Our three experiments were conducted on BEST 2010 I2R, a standard Thai NLP corpus on F1 measurement, with the different percentage of noises having been synthesized. Firstly, for Tokenization, the result shows that Synthai, a jointed bidirectional LSTM, has the best performance. Additionally, for POS, bi-directional LSTM with CRF has obtained the best performance. For NER, variational bi-directional LSTM with CRF has outperformed other methods. Finally, the effect of noises reduces the performance of all algorithms on these foundation tasks and the result shows that our OOV handling technique could improve the performance on noisy data.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
不同深度学习技术在泰国语NLP有噪声数据下词法和句法任务的比较研究
在自然语言处理(NLP)中,有三个基本任务:词法层面的标记化、词性标注(POS)和句法层面的命名实体识别(NER)。近年来,深度学习的研究在许多领域都取得了成功。然而,目前还没有对泰国NLP进行比较研究,以建议最适合每个任务的技术。在本文中,我们旨在对基于深度学习的各种技术在三种NLP任务上的性能进行比较,并研究对合成OOV词的影响,以及由于现有的大多数工作依赖于训练模型中的一组词汇表而不适合实际用例中的噪声文本,因此提供了Levenshtein距离的OOV处理算法。我们的三个实验是在BEST 2010 I2R上进行的,这是一个标准的泰国NLP语料库,用于F1测量,合成了不同百分比的噪声。首先,在标记化方面,结果表明联合双向LSTM Synthai具有最好的性能。此外,在POS中,带CRF的双向LSTM获得了最好的性能。对于NER,基于CRF的变分双向LSTM优于其他方法。最后,噪声的影响降低了所有算法在这些基础任务上的性能,结果表明我们的OOV处理技术可以提高对噪声数据的性能。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Android Forensic and Security Assessment for Hospital and Stock-and-Trade Applications in Thailand Traffic State Prediction Using Convolutional Neural Network Development of Low-Cost in-the-Ear EEG Prototype JCSSE 2018 Title Page JCSSE 2018 Session Chairs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1