高棉语分词、词性标注及其双向长短期记忆实验研究综述

Sreyteav Sry, Amrudee Sukpan Nguyen
{"title":"高棉语分词、词性标注及其双向长短期记忆实验研究综述","authors":"Sreyteav Sry, Amrudee Sukpan Nguyen","doi":"10.46223/hcmcoujs.tech.en.12.1.2219.2022","DOIUrl":null,"url":null,"abstract":"Large contiguous blocks of unsegmented Khmer words can cause major problems for natural language processing applications such as machine translation, speech synthesis, information extraction, etc. Thus, word segmentation and part-of- speech tagging are two important prior tasks. Since the Khmer language does not always use explicit separators to split words, the definition of words is not a natural concept. Hence, tokenization and part-of-speech tagging of these languages are inseparable because the definition and principle of one task unavoidably affect the other. In this study, different approaches using in Khmer word segmentation and part-of-speech are reviewed and experimental study using a single long short-term memory network is described. Dataset from Asia Language Treebank is used to train and test the model. The preliminary experimental model achieved 95% accuracy rate. However, more testing to evaluate the model and compare it with different models is needed to conduct to select the more higher accuracy model.","PeriodicalId":34742,"journal":{"name":"Ho Chi Minh City Open University Journal of Science Engineering and Technology","volume":"73 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A review of Khmer word segmentation and part-of-speech tagging and an experimental study using bidirectional long short-term memory\",\"authors\":\"Sreyteav Sry, Amrudee Sukpan Nguyen\",\"doi\":\"10.46223/hcmcoujs.tech.en.12.1.2219.2022\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Large contiguous blocks of unsegmented Khmer words can cause major problems for natural language processing applications such as machine translation, speech synthesis, information extraction, etc. Thus, word segmentation and part-of- speech tagging are two important prior tasks. Since the Khmer language does not always use explicit separators to split words, the definition of words is not a natural concept. Hence, tokenization and part-of-speech tagging of these languages are inseparable because the definition and principle of one task unavoidably affect the other. In this study, different approaches using in Khmer word segmentation and part-of-speech are reviewed and experimental study using a single long short-term memory network is described. Dataset from Asia Language Treebank is used to train and test the model. The preliminary experimental model achieved 95% accuracy rate. However, more testing to evaluate the model and compare it with different models is needed to conduct to select the more higher accuracy model.\",\"PeriodicalId\":34742,\"journal\":{\"name\":\"Ho Chi Minh City Open University Journal of Science Engineering and Technology\",\"volume\":\"73 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-04-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Ho Chi Minh City Open University Journal of Science Engineering and Technology\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.46223/hcmcoujs.tech.en.12.1.2219.2022\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Ho Chi Minh City Open University Journal of Science Engineering and Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.46223/hcmcoujs.tech.en.12.1.2219.2022","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

大量连续的未分割高棉词块会给机器翻译、语音合成、信息提取等自然语言处理应用带来重大问题。因此,分词和词性标注是两项重要的前期工作。由于高棉语并不总是使用显式分隔符来分割单词,因此单词的定义不是一个自然的概念。因此,这些语言的标记和词性标注是不可分割的,因为一个任务的定义和原则不可避免地影响另一个任务。本研究回顾了高棉语中不同的分词和词性方法,并描述了使用单一长短期记忆网络的实验研究。使用来自亚洲语言树库的数据集对模型进行训练和测试。初步实验模型的准确率达到95%。然而,需要进行更多的测试来评估模型并与不同的模型进行比较,以选择更高精度的模型。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A review of Khmer word segmentation and part-of-speech tagging and an experimental study using bidirectional long short-term memory
Large contiguous blocks of unsegmented Khmer words can cause major problems for natural language processing applications such as machine translation, speech synthesis, information extraction, etc. Thus, word segmentation and part-of- speech tagging are two important prior tasks. Since the Khmer language does not always use explicit separators to split words, the definition of words is not a natural concept. Hence, tokenization and part-of-speech tagging of these languages are inseparable because the definition and principle of one task unavoidably affect the other. In this study, different approaches using in Khmer word segmentation and part-of-speech are reviewed and experimental study using a single long short-term memory network is described. Dataset from Asia Language Treebank is used to train and test the model. The preliminary experimental model achieved 95% accuracy rate. However, more testing to evaluate the model and compare it with different models is needed to conduct to select the more higher accuracy model.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
6
审稿时长
8 weeks
期刊最新文献
Disinfection of methicillin-resistant Staphylococcus Aureus on flat surface by 460 nm light and hydrogen peroxide combination Load-carrying capacity of circular concrete filled steel tubes under axial loading: Reliability analyses Isolation and identification of Vibrio spp. with potential ability to produce polysaccharide monooxygenase from diseased Penaeus monodon The phylogenetic authentication of Ophiocordyceps sphecocephala from Lang Biang Biosphere Reserve, Lam Dong, Vietnam Real-time PCR application in confirmation test of Salmonella Typhimurium on instant noodle
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1