用HMM改进泰语分词:以情感分析为例

Thapani Hengsanankun, Atchara Namburi
{"title":"用HMM改进泰语分词:以情感分析为例","authors":"Thapani Hengsanankun, Atchara Namburi","doi":"10.1109/ICSEC51790.2020.9375142","DOIUrl":null,"url":null,"abstract":"Word segmentation is a basic problem in the natural language processing of non-boundary delimiters language, especially for the Thai language. The ambiguity of the boundaries of the words in the sentence is one of the significant problems that can cause an unknown word and affects the word segmentation accuracy. This paper presents an improving Thai word segmentation using Hidden Markov Model to cope with an unknown word problem. The five-state of left-to-right HMMs are built according to the classes of the unknown word by applied the parts of speech of the Thai language as the observation symbols of the model. To determine the unknown word in the sentence, the String Matching algorithm is first implemented to find overlapping words and unknown words. The unknown words that unidentified by the lexical dictionary are classified according to their classes by the HMMs. Then the word combining rules are applied to determine the proper word boundary and to merge possible characters into words. In addition, the sentiment analysis task of polarity detection was selected as a case study to verify the accuracy of the proposed method. The precision, recall, and F-measure are used for evaluating the efficiency of the proposed method. The empirical results show that both segmented words and polarity classification results obtained by the proposed method tend to outperform the existing methods.","PeriodicalId":158728,"journal":{"name":"2020 24th International Computer Science and Engineering Conference (ICSEC)","volume":"235 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Improving Thai Word Segmentation using HMM: A Case Study of Sentiment Analysis\",\"authors\":\"Thapani Hengsanankun, Atchara Namburi\",\"doi\":\"10.1109/ICSEC51790.2020.9375142\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Word segmentation is a basic problem in the natural language processing of non-boundary delimiters language, especially for the Thai language. The ambiguity of the boundaries of the words in the sentence is one of the significant problems that can cause an unknown word and affects the word segmentation accuracy. This paper presents an improving Thai word segmentation using Hidden Markov Model to cope with an unknown word problem. The five-state of left-to-right HMMs are built according to the classes of the unknown word by applied the parts of speech of the Thai language as the observation symbols of the model. To determine the unknown word in the sentence, the String Matching algorithm is first implemented to find overlapping words and unknown words. The unknown words that unidentified by the lexical dictionary are classified according to their classes by the HMMs. Then the word combining rules are applied to determine the proper word boundary and to merge possible characters into words. In addition, the sentiment analysis task of polarity detection was selected as a case study to verify the accuracy of the proposed method. The precision, recall, and F-measure are used for evaluating the efficiency of the proposed method. The empirical results show that both segmented words and polarity classification results obtained by the proposed method tend to outperform the existing methods.\",\"PeriodicalId\":158728,\"journal\":{\"name\":\"2020 24th International Computer Science and Engineering Conference (ICSEC)\",\"volume\":\"235 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2020-12-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2020 24th International Computer Science and Engineering Conference (ICSEC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICSEC51790.2020.9375142\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 24th International Computer Science and Engineering Conference (ICSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSEC51790.2020.9375142","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

分词是自然语言处理中的一个基本问题,对泰语来说尤其如此。句子中词边界的模糊性是导致词未知并影响分词精度的重要问题之一。本文提出了一种基于隐马尔可夫模型的改进的泰语分词方法。采用泰语词类作为模型的观察符号,根据未知词的类别构建了从左到右的五态hmm。为了确定句子中的未知词,首先实现字符串匹配算法,查找重叠词和未知词。对词典中未识别的未知词进行hmm分类。然后应用单词组合规则来确定合适的单词边界,并将可能的字符合并成单词。此外,以极性检测的情感分析任务为例,验证了所提方法的准确性。用精密度、召回率和f值来评价该方法的有效性。实证结果表明,该方法得到的分词结果和极性分类结果都优于现有方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Improving Thai Word Segmentation using HMM: A Case Study of Sentiment Analysis
Word segmentation is a basic problem in the natural language processing of non-boundary delimiters language, especially for the Thai language. The ambiguity of the boundaries of the words in the sentence is one of the significant problems that can cause an unknown word and affects the word segmentation accuracy. This paper presents an improving Thai word segmentation using Hidden Markov Model to cope with an unknown word problem. The five-state of left-to-right HMMs are built according to the classes of the unknown word by applied the parts of speech of the Thai language as the observation symbols of the model. To determine the unknown word in the sentence, the String Matching algorithm is first implemented to find overlapping words and unknown words. The unknown words that unidentified by the lexical dictionary are classified according to their classes by the HMMs. Then the word combining rules are applied to determine the proper word boundary and to merge possible characters into words. In addition, the sentiment analysis task of polarity detection was selected as a case study to verify the accuracy of the proposed method. The precision, recall, and F-measure are used for evaluating the efficiency of the proposed method. The empirical results show that both segmented words and polarity classification results obtained by the proposed method tend to outperform the existing methods.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Multiclass Classification of Astronomical Objects in the Galaxy M81 using Machine Learning Techniques A framework for cross-datasources agricultural research-to-impact analysis Abnormality Detection in Musculoskeletal Radiographs using EfficientNets Drowsiness Detection using Facial Emotions and Eye Aspect Ratios Approximating k-Connected m-Dominating Sets in Disk Graphs
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1