{"title":"Improving Thai Word Segmentation using HMM: A Case Study of Sentiment Analysis","authors":"Thapani Hengsanankun, Atchara Namburi","doi":"10.1109/ICSEC51790.2020.9375142","DOIUrl":null,"url":null,"abstract":"Word segmentation is a basic problem in the natural language processing of non-boundary delimiters language, especially for the Thai language. The ambiguity of the boundaries of the words in the sentence is one of the significant problems that can cause an unknown word and affects the word segmentation accuracy. This paper presents an improving Thai word segmentation using Hidden Markov Model to cope with an unknown word problem. The five-state of left-to-right HMMs are built according to the classes of the unknown word by applied the parts of speech of the Thai language as the observation symbols of the model. To determine the unknown word in the sentence, the String Matching algorithm is first implemented to find overlapping words and unknown words. The unknown words that unidentified by the lexical dictionary are classified according to their classes by the HMMs. Then the word combining rules are applied to determine the proper word boundary and to merge possible characters into words. In addition, the sentiment analysis task of polarity detection was selected as a case study to verify the accuracy of the proposed method. The precision, recall, and F-measure are used for evaluating the efficiency of the proposed method. The empirical results show that both segmented words and polarity classification results obtained by the proposed method tend to outperform the existing methods.","PeriodicalId":158728,"journal":{"name":"2020 24th International Computer Science and Engineering Conference (ICSEC)","volume":"235 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 24th International Computer Science and Engineering Conference (ICSEC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICSEC51790.2020.9375142","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Word segmentation is a basic problem in the natural language processing of non-boundary delimiters language, especially for the Thai language. The ambiguity of the boundaries of the words in the sentence is one of the significant problems that can cause an unknown word and affects the word segmentation accuracy. This paper presents an improving Thai word segmentation using Hidden Markov Model to cope with an unknown word problem. The five-state of left-to-right HMMs are built according to the classes of the unknown word by applied the parts of speech of the Thai language as the observation symbols of the model. To determine the unknown word in the sentence, the String Matching algorithm is first implemented to find overlapping words and unknown words. The unknown words that unidentified by the lexical dictionary are classified according to their classes by the HMMs. Then the word combining rules are applied to determine the proper word boundary and to merge possible characters into words. In addition, the sentiment analysis task of polarity detection was selected as a case study to verify the accuracy of the proposed method. The precision, recall, and F-measure are used for evaluating the efficiency of the proposed method. The empirical results show that both segmented words and polarity classification results obtained by the proposed method tend to outperform the existing methods.