基于分离标记语料库的泰语命名实体识别建模

2018 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA) Pub Date : 2018-08-01 DOI:10.1109/ICAICTA.2018.8541344

Kitiya Suriyachay, Virach Sornlertlamvanich

{"title":"基于分离标记语料库的泰语命名实体识别建模","authors":"Kitiya Suriyachay, Virach Sornlertlamvanich","doi":"10.1109/ICAICTA.2018.8541344","DOIUrl":null,"url":null,"abstract":"In the Thai language, named entity can be used with or without a prefix or an indication of word. This may cause confusion between named entity and other types of noun. However, a named entity is likely to be used in adjacent to verbs or prepositions. This means that the adjacent verbs or prepositions to a noun can be as a good feature to determine the type of named entity. There are some studies on named entity recognition (NER) task in other languages such as Indonesian showing that combination of word embedding and part-of-speech (POS) tag can improve the performance of the NER model. In this paper, we investigate the Thai Named Entity Recognition task using Bi-LSTM model with word embedding and POS embedding for dealing with the relatively small and disjointedly labeled corpus. We compare our model with the one without POS tag, and the baseline model of CRF with the similar set of feature. The experiment results show that our proposed model outperforms the other two in all F1-score measures. Especially, in the case of location file, the F1-score is increased by 14 percent.","PeriodicalId":184882,"journal":{"name":"2018 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA)","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Named Entity Recognition Modeling for the Thai Language from a Disjointedly Labeled Corpus\",\"authors\":\"Kitiya Suriyachay, Virach Sornlertlamvanich\",\"doi\":\"10.1109/ICAICTA.2018.8541344\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In the Thai language, named entity can be used with or without a prefix or an indication of word. This may cause confusion between named entity and other types of noun. However, a named entity is likely to be used in adjacent to verbs or prepositions. This means that the adjacent verbs or prepositions to a noun can be as a good feature to determine the type of named entity. There are some studies on named entity recognition (NER) task in other languages such as Indonesian showing that combination of word embedding and part-of-speech (POS) tag can improve the performance of the NER model. In this paper, we investigate the Thai Named Entity Recognition task using Bi-LSTM model with word embedding and POS embedding for dealing with the relatively small and disjointedly labeled corpus. We compare our model with the one without POS tag, and the baseline model of CRF with the similar set of feature. The experiment results show that our proposed model outperforms the other two in all F1-score measures. Especially, in the case of location file, the F1-score is increased by 14 percent.\",\"PeriodicalId\":184882,\"journal\":{\"name\":\"2018 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA)\",\"volume\":\"41 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2018-08-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2018 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICAICTA.2018.8541344\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICAICTA.2018.8541344","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

摘要

在泰语中，命名实体可以带或不带前缀或单词指示。这可能会导致命名实体和其他类型的名词之间的混淆。然而，命名实体可能与动词或介词相邻使用。这意味着名词的相邻动词或介词可以作为确定命名实体类型的一个很好的特征。对印尼语等其他语言的命名实体识别(NER)任务的研究表明，将词嵌入和词性标签相结合可以提高命名实体识别模型的性能。在本文中，我们研究了使用Bi-LSTM模型结合词嵌入和POS嵌入来处理相对较小且标记不连贯的语料库的泰语命名实体识别任务。我们将我们的模型与不带POS标签的模型以及具有相似特征集的CRF基线模型进行了比较。实验结果表明，我们提出的模型在所有f1评分指标上都优于其他两种模型。特别是，在位置文件的情况下，f1得分提高了14%。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Named Entity Recognition Modeling for the Thai Language from a Disjointedly Labeled Corpus

In the Thai language, named entity can be used with or without a prefix or an indication of word. This may cause confusion between named entity and other types of noun. However, a named entity is likely to be used in adjacent to verbs or prepositions. This means that the adjacent verbs or prepositions to a noun can be as a good feature to determine the type of named entity. There are some studies on named entity recognition (NER) task in other languages such as Indonesian showing that combination of word embedding and part-of-speech (POS) tag can improve the performance of the NER model. In this paper, we investigate the Thai Named Entity Recognition task using Bi-LSTM model with word embedding and POS embedding for dealing with the relatively small and disjointedly labeled corpus. We compare our model with the one without POS tag, and the baseline model of CRF with the similar set of feature. The experiment results show that our proposed model outperforms the other two in all F1-score measures. Especially, in the case of location file, the F1-score is increased by 14 percent.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2018 5th International Conference on Advanced Informatics: Concept Theory and Applications (ICAICTA)

自引率

0.00%

发文量

期刊最新文献

Construction of dialog database for development of spoken dialog breakdown detection methods Associative Memory by Using Coupled Gaussian Maps Supplementary Book Suggestion for Computer Science Courses The Experimental Set of Light Distribution Analysis by LabView Application Online Speech Decoding Optimization Strategy with Viterbi Algorithm on GPU