使用潜在狄利克雷分配从缅甸文本中提取专有名称

2016 Conference on Technologies and Applications of Artificial Intelligence (TAAI) Pub Date : 2016-11-01 DOI:10.1109/TAAI.2016.7880176

Yuzana Win, Tomonari Masada

{"title":"使用潜在狄利克雷分配从缅甸文本中提取专有名称","authors":"Yuzana Win, Tomonari Masada","doi":"10.1109/TAAI.2016.7880176","DOIUrl":null,"url":null,"abstract":"This paper proposes a method for proper names extraction from Myanmar text by using latent Dirichlet allocation (LDA). Our method aims to extract proper names that provide important information on the contents of Myanmar text. Our method consists of two steps. In the first step, we extract topic words from Myanmar news articles by using LDA. In the second step, we make a post-processing, because the resulting topic words contain some noisy words. Our post-processing, first of all, eliminates the topic words whose prefixes are Myanmar digits and suffixes are noun and verb particles. We then remove the duplicate words and discard the topic words that are contained in the existing dictionary. Consequently, we obtain the words as candidate of proper names, namely personal names, geographical names, unique object names, organization names, single event names, and so on. The evaluation is performed both from the subjective and quantitative perspectives. From the subjective perspective, we compare the accuracy of proper names extracted by our method with those extracted by latent semantic indexing (LSI) and rule-based method. It is shown that both LS] and our method can improve the accuracy of those obtained by rule-based method. However, our method can provide more interesting proper names than LSI. From the quantitative perspective, we use the extracted proper names as additional features in K-means clustering. The experimental results show that the document clusters given by our method are better than those given by LSI and rule-based method in precision, recall and F-score.","PeriodicalId":159858,"journal":{"name":"2016 Conference on Technologies and Applications of Artificial Intelligence (TAAI)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Extraction of proper names from myanmar text using latent dirichlet allocation\",\"authors\":\"Yuzana Win, Tomonari Masada\",\"doi\":\"10.1109/TAAI.2016.7880176\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a method for proper names extraction from Myanmar text by using latent Dirichlet allocation (LDA). Our method aims to extract proper names that provide important information on the contents of Myanmar text. Our method consists of two steps. In the first step, we extract topic words from Myanmar news articles by using LDA. In the second step, we make a post-processing, because the resulting topic words contain some noisy words. Our post-processing, first of all, eliminates the topic words whose prefixes are Myanmar digits and suffixes are noun and verb particles. We then remove the duplicate words and discard the topic words that are contained in the existing dictionary. Consequently, we obtain the words as candidate of proper names, namely personal names, geographical names, unique object names, organization names, single event names, and so on. The evaluation is performed both from the subjective and quantitative perspectives. From the subjective perspective, we compare the accuracy of proper names extracted by our method with those extracted by latent semantic indexing (LSI) and rule-based method. It is shown that both LS] and our method can improve the accuracy of those obtained by rule-based method. However, our method can provide more interesting proper names than LSI. From the quantitative perspective, we use the extracted proper names as additional features in K-means clustering. The experimental results show that the document clusters given by our method are better than those given by LSI and rule-based method in precision, recall and F-score.\",\"PeriodicalId\":159858,\"journal\":{\"name\":\"2016 Conference on Technologies and Applications of Artificial Intelligence (TAAI)\",\"volume\":\"129 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 Conference on Technologies and Applications of Artificial Intelligence (TAAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TAAI.2016.7880176\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Conference on Technologies and Applications of Artificial Intelligence (TAAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TAAI.2016.7880176","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

本文提出了一种利用潜在狄利克雷分配(latent Dirichlet allocation, LDA)从缅甸语文本中提取专有名称的方法。我们的方法旨在提取提供缅甸文本内容重要信息的专有名称。我们的方法包括两个步骤。在第一步，我们使用LDA从缅甸新闻文章中提取主题词。在第二步中，我们进行了后处理，因为得到的主题词中包含一些有噪声的词。我们的后处理首先剔除了前缀为缅甸数字，后缀为名词和动词小品的主题词。然后，我们删除重复的单词并丢弃现有字典中包含的主题词。因此，我们获得了作为专有名称候选者的单词，即人名、地名、唯一对象名称、组织名称、单个事件名称等。评价从主观和定量两个角度进行。从主观角度来看，我们比较了该方法与潜在语义索引(LSI)和基于规则的方法提取的专有名称的准确性。结果表明，LS]和我们的方法都可以提高基于规则的方法得到的结果的准确性。然而，我们的方法可以提供比LSI更有趣的专有名称。从定量的角度来看，我们使用提取的专有名称作为K-means聚类的附加特征。实验结果表明，该方法在准确率、查全率和f分数方面都优于基于规则的方法和基于LSI的方法。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Extraction of proper names from myanmar text using latent dirichlet allocation

This paper proposes a method for proper names extraction from Myanmar text by using latent Dirichlet allocation (LDA). Our method aims to extract proper names that provide important information on the contents of Myanmar text. Our method consists of two steps. In the first step, we extract topic words from Myanmar news articles by using LDA. In the second step, we make a post-processing, because the resulting topic words contain some noisy words. Our post-processing, first of all, eliminates the topic words whose prefixes are Myanmar digits and suffixes are noun and verb particles. We then remove the duplicate words and discard the topic words that are contained in the existing dictionary. Consequently, we obtain the words as candidate of proper names, namely personal names, geographical names, unique object names, organization names, single event names, and so on. The evaluation is performed both from the subjective and quantitative perspectives. From the subjective perspective, we compare the accuracy of proper names extracted by our method with those extracted by latent semantic indexing (LSI) and rule-based method. It is shown that both LS] and our method can improve the accuracy of those obtained by rule-based method. However, our method can provide more interesting proper names than LSI. From the quantitative perspective, we use the extracted proper names as additional features in K-means clustering. The experimental results show that the document clusters given by our method are better than those given by LSI and rule-based method in precision, recall and F-score.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2016 Conference on Technologies and Applications of Artificial Intelligence (TAAI)

自引率

0.00%

发文量