使用潜在狄利克雷分配从缅甸文本中提取专有名称

Yuzana Win, Tomonari Masada
{"title":"使用潜在狄利克雷分配从缅甸文本中提取专有名称","authors":"Yuzana Win, Tomonari Masada","doi":"10.1109/TAAI.2016.7880176","DOIUrl":null,"url":null,"abstract":"This paper proposes a method for proper names extraction from Myanmar text by using latent Dirichlet allocation (LDA). Our method aims to extract proper names that provide important information on the contents of Myanmar text. Our method consists of two steps. In the first step, we extract topic words from Myanmar news articles by using LDA. In the second step, we make a post-processing, because the resulting topic words contain some noisy words. Our post-processing, first of all, eliminates the topic words whose prefixes are Myanmar digits and suffixes are noun and verb particles. We then remove the duplicate words and discard the topic words that are contained in the existing dictionary. Consequently, we obtain the words as candidate of proper names, namely personal names, geographical names, unique object names, organization names, single event names, and so on. The evaluation is performed both from the subjective and quantitative perspectives. From the subjective perspective, we compare the accuracy of proper names extracted by our method with those extracted by latent semantic indexing (LSI) and rule-based method. It is shown that both LS] and our method can improve the accuracy of those obtained by rule-based method. However, our method can provide more interesting proper names than LSI. From the quantitative perspective, we use the extracted proper names as additional features in K-means clustering. The experimental results show that the document clusters given by our method are better than those given by LSI and rule-based method in precision, recall and F-score.","PeriodicalId":159858,"journal":{"name":"2016 Conference on Technologies and Applications of Artificial Intelligence (TAAI)","volume":"129 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2016-11-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Extraction of proper names from myanmar text using latent dirichlet allocation\",\"authors\":\"Yuzana Win, Tomonari Masada\",\"doi\":\"10.1109/TAAI.2016.7880176\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper proposes a method for proper names extraction from Myanmar text by using latent Dirichlet allocation (LDA). Our method aims to extract proper names that provide important information on the contents of Myanmar text. Our method consists of two steps. In the first step, we extract topic words from Myanmar news articles by using LDA. In the second step, we make a post-processing, because the resulting topic words contain some noisy words. Our post-processing, first of all, eliminates the topic words whose prefixes are Myanmar digits and suffixes are noun and verb particles. We then remove the duplicate words and discard the topic words that are contained in the existing dictionary. Consequently, we obtain the words as candidate of proper names, namely personal names, geographical names, unique object names, organization names, single event names, and so on. The evaluation is performed both from the subjective and quantitative perspectives. From the subjective perspective, we compare the accuracy of proper names extracted by our method with those extracted by latent semantic indexing (LSI) and rule-based method. It is shown that both LS] and our method can improve the accuracy of those obtained by rule-based method. However, our method can provide more interesting proper names than LSI. From the quantitative perspective, we use the extracted proper names as additional features in K-means clustering. The experimental results show that the document clusters given by our method are better than those given by LSI and rule-based method in precision, recall and F-score.\",\"PeriodicalId\":159858,\"journal\":{\"name\":\"2016 Conference on Technologies and Applications of Artificial Intelligence (TAAI)\",\"volume\":\"129 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2016-11-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2016 Conference on Technologies and Applications of Artificial Intelligence (TAAI)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/TAAI.2016.7880176\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2016 Conference on Technologies and Applications of Artificial Intelligence (TAAI)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TAAI.2016.7880176","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

本文提出了一种利用潜在狄利克雷分配(latent Dirichlet allocation, LDA)从缅甸语文本中提取专有名称的方法。我们的方法旨在提取提供缅甸文本内容重要信息的专有名称。我们的方法包括两个步骤。在第一步,我们使用LDA从缅甸新闻文章中提取主题词。在第二步中,我们进行了后处理,因为得到的主题词中包含一些有噪声的词。我们的后处理首先剔除了前缀为缅甸数字,后缀为名词和动词小品的主题词。然后,我们删除重复的单词并丢弃现有字典中包含的主题词。因此,我们获得了作为专有名称候选者的单词,即人名、地名、唯一对象名称、组织名称、单个事件名称等。评价从主观和定量两个角度进行。从主观角度来看,我们比较了该方法与潜在语义索引(LSI)和基于规则的方法提取的专有名称的准确性。结果表明,LS]和我们的方法都可以提高基于规则的方法得到的结果的准确性。然而,我们的方法可以提供比LSI更有趣的专有名称。从定量的角度来看,我们使用提取的专有名称作为K-means聚类的附加特征。实验结果表明,该方法在准确率、查全率和f分数方面都优于基于规则的方法和基于LSI的方法。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Extraction of proper names from myanmar text using latent dirichlet allocation
This paper proposes a method for proper names extraction from Myanmar text by using latent Dirichlet allocation (LDA). Our method aims to extract proper names that provide important information on the contents of Myanmar text. Our method consists of two steps. In the first step, we extract topic words from Myanmar news articles by using LDA. In the second step, we make a post-processing, because the resulting topic words contain some noisy words. Our post-processing, first of all, eliminates the topic words whose prefixes are Myanmar digits and suffixes are noun and verb particles. We then remove the duplicate words and discard the topic words that are contained in the existing dictionary. Consequently, we obtain the words as candidate of proper names, namely personal names, geographical names, unique object names, organization names, single event names, and so on. The evaluation is performed both from the subjective and quantitative perspectives. From the subjective perspective, we compare the accuracy of proper names extracted by our method with those extracted by latent semantic indexing (LSI) and rule-based method. It is shown that both LS] and our method can improve the accuracy of those obtained by rule-based method. However, our method can provide more interesting proper names than LSI. From the quantitative perspective, we use the extracted proper names as additional features in K-means clustering. The experimental results show that the document clusters given by our method are better than those given by LSI and rule-based method in precision, recall and F-score.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A cluster-based opinion leader discovery in social network User behavior analysis and commodity recommendation for point-earning apps Extraction of proper names from myanmar text using latent dirichlet allocation Heuristic algorithm for target coverage with connectivity fault-tolerance problem in wireless sensor networks AFIS: Aligning detail-pages for full schema induction
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1