一种利用二阶马尔可夫模型求“bunsetsu”临时边界的新方法

Proceedings of 1993 2nd IEEE International Workshop on Robot and Human Communication Pub Date : 1993-11-03 DOI:10.1109/ROMAN.1993.367738

T. Araki, S. Ikehara, J. Tuchihase

{"title":"一种利用二阶马尔可夫模型求“bunsetsu”临时边界的新方法","authors":"T. Araki, S. Ikehara, J. Tuchihase","doi":"10.1109/ROMAN.1993.367738","DOIUrl":null,"url":null,"abstract":"As Japanese sentences are usually written using thousand kinds of characters especially \"kanji\" characters, it is not easy to input them into computer files. There has been much research on the method which translates the non-segmented \"kana\" sentences into the \"kanji-kana\" sentences. However, the amount of computer memory required for the translating processing explodes in many times, because the number of the combinations of candidates for \"kanji-kana\" words grows rapidly in proportion to the increasing of the length of the sentence. The memory explosion can be prevented if a sentence is separated into \"bunsetsu\" This paper proposes a new method of finding provisional boundaries of \"bunsetsu\" of non-segmented \"kana\" sentences using 2nd-order Markov chain probabilities. \"Relevance factor\" P and \"Recall factor\" R for provisional boundaries of \"bunsetsu\" determined by this method, were evaluated by experiment using the statistical data for 70 issues of a daily Japanese newspaper.<<ETX>>","PeriodicalId":270591,"journal":{"name":"Proceedings of 1993 2nd IEEE International Workshop on Robot and Human Communication","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1993-11-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"A new method of finding provisional boundaries of \\\"bunsetsu\\\" using 2nd-order Markov model\",\"authors\":\"T. Araki, S. Ikehara, J. Tuchihase\",\"doi\":\"10.1109/ROMAN.1993.367738\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"As Japanese sentences are usually written using thousand kinds of characters especially \\\"kanji\\\" characters, it is not easy to input them into computer files. There has been much research on the method which translates the non-segmented \\\"kana\\\" sentences into the \\\"kanji-kana\\\" sentences. However, the amount of computer memory required for the translating processing explodes in many times, because the number of the combinations of candidates for \\\"kanji-kana\\\" words grows rapidly in proportion to the increasing of the length of the sentence. The memory explosion can be prevented if a sentence is separated into \\\"bunsetsu\\\" This paper proposes a new method of finding provisional boundaries of \\\"bunsetsu\\\" of non-segmented \\\"kana\\\" sentences using 2nd-order Markov chain probabilities. \\\"Relevance factor\\\" P and \\\"Recall factor\\\" R for provisional boundaries of \\\"bunsetsu\\\" determined by this method, were evaluated by experiment using the statistical data for 70 issues of a daily Japanese newspaper.<<ETX>>\",\"PeriodicalId\":270591,\"journal\":{\"name\":\"Proceedings of 1993 2nd IEEE International Workshop on Robot and Human Communication\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1993-11-03\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of 1993 2nd IEEE International Workshop on Robot and Human Communication\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ROMAN.1993.367738\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of 1993 2nd IEEE International Workshop on Robot and Human Communication","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ROMAN.1993.367738","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

由于日语句子通常使用数千种字符，特别是“汉字”字符，因此将它们输入计算机文件并不容易。将非分词的假名句翻译成假名句的方法一直是人们研究的热点。然而，由于“假名-汉字”候选词组合的数量随着句子长度的增加而迅速增长，翻译处理所需的计算机内存量会成倍增长。本文提出了一种利用二阶马尔可夫链概率寻找非分段假名句的临时边界的新方法。用70期日本日报的统计数据，对该方法确定的“bunsetsu”临时边界的“相关因子”P和“召回因子”R进行了实验评价。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A new method of finding provisional boundaries of "bunsetsu" using 2nd-order Markov model

As Japanese sentences are usually written using thousand kinds of characters especially "kanji" characters, it is not easy to input them into computer files. There has been much research on the method which translates the non-segmented "kana" sentences into the "kanji-kana" sentences. However, the amount of computer memory required for the translating processing explodes in many times, because the number of the combinations of candidates for "kanji-kana" words grows rapidly in proportion to the increasing of the length of the sentence. The memory explosion can be prevented if a sentence is separated into "bunsetsu" This paper proposes a new method of finding provisional boundaries of "bunsetsu" of non-segmented "kana" sentences using 2nd-order Markov chain probabilities. "Relevance factor" P and "Recall factor" R for provisional boundaries of "bunsetsu" determined by this method, were evaluated by experiment using the statistical data for 70 issues of a daily Japanese newspaper.<>

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of 1993 2nd IEEE International Workshop on Robot and Human Communication

自引率

0.00%

发文量