剑鱼:一种基于无监督神经网络的形态学分析方法

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval Pub Date : 2006-08-06 DOI:10.1145/1148170.1148303

Christopher T. Jordan, J. Healy, Vlado Keselj

{"title":"剑鱼:一种基于无监督神经网络的形态学分析方法","authors":"Christopher T. Jordan, J. Healy, Vlado Keselj","doi":"10.1145/1148170.1148303","DOIUrl":null,"url":null,"abstract":"Extracting morphemes from words is a nontrivial task. Rule based stemming approaches such as Porter's algorithm have encountered some success, however they are restricted by their ability to identify a limited number of affixes and are language dependent. When dealing with languages with many affixes, rule based approaches generally require many more rules to deal with all the possible word forms. Deriving these rules requires a larger effort on the part of linguists and in some instances can be simply impractical. We propose an unsupervised ngram based approach, named Swordfish. Using ngram probabilities in the corpus, possible morphemes are identified. We look at two possible methods for identifying candidate morphemes, one using joint probabilities between two ngrams, and the second based on log odds between prefix probabilities. Initial results indicate the joint probability approach to be better for English while the prefix ratio approach is better for Finnish and Turkish.","PeriodicalId":433366,"journal":{"name":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2006-08-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"7","resultStr":"{\"title\":\"Swordfish: an unsupervised Ngram based approach to morphological analysis\",\"authors\":\"Christopher T. Jordan, J. Healy, Vlado Keselj\",\"doi\":\"10.1145/1148170.1148303\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Extracting morphemes from words is a nontrivial task. Rule based stemming approaches such as Porter's algorithm have encountered some success, however they are restricted by their ability to identify a limited number of affixes and are language dependent. When dealing with languages with many affixes, rule based approaches generally require many more rules to deal with all the possible word forms. Deriving these rules requires a larger effort on the part of linguists and in some instances can be simply impractical. We propose an unsupervised ngram based approach, named Swordfish. Using ngram probabilities in the corpus, possible morphemes are identified. We look at two possible methods for identifying candidate morphemes, one using joint probabilities between two ngrams, and the second based on log odds between prefix probabilities. Initial results indicate the joint probability approach to be better for English while the prefix ratio approach is better for Finnish and Turkish.\",\"PeriodicalId\":433366,\"journal\":{\"name\":\"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval\",\"volume\":\"1 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2006-08-06\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"7\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/1148170.1148303\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/1148170.1148303","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 7

摘要

从单词中提取语素是一项不平凡的任务。基于规则的词干提取方法，如波特的算法，已经取得了一些成功，但是它们受到识别有限数量词缀的能力的限制，并且依赖于语言。在处理具有许多词缀的语言时，基于规则的方法通常需要更多的规则来处理所有可能的单词形式。得出这些规则需要语言学家付出更大的努力，在某些情况下可能根本不切实际。我们提出了一种基于无监督图像图的方法，命名为“剑鱼”。使用语料库中的ngram概率，识别可能的语素。我们研究了识别候选语素的两种可能方法，一种是使用两个图之间的联合概率，另一种是基于前缀概率之间的对数赔率。初步结果表明，联合概率方法对英语更好，而前缀比例方法对芬兰语和土耳其语更好。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Swordfish: an unsupervised Ngram based approach to morphological analysis

Extracting morphemes from words is a nontrivial task. Rule based stemming approaches such as Porter's algorithm have encountered some success, however they are restricted by their ability to identify a limited number of affixes and are language dependent. When dealing with languages with many affixes, rule based approaches generally require many more rules to deal with all the possible word forms. Deriving these rules requires a larger effort on the part of linguists and in some instances can be simply impractical. We propose an unsupervised ngram based approach, named Swordfish. Using ngram probabilities in the corpus, possible morphemes are identified. We look at two possible methods for identifying candidate morphemes, one using joint probabilities between two ngrams, and the second based on log odds between prefix probabilities. Initial results indicate the joint probability approach to be better for English while the prefix ratio approach is better for Finnish and Turkish.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Proceedings of the 29th annual international ACM SIGIR conference on Research and development in information retrieval

自引率

0.00%

发文量

期刊最新文献

Strict and vague interpretation of XML-retrieval queries AggregateRank: bringing order to web sites Text clustering with extended user feedback Improving personalized web search using result diversification High accuracy retrieval with multiple nested ranker