WonHee Yu, Kinam Park, Soonyoung Jung, Heuiseok Lim
{"title":"Acquiring Korean Lexical Entry from a Raw Corpus","authors":"WonHee Yu, Kinam Park, Soonyoung Jung, Heuiseok Lim","doi":"10.1109/ITCS.2010.5581289","DOIUrl":null,"url":null,"abstract":"This paper proposes a computational lexical entry acquisition model based on a representation model of the mental lexicon. The proposed model acquires lexical entries from a raw corpus by unsupervised learning like human. The model is composed of full-form and morpheme acquisition modules. In the full-from acquisition module, core full-forms are automatically acquired according to the frequency and recency thresholds. In the morpheme acquisition module, a repeatedly occurring substring in different full-forms is chosen as a candidate morpheme. Then, the candidate is corroborated as a morpheme by using the entropy measure of syllables in the string. The experimental results with a Korean corpus of which size is about 16 million full-forms show that the model successively acquires major full-forms and morphemes with the precision of 100% and 99.04%, respectively.","PeriodicalId":166169,"journal":{"name":"2010 2nd International Conference on Information Technology Convergence and Services","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-09-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 2nd International Conference on Information Technology Convergence and Services","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ITCS.2010.5581289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
This paper proposes a computational lexical entry acquisition model based on a representation model of the mental lexicon. The proposed model acquires lexical entries from a raw corpus by unsupervised learning like human. The model is composed of full-form and morpheme acquisition modules. In the full-from acquisition module, core full-forms are automatically acquired according to the frequency and recency thresholds. In the morpheme acquisition module, a repeatedly occurring substring in different full-forms is chosen as a candidate morpheme. Then, the candidate is corroborated as a morpheme by using the entropy measure of syllables in the string. The experimental results with a Korean corpus of which size is about 16 million full-forms show that the model successively acquires major full-forms and morphemes with the precision of 100% and 99.04%, respectively.