尼泊尔语文本中复合词的生成与分裂

Multiscale multimodal medical imaging : Third International Workshop, MMMI 2022, held in conjunction with MICCAI 2022, Singapore, September 22, 2022, proceedings Pub Date : 2022-09-19 DOI:10.36548/jitdw.2022.3.007

Prabin Acharya, S. Shakya

{"title":"尼泊尔语文本中复合词的生成与分裂","authors":"Prabin Acharya, S. Shakya","doi":"10.36548/jitdw.2022.3.007","DOIUrl":null,"url":null,"abstract":"In Nepali language, compound word formation is mostly associated with inflection, derivation, and postposition attachment. Inflection occurs due to suffixation, whereas derivation is driven by both prefixation and suffixation. The compound word generated by the rules may produce lots of out-of-vocabulary words due to limited lexical resources and numerous exceptions. Hence, the machine learning approach can help to generate valid compounds and split them into valid morphemes that can be further used as a resource for spelling suggestions, information retrieval, and machine translation. In this research, a method to generate valid compounds from the corresponding compound splits (head word and prefix/suffix/ postpositions) is suggested. A BiLSTM based deep learning approach was used to generate and split the valid compound words. Publicly available Nepali Brihat Shabdakosh data from Nepal Academy and scraped news data were used for the experimentation. The obtained results were found to be outstanding compared to the rule-based approach applied to a similar job.","PeriodicalId":74231,"journal":{"name":"Multiscale multimodal medical imaging : Third International Workshop, MMMI 2022, held in conjunction with MICCAI 2022, Singapore, September 22, 2022, proceedings","volume":"42 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2022-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Generation and Splitting of the Compound Words in Nepali Text\",\"authors\":\"Prabin Acharya, S. Shakya\",\"doi\":\"10.36548/jitdw.2022.3.007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In Nepali language, compound word formation is mostly associated with inflection, derivation, and postposition attachment. Inflection occurs due to suffixation, whereas derivation is driven by both prefixation and suffixation. The compound word generated by the rules may produce lots of out-of-vocabulary words due to limited lexical resources and numerous exceptions. Hence, the machine learning approach can help to generate valid compounds and split them into valid morphemes that can be further used as a resource for spelling suggestions, information retrieval, and machine translation. In this research, a method to generate valid compounds from the corresponding compound splits (head word and prefix/suffix/ postpositions) is suggested. A BiLSTM based deep learning approach was used to generate and split the valid compound words. Publicly available Nepali Brihat Shabdakosh data from Nepal Academy and scraped news data were used for the experimentation. The obtained results were found to be outstanding compared to the rule-based approach applied to a similar job.\",\"PeriodicalId\":74231,\"journal\":{\"name\":\"Multiscale multimodal medical imaging : Third International Workshop, MMMI 2022, held in conjunction with MICCAI 2022, Singapore, September 22, 2022, proceedings\",\"volume\":\"42 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Multiscale multimodal medical imaging : Third International Workshop, MMMI 2022, held in conjunction with MICCAI 2022, Singapore, September 22, 2022, proceedings\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.36548/jitdw.2022.3.007\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Multiscale multimodal medical imaging : Third International Workshop, MMMI 2022, held in conjunction with MICCAI 2022, Singapore, September 22, 2022, proceedings","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.36548/jitdw.2022.3.007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

在尼泊尔语中，复合词的构成主要与屈折、衍生和后置连接有关。屈折是由后缀引起的，而派生是由前缀和后缀共同引起的。由规则生成的复合词由于词汇资源有限，且例外情况众多，可能产生大量的词汇外词。因此，机器学习方法可以帮助生成有效的复合词，并将它们分割成有效的语素，这些语素可以进一步用作拼写建议、信息检索和机器翻译的资源。在本研究中，提出了一种从相应的复合词分割(首词和前缀/后缀/后置)中生成有效复合词的方法。采用基于BiLSTM的深度学习方法生成并拆分有效复合词。实验使用了尼泊尔学院公开的尼泊尔Brihat Shabdakosh数据和抓取的新闻数据。与应用于类似工作的基于规则的方法相比，所获得的结果是突出的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Generation and Splitting of the Compound Words in Nepali Text

In Nepali language, compound word formation is mostly associated with inflection, derivation, and postposition attachment. Inflection occurs due to suffixation, whereas derivation is driven by both prefixation and suffixation. The compound word generated by the rules may produce lots of out-of-vocabulary words due to limited lexical resources and numerous exceptions. Hence, the machine learning approach can help to generate valid compounds and split them into valid morphemes that can be further used as a resource for spelling suggestions, information retrieval, and machine translation. In this research, a method to generate valid compounds from the corresponding compound splits (head word and prefix/suffix/ postpositions) is suggested. A BiLSTM based deep learning approach was used to generate and split the valid compound words. Publicly available Nepali Brihat Shabdakosh data from Nepal Academy and scraped news data were used for the experimentation. The obtained results were found to be outstanding compared to the rule-based approach applied to a similar job.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Multiscale multimodal medical imaging : Third International Workshop, MMMI 2022, held in conjunction with MICCAI 2022, Singapore, September 22, 2022, proceedings

自引率

0.00%

发文量