Rule Based Approach for Word Normalization in Transliterated Search Queries

International Journal of Linguistics and Computational Applications Pub Date : 2020-06-04 DOI:10.30726/ijlca/v7.i2.2020.72002

Varsha M. Pathak, M. Joshi

{"title":"Rule Based Approach for Word Normalization in Transliterated Search Queries","authors":"Varsha M. Pathak, M. Joshi","doi":"10.30726/ijlca/v7.i2.2020.72002","DOIUrl":null,"url":null,"abstract":"SMS based Information Systems is the need of the age. Most of the present SMS based information systems send one way SMS based informative text messages generated from respective knowledge systems. By applying information retrieval methodology using models like Vector Space Mode, the systems can allow its users to send queries as per their requirement of information. This makes the system more fruitful from the user’s point of view. This paper is about such initiatives for accessing relevant literature like poems, phrases, Rhymes, stories, abhang and much more. The mobile based quick library access system MQuickLib allows users to access such literature by formulating transliterated queries. The Vector Space Model is used to create the systems knowledge base by processing. The document terms and matched with the query terms by allowing variation in spelling due to transliteration style of the users. The matching score is assigned by devising a set of rules that identify the distance between two terms dk the term from document and qj the query term. The original Levenshtein’s minimum edit distance algorithm is modified by applying this rule based approach. These rules are identified by collecting SMS queries from users for a given set of known queries in Marathi (Devnagari). Experiments were carried out for the collection of Marathi and Hindi literature that mainly include songs, gazals, powadas, bharud and other types. These documents are available in a standard transliteration form like ITRANS (an Indic Transliteration System). This paper elaborated a rule based approach and analyses the results to select appropriate rule based model that is further applied for the development of MQuickLib system.","PeriodicalId":271922,"journal":{"name":"International Journal of Linguistics and Computational Applications","volume":"41 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Linguistics and Computational Applications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.30726/ijlca/v7.i2.2020.72002","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

SMS based Information Systems is the need of the age. Most of the present SMS based information systems send one way SMS based informative text messages generated from respective knowledge systems. By applying information retrieval methodology using models like Vector Space Mode, the systems can allow its users to send queries as per their requirement of information. This makes the system more fruitful from the user’s point of view. This paper is about such initiatives for accessing relevant literature like poems, phrases, Rhymes, stories, abhang and much more. The mobile based quick library access system MQuickLib allows users to access such literature by formulating transliterated queries. The Vector Space Model is used to create the systems knowledge base by processing. The document terms and matched with the query terms by allowing variation in spelling due to transliteration style of the users. The matching score is assigned by devising a set of rules that identify the distance between two terms dk the term from document and qj the query term. The original Levenshtein’s minimum edit distance algorithm is modified by applying this rule based approach. These rules are identified by collecting SMS queries from users for a given set of known queries in Marathi (Devnagari). Experiments were carried out for the collection of Marathi and Hindi literature that mainly include songs, gazals, powadas, bharud and other types. These documents are available in a standard transliteration form like ITRANS (an Indic Transliteration System). This paper elaborated a rule based approach and analyses the results to select appropriate rule based model that is further applied for the development of MQuickLib system.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

音译搜索查询中基于规则的词规范化方法

基于短信的信息系统是时代的需要。目前大多数基于短信的信息系统都是单向发送由各自的知识系统生成的基于短信的信息文本消息。通过使用像向量空间模式这样的模型来应用信息检索方法，系统可以允许用户根据他们的信息需求发送查询。从用户的角度来看，这使得系统更富有成效。本文是关于这样的倡议，以获取相关的文学，如诗歌，短语，押韵，故事，abhang和更多。基于移动的快速图书馆访问系统MQuickLib允许用户通过制定音译查询来访问此类文献。利用向量空间模型对系统进行处理，建立系统知识库。根据用户的音译风格，允许拼写变化，从而使文档术语与查询术语相匹配。匹配分数是通过设计一组规则来分配的，这些规则标识两个词之间的距离dk(来自文档的词)和qj(查询词)。利用这种基于规则的方法对原有的Levenshtein最小编辑距离算法进行了改进。这些规则是通过收集用户针对马拉地语(Devnagari)给定的一组已知查询的SMS查询来确定的。对马拉地语和印地语文学的收集进行了实验，主要包括歌曲，gazals, powadas, bharud和其他类型。这些文件以ITRANS(印度音译系统)等标准音译形式提供。本文阐述了一种基于规则的方法，并对结果进行了分析，选择了合适的基于规则的模型，并将其进一步应用于MQuickLib系统的开发。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

International Journal of Linguistics and Computational Applications

自引率

0.00%

发文量