{"title":"Synonym Insensitive Searching: A Novel Synonym Weighted-Vector Space Model for Document Retrieval","authors":"Mumthaz Beegum M, A. S, Raveena Vijayan","doi":"10.1109/ICCSC56913.2023.10142977","DOIUrl":null,"url":null,"abstract":"Document retrieval will become challenging when it deals with the unique capability of natural languages to present content in different forms using synonyms, usages, and their complex combinations. Most of the existing information retrieval systems are struggling to retrieve documents with a similar meaning, and they are helpful only to get documents based on matching keywords. The query expansion is a logically simple and straightforward technique to improve the effectiveness of information retrieval in this background. The existing statistical approach depends mainly on the term frequency to generate candidate documents for the expanded or normal query. Most of the existing works do not consider the ways in which the content in a particular document can be represented differently by keeping the same context. This paper proposes a novel Synonym Weighted - Vector Space Model and query expansion technique for an effective synonym-incorporated method for document retrieval. The combination of modified Term Frequency - Inverse Document Frequency(TF-IDF) and synonym extended VSM has given a promising outcome for the experiments throughout the study. The proposed method is validated with two English-written publicly available datasets - CACM and CISI. The quantitative measures, like mean average precision, precision, recall, and F-measure obtained in the experiments are found to be better for the proposed method compared with the classical VSM and other baseline methods in the problem domain. We could obtain the highest precision of 0.83 and 0.65 for the CACM and CISI datasets respectively.","PeriodicalId":184366,"journal":{"name":"2023 2nd International Conference on Computational Systems and Communication (ICCSC)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-03-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2023 2nd International Conference on Computational Systems and Communication (ICCSC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCSC56913.2023.10142977","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Document retrieval will become challenging when it deals with the unique capability of natural languages to present content in different forms using synonyms, usages, and their complex combinations. Most of the existing information retrieval systems are struggling to retrieve documents with a similar meaning, and they are helpful only to get documents based on matching keywords. The query expansion is a logically simple and straightforward technique to improve the effectiveness of information retrieval in this background. The existing statistical approach depends mainly on the term frequency to generate candidate documents for the expanded or normal query. Most of the existing works do not consider the ways in which the content in a particular document can be represented differently by keeping the same context. This paper proposes a novel Synonym Weighted - Vector Space Model and query expansion technique for an effective synonym-incorporated method for document retrieval. The combination of modified Term Frequency - Inverse Document Frequency(TF-IDF) and synonym extended VSM has given a promising outcome for the experiments throughout the study. The proposed method is validated with two English-written publicly available datasets - CACM and CISI. The quantitative measures, like mean average precision, precision, recall, and F-measure obtained in the experiments are found to be better for the proposed method compared with the classical VSM and other baseline methods in the problem domain. We could obtain the highest precision of 0.83 and 0.65 for the CACM and CISI datasets respectively.