{"title":"Improved TFIDF weighting techniques in document Retrieval","authors":"Fadi Yamout, Rachad Lakkis","doi":"10.1109/ICDIM.2018.8847156","DOIUrl":null,"url":null,"abstract":"In information retrieval, documents are usually retrieved using lexical matching which matches where words in a user's query with words found in a set of documents. A significant model used in information retrieval is the vector space model where these words are represented as a vector in space and are assigned weights using a favorite weighting technique called TFIDF (Term Frequency Inverse Document Frequency). In this thesis, we have devised three new weighting techniques to improve the TFIDF weighting technique. The first technique is Dispersed Words Weight Augmentation (DWWA) which gives more weight to the words distributed in most of the document’s paragraphs; we consider that those words are more significant than words found in few paragraphs. The second technique is called Title Weight Augmentation (TWA) which gives more weight to the words found in the document’s title and first paragraphs. The third technique is called First Ranked Words Weight Augmentation (FRWWA) which increments further the weight of the most frequent words in a document. We tested the three techniques, and we found more relevant documents were retrieved in our system.","PeriodicalId":120884,"journal":{"name":"2018 Thirteenth International Conference on Digital Information Management (ICDIM)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2018-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2018 Thirteenth International Conference on Digital Information Management (ICDIM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICDIM.2018.8847156","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
In information retrieval, documents are usually retrieved using lexical matching which matches where words in a user's query with words found in a set of documents. A significant model used in information retrieval is the vector space model where these words are represented as a vector in space and are assigned weights using a favorite weighting technique called TFIDF (Term Frequency Inverse Document Frequency). In this thesis, we have devised three new weighting techniques to improve the TFIDF weighting technique. The first technique is Dispersed Words Weight Augmentation (DWWA) which gives more weight to the words distributed in most of the document’s paragraphs; we consider that those words are more significant than words found in few paragraphs. The second technique is called Title Weight Augmentation (TWA) which gives more weight to the words found in the document’s title and first paragraphs. The third technique is called First Ranked Words Weight Augmentation (FRWWA) which increments further the weight of the most frequent words in a document. We tested the three techniques, and we found more relevant documents were retrieved in our system.