Jiaul H. Paik, Yash Agrawal, Sahil Rishi, Vaishal Shah
{"title":"Truncated Models for Probabilistic Weighted Retrieval","authors":"Jiaul H. Paik, Yash Agrawal, Sahil Rishi, Vaishal Shah","doi":"10.1145/3476837","DOIUrl":null,"url":null,"abstract":"Existing probabilistic retrieval models do not restrict the domain of the random variables that they deal with. In this article, we show that the upper bound of the normalized term frequency (tf) from the relevant documents is much smaller than the upper bound of the normalized tf from the whole collection. As a result, the existing models suffer from two major problems: (i) the domain mismatch causes data modeling error, (ii) since the outliers have very large magnitude and the retrieval models follow tf hypothesis, the combination of these two factors tends to overestimate the relevance score. In an attempt to address these problems, we propose novel weighted probabilistic models based on truncated distributions. We evaluate our models on a set of large document collections. Significant performance improvement over six existing probabilistic models is demonstrated.","PeriodicalId":6934,"journal":{"name":"ACM Transactions on Information Systems (TOIS)","volume":"21 1","pages":"1 - 24"},"PeriodicalIF":0.0000,"publicationDate":"2021-12-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"2","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ACM Transactions on Information Systems (TOIS)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3476837","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 2
Abstract
Existing probabilistic retrieval models do not restrict the domain of the random variables that they deal with. In this article, we show that the upper bound of the normalized term frequency (tf) from the relevant documents is much smaller than the upper bound of the normalized tf from the whole collection. As a result, the existing models suffer from two major problems: (i) the domain mismatch causes data modeling error, (ii) since the outliers have very large magnitude and the retrieval models follow tf hypothesis, the combination of these two factors tends to overestimate the relevance score. In an attempt to address these problems, we propose novel weighted probabilistic models based on truncated distributions. We evaluate our models on a set of large document collections. Significant performance improvement over six existing probabilistic models is demonstrated.