{"title":"A Weighted Freshness Metric for Maintaining Search Engine Local Repository","authors":"Jianchao Han, N. Cercone, Xiaohua Hu","doi":"10.1109/WI.2004.17","DOIUrl":null,"url":null,"abstract":"Current search engines maintain a local repository to improve the search efficiency. A crawler is used to periodically poll the remote web pages to update the contents of the local repository. Due to the resource limitations, some local pages may be stale. To maintain the high freshness of the repository, the crawler is expected to revisit remote web pages in optimized order and frequency. The intuitive metric of freshness of the local repository is defined as the fraction of up-to-date web pages in the repository, which is merely based on the repository content, and does not, unfortunately, reflect the perspective of the search engine users, e.g., how often is a web page queried? We propose a novel weighted metric of the repository freshness with the importance of web pages being the weights. This metric not only takes into account the local web pages themselves but also the perspectives of the search engine users. We study the repository synchronization policy under this new metric, compare this metric with others, analyze its features, and discuss how the web page importance is determined.","PeriodicalId":229107,"journal":{"name":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'04)","volume":"81 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2004-09-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"4","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/WIC/ACM International Conference on Web Intelligence (WI'04)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI.2004.17","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 4
Abstract
Current search engines maintain a local repository to improve the search efficiency. A crawler is used to periodically poll the remote web pages to update the contents of the local repository. Due to the resource limitations, some local pages may be stale. To maintain the high freshness of the repository, the crawler is expected to revisit remote web pages in optimized order and frequency. The intuitive metric of freshness of the local repository is defined as the fraction of up-to-date web pages in the repository, which is merely based on the repository content, and does not, unfortunately, reflect the perspective of the search engine users, e.g., how often is a web page queried? We propose a novel weighted metric of the repository freshness with the importance of web pages being the weights. This metric not only takes into account the local web pages themselves but also the perspectives of the search engine users. We study the repository synchronization policy under this new metric, compare this metric with others, analyze its features, and discuss how the web page importance is determined.