{"title":"A Web Page De-duplication Algorithm Based on Data Clearing","authors":"Jian-ming Lin, Dong-sheng Liu, Shi-wen Gao, Wei Chen","doi":"10.1109/JCAI.2009.181","DOIUrl":null,"url":null,"abstract":"Duplicated web pages responded by search engines not only waste valuable storage, but also aggravate burdens of users’ browse. Web page de-duplication can effectively improve the information retrieval. This paper proposes pretreatment of web pages to improve the effectiveness and efficiency of web page de-duplication based on feature code according to the principle of data clearing. This paper features that ranking feature code to reduce the comparison times of the system and space and time complexity. Experiments show that this method has a promising prospect in eliminating large-scale duplicated web pages.","PeriodicalId":154425,"journal":{"name":"2009 International Joint Conference on Artificial Intelligence","volume":"117 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2009-04-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2009 International Joint Conference on Artificial Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/JCAI.2009.181","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1
Abstract
Duplicated web pages responded by search engines not only waste valuable storage, but also aggravate burdens of users’ browse. Web page de-duplication can effectively improve the information retrieval. This paper proposes pretreatment of web pages to improve the effectiveness and efficiency of web page de-duplication based on feature code according to the principle of data clearing. This paper features that ranking feature code to reduce the comparison times of the system and space and time complexity. Experiments show that this method has a promising prospect in eliminating large-scale duplicated web pages.