{"title":"使用网页元数据增强URL规范化","authors":"Lay-Ki Soon, Sang Ho Lee","doi":"10.1109/ICCEE.2008.112","DOIUrl":null,"url":null,"abstract":"In this paper, we present our proposed method of incorporating metadata of Web pages to identify equivalent URLs in addition to the standard URL normalization methodology. The metadata considered are the page size and the body text of Web pages. These metadata can be obtained during HTML parsing in the process of crawling without incurring unnecessary cost. Our experiment shows an accuracy of up to 95.38% in identifying equivalent URLs by using the body text of Web pages.","PeriodicalId":365473,"journal":{"name":"2008 International Conference on Computer and Electrical Engineering","volume":"9 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2008-12-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"6","resultStr":"{\"title\":\"Enhancing URL Normalization Using Metadata of Web Pages\",\"authors\":\"Lay-Ki Soon, Sang Ho Lee\",\"doi\":\"10.1109/ICCEE.2008.112\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, we present our proposed method of incorporating metadata of Web pages to identify equivalent URLs in addition to the standard URL normalization methodology. The metadata considered are the page size and the body text of Web pages. These metadata can be obtained during HTML parsing in the process of crawling without incurring unnecessary cost. Our experiment shows an accuracy of up to 95.38% in identifying equivalent URLs by using the body text of Web pages.\",\"PeriodicalId\":365473,\"journal\":{\"name\":\"2008 International Conference on Computer and Electrical Engineering\",\"volume\":\"9 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2008-12-20\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"6\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2008 International Conference on Computer and Electrical Engineering\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCEE.2008.112\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2008 International Conference on Computer and Electrical Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCEE.2008.112","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Enhancing URL Normalization Using Metadata of Web Pages
In this paper, we present our proposed method of incorporating metadata of Web pages to identify equivalent URLs in addition to the standard URL normalization methodology. The metadata considered are the page size and the body text of Web pages. These metadata can be obtained during HTML parsing in the process of crawling without incurring unnecessary cost. Our experiment shows an accuracy of up to 95.38% in identifying equivalent URLs by using the body text of Web pages.