{"title":"知识工程系统中基于URL的元爬虫除尘方法分析","authors":"Priti Chittekar, Smita Deshmukh","doi":"10.1109/ICCMC.2019.8819753","DOIUrl":null,"url":null,"abstract":"Multiple copies of URLs gathered by the web crawlers responsible for pages with similar or near about to similar content. Few pages combined by the web crawlers consist of similar content. Different URLs with Similar Text are generally known as DUST. Result of this is crawl data, to store the data and use such duplicated data results in building of less quality marking, waste of resources and destitute naive user experiences.To study with such problem, multiple studies have been observed. Previous studies focus only on URL based DUST removal .The proposed method removes content depended DUST and URL based DUST. To crawl the documents we are using a new method of metacrawler which fetches results from three Search engine. We are going to compare each website content with the other linked content to remove duplicates using k- gram paraphrased technique.","PeriodicalId":232624,"journal":{"name":"2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)","volume":"161 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Analysis of Metacrawler approach for URL based DUST removal by knowledge engineering systems\",\"authors\":\"Priti Chittekar, Smita Deshmukh\",\"doi\":\"10.1109/ICCMC.2019.8819753\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Multiple copies of URLs gathered by the web crawlers responsible for pages with similar or near about to similar content. Few pages combined by the web crawlers consist of similar content. Different URLs with Similar Text are generally known as DUST. Result of this is crawl data, to store the data and use such duplicated data results in building of less quality marking, waste of resources and destitute naive user experiences.To study with such problem, multiple studies have been observed. Previous studies focus only on URL based DUST removal .The proposed method removes content depended DUST and URL based DUST. To crawl the documents we are using a new method of metacrawler which fetches results from three Search engine. We are going to compare each website content with the other linked content to remove duplicates using k- gram paraphrased technique.\",\"PeriodicalId\":232624,\"journal\":{\"name\":\"2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)\",\"volume\":\"161 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2019-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/ICCMC.2019.8819753\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 3rd International Conference on Computing Methodologies and Communication (ICCMC)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICCMC.2019.8819753","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Analysis of Metacrawler approach for URL based DUST removal by knowledge engineering systems
Multiple copies of URLs gathered by the web crawlers responsible for pages with similar or near about to similar content. Few pages combined by the web crawlers consist of similar content. Different URLs with Similar Text are generally known as DUST. Result of this is crawl data, to store the data and use such duplicated data results in building of less quality marking, waste of resources and destitute naive user experiences.To study with such problem, multiple studies have been observed. Previous studies focus only on URL based DUST removal .The proposed method removes content depended DUST and URL based DUST. To crawl the documents we are using a new method of metacrawler which fetches results from three Search engine. We are going to compare each website content with the other linked content to remove duplicates using k- gram paraphrased technique.