{"title":"Crawl Topical Vietnamese Web Pages Using Genetic Algorithm","authors":"Nguyen Quoc Nhan, Vu Tuan Son, Huynh Thi Thanh Binh, Tran Duc Khanh","doi":"10.1109/KSE.2010.25","DOIUrl":null,"url":null,"abstract":"A focused crawler traverses the web selecting out relevant pages according to a predefined topic. While browsing the internet it is difficult to identify relevant pages and predict which links lead to high quality pages. In this paper, we propose a crawler system using genetic algorithm to improve its crawling performance. Apart from estimating the best path to follow, our system also expands its initial keywords by using genetic algorithm during the crawling process. To crawl Vietnamese web pages, we apply a hybrid word segmentation approach which consists of combining automata and part of speech tagging techniques for the Vietnamese text classifier. We experiment our algorithm on Vietnamese websites. Experimental results are reported to show the efficiency of our system.","PeriodicalId":158823,"journal":{"name":"2010 Second International Conference on Knowledge and Systems Engineering","volume":"64 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2010-10-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"8","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2010 Second International Conference on Knowledge and Systems Engineering","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/KSE.2010.25","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 8
Abstract
A focused crawler traverses the web selecting out relevant pages according to a predefined topic. While browsing the internet it is difficult to identify relevant pages and predict which links lead to high quality pages. In this paper, we propose a crawler system using genetic algorithm to improve its crawling performance. Apart from estimating the best path to follow, our system also expands its initial keywords by using genetic algorithm during the crawling process. To crawl Vietnamese web pages, we apply a hybrid word segmentation approach which consists of combining automata and part of speech tagging techniques for the Vietnamese text classifier. We experiment our algorithm on Vietnamese websites. Experimental results are reported to show the efficiency of our system.