{"title":"Web文档分类的渐进式分析方案","authors":"Li-Chun Sung, Chin-Hwa Kuo, M. Chen, Yeali S. Sun","doi":"10.1109/WI.2005.119","DOIUrl":null,"url":null,"abstract":"In this paper, a Web document classification scheme, progressive analysis scheme (PAS) is proposed to efficiently and effectively classify HTML Web documents. When an author writes a Web document, HTML tags are used to visually emphasize the texts related to main concepts. The design of PAS is to catch the authoring convention in terms of the contributions of nested HTML tags to document classification. During the learning phase, PAS provides an enhanced tag sequence model to resolve the sample lacking problem in learning the classification contributions of HTML tag sequences. While in classification phase, PAS decomposes a Web document into regions based on the DOM tag-tree, and analyzes the regions in the descending order of their classification contributions. PAS also provides a mechanism called emphasis degree adjustment to defer the processing of noisy region during classification. The simulation results shows that PAS has better performance than full-text (e.g. SVM) and sequential classifier.","PeriodicalId":213856,"journal":{"name":"The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05)","volume":"57 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2005-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":"{\"title\":\"Progressive analysis scheme for Web document classification\",\"authors\":\"Li-Chun Sung, Chin-Hwa Kuo, M. Chen, Yeali S. Sun\",\"doi\":\"10.1109/WI.2005.119\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"In this paper, a Web document classification scheme, progressive analysis scheme (PAS) is proposed to efficiently and effectively classify HTML Web documents. When an author writes a Web document, HTML tags are used to visually emphasize the texts related to main concepts. The design of PAS is to catch the authoring convention in terms of the contributions of nested HTML tags to document classification. During the learning phase, PAS provides an enhanced tag sequence model to resolve the sample lacking problem in learning the classification contributions of HTML tag sequences. While in classification phase, PAS decomposes a Web document into regions based on the DOM tag-tree, and analyzes the regions in the descending order of their classification contributions. PAS also provides a mechanism called emphasis degree adjustment to defer the processing of noisy region during classification. The simulation results shows that PAS has better performance than full-text (e.g. SVM) and sequential classifier.\",\"PeriodicalId\":213856,\"journal\":{\"name\":\"The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05)\",\"volume\":\"57 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2005-09-19\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"3\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05)\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1109/WI.2005.119\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"The 2005 IEEE/WIC/ACM International Conference on Web Intelligence (WI'05)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/WI.2005.119","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Progressive analysis scheme for Web document classification
In this paper, a Web document classification scheme, progressive analysis scheme (PAS) is proposed to efficiently and effectively classify HTML Web documents. When an author writes a Web document, HTML tags are used to visually emphasize the texts related to main concepts. The design of PAS is to catch the authoring convention in terms of the contributions of nested HTML tags to document classification. During the learning phase, PAS provides an enhanced tag sequence model to resolve the sample lacking problem in learning the classification contributions of HTML tag sequences. While in classification phase, PAS decomposes a Web document into regions based on the DOM tag-tree, and analyzes the regions in the descending order of their classification contributions. PAS also provides a mechanism called emphasis degree adjustment to defer the processing of noisy region during classification. The simulation results shows that PAS has better performance than full-text (e.g. SVM) and sequential classifier.