{"title":"有效地利用本地周边页面构建网页集合","authors":"Yuxin Wang, K. Oyama","doi":"10.2201/NIIPI.2009.6.4","DOIUrl":null,"url":null,"abstract":"This paper describes a method for building a high-quality web page collection with a reduced manual assessment cost that exploits local surrounding pages. Effectiveness of the method is shown through experiments using a researcher’s homepage as an example of the target categories. The method consists of two processes: rough filtering and accurate classification. In both processes, we introduce a logical page group structure concept that is represented by the relation between an entry page and its surrounding pages based on their connection type and relative URL directory level, and use the contents of local surrounding pages according to that concept. For the first process, we propose a very efficient method for comprehensively gathering all potential researchers’ homepages from the web using property-based keyword lists. Four kinds of page group models (PGMs) based on the page group structure were used for merging the keywords from the surrounding pages. Although a lot of noise pages are included if we use keywords in the surrounding pages without considering the page group structure, the experimental results show that our method can reduce the increase of noise pages to an allowable level and can gather a significant number of the positive pages that could not be gathered using a single-page-based method. For the second process, we propose composing a three-grade classifier using two base classifiers: precision-assured and recall-assured. It classifies the input to assured positive, assured negative, and uncertain pages, where the uncertain pages need a manual assessment, so that the collection quality required by an application can be assured. Each of the base classifiers is further composed of a surrounding page classifier (SC) and an entry page classifier (EC). The SC selects likely component pages and the EC classifies the entry pages using information from both the entry page and the likely component pages. An evident performance improvement of the base classifiers by the introduction of the SC is shown through experiments. Then, the reduction of the number of uncertain pages is evaluated and the effectiveness of the proposed method is shown.","PeriodicalId":91638,"journal":{"name":"... Proceedings of the ... IEEE International Conference on Progress in Informatics and Computing. IEEE International Conference on Progress in Informatics and Computing","volume":"12 1","pages":"27"},"PeriodicalIF":0.0000,"publicationDate":"2009-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Buildingweb page collections efficiently exploiting local surrounding pages\",\"authors\":\"Yuxin Wang, K. Oyama\",\"doi\":\"10.2201/NIIPI.2009.6.4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes a method for building a high-quality web page collection with a reduced manual assessment cost that exploits local surrounding pages. Effectiveness of the method is shown through experiments using a researcher’s homepage as an example of the target categories. The method consists of two processes: rough filtering and accurate classification. In both processes, we introduce a logical page group structure concept that is represented by the relation between an entry page and its surrounding pages based on their connection type and relative URL directory level, and use the contents of local surrounding pages according to that concept. For the first process, we propose a very efficient method for comprehensively gathering all potential researchers’ homepages from the web using property-based keyword lists. Four kinds of page group models (PGMs) based on the page group structure were used for merging the keywords from the surrounding pages. Although a lot of noise pages are included if we use keywords in the surrounding pages without considering the page group structure, the experimental results show that our method can reduce the increase of noise pages to an allowable level and can gather a significant number of the positive pages that could not be gathered using a single-page-based method. For the second process, we propose composing a three-grade classifier using two base classifiers: precision-assured and recall-assured. It classifies the input to assured positive, assured negative, and uncertain pages, where the uncertain pages need a manual assessment, so that the collection quality required by an application can be assured. Each of the base classifiers is further composed of a surrounding page classifier (SC) and an entry page classifier (EC). The SC selects likely component pages and the EC classifies the entry pages using information from both the entry page and the likely component pages. An evident performance improvement of the base classifiers by the introduction of the SC is shown through experiments. Then, the reduction of the number of uncertain pages is evaluated and the effectiveness of the proposed method is shown.\",\"PeriodicalId\":91638,\"journal\":{\"name\":\"... Proceedings of the ... IEEE International Conference on Progress in Informatics and Computing. IEEE International Conference on Progress in Informatics and Computing\",\"volume\":\"12 1\",\"pages\":\"27\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"... Proceedings of the ... IEEE International Conference on Progress in Informatics and Computing. IEEE International Conference on Progress in Informatics and Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2201/NIIPI.2009.6.4\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"... Proceedings of the ... IEEE International Conference on Progress in Informatics and Computing. IEEE International Conference on Progress in Informatics and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2201/NIIPI.2009.6.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
本文描述了一种利用本地周边页面构建高质量网页集的方法,减少了人工评估成本。以某研究员的主页为例,对目标分类进行了实验,证明了该方法的有效性。该方法包括粗过滤和精确分类两个过程。在这两个过程中,我们都引入了一个逻辑页面组结构概念,该概念由条目页面与其周围页面之间的关系(基于它们的连接类型和相对URL目录级别)表示,并根据该概念使用本地周围页面的内容。对于第一个过程,我们提出了一种非常有效的方法,利用基于属性的关键字列表从网络上全面收集所有潜在研究人员的主页。采用四种基于页面组结构的页面组模型(page group model, PGMs)对周边页面进行关键词合并。虽然在不考虑页面组结构的情况下使用周围页面中的关键字会产生大量的噪声页面,但实验结果表明,我们的方法可以将噪声页面的增加减少到允许的水平,并且可以收集到使用基于单页面的方法无法收集到的大量积极页面。对于第二个过程,我们建议使用两个基本分类器组成一个三级分类器:精度保证和召回保证。它将输入分类为确定的正面、确定的负面和不确定的页面,其中不确定的页面需要手动评估,以便可以确保应用程序所需的收集质量。每个基分类器进一步由一个周围页面分类器(SC)和一个条目页面分类器(EC)组成。SC选择可能的组件页面,EC使用来自条目页面和可能的组件页面的信息对条目页面进行分类。实验表明,引入SC后,基分类器的性能有了明显的提高。然后,对不确定页数的减少进行了评估,并证明了所提方法的有效性。
Buildingweb page collections efficiently exploiting local surrounding pages
This paper describes a method for building a high-quality web page collection with a reduced manual assessment cost that exploits local surrounding pages. Effectiveness of the method is shown through experiments using a researcher’s homepage as an example of the target categories. The method consists of two processes: rough filtering and accurate classification. In both processes, we introduce a logical page group structure concept that is represented by the relation between an entry page and its surrounding pages based on their connection type and relative URL directory level, and use the contents of local surrounding pages according to that concept. For the first process, we propose a very efficient method for comprehensively gathering all potential researchers’ homepages from the web using property-based keyword lists. Four kinds of page group models (PGMs) based on the page group structure were used for merging the keywords from the surrounding pages. Although a lot of noise pages are included if we use keywords in the surrounding pages without considering the page group structure, the experimental results show that our method can reduce the increase of noise pages to an allowable level and can gather a significant number of the positive pages that could not be gathered using a single-page-based method. For the second process, we propose composing a three-grade classifier using two base classifiers: precision-assured and recall-assured. It classifies the input to assured positive, assured negative, and uncertain pages, where the uncertain pages need a manual assessment, so that the collection quality required by an application can be assured. Each of the base classifiers is further composed of a surrounding page classifier (SC) and an entry page classifier (EC). The SC selects likely component pages and the EC classifies the entry pages using information from both the entry page and the likely component pages. An evident performance improvement of the base classifiers by the introduction of the SC is shown through experiments. Then, the reduction of the number of uncertain pages is evaluated and the effectiveness of the proposed method is shown.