{"title":"有效地利用本地周边页面构建网页集合","authors":"Yuxin Wang, K. Oyama","doi":"10.2201/NIIPI.2009.6.4","DOIUrl":null,"url":null,"abstract":"This paper describes a method for building a high-quality web page collection with a reduced manual assessment cost that exploits local surrounding pages. Effectiveness of the method is shown through experiments using a researcher’s homepage as an example of the target categories. The method consists of two processes: rough filtering and accurate classification. In both processes, we introduce a logical page group structure concept that is represented by the relation between an entry page and its surrounding pages based on their connection type and relative URL directory level, and use the contents of local surrounding pages according to that concept. For the first process, we propose a very efficient method for comprehensively gathering all potential researchers’ homepages from the web using property-based keyword lists. Four kinds of page group models (PGMs) based on the page group structure were used for merging the keywords from the surrounding pages. Although a lot of noise pages are included if we use keywords in the surrounding pages without considering the page group structure, the experimental results show that our method can reduce the increase of noise pages to an allowable level and can gather a significant number of the positive pages that could not be gathered using a single-page-based method. For the second process, we propose composing a three-grade classifier using two base classifiers: precision-assured and recall-assured. It classifies the input to assured positive, assured negative, and uncertain pages, where the uncertain pages need a manual assessment, so that the collection quality required by an application can be assured. Each of the base classifiers is further composed of a surrounding page classifier (SC) and an entry page classifier (EC). The SC selects likely component pages and the EC classifies the entry pages using information from both the entry page and the likely component pages. An evident performance improvement of the base classifiers by the introduction of the SC is shown through experiments. Then, the reduction of the number of uncertain pages is evaluated and the effectiveness of the proposed method is shown.","PeriodicalId":91638,"journal":{"name":"... Proceedings of the ... IEEE International Conference on Progress in Informatics and Computing. IEEE International Conference on Progress in Informatics and Computing","volume":"12 1","pages":"27"},"PeriodicalIF":0.0000,"publicationDate":"2009-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Buildingweb page collections efficiently exploiting local surrounding pages\",\"authors\":\"Yuxin Wang, K. Oyama\",\"doi\":\"10.2201/NIIPI.2009.6.4\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"This paper describes a method for building a high-quality web page collection with a reduced manual assessment cost that exploits local surrounding pages. Effectiveness of the method is shown through experiments using a researcher’s homepage as an example of the target categories. The method consists of two processes: rough filtering and accurate classification. In both processes, we introduce a logical page group structure concept that is represented by the relation between an entry page and its surrounding pages based on their connection type and relative URL directory level, and use the contents of local surrounding pages according to that concept. For the first process, we propose a very efficient method for comprehensively gathering all potential researchers’ homepages from the web using property-based keyword lists. Four kinds of page group models (PGMs) based on the page group structure were used for merging the keywords from the surrounding pages. Although a lot of noise pages are included if we use keywords in the surrounding pages without considering the page group structure, the experimental results show that our method can reduce the increase of noise pages to an allowable level and can gather a significant number of the positive pages that could not be gathered using a single-page-based method. For the second process, we propose composing a three-grade classifier using two base classifiers: precision-assured and recall-assured. It classifies the input to assured positive, assured negative, and uncertain pages, where the uncertain pages need a manual assessment, so that the collection quality required by an application can be assured. Each of the base classifiers is further composed of a surrounding page classifier (SC) and an entry page classifier (EC). The SC selects likely component pages and the EC classifies the entry pages using information from both the entry page and the likely component pages. An evident performance improvement of the base classifiers by the introduction of the SC is shown through experiments. Then, the reduction of the number of uncertain pages is evaluated and the effectiveness of the proposed method is shown.\",\"PeriodicalId\":91638,\"journal\":{\"name\":\"... Proceedings of the ... IEEE International Conference on Progress in Informatics and Computing. IEEE International Conference on Progress in Informatics and Computing\",\"volume\":\"12 1\",\"pages\":\"27\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2009-03-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"... Proceedings of the ... IEEE International Conference on Progress in Informatics and Computing. IEEE International Conference on Progress in Informatics and Computing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.2201/NIIPI.2009.6.4\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"... Proceedings of the ... IEEE International Conference on Progress in Informatics and Computing. IEEE International Conference on Progress in Informatics and Computing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.2201/NIIPI.2009.6.4","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

本文描述了一种利用本地周边页面构建高质量网页集的方法,减少了人工评估成本。以某研究员的主页为例,对目标分类进行了实验,证明了该方法的有效性。该方法包括粗过滤和精确分类两个过程。在这两个过程中,我们都引入了一个逻辑页面组结构概念,该概念由条目页面与其周围页面之间的关系(基于它们的连接类型和相对URL目录级别)表示,并根据该概念使用本地周围页面的内容。对于第一个过程,我们提出了一种非常有效的方法,利用基于属性的关键字列表从网络上全面收集所有潜在研究人员的主页。采用四种基于页面组结构的页面组模型(page group model, PGMs)对周边页面进行关键词合并。虽然在不考虑页面组结构的情况下使用周围页面中的关键字会产生大量的噪声页面,但实验结果表明,我们的方法可以将噪声页面的增加减少到允许的水平,并且可以收集到使用基于单页面的方法无法收集到的大量积极页面。对于第二个过程,我们建议使用两个基本分类器组成一个三级分类器:精度保证和召回保证。它将输入分类为确定的正面、确定的负面和不确定的页面,其中不确定的页面需要手动评估,以便可以确保应用程序所需的收集质量。每个基分类器进一步由一个周围页面分类器(SC)和一个条目页面分类器(EC)组成。SC选择可能的组件页面,EC使用来自条目页面和可能的组件页面的信息对条目页面进行分类。实验表明,引入SC后,基分类器的性能有了明显的提高。然后,对不确定页数的减少进行了评估,并证明了所提方法的有效性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Buildingweb page collections efficiently exploiting local surrounding pages
This paper describes a method for building a high-quality web page collection with a reduced manual assessment cost that exploits local surrounding pages. Effectiveness of the method is shown through experiments using a researcher’s homepage as an example of the target categories. The method consists of two processes: rough filtering and accurate classification. In both processes, we introduce a logical page group structure concept that is represented by the relation between an entry page and its surrounding pages based on their connection type and relative URL directory level, and use the contents of local surrounding pages according to that concept. For the first process, we propose a very efficient method for comprehensively gathering all potential researchers’ homepages from the web using property-based keyword lists. Four kinds of page group models (PGMs) based on the page group structure were used for merging the keywords from the surrounding pages. Although a lot of noise pages are included if we use keywords in the surrounding pages without considering the page group structure, the experimental results show that our method can reduce the increase of noise pages to an allowable level and can gather a significant number of the positive pages that could not be gathered using a single-page-based method. For the second process, we propose composing a three-grade classifier using two base classifiers: precision-assured and recall-assured. It classifies the input to assured positive, assured negative, and uncertain pages, where the uncertain pages need a manual assessment, so that the collection quality required by an application can be assured. Each of the base classifiers is further composed of a surrounding page classifier (SC) and an entry page classifier (EC). The SC selects likely component pages and the EC classifies the entry pages using information from both the entry page and the likely component pages. An evident performance improvement of the base classifiers by the introduction of the SC is shown through experiments. Then, the reduction of the number of uncertain pages is evaluated and the effectiveness of the proposed method is shown.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
A convolutional neural network based approach towards real-time hard hat detection Report on the analyses and the applications of a large-scale news video archive: NII TV-RECS Large-scale cross-media analysis and mining from socially curated contents Scalable Approaches for Content -based Video Retrieval 湘南会議 The future of multimedia analysis and mining
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1