结合统计、结构和语言特征,从网页中提取关键字

H. Shah, P. Fränti
{"title":"结合统计、结构和语言特征,从网页中提取关键字","authors":"H. Shah, P. Fränti","doi":"10.3934/aci.2022007","DOIUrl":null,"url":null,"abstract":"\n\nKeywords are commonly used to summarize text documents. In this paper, we perform a systematic comparison of methods for automatic keyword extraction from web pages. The methods are based on three different types of features: statistical, structural and linguistic. Statistical features are the most common, but there are other clues in web documents that can also be used. Structural features utilize styling codes like header tags and links, but also the structure of the web page. Linguistic features can be based on detecting synonyms, semantic similarity of the words and part-of-speech tagging, but also concept hierarchy or a concept graph derived from Wikipedia. We compare different types of features to find out the importance of each of them. One of the key results is that stop word removal and other pre-processing steps are the most critical. The most successful linguistic feature was a pre-constructed list of words that had no synonyms in WordNet. A new method called ACI‑rank is also compiled from the best working combination.\n\n","PeriodicalId":414924,"journal":{"name":"Applied Computing and Intelligence","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Combining statistical, structural, and linguistic features for keyword extraction from web pages\",\"authors\":\"H. Shah, P. Fränti\",\"doi\":\"10.3934/aci.2022007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n\\nKeywords are commonly used to summarize text documents. In this paper, we perform a systematic comparison of methods for automatic keyword extraction from web pages. The methods are based on three different types of features: statistical, structural and linguistic. Statistical features are the most common, but there are other clues in web documents that can also be used. Structural features utilize styling codes like header tags and links, but also the structure of the web page. Linguistic features can be based on detecting synonyms, semantic similarity of the words and part-of-speech tagging, but also concept hierarchy or a concept graph derived from Wikipedia. We compare different types of features to find out the importance of each of them. One of the key results is that stop word removal and other pre-processing steps are the most critical. The most successful linguistic feature was a pre-constructed list of words that had no synonyms in WordNet. A new method called ACI‑rank is also compiled from the best working combination.\\n\\n\",\"PeriodicalId\":414924,\"journal\":{\"name\":\"Applied Computing and Intelligence\",\"volume\":\"73 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Computing and Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3934/aci.2022007\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing and Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3934/aci.2022007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 1

摘要

关键词通常用于总结文本文档。在本文中,我们进行了一个系统的比较方法自动关键字提取从网页。这些方法基于三种不同类型的特征:统计、结构和语言。统计特征是最常见的,但在web文档中也可以使用其他线索。结构特性利用样式代码,如标题标签和链接,以及网页的结构。语言特征可以基于同义词检测、词的语义相似性和词性标注,也可以基于概念层次或来自维基百科的概念图。我们比较不同类型的特征,找出每一个特征的重要性。其中一个关键的结果是,停止词删除和其他预处理步骤是最关键的。最成功的语言特征是一个预先构建的单词列表,这些单词在WordNet中没有同义词。一种称为ACI - rank的新方法也是从最佳工作组合中编译出来的。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Combining statistical, structural, and linguistic features for keyword extraction from web pages
Keywords are commonly used to summarize text documents. In this paper, we perform a systematic comparison of methods for automatic keyword extraction from web pages. The methods are based on three different types of features: statistical, structural and linguistic. Statistical features are the most common, but there are other clues in web documents that can also be used. Structural features utilize styling codes like header tags and links, but also the structure of the web page. Linguistic features can be based on detecting synonyms, semantic similarity of the words and part-of-speech tagging, but also concept hierarchy or a concept graph derived from Wikipedia. We compare different types of features to find out the importance of each of them. One of the key results is that stop word removal and other pre-processing steps are the most critical. The most successful linguistic feature was a pre-constructed list of words that had no synonyms in WordNet. A new method called ACI‑rank is also compiled from the best working combination.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Novel split quality measures for stratified multilabel cross validation with application to large and sparse gene ontology datasets Crop and weed classification based on AutoML A review of the application of machine learning in adult obesity studies Definition modeling: literature review and dataset analysis Effects of COVID-19 pandemic on computational intelligence and cybersecurity: survey
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1