结合统计、结构和语言特征，从网页中提取关键字

Applied Computing and Intelligence Pub Date : 1900-01-01 DOI:10.3934/aci.2022007

H. Shah, P. Fränti

{"title":"结合统计、结构和语言特征，从网页中提取关键字","authors":"H. Shah, P. Fränti","doi":"10.3934/aci.2022007","DOIUrl":null,"url":null,"abstract":"\n\nKeywords are commonly used to summarize text documents. In this paper, we perform a systematic comparison of methods for automatic keyword extraction from web pages. The methods are based on three different types of features: statistical, structural and linguistic. Statistical features are the most common, but there are other clues in web documents that can also be used. Structural features utilize styling codes like header tags and links, but also the structure of the web page. Linguistic features can be based on detecting synonyms, semantic similarity of the words and part-of-speech tagging, but also concept hierarchy or a concept graph derived from Wikipedia. We compare different types of features to find out the importance of each of them. One of the key results is that stop word removal and other pre-processing steps are the most critical. The most successful linguistic feature was a pre-constructed list of words that had no synonyms in WordNet. A new method called ACI‑rank is also compiled from the best working combination.\n\n","PeriodicalId":414924,"journal":{"name":"Applied Computing and Intelligence","volume":"73 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"1900-01-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":"{\"title\":\"Combining statistical, structural, and linguistic features for keyword extraction from web pages\",\"authors\":\"H. Shah, P. Fränti\",\"doi\":\"10.3934/aci.2022007\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"\\n\\nKeywords are commonly used to summarize text documents. In this paper, we perform a systematic comparison of methods for automatic keyword extraction from web pages. The methods are based on three different types of features: statistical, structural and linguistic. Statistical features are the most common, but there are other clues in web documents that can also be used. Structural features utilize styling codes like header tags and links, but also the structure of the web page. Linguistic features can be based on detecting synonyms, semantic similarity of the words and part-of-speech tagging, but also concept hierarchy or a concept graph derived from Wikipedia. We compare different types of features to find out the importance of each of them. One of the key results is that stop word removal and other pre-processing steps are the most critical. The most successful linguistic feature was a pre-constructed list of words that had no synonyms in WordNet. A new method called ACI‑rank is also compiled from the best working combination.\\n\\n\",\"PeriodicalId\":414924,\"journal\":{\"name\":\"Applied Computing and Intelligence\",\"volume\":\"73 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"1900-01-01\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"1\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Applied Computing and Intelligence\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.3934/aci.2022007\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Applied Computing and Intelligence","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.3934/aci.2022007","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 1

摘要

关键词通常用于总结文本文档。在本文中，我们进行了一个系统的比较方法自动关键字提取从网页。这些方法基于三种不同类型的特征:统计、结构和语言。统计特征是最常见的，但在web文档中也可以使用其他线索。结构特性利用样式代码，如标题标签和链接，以及网页的结构。语言特征可以基于同义词检测、词的语义相似性和词性标注，也可以基于概念层次或来自维基百科的概念图。我们比较不同类型的特征，找出每一个特征的重要性。其中一个关键的结果是，停止词删除和其他预处理步骤是最关键的。最成功的语言特征是一个预先构建的单词列表，这些单词在WordNet中没有同义词。一种称为ACI - rank的新方法也是从最佳工作组合中编译出来的。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Combining statistical, structural, and linguistic features for keyword extraction from web pages

Keywords are commonly used to summarize text documents. In this paper, we perform a systematic comparison of methods for automatic keyword extraction from web pages. The methods are based on three different types of features: statistical, structural and linguistic. Statistical features are the most common, but there are other clues in web documents that can also be used. Structural features utilize styling codes like header tags and links, but also the structure of the web page. Linguistic features can be based on detecting synonyms, semantic similarity of the words and part-of-speech tagging, but also concept hierarchy or a concept graph derived from Wikipedia. We compare different types of features to find out the importance of each of them. One of the key results is that stop word removal and other pre-processing steps are the most critical. The most successful linguistic feature was a pre-constructed list of words that had no synonyms in WordNet. A new method called ACI‑rank is also compiled from the best working combination.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Applied Computing and Intelligence

自引率

0.00%

发文量