An empirical study of important keyword extraction techniques from documents

H. Hasan, Falguni Sanyal, Dipankar Chaki, Md. Haider Ali
{"title":"An empirical study of important keyword extraction techniques from documents","authors":"H. Hasan, Falguni Sanyal, Dipankar Chaki, Md. Haider Ali","doi":"10.1109/ICISIM.2017.8122154","DOIUrl":null,"url":null,"abstract":"Keyword extraction is an automated process that collects a set of terms, illustrating an overview of the document. The term is defined how the keyword identifies the core information of a particular document. Analyzing huge number of documents to find out the relevant information, keyword extraction will be the key approach. This approach will help us to understand the depth of it even before we read it. In this paper, we have given an overview of different approaches and algorithms that have been used in keyword extraction technique and compare them to find out the better approach to work in the future. We have studied various algorithms like support vector machine (SVM), conditional random fields (CRF), NP-chunk, n-grams, multiple linear regression, and logistic regression to find out important keywords in a document. We have figured out that SVM and CRF give better results where CRF accuracy is greater than SVM based on F1 score (The balance between precision and recall). According to precision, SVM shows a better result than CRF. But, in case of the recall, logit shows the greater result. Also, we have found out that, there are two more approaches that have been used in keyword extraction technique. One is statistical approach and another is machine learning approach. Statistical approaches show good result with statistical data. Machine learning approaches provide better result than the statistical approaches using training data. Some specimens of statistical approaches are Expectation-Maximization, K-Nearest Neighbor and Bayesian. Extractor and GenEx are the example of machine learning approaches in keyword extraction fields. Apart from these two approaches, semantic relation between words is another key feature in keyword extraction techniques.","PeriodicalId":139000,"journal":{"name":"2017 1st International Conference on Intelligent Systems and Information Management (ICISIM)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2017-10-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"13","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2017 1st International Conference on Intelligent Systems and Information Management (ICISIM)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICISIM.2017.8122154","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 13

Abstract

Keyword extraction is an automated process that collects a set of terms, illustrating an overview of the document. The term is defined how the keyword identifies the core information of a particular document. Analyzing huge number of documents to find out the relevant information, keyword extraction will be the key approach. This approach will help us to understand the depth of it even before we read it. In this paper, we have given an overview of different approaches and algorithms that have been used in keyword extraction technique and compare them to find out the better approach to work in the future. We have studied various algorithms like support vector machine (SVM), conditional random fields (CRF), NP-chunk, n-grams, multiple linear regression, and logistic regression to find out important keywords in a document. We have figured out that SVM and CRF give better results where CRF accuracy is greater than SVM based on F1 score (The balance between precision and recall). According to precision, SVM shows a better result than CRF. But, in case of the recall, logit shows the greater result. Also, we have found out that, there are two more approaches that have been used in keyword extraction technique. One is statistical approach and another is machine learning approach. Statistical approaches show good result with statistical data. Machine learning approaches provide better result than the statistical approaches using training data. Some specimens of statistical approaches are Expectation-Maximization, K-Nearest Neighbor and Bayesian. Extractor and GenEx are the example of machine learning approaches in keyword extraction fields. Apart from these two approaches, semantic relation between words is another key feature in keyword extraction techniques.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
文档关键字提取技术的实证研究
关键字提取是一个自动过程,它收集一组术语,说明文档的概述。该术语定义了关键字如何标识特定文档的核心信息。分析海量的文档,找出相关信息,关键字提取将是关键方法。这种方法可以帮助我们在阅读之前就理解它的深度。在本文中,我们概述了在关键字提取技术中使用的不同方法和算法,并对它们进行比较,以找出未来更好的工作方法。我们研究了各种算法,如支持向量机(SVM)、条件随机场(CRF)、NP-chunk、n-grams、多元线性回归、逻辑回归等,以找出文档中的重要关键词。我们已经发现SVM和CRF给出了更好的结果,其中基于F1分数(precision and recall之间的平衡)的CRF准确率大于SVM。从精度上看,SVM优于CRF。但是,在召回的情况下,logit显示了更大的结果。此外,我们还发现,在关键字提取技术中使用了另外两种方法。一种是统计方法,另一种是机器学习方法。统计方法对统计数据显示出良好的效果。机器学习方法比使用训练数据的统计方法提供更好的结果。统计方法的一些例子是期望最大化,k近邻和贝叶斯。Extractor和GenEx是关键字提取领域中机器学习方法的例子。除了这两种方法之外,词间的语义关系是关键词提取技术的另一个关键特征。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Hybrid technique for splice site prediction Information fusion for images on FPGA: Pixel level with pseudo color Hierarchical document clustering based on cosine similarity measure Embedded home surveillance system with pyroelectric infrared sensor using GSM Healthcare data modeling in R
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1