Graph-based Density Peaks Ranking Approach for Extracting KeyPhrases (GDREK)

2019 IEEE 7th Palestinian International Conference on Electrical and Computer Engineering (PICECE) Pub Date : 2019-03-01 DOI:10.1109/PICECE.2019.8747175

Mahmoud R. Alfarra, Abdalfattah M. Alfarra, Ahmed Salahedden

{"title":"Graph-based Density Peaks Ranking Approach for Extracting KeyPhrases (GDREK)","authors":"Mahmoud R. Alfarra, Abdalfattah M. Alfarra, Ahmed Salahedden","doi":"10.1109/PICECE.2019.8747175","DOIUrl":null,"url":null,"abstract":"Surprisingly, there are more than 1,500,000 articles found by google scholar search engine on keyphrase extraction (KE) have been published recently, 21,000 of them only in current year. This large number implies that researchers need to find more accurate and better performing models for KE from text as a subtask of text mining and summarization. This paper presents a novel design of KE. The model is composed of Graph-based Representation, sentence clustering and ranking based on Density peaks for KE in single or multi-documents (GDREK) which can be used further in text extractive summarization. The principle of GDREK is using graph model to represent text and then group and rank the sentences in a mutuality manner. In this model, sentence grouping and ranking proceeds by discovering the main topics of text and finding central sentences of each topic incrementally. In this incremental step, as the sentences are grouped based on the Graph-based Growing Self-Organizing Map (G-GSOM), they are ranked using Density Peaks (DP) concept according to a measure of similarity between sentences. Our similarity measure is based on shared phrases and Cosine function. Sentences are scored under the assumption that when a sentence has more similar sentences, it is considered more important (higher density) and more representative. Finally, the most frequent words or phrases in the sentences are selected as key phrases of the text. Experimental results show that our innovative technique extracts the most key phrases and words of two datasets and yields over 75% accuracy and from most sub-topics of text.","PeriodicalId":375980,"journal":{"name":"2019 IEEE 7th Palestinian International Conference on Electrical and Computer Engineering (PICECE)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2019-03-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2019 IEEE 7th Palestinian International Conference on Electrical and Computer Engineering (PICECE)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/PICECE.2019.8747175","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 3

Abstract

Surprisingly, there are more than 1,500,000 articles found by google scholar search engine on keyphrase extraction (KE) have been published recently, 21,000 of them only in current year. This large number implies that researchers need to find more accurate and better performing models for KE from text as a subtask of text mining and summarization. This paper presents a novel design of KE. The model is composed of Graph-based Representation, sentence clustering and ranking based on Density peaks for KE in single or multi-documents (GDREK) which can be used further in text extractive summarization. The principle of GDREK is using graph model to represent text and then group and rank the sentences in a mutuality manner. In this model, sentence grouping and ranking proceeds by discovering the main topics of text and finding central sentences of each topic incrementally. In this incremental step, as the sentences are grouped based on the Graph-based Growing Self-Organizing Map (G-GSOM), they are ranked using Density Peaks (DP) concept according to a measure of similarity between sentences. Our similarity measure is based on shared phrases and Cosine function. Sentences are scored under the assumption that when a sentence has more similar sentences, it is considered more important (higher density) and more representative. Finally, the most frequent words or phrases in the sentences are selected as key phrases of the text. Experimental results show that our innovative technique extracts the most key phrases and words of two datasets and yields over 75% accuracy and from most sub-topics of text.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

基于图的关键词提取密度峰排序方法(GDREK)

令人惊讶的是，最近在谷歌学术搜索引擎搜索到的关键词提取(KE)上发表的文章超过150万篇，其中仅今年就有2.1万篇。这个庞大的数字意味着研究人员需要为文本KE找到更准确、性能更好的模型，作为文本挖掘和摘要的子任务。本文提出了一种新的KE设计方案。该模型由基于图的表示、句子聚类和基于密度峰的单文档或多文档KE排序(GDREK)组成，可进一步用于文本抽取摘要。GDREK的原理是使用图模型来表示文本，然后以相互关系的方式对句子进行分组和排序。在该模型中，句子分组和排序是通过发现文本的主要主题，并逐步找到每个主题的中心句子来进行的。在这个增量步骤中，当基于基于图的增长自组织图(G-GSOM)对句子进行分组时，根据句子之间的相似性度量，使用密度峰值(DP)概念对它们进行排名。我们的相似性度量基于共享短语和余弦函数。句子评分的假设是，当一个句子有更多的相似句子时，它被认为更重要(密度更高)，更有代表性。最后，选出句子中出现频率最高的单词或短语作为文本的关键短语。实验结果表明，我们的创新技术从两个数据集中提取出最多的关键短语和单词，准确率超过75%，并且从文本的大多数子主题中提取。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

2019 IEEE 7th Palestinian International Conference on Electrical and Computer Engineering (PICECE)

自引率

0.00%

发文量