{"title":"Automatic keyword extraction based on dependency parsing and BERT semantic weighting","authors":"Huixin Liu","doi":"10.1117/12.2667242","DOIUrl":null,"url":null,"abstract":"It's hard for the classic TextRank algorithm to differentiate the degree of association between candidate keyword nodes. Furthermore, it readily ignores the long-distance syntactic relations and topic semantic information between words while extracting keywords from a document. For the purpose of solving this problem, we propose an improved TextRank algorithm utilizing lexical, grammatical, and semantic features to find objective keywords from Chinese academic text. Firstly, we construct the word graph of candidate keywords after text preprocessing. Secondly, we integrate multidimensional features of candidate words into the primary calculation of the transition probability matrix. In this regard, our approach mines the full text to extract a collection of grammatical and morphological features (such as part-of-speech, word position, long-distance dependencies, and distinguished BERT dynamic semantic information). By introducing the dependency syntax of long sentences, the algorithm's ability to identify low-frequency topic keywords is obviously promotional. In addition, the external semantic information is designed to be imported through the word embedding model. A merged feature-based matrix is then employed to calculate the influence of all candidate keyword nodes with the iterative formula of PageRank. Namely, we attain a set of satisfactory keywords by ranking candidate nodes according to their comprehensive influence scores and selecting the ultimate top N keywords. This paper utilizes public data sets to verify the effectiveness of the proposed algorithm. Our approach achieves comparable f-scores with a 5.5% improvement (4 keywords) over the classic. The experimental results demonstrate that our approach can expand the degree of association differentiation between nodes better by mining synthetic long text features. Besides, the results also show that the proposed algorithm is more promising and its extraction effect is more robust than previously studied ensemble methods.","PeriodicalId":128051,"journal":{"name":"Third International Seminar on Artificial Intelligence, Networking, and Information Technology","volume":"4 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-02-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Third International Seminar on Artificial Intelligence, Networking, and Information Technology","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1117/12.2667242","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
It's hard for the classic TextRank algorithm to differentiate the degree of association between candidate keyword nodes. Furthermore, it readily ignores the long-distance syntactic relations and topic semantic information between words while extracting keywords from a document. For the purpose of solving this problem, we propose an improved TextRank algorithm utilizing lexical, grammatical, and semantic features to find objective keywords from Chinese academic text. Firstly, we construct the word graph of candidate keywords after text preprocessing. Secondly, we integrate multidimensional features of candidate words into the primary calculation of the transition probability matrix. In this regard, our approach mines the full text to extract a collection of grammatical and morphological features (such as part-of-speech, word position, long-distance dependencies, and distinguished BERT dynamic semantic information). By introducing the dependency syntax of long sentences, the algorithm's ability to identify low-frequency topic keywords is obviously promotional. In addition, the external semantic information is designed to be imported through the word embedding model. A merged feature-based matrix is then employed to calculate the influence of all candidate keyword nodes with the iterative formula of PageRank. Namely, we attain a set of satisfactory keywords by ranking candidate nodes according to their comprehensive influence scores and selecting the ultimate top N keywords. This paper utilizes public data sets to verify the effectiveness of the proposed algorithm. Our approach achieves comparable f-scores with a 5.5% improvement (4 keywords) over the classic. The experimental results demonstrate that our approach can expand the degree of association differentiation between nodes better by mining synthetic long text features. Besides, the results also show that the proposed algorithm is more promising and its extraction effect is more robust than previously studied ensemble methods.