{"title":"A Four-Feature Keyword Extraction Algorithm Based on Word Length Priority ratio","authors":"Hui Kang, Lingfeng Lu, H. Su","doi":"10.1109/TrustCom50675.2020.00203","DOIUrl":null,"url":null,"abstract":"With the rapid development of Internet technology and the advent of the information age, it has become a research hotspot to obtain key information from numerous data. Due to the diversity and irregularity of network data, it is difficult for people to find the literature they want, especially knowledge scholars working in the frontier field of science and technology, who have higher requirements on the accuracy and efficiency of literature keyword extraction than ordinary people. The feature values selected by the current keyword extraction algorithm are usually limited to word frequency and word length, which is incomplete and affects the accuracy of the algorithm. Given this phenomenon, this paper, by comparing with TF-IDF and KEA algorithm, define the concept of word length priority ratio, and applies this concept to the calculation of word length-weight, proposes a four-feature keyword extraction algorithm (WPR-TOC algorithm) based on word frequency, word length, word position and the degree of association between words. Through experiments, compared with the KEA algorithm, KEA++ algorithm, and four features extraction algorithm, the precision of the WPR-TOC algorithm is improved by 40%, 30%, and 10% respectively, and the recall rate is also increased by 40%, 30%, and 10% respectively.","PeriodicalId":221956,"journal":{"name":"2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)","volume":"48 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2020-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"2020 IEEE 19th International Conference on Trust, Security and Privacy in Computing and Communications (TrustCom)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/TrustCom50675.2020.00203","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
With the rapid development of Internet technology and the advent of the information age, it has become a research hotspot to obtain key information from numerous data. Due to the diversity and irregularity of network data, it is difficult for people to find the literature they want, especially knowledge scholars working in the frontier field of science and technology, who have higher requirements on the accuracy and efficiency of literature keyword extraction than ordinary people. The feature values selected by the current keyword extraction algorithm are usually limited to word frequency and word length, which is incomplete and affects the accuracy of the algorithm. Given this phenomenon, this paper, by comparing with TF-IDF and KEA algorithm, define the concept of word length priority ratio, and applies this concept to the calculation of word length-weight, proposes a four-feature keyword extraction algorithm (WPR-TOC algorithm) based on word frequency, word length, word position and the degree of association between words. Through experiments, compared with the KEA algorithm, KEA++ algorithm, and four features extraction algorithm, the precision of the WPR-TOC algorithm is improved by 40%, 30%, and 10% respectively, and the recall rate is also increased by 40%, 30%, and 10% respectively.