M. Goltaji, Javad Abbaspour, A. Jowkar, S. M. Fakhrahmad
{"title":"Comparison of text-based and linked-based metrics in terms of estimating the similarity of articles","authors":"M. Goltaji, Javad Abbaspour, A. Jowkar, S. M. Fakhrahmad","doi":"10.1177/09610006231165759","DOIUrl":null,"url":null,"abstract":"The aim of this study is to identify the power of text-based metrics (Cosine and Lucene similarity) and linked-based (Co-citation, bibliographic coupling, Amsler, PageRank, and HITS) and their combination in estimating the similarity of articles with each other. The experiments were conducted on a test collection of 26,262 articles in the PubMed Central Open Access Subset (PMC OAS) of CITREC that, in addition to having linked-based metrics, their full text was available for calculating text-based metrics. Thirty articles were selected as primary articles, and articles related to each of them were retrieved based on the mesh similarity metric. Then, the similarity of the retrieved documents based on text-based and linked-based metrics was also extracted. In the next stage, text-based, linked-based, and hybrid metrics were entered into the generalized regression model to estimate the similarity of the articles to determine their power; finally, the performance of the models was compared based on the mean squared error and correlation. The results showed that the model included Cosine and Lucene similarity metrics in text-based metrics. In linked-based metrics, HITS (Hub), HITS (authority), PageRank, and co-citation had the highest power, respectively; but the bibliographic coupling and Amsler could not enter the model. In general, a comparison of text-based, linked-based, and hybrid metrics performance indicated that the linked-based model estimates similarity between articles better than the text-based model, and the combination of text-based and linked-based metrics makes little change in improving the power of the articles. Despite the importance and application of text-based and linked-based metrics to measure the similarity of articles, a study that examines the power of these metrics alone and in comparison with each other in estimating the similarity of articles was not observed.","PeriodicalId":47004,"journal":{"name":"Journal of Librarianship and Information Science","volume":" ","pages":""},"PeriodicalIF":1.4000,"publicationDate":"2023-04-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Librarianship and Information Science","FirstCategoryId":"91","ListUrlMain":"https://doi.org/10.1177/09610006231165759","RegionNum":4,"RegionCategory":"管理学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q2","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}
引用次数: 0
Abstract
The aim of this study is to identify the power of text-based metrics (Cosine and Lucene similarity) and linked-based (Co-citation, bibliographic coupling, Amsler, PageRank, and HITS) and their combination in estimating the similarity of articles with each other. The experiments were conducted on a test collection of 26,262 articles in the PubMed Central Open Access Subset (PMC OAS) of CITREC that, in addition to having linked-based metrics, their full text was available for calculating text-based metrics. Thirty articles were selected as primary articles, and articles related to each of them were retrieved based on the mesh similarity metric. Then, the similarity of the retrieved documents based on text-based and linked-based metrics was also extracted. In the next stage, text-based, linked-based, and hybrid metrics were entered into the generalized regression model to estimate the similarity of the articles to determine their power; finally, the performance of the models was compared based on the mean squared error and correlation. The results showed that the model included Cosine and Lucene similarity metrics in text-based metrics. In linked-based metrics, HITS (Hub), HITS (authority), PageRank, and co-citation had the highest power, respectively; but the bibliographic coupling and Amsler could not enter the model. In general, a comparison of text-based, linked-based, and hybrid metrics performance indicated that the linked-based model estimates similarity between articles better than the text-based model, and the combination of text-based and linked-based metrics makes little change in improving the power of the articles. Despite the importance and application of text-based and linked-based metrics to measure the similarity of articles, a study that examines the power of these metrics alone and in comparison with each other in estimating the similarity of articles was not observed.
本研究的目的是确定基于文本的度量(Cosine和Lucene相似性)和基于链接的度量(共引、书目耦合、Amsler、PageRank和HITS)及其组合在估计文章彼此相似性方面的能力。实验是在CITREC的PubMed Central Open Access Subset(PMC OAS)中的26262篇文章的测试集上进行的,这些文章除了具有基于链接的度量外,它们的全文还可用于计算基于文本的度量。选择30篇文章作为主要文章,并基于网格相似性度量检索与每一篇文章相关的文章。然后,还提取了基于文本和基于链接的度量的检索文档的相似性。在下一阶段,将基于文本、基于链接和混合的度量输入到广义回归模型中,以估计文章的相似性,从而确定其功效;最后,基于均方误差和相关性对模型的性能进行了比较。结果表明,该模型在基于文本的度量中包含了余弦和Lucene相似性度量。在基于链接的度量中,HITS(Hub)、HITS(authority)、PageRank和共引分别具有最高的幂;但书目耦合和Amsler不能进入模型。通常,基于文本、基于链接和混合度量性能的比较表明,基于链接的模型比基于文本的模型更好地估计文章之间的相似性,并且基于文本和基于链接的度量的组合在提高文章的能力方面几乎没有变化。尽管基于文本和基于链接的度量在衡量文章相似性方面具有重要意义和应用,但没有观察到一项单独检查这些度量在估计文章相似性时的能力以及相互比较的研究。
期刊介绍:
Journal of Librarianship and Information Science is the peer-reviewed international quarterly journal for librarians, information scientists, specialists, managers and educators interested in keeping up to date with the most recent issues and developments in the field. The Journal provides a forumfor the publication of research and practical developments as well as for discussion papers and viewpoints on topical concerns in a profession facing many challenges.