通过元标签分析比较研究课题：使用真实世界数字人文数据的多模块机器算法方法

IF 0.6 Q3 INFORMATION SCIENCE & LIBRARY SCIENCE Journal of Scientometric Research Pub Date : 2024-04-15 DOI:10.5530/jscires.13.1.5

Bhaskar Mukherjee, Debasis Majhi, Priya Tiwari, Saloni Chaudhary

{"title":"通过元标签分析比较研究课题：使用真实世界数字人文数据的多模块机器算法方法","authors":"Bhaskar Mukherjee, Debasis Majhi, Priya Tiwari, Saloni Chaudhary","doi":"10.5530/jscires.13.1.5","DOIUrl":null,"url":null,"abstract":"The present study extract, map and compare the lexical and semantic similarity of terms from author-provided keywords with machine extracted terms and topics from titles and abstracts of an inter-disciplinary field like ‘digital humanities’. Author-provided terms (keywords) were first extracted and mapped through visualization software like Gephi and then these extracted terms were compared with terms extracted from title and abstract of the research articles through NLP based statistical modules. Also, the interdisciplinary of significant topics were measured through the Brillouin index. A set of 7483 articles downloaded from Scopus database on the domain of digital humanities and its associated fields were used for the purpose. We observed the researches on digital humanities are spread over a considerable number of concepts like ‘Industry 4.0’, ‘topic modelling, ‘open science’. Further, the machine algorithm-based extraction compared and identified a larger lexical similarity between these author-provided keywords and title-extracted keywords, rather than abstract-extracted keywords. Jaccard similarity of all author-keywords with machine extracted title keywords came 0.83 and SBERT BiEncoder_score was 0.7374. The top research areas extracted from titles, through unsupervised approach of term extraction resulted in topics like digital humanities approach, digital humanities visualization, indicating a strong connection to the discipline of digital humanities. The average interdisciplinarity index of top significant topics came between 1.217 and 1.284, with the highest index value for ‘computational digital humanities’. As this study is based on real-world data, it is highly useful to understand how far machine algorithm-based text extraction can be helpful for information retrieval process.","PeriodicalId":43282,"journal":{"name":"Journal of Scientometric Research","volume":null,"pages":null},"PeriodicalIF":0.6000,"publicationDate":"2024-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Comparing Research Topics through Metatags Analysis: A Multi-module Machine Algorithm Approaches Using Real World Data on Digital Humanities\",\"authors\":\"Bhaskar Mukherjee, Debasis Majhi, Priya Tiwari, Saloni Chaudhary\",\"doi\":\"10.5530/jscires.13.1.5\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The present study extract, map and compare the lexical and semantic similarity of terms from author-provided keywords with machine extracted terms and topics from titles and abstracts of an inter-disciplinary field like ‘digital humanities’. Author-provided terms (keywords) were first extracted and mapped through visualization software like Gephi and then these extracted terms were compared with terms extracted from title and abstract of the research articles through NLP based statistical modules. Also, the interdisciplinary of significant topics were measured through the Brillouin index. A set of 7483 articles downloaded from Scopus database on the domain of digital humanities and its associated fields were used for the purpose. We observed the researches on digital humanities are spread over a considerable number of concepts like ‘Industry 4.0’, ‘topic modelling, ‘open science’. Further, the machine algorithm-based extraction compared and identified a larger lexical similarity between these author-provided keywords and title-extracted keywords, rather than abstract-extracted keywords. Jaccard similarity of all author-keywords with machine extracted title keywords came 0.83 and SBERT BiEncoder_score was 0.7374. The top research areas extracted from titles, through unsupervised approach of term extraction resulted in topics like digital humanities approach, digital humanities visualization, indicating a strong connection to the discipline of digital humanities. The average interdisciplinarity index of top significant topics came between 1.217 and 1.284, with the highest index value for ‘computational digital humanities’. As this study is based on real-world data, it is highly useful to understand how far machine algorithm-based text extraction can be helpful for information retrieval process.\",\"PeriodicalId\":43282,\"journal\":{\"name\":\"Journal of Scientometric Research\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.6000,\"publicationDate\":\"2024-04-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"Journal of Scientometric Research\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.5530/jscires.13.1.5\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q3\",\"JCRName\":\"INFORMATION SCIENCE & LIBRARY SCIENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Scientometric Research","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.5530/jscires.13.1.5","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q3","JCRName":"INFORMATION SCIENCE & LIBRARY SCIENCE","Score":null,"Total":0}

引用次数: 0

摘要

本研究从 "数字人文 "等跨学科领域的标题和摘要中，提取、映射和比较从作者提供的关键词与机器提取的术语和主题的词性和语义相似性。首先通过 Gephi 等可视化软件对作者提供的术语（关键词）进行提取和映射，然后通过基于 NLP 的统计模块将这些提取的术语与从研究文章的标题和摘要中提取的术语进行比较。此外，还通过布里渊指数衡量了重要主题的跨学科性。我们使用了从 Scopus 数据库下载的一组 7483 篇有关数字人文领域及其相关领域的文章。我们发现，数字人文研究涉及大量概念，如 "工业 4.0"、"主题建模"、"开放科学 "等。此外，基于机器算法的提取比较并确定了这些作者提供的关键词与标题提取的关键词之间更大的词性相似性，而不是摘要提取的关键词。所有作者关键词与机器提取的标题关键词的 Jaccard 相似度为 0.83，SBERT BiEncoder_score 为 0.7374。通过无监督术语提取方法从标题中提取的顶级研究领域包括数字人文方法、数字人文可视化等主题，这表明这些主题与数字人文学科有着密切联系。最重要主题的平均跨学科指数介于 1.217 和 1.284 之间，其中 "计算数字人文 "的指数值最高。由于这项研究基于真实世界的数据，因此对于了解基于机器算法的文本提取在多大程度上有助于信息检索过程非常有用。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Comparing Research Topics through Metatags Analysis: A Multi-module Machine Algorithm Approaches Using Real World Data on Digital Humanities

The present study extract, map and compare the lexical and semantic similarity of terms from author-provided keywords with machine extracted terms and topics from titles and abstracts of an inter-disciplinary field like ‘digital humanities’. Author-provided terms (keywords) were first extracted and mapped through visualization software like Gephi and then these extracted terms were compared with terms extracted from title and abstract of the research articles through NLP based statistical modules. Also, the interdisciplinary of significant topics were measured through the Brillouin index. A set of 7483 articles downloaded from Scopus database on the domain of digital humanities and its associated fields were used for the purpose. We observed the researches on digital humanities are spread over a considerable number of concepts like ‘Industry 4.0’, ‘topic modelling, ‘open science’. Further, the machine algorithm-based extraction compared and identified a larger lexical similarity between these author-provided keywords and title-extracted keywords, rather than abstract-extracted keywords. Jaccard similarity of all author-keywords with machine extracted title keywords came 0.83 and SBERT BiEncoder_score was 0.7374. The top research areas extracted from titles, through unsupervised approach of term extraction resulted in topics like digital humanities approach, digital humanities visualization, indicating a strong connection to the discipline of digital humanities. The average interdisciplinarity index of top significant topics came between 1.217 and 1.284, with the highest index value for ‘computational digital humanities’. As this study is based on real-world data, it is highly useful to understand how far machine algorithm-based text extraction can be helpful for information retrieval process.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

Journal of Scientometric Research INFORMATION SCIENCE & LIBRARY SCIENCE-

CiteScore

1.30

自引率

12.50%

发文量