Michael Taylor, Carlos Areia, Kath Burton, Charles Watkinson
The use of Wikipedia citations in scholarly research has been the topic of much inquiry over the past decade. A cross-publisher study (Taylor & Francis and University of Michigan Press) convened by Digital Science was established in late 2022 to explore author sentiment towards Wikipedia as a trusted source of information. A short survey was designed to poll published authors about views and uses of Wikipedia and explore how the increased addition of research citations in Wikipedia might help combat misinformation in the context of increasing public engagement with and access to validated research sources. With 21,854 surveys sent, targeting 40,402 papers mentioned in Wikipedia, a total of 750 complete surveys from 60 countries were included in this analysis. In general, responses revealed a positive sentiment towards research citation in Wikipedia and the researcher engagement practices. However, our sub analysis revealed statistically significant differences when comparison articles vs books and across disciplines, but not open vs closed access. This study will open the door to further research and deepen our understanding of authors perceived trustworthiness of the representation of their research in Wikipedia.
{"title":"Research Citations Building Trust in Wikipedia","authors":"Michael Taylor, Carlos Areia, Kath Burton, Charles Watkinson","doi":"arxiv-2409.11948","DOIUrl":"https://doi.org/arxiv-2409.11948","url":null,"abstract":"The use of Wikipedia citations in scholarly research has been the topic of\u0000much inquiry over the past decade. A cross-publisher study (Taylor & Francis\u0000and University of Michigan Press) convened by Digital Science was established\u0000in late 2022 to explore author sentiment towards Wikipedia as a trusted source\u0000of information. A short survey was designed to poll published authors about\u0000views and uses of Wikipedia and explore how the increased addition of research\u0000citations in Wikipedia might help combat misinformation in the context of\u0000increasing public engagement with and access to validated research sources.\u0000With 21,854 surveys sent, targeting 40,402 papers mentioned in Wikipedia, a\u0000total of 750 complete surveys from 60 countries were included in this analysis.\u0000In general, responses revealed a positive sentiment towards research citation\u0000in Wikipedia and the researcher engagement practices. However, our sub analysis\u0000revealed statistically significant differences when comparison articles vs\u0000books and across disciplines, but not open vs closed access. This study will\u0000open the door to further research and deepen our understanding of authors\u0000perceived trustworthiness of the representation of their research in Wikipedia.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Scholarly communication is vital to scientific advancement, enabling the exchange of ideas and knowledge. When selecting publication venues, scholars consider various factors, such as journal relevance, reputation, outreach, and editorial standards and practices. However, some of these factors are inconspicuous or inconsistent across venues and individual publications. This study proposes that scholars' decision-making process can be conceptualized and explored through the biologically inspired exploration-exploitation (EE) framework, which posits that scholars balance between familiar and under-explored publication venues. Building on the EE framework, we introduce a grounded definition for "Home Venues" (HVs) - an informal concept used to describe the set of venues where a scholar consistently publishes - and investigate their emergence and key characteristics. Our analysis reveals that the publication patterns of roughly three-quarters of computer science scholars align with the expectations of the EE framework. For these scholars, HVs typically emerge and stabilize after approximately 15-20 publications. Additionally, scholars with higher h-indexes or a greater number of publications, tend to have higher-ranking journals as their HVs.
学术交流对科学进步至关重要,它使思想和知识的交流成为可能。在选择出版场所时,学者们会考虑各种因素,如期刊的相关性、声誉、外联以及编辑标准和实践。然而,其中一些因素在不同的出版场所和单个出版物中并不明显或不一致。本研究提出,学者的决策过程可以通过生物学启发的探索-开发(EE)框架进行概念化和探索,该框架认为学者会在熟悉的和探索不足的发表场所之间进行权衡。在 EE 框架的基础上,我们引入了 "主场"(HVs)的基础定义--这是一个非正式的概念,用于描述学者持续发表论文的场所--并研究了它们的出现和主要特征。我们的分析表明,大约四分之三的计算机科学学者的发表模式符合 EE 框架的预期。此外,h 指数较高或发表论文数量较多的学者倾向于将排名较高的期刊作为其 HV。
{"title":"Publishing Instincts: An Exploration-Exploitation Framework for Studying Academic Publishing Behavior and \"Home Venues\"","authors":"Teddy Lazebnik, Shir Aviv-Reuven, Ariel Rosenfeld","doi":"arxiv-2409.12158","DOIUrl":"https://doi.org/arxiv-2409.12158","url":null,"abstract":"Scholarly communication is vital to scientific advancement, enabling the\u0000exchange of ideas and knowledge. When selecting publication venues, scholars\u0000consider various factors, such as journal relevance, reputation, outreach, and\u0000editorial standards and practices. However, some of these factors are\u0000inconspicuous or inconsistent across venues and individual publications. This\u0000study proposes that scholars' decision-making process can be conceptualized and\u0000explored through the biologically inspired exploration-exploitation (EE)\u0000framework, which posits that scholars balance between familiar and\u0000under-explored publication venues. Building on the EE framework, we introduce a\u0000grounded definition for \"Home Venues\" (HVs) - an informal concept used to\u0000describe the set of venues where a scholar consistently publishes - and\u0000investigate their emergence and key characteristics. Our analysis reveals that\u0000the publication patterns of roughly three-quarters of computer science scholars\u0000align with the expectations of the EE framework. For these scholars, HVs\u0000typically emerge and stabilize after approximately 15-20 publications.\u0000Additionally, scholars with higher h-indexes or a greater number of\u0000publications, tend to have higher-ranking journals as their HVs.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lucía Céspedes, Diego Kozlowski, Carolina Pradier, Maxime Holmberg Sainte-Marie, Natsumi Solange Shokida, Pierre Benz, Constance Poitras, Anton Boudreau Ninkov, Saeideh Ebrahimy, Philips Ayeni, Sarra Filali, Bing Li, Vincent Larivière
Clarivate's Web of Science (WoS) and Elsevier's Scopus have been for decades the main sources of bibliometric information. Although highly curated, these closed, proprietary databases are largely biased towards English-language publications, underestimating the use of other languages in research dissemination. Launched in 2022, OpenAlex promised comprehensive, inclusive, and open-source research information. While already in use by scholars and research institutions, the quality of its metadata is currently being assessed. This paper contributes to this literature by assessing the completeness and accuracy of its metadata related to language, through a comparison with WoS, as well as an in-depth manual validation of a sample of 6,836 articles. Results show that OpenAlex exhibits a far more balanced linguistic coverage than WoS. However, language metadata is not always accurate, which leads OpenAlex to overestimate the place of English while underestimating that of other languages. If used critically, OpenAlex can provide comprehensive and representative analyses of languages used for scholarly publishing. However, more work is needed at infrastructural level to ensure the quality of metadata on language.
{"title":"Evaluating the Linguistic Coverage of OpenAlex: An Assessment of Metadata Accuracy and Completeness","authors":"Lucía Céspedes, Diego Kozlowski, Carolina Pradier, Maxime Holmberg Sainte-Marie, Natsumi Solange Shokida, Pierre Benz, Constance Poitras, Anton Boudreau Ninkov, Saeideh Ebrahimy, Philips Ayeni, Sarra Filali, Bing Li, Vincent Larivière","doi":"arxiv-2409.10633","DOIUrl":"https://doi.org/arxiv-2409.10633","url":null,"abstract":"Clarivate's Web of Science (WoS) and Elsevier's Scopus have been for decades\u0000the main sources of bibliometric information. Although highly curated, these\u0000closed, proprietary databases are largely biased towards English-language\u0000publications, underestimating the use of other languages in research\u0000dissemination. Launched in 2022, OpenAlex promised comprehensive, inclusive,\u0000and open-source research information. While already in use by scholars and\u0000research institutions, the quality of its metadata is currently being assessed.\u0000This paper contributes to this literature by assessing the completeness and\u0000accuracy of its metadata related to language, through a comparison with WoS, as\u0000well as an in-depth manual validation of a sample of 6,836 articles. Results\u0000show that OpenAlex exhibits a far more balanced linguistic coverage than WoS.\u0000However, language metadata is not always accurate, which leads OpenAlex to\u0000overestimate the place of English while underestimating that of other\u0000languages. If used critically, OpenAlex can provide comprehensive and\u0000representative analyses of languages used for scholarly publishing. However,\u0000more work is needed at infrastructural level to ensure the quality of metadata\u0000on language.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We introduce AnnualBERT, a series of language models designed specifically to capture the temporal evolution of scientific text. Deviating from the prevailing paradigms of subword tokenizations and "one model to rule them all", AnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model pretrained from scratch on the full-text of 1.7 million arXiv papers published until 2008 and a collection of progressively trained models on arXiv papers at an annual basis. We demonstrate the effectiveness of AnnualBERT models by showing that they not only have comparable performances in standard tasks but also achieve state-of-the-art performances on domain-specific NLP tasks as well as link prediction tasks in the arXiv citation network. We then utilize probing tasks to quantify the models' behavior in terms of representation learning and forgetting as time progresses. Our approach enables the pretrained models to not only improve performances on scientific text processing tasks but also to provide insights into the development of scientific discourse over time. The series of the models is available at https://huggingface.co/jd445/AnnualBERTs.
{"title":"Towards understanding evolution of science through language model series","authors":"Junjie Dong, Zhuoqi Lyu, Qing Ke","doi":"arxiv-2409.09636","DOIUrl":"https://doi.org/arxiv-2409.09636","url":null,"abstract":"We introduce AnnualBERT, a series of language models designed specifically to\u0000capture the temporal evolution of scientific text. Deviating from the\u0000prevailing paradigms of subword tokenizations and \"one model to rule them all\",\u0000AnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model\u0000pretrained from scratch on the full-text of 1.7 million arXiv papers published\u0000until 2008 and a collection of progressively trained models on arXiv papers at\u0000an annual basis. We demonstrate the effectiveness of AnnualBERT models by\u0000showing that they not only have comparable performances in standard tasks but\u0000also achieve state-of-the-art performances on domain-specific NLP tasks as well\u0000as link prediction tasks in the arXiv citation network. We then utilize probing\u0000tasks to quantify the models' behavior in terms of representation learning and\u0000forgetting as time progresses. Our approach enables the pretrained models to\u0000not only improve performances on scientific text processing tasks but also to\u0000provide insights into the development of scientific discourse over time. The\u0000series of the models is available at https://huggingface.co/jd445/AnnualBERTs.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Martin J. O'Connor, Josef Hardi, Marcos Martínez-Romero, Sowmya Somasundaram, Brendan Honick, Stephen A. Fisher, Ajay Pillai, Mark A. Musen
Scientists increasingly recognize the importance of providing rich, standards-adherent metadata to describe their experimental results. Despite the availability of sophisticated tools to assist in the process of data annotation, investigators generally seem to prefer to use spreadsheets when supplying metadata, despite the limitations of spreadsheets in ensuring metadata consistency and compliance with formal specifications. In this paper, we describe an end-to-end approach that supports spreadsheet-based entry of metadata, while ensuring rigorous adherence to community-based metadata standards and providing quality control. Our methods employ several key components, including customizable templates that capture metadata standards and that can inform the spreadsheets that investigators use to author metadata, controlled terminologies and ontologies for defining metadata values that can be accessed directly from a spreadsheet, and an interactive Web-based tool that allows users to rapidly identify and fix errors in their spreadsheet-based metadata. We demonstrate how this approach is being deployed in a biomedical consortium known as HuBMAP to define and collect metadata about a wide range of biological assays.
{"title":"Ensuring Adherence to Standards in Experiment-Related Metadata Entered Via Spreadsheets","authors":"Martin J. O'Connor, Josef Hardi, Marcos Martínez-Romero, Sowmya Somasundaram, Brendan Honick, Stephen A. Fisher, Ajay Pillai, Mark A. Musen","doi":"arxiv-2409.08897","DOIUrl":"https://doi.org/arxiv-2409.08897","url":null,"abstract":"Scientists increasingly recognize the importance of providing rich,\u0000standards-adherent metadata to describe their experimental results. Despite the\u0000availability of sophisticated tools to assist in the process of data\u0000annotation, investigators generally seem to prefer to use spreadsheets when\u0000supplying metadata, despite the limitations of spreadsheets in ensuring\u0000metadata consistency and compliance with formal specifications. In this paper,\u0000we describe an end-to-end approach that supports spreadsheet-based entry of\u0000metadata, while ensuring rigorous adherence to community-based metadata\u0000standards and providing quality control. Our methods employ several key\u0000components, including customizable templates that capture metadata standards\u0000and that can inform the spreadsheets that investigators use to author metadata,\u0000controlled terminologies and ontologies for defining metadata values that can\u0000be accessed directly from a spreadsheet, and an interactive Web-based tool that\u0000allows users to rapidly identify and fix errors in their spreadsheet-based\u0000metadata. We demonstrate how this approach is being deployed in a biomedical\u0000consortium known as HuBMAP to define and collect metadata about a wide range of\u0000biological assays.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Various stakeholders, such as researchers, government agencies, businesses, and laboratories require reliable scientific research outcomes and patent data to support their work. These data are crucial for advancing scientific research, conducting business evaluations, and policy analysis. However, collecting such data is often a time-consuming and laborious task. Consequently, many users turn to using openly accessible data for their research. However, these open data releases may suffer from lack of relationship between different data sources or limited temporal coverage. In this context, we present a new Intelligent Innovation Dataset (IIDS dataset), which comprises six inter-related datasets spanning nearly 120 years, encompassing paper information, paper citation relationships, patent details, patent legal statuses, funding information and funding relationship. The extensive contextual and extensive temporal coverage of the IIDS dataset will provide researchers with comprehensive data support, enabling them to delve into in-depth scientific research and conduct thorough data analysis.
{"title":"Intelligent Innovation Dataset on Scientific Research Outcomes and Patents","authors":"Xinran Wu, Hui Zou, Yidan Xing, Jingjing Qu, Qiongxiu Li, Renxia Xue, Xiaoming Fu","doi":"arxiv-2409.06936","DOIUrl":"https://doi.org/arxiv-2409.06936","url":null,"abstract":"Various stakeholders, such as researchers, government agencies, businesses,\u0000and laboratories require reliable scientific research outcomes and patent data\u0000to support their work. These data are crucial for advancing scientific\u0000research, conducting business evaluations, and policy analysis. However,\u0000collecting such data is often a time-consuming and laborious task.\u0000Consequently, many users turn to using openly accessible data for their\u0000research. However, these open data releases may suffer from lack of\u0000relationship between different data sources or limited temporal coverage. In\u0000this context, we present a new Intelligent Innovation Dataset (IIDS dataset),\u0000which comprises six inter-related datasets spanning nearly 120 years,\u0000encompassing paper information, paper citation relationships, patent details,\u0000patent legal statuses, funding information and funding relationship. The\u0000extensive contextual and extensive temporal coverage of the IIDS dataset will\u0000provide researchers with comprehensive data support, enabling them to delve\u0000into in-depth scientific research and conduct thorough data analysis.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Between 1960 and 1980, urban renewal transformed many cities, creating vast handwritten records. These documents posed a significant challenge for researchers due to their volume and handwritten nature. The launch of GPT-4V in November 2023 offered a breakthrough, enabling large-scale, efficient transcription and analysis of these historical urban renewal documents.
{"title":"An Evaluation of GPT-4V for Transcribing the Urban Renewal Hand-Written Collection","authors":"Myeong Lee, Julia H. P. Hsu","doi":"arxiv-2409.09090","DOIUrl":"https://doi.org/arxiv-2409.09090","url":null,"abstract":"Between 1960 and 1980, urban renewal transformed many cities, creating vast\u0000handwritten records. These documents posed a significant challenge for\u0000researchers due to their volume and handwritten nature. The launch of GPT-4V in\u0000November 2023 offered a breakthrough, enabling large-scale, efficient\u0000transcription and analysis of these historical urban renewal documents.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rene Aquarius, Floris Schoeters, Nick Wise, Alex Glynn, Guillaume Cabanac
Introduction: Thorough maintenance of the scientific record is needed to ensure the trustworthiness of its content. This can be undermined by a stealth correction, which is at least one post-publication change made to a scientific article, without providing a correction note or any other indicator that the publication was temporarily or permanently altered. In this paper we provide several examples of stealth corrections in order to demonstrate that these exist within the scientific literature. As far as we are aware, no documentation of such stealth corrections was previously reported in the scientific literature. Methods: We identified stealth corrections ourselves, or found already reported ones on the public database pubpeer.com or through social media accounts of known science sleuths. Results: In total we report 131 articles that were affected by stealth corrections and were published between 2005 and 2024. These stealth corrections were found among multiple publishers and scientific fields. Conclusion: and recommendations Stealth corrections exist in the scientific literature. This needs to end immediately as it threatens scientific integrity. We recommend the following: 1) Tracking all changes to the published record by all publishers in an open, uniform and transparent manner, preferably by online submission systems that log every change publicly, making stealth corrections impossible; 2) Clear definitions and guidelines on all types of corrections; 3) Support sustained vigilance of the scientific community to publicly register stealth corrections.
{"title":"The existence of stealth corrections in scientific literature -- a threat to scientific integrity","authors":"Rene Aquarius, Floris Schoeters, Nick Wise, Alex Glynn, Guillaume Cabanac","doi":"arxiv-2409.06852","DOIUrl":"https://doi.org/arxiv-2409.06852","url":null,"abstract":"Introduction: Thorough maintenance of the scientific record is needed to\u0000ensure the trustworthiness of its content. This can be undermined by a stealth\u0000correction, which is at least one post-publication change made to a scientific\u0000article, without providing a correction note or any other indicator that the\u0000publication was temporarily or permanently altered. In this paper we provide\u0000several examples of stealth corrections in order to demonstrate that these\u0000exist within the scientific literature. As far as we are aware, no\u0000documentation of such stealth corrections was previously reported in the\u0000scientific literature. Methods: We identified stealth corrections ourselves, or found already\u0000reported ones on the public database pubpeer.com or through social media\u0000accounts of known science sleuths. Results: In total we report 131 articles that were affected by stealth\u0000corrections and were published between 2005 and 2024. These stealth corrections\u0000were found among multiple publishers and scientific fields. Conclusion: and recommendations Stealth corrections exist in the scientific\u0000literature. This needs to end immediately as it threatens scientific integrity.\u0000We recommend the following: 1) Tracking all changes to the published record by\u0000all publishers in an open, uniform and transparent manner, preferably by online\u0000submission systems that log every change publicly, making stealth corrections\u0000impossible; 2) Clear definitions and guidelines on all types of corrections; 3)\u0000Support sustained vigilance of the scientific community to publicly register\u0000stealth corrections.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"110 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Gollam Rabby, Sören Auer, Jennifer D'Souza, Allard Oelen
The increasing amount of published scholarly articles, exceeding 2.5 million yearly, raises the challenge for researchers in following scientific progress. Integrating the contributions from scholarly articles into a novel type of cognitive knowledge graph (CKG) will be a crucial element for accessing and organizing scholarly knowledge, surpassing the insights provided by titles and abstracts. This research focuses on effectively conveying structured scholarly knowledge by utilizing large language models (LLMs) to categorize scholarly articles and describe their contributions in a structured and comparable manner. While previous studies explored language models within specific research domains, the extensive domain-independent knowledge captured by LLMs offers a substantial opportunity for generating structured contribution descriptions as CKGs. Additionally, LLMs offer customizable pathways through prompt engineering or fine-tuning, thus facilitating to leveraging of smaller LLMs known for their efficiency, cost-effectiveness, and environmental considerations. Our methodology involves harnessing LLM knowledge, and complementing it with domain expert-verified scholarly data sourced from a CKG. This strategic fusion significantly enhances LLM performance, especially in tasks like scholarly article categorization and predicate recommendation. Our method involves fine-tuning LLMs with CKG knowledge and additionally injecting knowledge from a CKG with a novel prompting technique significantly increasing the accuracy of scholarly knowledge extraction. We integrated our approach in the Open Research Knowledge Graph (ORKG), thus enabling precise access to organized scholarly knowledge, crucially benefiting domain-independent scholarly knowledge exchange and dissemination among policymakers, industrial practitioners, and the general public.
{"title":"Fine-tuning and Prompt Engineering with Cognitive Knowledge Graphs for Scholarly Knowledge Organization","authors":"Gollam Rabby, Sören Auer, Jennifer D'Souza, Allard Oelen","doi":"arxiv-2409.06433","DOIUrl":"https://doi.org/arxiv-2409.06433","url":null,"abstract":"The increasing amount of published scholarly articles, exceeding 2.5 million\u0000yearly, raises the challenge for researchers in following scientific progress.\u0000Integrating the contributions from scholarly articles into a novel type of\u0000cognitive knowledge graph (CKG) will be a crucial element for accessing and\u0000organizing scholarly knowledge, surpassing the insights provided by titles and\u0000abstracts. This research focuses on effectively conveying structured scholarly\u0000knowledge by utilizing large language models (LLMs) to categorize scholarly\u0000articles and describe their contributions in a structured and comparable\u0000manner. While previous studies explored language models within specific\u0000research domains, the extensive domain-independent knowledge captured by LLMs\u0000offers a substantial opportunity for generating structured contribution\u0000descriptions as CKGs. Additionally, LLMs offer customizable pathways through\u0000prompt engineering or fine-tuning, thus facilitating to leveraging of smaller\u0000LLMs known for their efficiency, cost-effectiveness, and environmental\u0000considerations. Our methodology involves harnessing LLM knowledge, and\u0000complementing it with domain expert-verified scholarly data sourced from a CKG.\u0000This strategic fusion significantly enhances LLM performance, especially in\u0000tasks like scholarly article categorization and predicate recommendation. Our\u0000method involves fine-tuning LLMs with CKG knowledge and additionally injecting\u0000knowledge from a CKG with a novel prompting technique significantly increasing\u0000the accuracy of scholarly knowledge extraction. We integrated our approach in\u0000the Open Research Knowledge Graph (ORKG), thus enabling precise access to\u0000organized scholarly knowledge, crucially benefiting domain-independent\u0000scholarly knowledge exchange and dissemination among policymakers, industrial\u0000practitioners, and the general public.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, Leslie A. Lenert
Objective: This study aims to summarize the usage of Large Language Models (LLMs) in the process of creating a scientific review. We look at the range of stages in a review that can be automated and assess the current state-of-the-art research projects in the field. Materials and Methods: The search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google Scholar databases by human reviewers. Screening and extraction process took place in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model. ChatGPT was used to clean extracted data and generate code for figures in this manuscript, ChatGPT and Scite.ai were used in drafting all components of the manuscript, except the methods and discussion sections. Results: 3,788 articles were retrieved, and 172 studies were deemed eligible for the final review. ChatGPT and GPT-based LLM emerged as the most dominant architecture for review automation (n=126, 73.2%). A significant number of review automation projects were found, but only a limited number of papers (n=26, 15.1%) were actual reviews that used LLM during their creation. Most citations focused on automation of a particular stage of review, such as Searching for publications (n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled performance of GPT-based and BERT-based models, the former were better in data extraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8), while being slightly less accurate in title and abstract screening stage (Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic review revealed a significant number of research projects related to review automation using LLMs. The results looked promising, and we anticipate that LLMs will change in the near future the way the scientific reviews are conducted.
{"title":"The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review","authors":"Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, Leslie A. Lenert","doi":"arxiv-2409.04600","DOIUrl":"https://doi.org/arxiv-2409.04600","url":null,"abstract":"Objective: This study aims to summarize the usage of Large Language Models\u0000(LLMs) in the process of creating a scientific review. We look at the range of\u0000stages in a review that can be automated and assess the current\u0000state-of-the-art research projects in the field. Materials and Methods: The\u0000search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google\u0000Scholar databases by human reviewers. Screening and extraction process took\u0000place in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model.\u0000ChatGPT was used to clean extracted data and generate code for figures in this\u0000manuscript, ChatGPT and Scite.ai were used in drafting all components of the\u0000manuscript, except the methods and discussion sections. Results: 3,788 articles\u0000were retrieved, and 172 studies were deemed eligible for the final review.\u0000ChatGPT and GPT-based LLM emerged as the most dominant architecture for review\u0000automation (n=126, 73.2%). A significant number of review automation projects\u0000were found, but only a limited number of papers (n=26, 15.1%) were actual\u0000reviews that used LLM during their creation. Most citations focused on\u0000automation of a particular stage of review, such as Searching for publications\u0000(n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled\u0000performance of GPT-based and BERT-based models, the former were better in data\u0000extraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8),\u0000while being slightly less accurate in title and abstract screening stage\u0000(Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic\u0000review revealed a significant number of research projects related to review\u0000automation using LLMs. The results looked promising, and we anticipate that\u0000LLMs will change in the near future the way the scientific reviews are\u0000conducted.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}