首页 > 最新文献

arXiv - CS - Digital Libraries最新文献

英文 中文
Research Citations Building Trust in Wikipedia 研究引用建立对维基百科的信任
Pub Date : 2024-09-18 DOI: arxiv-2409.11948
Michael Taylor, Carlos Areia, Kath Burton, Charles Watkinson
The use of Wikipedia citations in scholarly research has been the topic ofmuch inquiry over the past decade. A cross-publisher study (Taylor & Francisand University of Michigan Press) convened by Digital Science was establishedin late 2022 to explore author sentiment towards Wikipedia as a trusted sourceof information. A short survey was designed to poll published authors aboutviews and uses of Wikipedia and explore how the increased addition of researchcitations in Wikipedia might help combat misinformation in the context ofincreasing public engagement with and access to validated research sources.With 21,854 surveys sent, targeting 40,402 papers mentioned in Wikipedia, atotal of 750 complete surveys from 60 countries were included in this analysis.In general, responses revealed a positive sentiment towards research citationin Wikipedia and the researcher engagement practices. However, our sub analysisrevealed statistically significant differences when comparison articles vsbooks and across disciplines, but not open vs closed access. This study willopen the door to further research and deepen our understanding of authorsperceived trustworthiness of the representation of their research in Wikipedia.
过去十年来,维基百科在学术研究中的引文使用一直是备受关注的话题。2022 年末,数字科学(Digital Science)召集了一项跨出版商(泰勒-弗朗西斯出版社和密歇根大学出版社)的研究,以探讨作者对维基百科作为可信信息源的看法。我们设计了一项简短的调查,以调查发表论文的作者对维基百科的看法和使用情况,并探讨在维基百科中增加研究引文如何有助于在提高公众参与度和对有效研究来源的访问量的背景下打击错误信息。然而,我们的子分析表明,文章与书籍的比较以及不同学科之间的比较在统计学上存在显著差异,但开放存取与封闭存取之间的比较则不存在显著差异。这项研究将为进一步的研究打开大门,并加深我们对作者认为其研究在维基百科中的可信度的理解。
{"title":"Research Citations Building Trust in Wikipedia","authors":"Michael Taylor, Carlos Areia, Kath Burton, Charles Watkinson","doi":"arxiv-2409.11948","DOIUrl":"https://doi.org/arxiv-2409.11948","url":null,"abstract":"The use of Wikipedia citations in scholarly research has been the topic of\u0000much inquiry over the past decade. A cross-publisher study (Taylor & Francis\u0000and University of Michigan Press) convened by Digital Science was established\u0000in late 2022 to explore author sentiment towards Wikipedia as a trusted source\u0000of information. A short survey was designed to poll published authors about\u0000views and uses of Wikipedia and explore how the increased addition of research\u0000citations in Wikipedia might help combat misinformation in the context of\u0000increasing public engagement with and access to validated research sources.\u0000With 21,854 surveys sent, targeting 40,402 papers mentioned in Wikipedia, a\u0000total of 750 complete surveys from 60 countries were included in this analysis.\u0000In general, responses revealed a positive sentiment towards research citation\u0000in Wikipedia and the researcher engagement practices. However, our sub analysis\u0000revealed statistically significant differences when comparison articles vs\u0000books and across disciplines, but not open vs closed access. This study will\u0000open the door to further research and deepen our understanding of authors\u0000perceived trustworthiness of the representation of their research in Wikipedia.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251647","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Publishing Instincts: An Exploration-Exploitation Framework for Studying Academic Publishing Behavior and "Home Venues" 出版本能:研究学术出版行为和 "主场 "的探索-利用框架
Pub Date : 2024-09-18 DOI: arxiv-2409.12158
Teddy Lazebnik, Shir Aviv-Reuven, Ariel Rosenfeld
Scholarly communication is vital to scientific advancement, enabling theexchange of ideas and knowledge. When selecting publication venues, scholarsconsider various factors, such as journal relevance, reputation, outreach, andeditorial standards and practices. However, some of these factors areinconspicuous or inconsistent across venues and individual publications. Thisstudy proposes that scholars' decision-making process can be conceptualized andexplored through the biologically inspired exploration-exploitation (EE)framework, which posits that scholars balance between familiar andunder-explored publication venues. Building on the EE framework, we introduce agrounded definition for "Home Venues" (HVs) - an informal concept used todescribe the set of venues where a scholar consistently publishes - andinvestigate their emergence and key characteristics. Our analysis reveals thatthe publication patterns of roughly three-quarters of computer science scholarsalign with the expectations of the EE framework. For these scholars, HVstypically emerge and stabilize after approximately 15-20 publications.Additionally, scholars with higher h-indexes or a greater number ofpublications, tend to have higher-ranking journals as their HVs.
学术交流对科学进步至关重要,它使思想和知识的交流成为可能。在选择出版场所时,学者们会考虑各种因素,如期刊的相关性、声誉、外联以及编辑标准和实践。然而,其中一些因素在不同的出版场所和单个出版物中并不明显或不一致。本研究提出,学者的决策过程可以通过生物学启发的探索-开发(EE)框架进行概念化和探索,该框架认为学者会在熟悉的和探索不足的发表场所之间进行权衡。在 EE 框架的基础上,我们引入了 "主场"(HVs)的基础定义--这是一个非正式的概念,用于描述学者持续发表论文的场所--并研究了它们的出现和主要特征。我们的分析表明,大约四分之三的计算机科学学者的发表模式符合 EE 框架的预期。此外,h 指数较高或发表论文数量较多的学者倾向于将排名较高的期刊作为其 HV。
{"title":"Publishing Instincts: An Exploration-Exploitation Framework for Studying Academic Publishing Behavior and \"Home Venues\"","authors":"Teddy Lazebnik, Shir Aviv-Reuven, Ariel Rosenfeld","doi":"arxiv-2409.12158","DOIUrl":"https://doi.org/arxiv-2409.12158","url":null,"abstract":"Scholarly communication is vital to scientific advancement, enabling the\u0000exchange of ideas and knowledge. When selecting publication venues, scholars\u0000consider various factors, such as journal relevance, reputation, outreach, and\u0000editorial standards and practices. However, some of these factors are\u0000inconspicuous or inconsistent across venues and individual publications. This\u0000study proposes that scholars' decision-making process can be conceptualized and\u0000explored through the biologically inspired exploration-exploitation (EE)\u0000framework, which posits that scholars balance between familiar and\u0000under-explored publication venues. Building on the EE framework, we introduce a\u0000grounded definition for \"Home Venues\" (HVs) - an informal concept used to\u0000describe the set of venues where a scholar consistently publishes - and\u0000investigate their emergence and key characteristics. Our analysis reveals that\u0000the publication patterns of roughly three-quarters of computer science scholars\u0000align with the expectations of the EE framework. For these scholars, HVs\u0000typically emerge and stabilize after approximately 15-20 publications.\u0000Additionally, scholars with higher h-indexes or a greater number of\u0000publications, tend to have higher-ranking journals as their HVs.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"51 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251645","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating the Linguistic Coverage of OpenAlex: An Assessment of Metadata Accuracy and Completeness 评估 OpenAlex 的语言覆盖范围:元数据准确性和完整性评估
Pub Date : 2024-09-16 DOI: arxiv-2409.10633
Lucía Céspedes, Diego Kozlowski, Carolina Pradier, Maxime Holmberg Sainte-Marie, Natsumi Solange Shokida, Pierre Benz, Constance Poitras, Anton Boudreau Ninkov, Saeideh Ebrahimy, Philips Ayeni, Sarra Filali, Bing Li, Vincent Larivière
Clarivate's Web of Science (WoS) and Elsevier's Scopus have been for decadesthe main sources of bibliometric information. Although highly curated, theseclosed, proprietary databases are largely biased towards English-languagepublications, underestimating the use of other languages in researchdissemination. Launched in 2022, OpenAlex promised comprehensive, inclusive,and open-source research information. While already in use by scholars andresearch institutions, the quality of its metadata is currently being assessed.This paper contributes to this literature by assessing the completeness andaccuracy of its metadata related to language, through a comparison with WoS, aswell as an in-depth manual validation of a sample of 6,836 articles. Resultsshow that OpenAlex exhibits a far more balanced linguistic coverage than WoS.However, language metadata is not always accurate, which leads OpenAlex tooverestimate the place of English while underestimating that of otherlanguages. If used critically, OpenAlex can provide comprehensive andrepresentative analyses of languages used for scholarly publishing. However,more work is needed at infrastructural level to ensure the quality of metadataon language.
几十年来,Clarivate 的 Web of Science (WoS) 和 Elsevier 的 Scopus 一直是文献计量信息的主要来源。这些封闭式的专有数据库虽然经过了精心策划,但在很大程度上偏重于英语出版物,低估了其他语言在研究传播中的使用。OpenAlex 于 2022 年推出,承诺提供全面、包容和开源的研究信息。本文通过与 WoS 的比较,以及对 6836 篇文章样本的深入人工验证,评估了其与语言相关的元数据的完整性和准确性,为相关文献做出了贡献。结果表明,OpenAlex 的语言覆盖范围比 WoS 更均衡。然而,语言元数据并不总是准确的,这导致 OpenAlex 高估了英语的地位,而低估了其他语言的地位。如果认真使用,OpenAlex 可以对用于学术出版的语言进行全面和有代表性的分析。不过,还需要在基础设施层面开展更多工作,以确保语言元数据的质量。
{"title":"Evaluating the Linguistic Coverage of OpenAlex: An Assessment of Metadata Accuracy and Completeness","authors":"Lucía Céspedes, Diego Kozlowski, Carolina Pradier, Maxime Holmberg Sainte-Marie, Natsumi Solange Shokida, Pierre Benz, Constance Poitras, Anton Boudreau Ninkov, Saeideh Ebrahimy, Philips Ayeni, Sarra Filali, Bing Li, Vincent Larivière","doi":"arxiv-2409.10633","DOIUrl":"https://doi.org/arxiv-2409.10633","url":null,"abstract":"Clarivate's Web of Science (WoS) and Elsevier's Scopus have been for decades\u0000the main sources of bibliometric information. Although highly curated, these\u0000closed, proprietary databases are largely biased towards English-language\u0000publications, underestimating the use of other languages in research\u0000dissemination. Launched in 2022, OpenAlex promised comprehensive, inclusive,\u0000and open-source research information. While already in use by scholars and\u0000research institutions, the quality of its metadata is currently being assessed.\u0000This paper contributes to this literature by assessing the completeness and\u0000accuracy of its metadata related to language, through a comparison with WoS, as\u0000well as an in-depth manual validation of a sample of 6,836 articles. Results\u0000show that OpenAlex exhibits a far more balanced linguistic coverage than WoS.\u0000However, language metadata is not always accurate, which leads OpenAlex to\u0000overestimate the place of English while underestimating that of other\u0000languages. If used critically, OpenAlex can provide comprehensive and\u0000representative analyses of languages used for scholarly publishing. However,\u0000more work is needed at infrastructural level to ensure the quality of metadata\u0000on language.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"22 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-16","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251648","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Towards understanding evolution of science through language model series 通过语言模型系列了解科学的演变
Pub Date : 2024-09-15 DOI: arxiv-2409.09636
Junjie Dong, Zhuoqi Lyu, Qing Ke
We introduce AnnualBERT, a series of language models designed specifically tocapture the temporal evolution of scientific text. Deviating from theprevailing paradigms of subword tokenizations and "one model to rule them all",AnnualBERT adopts whole words as tokens and is composed of a base RoBERTa modelpretrained from scratch on the full-text of 1.7 million arXiv papers publisheduntil 2008 and a collection of progressively trained models on arXiv papers atan annual basis. We demonstrate the effectiveness of AnnualBERT models byshowing that they not only have comparable performances in standard tasks butalso achieve state-of-the-art performances on domain-specific NLP tasks as wellas link prediction tasks in the arXiv citation network. We then utilize probingtasks to quantify the models' behavior in terms of representation learning andforgetting as time progresses. Our approach enables the pretrained models tonot only improve performances on scientific text processing tasks but also toprovide insights into the development of scientific discourse over time. Theseries of the models is available at https://huggingface.co/jd445/AnnualBERTs.
我们介绍了 AnnualBERT,这是一系列专为捕捉科学文本的时间演变而设计的语言模型。与目前流行的分词标记化和 "一个模型统治所有模型 "的模式不同,AnnualBERT采用整词作为标记,由一个在截至2008年发表的170万篇arXiv论文全文上从头开始训练的基础RoBERT模型和一系列每年在arXiv论文上逐步训练的模型组成。我们证明了 AnnualBERT 模型的有效性,它们不仅在标准任务中表现相当,而且在特定领域的 NLP 任务以及 arXiv 引用网络中的链接预测任务中也达到了最先进的水平。然后,我们利用探测任务来量化模型随着时间推移在表征学习和遗忘方面的行为。我们的方法使得预训练模型不仅能提高科学文本处理任务的性能,还能深入了解科学话语随时间的发展。有关模型的系列信息,请访问 https://huggingface.co/jd445/AnnualBERTs。
{"title":"Towards understanding evolution of science through language model series","authors":"Junjie Dong, Zhuoqi Lyu, Qing Ke","doi":"arxiv-2409.09636","DOIUrl":"https://doi.org/arxiv-2409.09636","url":null,"abstract":"We introduce AnnualBERT, a series of language models designed specifically to\u0000capture the temporal evolution of scientific text. Deviating from the\u0000prevailing paradigms of subword tokenizations and \"one model to rule them all\",\u0000AnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model\u0000pretrained from scratch on the full-text of 1.7 million arXiv papers published\u0000until 2008 and a collection of progressively trained models on arXiv papers at\u0000an annual basis. We demonstrate the effectiveness of AnnualBERT models by\u0000showing that they not only have comparable performances in standard tasks but\u0000also achieve state-of-the-art performances on domain-specific NLP tasks as well\u0000as link prediction tasks in the arXiv citation network. We then utilize probing\u0000tasks to quantify the models' behavior in terms of representation learning and\u0000forgetting as time progresses. Our approach enables the pretrained models to\u0000not only improve performances on scientific text processing tasks but also to\u0000provide insights into the development of scientific discourse over time. The\u0000series of the models is available at https://huggingface.co/jd445/AnnualBERTs.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251649","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Ensuring Adherence to Standards in Experiment-Related Metadata Entered Via Spreadsheets 确保通过电子表格输入的实验相关元数据符合标准
Pub Date : 2024-09-13 DOI: arxiv-2409.08897
Martin J. O'Connor, Josef Hardi, Marcos Martínez-Romero, Sowmya Somasundaram, Brendan Honick, Stephen A. Fisher, Ajay Pillai, Mark A. Musen
Scientists increasingly recognize the importance of providing rich,standards-adherent metadata to describe their experimental results. Despite theavailability of sophisticated tools to assist in the process of dataannotation, investigators generally seem to prefer to use spreadsheets whensupplying metadata, despite the limitations of spreadsheets in ensuringmetadata consistency and compliance with formal specifications. In this paper,we describe an end-to-end approach that supports spreadsheet-based entry ofmetadata, while ensuring rigorous adherence to community-based metadatastandards and providing quality control. Our methods employ several keycomponents, including customizable templates that capture metadata standardsand that can inform the spreadsheets that investigators use to author metadata,controlled terminologies and ontologies for defining metadata values that canbe accessed directly from a spreadsheet, and an interactive Web-based tool thatallows users to rapidly identify and fix errors in their spreadsheet-basedmetadata. We demonstrate how this approach is being deployed in a biomedicalconsortium known as HuBMAP to define and collect metadata about a wide range ofbiological assays.
科学家们越来越认识到提供丰富的、符合标准的元数据来描述实验结果的重要性。尽管有先进的工具来协助数据标注过程,但研究人员在提供元数据时似乎普遍倾向于使用电子表格,尽管电子表格在确保元数据一致性和符合正式规范方面有其局限性。在本文中,我们介绍了一种端到端的方法,它支持基于电子表格的元数据录入,同时确保严格遵守基于社区的元数据标准并提供质量控制。我们的方法采用了几个关键组件,包括捕捉元数据标准的可定制模板,这些模板可以为研究人员用来编写元数据的电子表格提供信息;用于定义元数据值的受控术语和本体,这些术语和本体可以直接从电子表格中访问;以及一个基于网络的交互式工具,该工具允许用户快速识别和修正基于电子表格的元数据中的错误。我们展示了这种方法是如何在一个名为 HuBMAP 的生物医学联盟中部署的,以定义和收集有关各种生物检测的元数据。
{"title":"Ensuring Adherence to Standards in Experiment-Related Metadata Entered Via Spreadsheets","authors":"Martin J. O'Connor, Josef Hardi, Marcos Martínez-Romero, Sowmya Somasundaram, Brendan Honick, Stephen A. Fisher, Ajay Pillai, Mark A. Musen","doi":"arxiv-2409.08897","DOIUrl":"https://doi.org/arxiv-2409.08897","url":null,"abstract":"Scientists increasingly recognize the importance of providing rich,\u0000standards-adherent metadata to describe their experimental results. Despite the\u0000availability of sophisticated tools to assist in the process of data\u0000annotation, investigators generally seem to prefer to use spreadsheets when\u0000supplying metadata, despite the limitations of spreadsheets in ensuring\u0000metadata consistency and compliance with formal specifications. In this paper,\u0000we describe an end-to-end approach that supports spreadsheet-based entry of\u0000metadata, while ensuring rigorous adherence to community-based metadata\u0000standards and providing quality control. Our methods employ several key\u0000components, including customizable templates that capture metadata standards\u0000and that can inform the spreadsheets that investigators use to author metadata,\u0000controlled terminologies and ontologies for defining metadata values that can\u0000be accessed directly from a spreadsheet, and an interactive Web-based tool that\u0000allows users to rapidly identify and fix errors in their spreadsheet-based\u0000metadata. We demonstrate how this approach is being deployed in a biomedical\u0000consortium known as HuBMAP to define and collect metadata about a wide range of\u0000biological assays.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"7 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Intelligent Innovation Dataset on Scientific Research Outcomes and Patents 关于科研成果和专利的智能创新数据集
Pub Date : 2024-09-11 DOI: arxiv-2409.06936
Xinran Wu, Hui Zou, Yidan Xing, Jingjing Qu, Qiongxiu Li, Renxia Xue, Xiaoming Fu
Various stakeholders, such as researchers, government agencies, businesses,and laboratories require reliable scientific research outcomes and patent datato support their work. These data are crucial for advancing scientificresearch, conducting business evaluations, and policy analysis. However,collecting such data is often a time-consuming and laborious task.Consequently, many users turn to using openly accessible data for theirresearch. However, these open data releases may suffer from lack ofrelationship between different data sources or limited temporal coverage. Inthis context, we present a new Intelligent Innovation Dataset (IIDS dataset),which comprises six inter-related datasets spanning nearly 120 years,encompassing paper information, paper citation relationships, patent details,patent legal statuses, funding information and funding relationship. Theextensive contextual and extensive temporal coverage of the IIDS dataset willprovide researchers with comprehensive data support, enabling them to delveinto in-depth scientific research and conduct thorough data analysis.
研究人员、政府机构、企业和实验室等各利益相关方都需要可靠的科研成果和专利数据来支持他们的工作。这些数据对于推进科学研究、开展商业评估和政策分析至关重要。因此,许多用户转而使用可公开访问的数据进行研究。然而,这些开放数据的发布可能存在不同数据源之间缺乏关联或时间覆盖范围有限的问题。在这种情况下,我们提出了一个新的智能创新数据集(IDS数据集),它由六个相互关联的数据集组成,时间跨度近120年,涵盖论文信息、论文引用关系、专利详情、专利法律状态、资助信息和资助关系。IIDS 数据集广泛的上下文和时间覆盖将为研究人员提供全面的数据支持,使他们能够深入开展科学研究并进行全面的数据分析。
{"title":"Intelligent Innovation Dataset on Scientific Research Outcomes and Patents","authors":"Xinran Wu, Hui Zou, Yidan Xing, Jingjing Qu, Qiongxiu Li, Renxia Xue, Xiaoming Fu","doi":"arxiv-2409.06936","DOIUrl":"https://doi.org/arxiv-2409.06936","url":null,"abstract":"Various stakeholders, such as researchers, government agencies, businesses,\u0000and laboratories require reliable scientific research outcomes and patent data\u0000to support their work. These data are crucial for advancing scientific\u0000research, conducting business evaluations, and policy analysis. However,\u0000collecting such data is often a time-consuming and laborious task.\u0000Consequently, many users turn to using openly accessible data for their\u0000research. However, these open data releases may suffer from lack of\u0000relationship between different data sources or limited temporal coverage. In\u0000this context, we present a new Intelligent Innovation Dataset (IIDS dataset),\u0000which comprises six inter-related datasets spanning nearly 120 years,\u0000encompassing paper information, paper citation relationships, patent details,\u0000patent legal statuses, funding information and funding relationship. The\u0000extensive contextual and extensive temporal coverage of the IIDS dataset will\u0000provide researchers with comprehensive data support, enabling them to delve\u0000into in-depth scientific research and conduct thorough data analysis.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"53 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219098","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An Evaluation of GPT-4V for Transcribing the Urban Renewal Hand-Written Collection 对 GPT-4V 用于誊写《城市更新手稿集》的评估
Pub Date : 2024-09-11 DOI: arxiv-2409.09090
Myeong Lee, Julia H. P. Hsu
Between 1960 and 1980, urban renewal transformed many cities, creating vasthandwritten records. These documents posed a significant challenge forresearchers due to their volume and handwritten nature. The launch of GPT-4V inNovember 2023 offered a breakthrough, enabling large-scale, efficienttranscription and analysis of these historical urban renewal documents.
1960 年至 1980 年间,城市改造改变了许多城市,产生了大量手写记录。由于这些文件数量庞大且具有手写性质,给研究人员带来了巨大挑战。2023 年 11 月推出的 GPT-4V 实现了一项突破,使大规模、高效率地转录和分析这些历史性城市重建文件成为可能。
{"title":"An Evaluation of GPT-4V for Transcribing the Urban Renewal Hand-Written Collection","authors":"Myeong Lee, Julia H. P. Hsu","doi":"arxiv-2409.09090","DOIUrl":"https://doi.org/arxiv-2409.09090","url":null,"abstract":"Between 1960 and 1980, urban renewal transformed many cities, creating vast\u0000handwritten records. These documents posed a significant challenge for\u0000researchers due to their volume and handwritten nature. The launch of GPT-4V in\u0000November 2023 offered a breakthrough, enabling large-scale, efficient\u0000transcription and analysis of these historical urban renewal documents.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"56 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142251652","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The existence of stealth corrections in scientific literature -- a threat to scientific integrity 科学文献中隐形更正的存在--对科学诚信的威胁
Pub Date : 2024-09-10 DOI: arxiv-2409.06852
Rene Aquarius, Floris Schoeters, Nick Wise, Alex Glynn, Guillaume Cabanac
Introduction: Thorough maintenance of the scientific record is needed toensure the trustworthiness of its content. This can be undermined by a stealthcorrection, which is at least one post-publication change made to a scientificarticle, without providing a correction note or any other indicator that thepublication was temporarily or permanently altered. In this paper we provideseveral examples of stealth corrections in order to demonstrate that theseexist within the scientific literature. As far as we are aware, nodocumentation of such stealth corrections was previously reported in thescientific literature. Methods: We identified stealth corrections ourselves, or found alreadyreported ones on the public database pubpeer.com or through social mediaaccounts of known science sleuths. Results: In total we report 131 articles that were affected by stealthcorrections and were published between 2005 and 2024. These stealth correctionswere found among multiple publishers and scientific fields. Conclusion: and recommendations Stealth corrections exist in the scientificliterature. This needs to end immediately as it threatens scientific integrity.We recommend the following: 1) Tracking all changes to the published record byall publishers in an open, uniform and transparent manner, preferably by onlinesubmission systems that log every change publicly, making stealth correctionsimpossible; 2) Clear definitions and guidelines on all types of corrections; 3)Support sustained vigilance of the scientific community to publicly registerstealth corrections.
导言:为了确保科学记录内容的可信度,需要对科学记录进行彻底的维护。隐性更正是指在科学文章发表后对其进行至少一次修改,但不提供更正说明或任何其他表明该文章被临时或永久修改的迹象。在本文中,我们提供了几个隐性更正的例子,以证明这些更正存在于科学文献中。据我们所知,以前的科学文献中没有关于此类隐性更正的记录。方法:我们自己发现了隐性更正,或者在公共数据库 pubpeer.com 上找到了已经报道过的隐性更正,或者通过已知科学侦探的社交媒体账户找到了隐性更正。结果:我们总共报告了 131 篇受到隐性更正影响的文章,这些文章发表于 2005 年至 2024 年之间。这些隐性更正出现在多个出版商和科学领域。结论:和建议 科学文献中存在隐性更正。我们建议采取以下措施:1)以公开、统一和透明的方式跟踪所有出版商对已发表记录的所有修改,最好是通过在线提交系统,公开记录每一个修改,使隐形更正成为不可能;2)对所有类型的更正进行明确的定义和指导;3)支持科学界持续保持警惕,公开登记隐形更正。
{"title":"The existence of stealth corrections in scientific literature -- a threat to scientific integrity","authors":"Rene Aquarius, Floris Schoeters, Nick Wise, Alex Glynn, Guillaume Cabanac","doi":"arxiv-2409.06852","DOIUrl":"https://doi.org/arxiv-2409.06852","url":null,"abstract":"Introduction: Thorough maintenance of the scientific record is needed to\u0000ensure the trustworthiness of its content. This can be undermined by a stealth\u0000correction, which is at least one post-publication change made to a scientific\u0000article, without providing a correction note or any other indicator that the\u0000publication was temporarily or permanently altered. In this paper we provide\u0000several examples of stealth corrections in order to demonstrate that these\u0000exist within the scientific literature. As far as we are aware, no\u0000documentation of such stealth corrections was previously reported in the\u0000scientific literature. Methods: We identified stealth corrections ourselves, or found already\u0000reported ones on the public database pubpeer.com or through social media\u0000accounts of known science sleuths. Results: In total we report 131 articles that were affected by stealth\u0000corrections and were published between 2005 and 2024. These stealth corrections\u0000were found among multiple publishers and scientific fields. Conclusion: and recommendations Stealth corrections exist in the scientific\u0000literature. This needs to end immediately as it threatens scientific integrity.\u0000We recommend the following: 1) Tracking all changes to the published record by\u0000all publishers in an open, uniform and transparent manner, preferably by online\u0000submission systems that log every change publicly, making stealth corrections\u0000impossible; 2) Clear definitions and guidelines on all types of corrections; 3)\u0000Support sustained vigilance of the scientific community to publicly register\u0000stealth corrections.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"110 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219094","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Fine-tuning and Prompt Engineering with Cognitive Knowledge Graphs for Scholarly Knowledge Organization 利用认知知识图谱进行学术知识组织的微调和提示工程
Pub Date : 2024-09-10 DOI: arxiv-2409.06433
Gollam Rabby, Sören Auer, Jennifer D'Souza, Allard Oelen
The increasing amount of published scholarly articles, exceeding 2.5 millionyearly, raises the challenge for researchers in following scientific progress.Integrating the contributions from scholarly articles into a novel type ofcognitive knowledge graph (CKG) will be a crucial element for accessing andorganizing scholarly knowledge, surpassing the insights provided by titles andabstracts. This research focuses on effectively conveying structured scholarlyknowledge by utilizing large language models (LLMs) to categorize scholarlyarticles and describe their contributions in a structured and comparablemanner. While previous studies explored language models within specificresearch domains, the extensive domain-independent knowledge captured by LLMsoffers a substantial opportunity for generating structured contributiondescriptions as CKGs. Additionally, LLMs offer customizable pathways throughprompt engineering or fine-tuning, thus facilitating to leveraging of smallerLLMs known for their efficiency, cost-effectiveness, and environmentalconsiderations. Our methodology involves harnessing LLM knowledge, andcomplementing it with domain expert-verified scholarly data sourced from a CKG.This strategic fusion significantly enhances LLM performance, especially intasks like scholarly article categorization and predicate recommendation. Ourmethod involves fine-tuning LLMs with CKG knowledge and additionally injectingknowledge from a CKG with a novel prompting technique significantly increasingthe accuracy of scholarly knowledge extraction. We integrated our approach inthe Open Research Knowledge Graph (ORKG), thus enabling precise access toorganized scholarly knowledge, crucially benefiting domain-independentscholarly knowledge exchange and dissemination among policymakers, industrialpractitioners, and the general public.
将学术文章的贡献整合到新型认知知识图谱(CKG)中,将成为获取和组织学术知识的关键要素,从而超越标题和摘要所提供的见解。本研究的重点是利用大型语言模型(LLM)对学术文章进行分类,并以结构化和可比较的方式描述其贡献,从而有效地传递结构化的学术知识。虽然以前的研究探索了特定研究领域中的语言模型,但 LLM 所捕获的与领域无关的广泛知识为生成结构化贡献描述(CKG)提供了大量机会。此外,LLM 还可以通过提示工程或微调提供可定制的路径,从而有助于利用以效率、成本效益和环境因素著称的小型 LLM。我们的方法包括利用 LLM 知识,并将其与来自 CKG 的领域专家验证过的学术数据相辅相成。这种战略性的融合大大提高了 LLM 的性能,尤其是在学术文章分类和谓词推荐等任务中。我们的方法包括利用 CKG 知识对 LLM 进行微调,并通过新颖的提示技术从 CKG 中注入知识,从而显著提高学术知识提取的准确性。我们将我们的方法集成到了开放研究知识图谱(ORKG)中,从而实现了对有组织的学术知识的精确访问,极大地促进了独立于领域的学术知识在政策制定者、行业从业者和公众之间的交流和传播。
{"title":"Fine-tuning and Prompt Engineering with Cognitive Knowledge Graphs for Scholarly Knowledge Organization","authors":"Gollam Rabby, Sören Auer, Jennifer D'Souza, Allard Oelen","doi":"arxiv-2409.06433","DOIUrl":"https://doi.org/arxiv-2409.06433","url":null,"abstract":"The increasing amount of published scholarly articles, exceeding 2.5 million\u0000yearly, raises the challenge for researchers in following scientific progress.\u0000Integrating the contributions from scholarly articles into a novel type of\u0000cognitive knowledge graph (CKG) will be a crucial element for accessing and\u0000organizing scholarly knowledge, surpassing the insights provided by titles and\u0000abstracts. This research focuses on effectively conveying structured scholarly\u0000knowledge by utilizing large language models (LLMs) to categorize scholarly\u0000articles and describe their contributions in a structured and comparable\u0000manner. While previous studies explored language models within specific\u0000research domains, the extensive domain-independent knowledge captured by LLMs\u0000offers a substantial opportunity for generating structured contribution\u0000descriptions as CKGs. Additionally, LLMs offer customizable pathways through\u0000prompt engineering or fine-tuning, thus facilitating to leveraging of smaller\u0000LLMs known for their efficiency, cost-effectiveness, and environmental\u0000considerations. Our methodology involves harnessing LLM knowledge, and\u0000complementing it with domain expert-verified scholarly data sourced from a CKG.\u0000This strategic fusion significantly enhances LLM performance, especially in\u0000tasks like scholarly article categorization and predicate recommendation. Our\u0000method involves fine-tuning LLMs with CKG knowledge and additionally injecting\u0000knowledge from a CKG with a novel prompting technique significantly increasing\u0000the accuracy of scholarly knowledge extraction. We integrated our approach in\u0000the Open Research Knowledge Graph (ORKG), thus enabling precise access to\u0000organized scholarly knowledge, crucially benefiting domain-independent\u0000scholarly knowledge exchange and dissemination among policymakers, industrial\u0000practitioners, and the general public.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219104","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review 大语言模型(LLM)作为文献综述工具的出现:LLM 自动系统综述
Pub Date : 2024-09-06 DOI: arxiv-2409.04600
Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, Leslie A. Lenert
Objective: This study aims to summarize the usage of Large Language Models(LLMs) in the process of creating a scientific review. We look at the range ofstages in a review that can be automated and assess the currentstate-of-the-art research projects in the field. Materials and Methods: Thesearch was conducted in June 2024 in PubMed, Scopus, Dimensions, and GoogleScholar databases by human reviewers. Screening and extraction process tookplace in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model.ChatGPT was used to clean extracted data and generate code for figures in thismanuscript, ChatGPT and Scite.ai were used in drafting all components of themanuscript, except the methods and discussion sections. Results: 3,788 articleswere retrieved, and 172 studies were deemed eligible for the final review.ChatGPT and GPT-based LLM emerged as the most dominant architecture for reviewautomation (n=126, 73.2%). A significant number of review automation projectswere found, but only a limited number of papers (n=26, 15.1%) were actualreviews that used LLM during their creation. Most citations focused onautomation of a particular stage of review, such as Searching for publications(n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooledperformance of GPT-based and BERT-based models, the former were better in dataextraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8),while being slightly less accurate in title and abstract screening stage(Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematicreview revealed a significant number of research projects related to reviewautomation using LLMs. The results looked promising, and we anticipate thatLLMs will change in the near future the way the scientific reviews areconducted.
研究目的本研究旨在总结大型语言模型(LLM)在撰写科学评论过程中的使用情况。我们探讨了可实现自动化的综述阶段的范围,并评估了该领域当前最先进的研究项目。材料与方法:这些研究于 2024 年 6 月由人类审稿人在 PubMed、Scopus、Dimensions 和 GoogleScholar 数据库中进行。ChatGPT 用于清理提取的数据并生成本手稿中图表的代码,ChatGPT 和 Scite.ai 用于起草手稿中除方法和讨论部分以外的所有部分。结果共检索到 3,788 篇文章,其中 172 项研究被认为符合最终评审条件。我们发现了大量的审稿自动化项目,但只有少数论文(n=26,15.1%)是在创建过程中使用了 LLM 的实际审稿。大多数引用都集中在审稿某一特定阶段的自动化,如搜索出版物(60篇,占34.9%)和数据提取(54篇,占31.4%)。在比较基于 GPT 的模型和基于 BERT 的模型的综合表现时,前者在数据提取方面表现更好,平均精确度为 83.0% (SD=10.4),召回率为 86.0% (SD=9.8),而在标题和摘要筛选阶段的精确度稍低(Maccuracy=77.3%,SD=13.0)。讨论/结论:我们的LLM辅助系统综述揭示了大量与使用LLM进行综述自动化相关的研究项目。我们预计,在不久的将来,LLM 将改变科学综述的方式。
{"title":"The emergence of Large Language Models (LLM) as a tool in literature reviews: an LLM automated systematic review","authors":"Dmitry Scherbakov, Nina Hubig, Vinita Jansari, Alexander Bakumenko, Leslie A. Lenert","doi":"arxiv-2409.04600","DOIUrl":"https://doi.org/arxiv-2409.04600","url":null,"abstract":"Objective: This study aims to summarize the usage of Large Language Models\u0000(LLMs) in the process of creating a scientific review. We look at the range of\u0000stages in a review that can be automated and assess the current\u0000state-of-the-art research projects in the field. Materials and Methods: The\u0000search was conducted in June 2024 in PubMed, Scopus, Dimensions, and Google\u0000Scholar databases by human reviewers. Screening and extraction process took\u0000place in Covidence with the help of LLM add-on which uses OpenAI gpt-4o model.\u0000ChatGPT was used to clean extracted data and generate code for figures in this\u0000manuscript, ChatGPT and Scite.ai were used in drafting all components of the\u0000manuscript, except the methods and discussion sections. Results: 3,788 articles\u0000were retrieved, and 172 studies were deemed eligible for the final review.\u0000ChatGPT and GPT-based LLM emerged as the most dominant architecture for review\u0000automation (n=126, 73.2%). A significant number of review automation projects\u0000were found, but only a limited number of papers (n=26, 15.1%) were actual\u0000reviews that used LLM during their creation. Most citations focused on\u0000automation of a particular stage of review, such as Searching for publications\u0000(n=60, 34.9%), and Data extraction (n=54, 31.4%). When comparing pooled\u0000performance of GPT-based and BERT-based models, the former were better in data\u0000extraction with mean precision 83.0% (SD=10.4), and recall 86.0% (SD=9.8),\u0000while being slightly less accurate in title and abstract screening stage\u0000(Maccuracy=77.3%, SD=13.0). Discussion/Conclusion: Our LLM-assisted systematic\u0000review revealed a significant number of research projects related to review\u0000automation using LLMs. The results looked promising, and we anticipate that\u0000LLMs will change in the near future the way the scientific reviews are\u0000conducted.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"8 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219097","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Digital Libraries
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1