首页 > 最新文献

arXiv - CS - Digital Libraries最新文献

英文 中文
Science cited in policy documents: Evidence from the Overton database 政策文件中引用的科学依据:来自奥弗顿数据库的证据
Pub Date : 2024-07-13 DOI: arxiv-2407.09854
Zhichao Fang, Jonathan Dudek, Ed Noyons, Rodrigo Costas
To reflect the extent to which science is cited in policy documents, thispaper explores the presence of policy document citations for over 18 millionWeb of Science-indexed publications published between 2010 and 2019. Enabled bythe policy document citation data provided by Overton, a searchable index ofpolicy documents worldwide, the results show that there are 3.9% ofpublications in the dataset cited at least once by policy documents. Policydocument citations present a citation delay towards newly publishedpublications and show a stronger predominance to the document types of reviewand article. Based on the Overton database, publications in the field of SocialSciences and Humanities have the highest relative presence in policy documentcitations, followed by Life and Earth Sciences and Biomedical and HealthSciences. Our findings shed light not only on the impact of scientificknowledge on the policy-making process, but also on the particular focus ofpolicy documents indexed by Overton on specific research areas.
为了反映科学在政策文件中的引用程度,本文探讨了 2010 年至 2019 年间发表的超过 1800 万篇《科学网》(Web of Science)收录的出版物中是否存在政策文件引用。结果显示,数据集中有 3.9% 的出版物至少被政策文件引用过一次。政策文件的引用呈现出对新出版出版物的引用延迟现象,并显示出评论和文章这两种文件类型更占优势。根据 Overton 数据库,社会科学和人文科学领域的出版物在政策文件引用中的相对比例最高,其次是生命科学和地球科学以及生物医学和健康科学。我们的研究结果不仅揭示了科学知识对政策制定过程的影响,还揭示了 Overton 索引的政策文件对特定研究领域的特别关注。
{"title":"Science cited in policy documents: Evidence from the Overton database","authors":"Zhichao Fang, Jonathan Dudek, Ed Noyons, Rodrigo Costas","doi":"arxiv-2407.09854","DOIUrl":"https://doi.org/arxiv-2407.09854","url":null,"abstract":"To reflect the extent to which science is cited in policy documents, this\u0000paper explores the presence of policy document citations for over 18 million\u0000Web of Science-indexed publications published between 2010 and 2019. Enabled by\u0000the policy document citation data provided by Overton, a searchable index of\u0000policy documents worldwide, the results show that there are 3.9% of\u0000publications in the dataset cited at least once by policy documents. Policy\u0000document citations present a citation delay towards newly published\u0000publications and show a stronger predominance to the document types of review\u0000and article. Based on the Overton database, publications in the field of Social\u0000Sciences and Humanities have the highest relative presence in policy document\u0000citations, followed by Life and Earth Sciences and Biomedical and Health\u0000Sciences. Our findings shed light not only on the impact of scientific\u0000knowledge on the policy-making process, but also on the particular focus of\u0000policy documents indexed by Overton on specific research areas.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Cool URIs for FAIR Knowledge Graphs FAIR 知识图谱的酷 URI
Pub Date : 2024-07-12 DOI: arxiv-2407.09237
Andreas Thalhammer
This guide is for everyone who seeks advice for creating stable, secure, andpersistent Uniform Resource Identifiers (URIs) in order to publish their datain accordance to the FAIR principles. The use case does not matter. It couldrange from publishing the results of a small research project to a largeknowledge graph at a big corporation. The FAIR principles apply equally andthis is why it is important to put extra thought into the URI selectionprocess. The title aims to extend the tradition of "Cool URIs don't change" and"Cool URIs for the Semantic Web". Much has changed since the publication ofthese works and we would like to revisit some of the principles. Many stillhold today, some had to be reworked, and we could also identify new ones
本指南适用于所有寻求创建稳定、安全和持久的统一资源标识符 (URI) 以按照 FAIR 原则发布数据的人。用例并不重要。小到发布一个小型研究项目的结果,大到发布一个大公司的大型知识图谱,都可以。FAIR 原则同样适用,这就是为什么在选择 URI 的过程中要多花些心思的原因。本标题旨在延续 "Cool URIs don't change "和 "Cool URIs for the Semantic Web "的传统。自这些著作出版以来,已经发生了很多变化,我们希望重新审视其中的一些原则。许多原则在今天仍然有效,有些原则必须重新修订,我们还可以找出新的原则
{"title":"Cool URIs for FAIR Knowledge Graphs","authors":"Andreas Thalhammer","doi":"arxiv-2407.09237","DOIUrl":"https://doi.org/arxiv-2407.09237","url":null,"abstract":"This guide is for everyone who seeks advice for creating stable, secure, and\u0000persistent Uniform Resource Identifiers (URIs) in order to publish their data\u0000in accordance to the FAIR principles. The use case does not matter. It could\u0000range from publishing the results of a small research project to a large\u0000knowledge graph at a big corporation. The FAIR principles apply equally and\u0000this is why it is important to put extra thought into the URI selection\u0000process. The title aims to extend the tradition of \"Cool URIs don't change\" and\u0000\"Cool URIs for the Semantic Web\". Much has changed since the publication of\u0000these works and we would like to revisit some of the principles. Many still\u0000hold today, some had to be reworked, and we could also identify new ones","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Structuring Authenticity Assessments on Historical Documents using LLMs 利用 LLM 构建历史文献真实性评估结构
Pub Date : 2024-07-12 DOI: arxiv-2407.09290
Andrea Schimmenti, Valentina Pasqual, Francesca Tomasi, Fabio Vitali, Marieke van Erp
Given the wide use of forgery throughout history, scholars have and arecontinuously engaged in assessing the authenticity of historical documents.However, online catalogues merely offer descriptive metadata for thesedocuments, relegating discussions about their authenticity to free-textformats, making it difficult to study these assessments at scale. This studyexplores the generation of structured data about documents' authenticityassessment from natural language texts. Our pipeline exploits Large LanguageModels (LLMs) to select, extract and classify relevant claims about the topicwithout the need for training, and Semantic Web technologies to structure andtype-validate the LLM's results. The final output is a catalogue of documentswhose authenticity has been debated, along with scholars' opinions on theirauthenticity. This process can serve as a valuable resource for integrationinto catalogues, allowing room for more intricate queries and analyses on theevolution of these debates over centuries.
然而,在线目录仅仅提供了这些文献的描述性元数据,将有关其真实性的讨论归结为自由文本格式,从而难以对这些评估进行大规模研究。本研究探索从自然语言文本中生成有关文档真实性评估的结构化数据。我们的管道利用大型语言模型(Large LanguageModels,LLM)来选择、提取和分类有关主题的相关说法,而无需进行训练,并利用语义网技术来构建和类型验证 LLM 的结果。最终的输出结果是一个文件目录,其中包含了其真实性曾引起争议的文件,以及学者们对其真实性的看法。这一过程可以作为整合到目录中的宝贵资源,为更复杂的查询和分析几个世纪以来这些争论的演变提供空间。
{"title":"Structuring Authenticity Assessments on Historical Documents using LLMs","authors":"Andrea Schimmenti, Valentina Pasqual, Francesca Tomasi, Fabio Vitali, Marieke van Erp","doi":"arxiv-2407.09290","DOIUrl":"https://doi.org/arxiv-2407.09290","url":null,"abstract":"Given the wide use of forgery throughout history, scholars have and are\u0000continuously engaged in assessing the authenticity of historical documents.\u0000However, online catalogues merely offer descriptive metadata for these\u0000documents, relegating discussions about their authenticity to free-text\u0000formats, making it difficult to study these assessments at scale. This study\u0000explores the generation of structured data about documents' authenticity\u0000assessment from natural language texts. Our pipeline exploits Large Language\u0000Models (LLMs) to select, extract and classify relevant claims about the topic\u0000without the need for training, and Semantic Web technologies to structure and\u0000type-validate the LLM's results. The final output is a catalogue of documents\u0000whose authenticity has been debated, along with scholars' opinions on their\u0000authenticity. This process can serve as a valuable resource for integration\u0000into catalogues, allowing room for more intricate queries and analyses on the\u0000evolution of these debates over centuries.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Advancing Manuscript Metadata: Work in Progress at the Jagiellonian University 推进手稿元数据:雅盖隆大学的工作进展
Pub Date : 2024-07-09 DOI: arxiv-2407.06976
Luiz do Valle Miranda, Krzysztof Kutt, Grzegorz J. Nalepa
As part of ongoing research projects, three Jagiellonian University units --the Jagiellonian University Museum, the Jagiellonian University Archives, andthe Jagiellonian Library -- are collaborating to digitize cultural heritagedocuments, describe them in detail, and then integrate these descriptions intoa linked data cloud. Achieving this goal requires, as a first step, thedevelopment of a metadata model that, on the one hand, complies with existingstandards, on the other hand, allows interoperability with other systems, andon the third, captures all the elements of description established by thecurators of the collections. In this paper, we present a report on the currentstatus of the work, in which we outline the most important requirements for thedata model under development and then make a detailed comparison with the twostandards that are the most relevant from the point of view of collections:Europeana Data Model used in Europeana and Encoded Archival Description used inKalliope.
作为正在进行的研究项目的一部分,雅盖隆大学的三个单位--雅盖隆大学博物馆、雅盖隆大学档案馆和雅盖隆图书馆--正在合作对文化遗产文献进行数字化、详细描述,然后将这些描述整合到一个链接数据云中。要实现这一目标,首先需要开发一个元数据模型,该模型一方面要符合现有标准,另一方面要允许与其他系统互操作,第三要捕捉到藏品管理员建立的所有描述要素。在本文中,我们报告了这项工作的现状,概述了对正在开发的数据模型最重要的要求,然后详细比较了与藏品最相关的两个标准:Europeana 中使用的 Europeana 数据模型和 Kalliope 中使用的编码档案描述。
{"title":"Advancing Manuscript Metadata: Work in Progress at the Jagiellonian University","authors":"Luiz do Valle Miranda, Krzysztof Kutt, Grzegorz J. Nalepa","doi":"arxiv-2407.06976","DOIUrl":"https://doi.org/arxiv-2407.06976","url":null,"abstract":"As part of ongoing research projects, three Jagiellonian University units --\u0000the Jagiellonian University Museum, the Jagiellonian University Archives, and\u0000the Jagiellonian Library -- are collaborating to digitize cultural heritage\u0000documents, describe them in detail, and then integrate these descriptions into\u0000a linked data cloud. Achieving this goal requires, as a first step, the\u0000development of a metadata model that, on the one hand, complies with existing\u0000standards, on the other hand, allows interoperability with other systems, and\u0000on the third, captures all the elements of description established by the\u0000curators of the collections. In this paper, we present a report on the current\u0000status of the work, in which we outline the most important requirements for the\u0000data model under development and then make a detailed comparison with the two\u0000standards that are the most relevant from the point of view of collections:\u0000Europeana Data Model used in Europeana and Encoded Archival Description used in\u0000Kalliope.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Hybrid X-Linker: Automated Data Generation and Extreme Multi-label Ranking for Biomedical Entity Linking 混合 X-链接器:用于生物医学实体链接的自动数据生成和极端多标签排序
Pub Date : 2024-07-08 DOI: arxiv-2407.06292
Pedro Ruas, Fernando Gallego, Francisco J. Veredas, Francisco M. Couto
State-of-the-art deep learning entity linking methods rely on extensivehuman-labelled data, which is costly to acquire. Current datasets are limitedin size, leading to inadequate coverage of biomedical concepts and diminishedperformance when applied to new data. In this work, we propose to automaticallygenerate data to create large-scale training datasets, which allows theexploration of approaches originally developed for the task of extrememulti-label ranking in the biomedical entity linking task. We propose thehybrid X-Linker pipeline that includes different modules to link disease andchemical entity mentions to concepts in the MEDIC and the CTD-Chemicalvocabularies, respectively. X-Linker was evaluated on several biomedicaldatasets: BC5CDR-Disease, BioRED-Disease, NCBI-Disease, BC5CDR-Chemical,BioRED-Chemical, and NLM-Chem, achieving top-1 accuracies of 0.8307, 0.7969,0.8271, 0.9511, 0.9248, and 0.7895, respectively. X-Linker demonstratedsuperior performance in three datasets: BC5CDR-Disease, NCBI-Disease, andBioRED-Chemical. In contrast, SapBERT outperformed X-Linker in the remainingthree datasets. Both models rely only on the mention string for theiroperations. The source code of X-Linker and its associated data are publiclyavailable for performing biomedical entity linking without requiringpre-labelled entities with identifiers from specific knowledge organizationsystems.
最先进的深度学习实体链接方法依赖于大量人类标记数据,而获取这些数据的成本很高。目前的数据集规模有限,导致对生物医学概念的覆盖不足,在应用于新数据时性能下降。在这项工作中,我们建议自动生成数据以创建大规模训练数据集,这样就可以探索最初为生物医学实体链接任务中的极端多标签排序任务而开发的方法。我们提出的混合 X-Linker 管道包括不同的模块,用于将疾病和化学实体提及分别链接到 MEDIC 和 CTD-Chemicalocabularies 中的概念。X-Linker 在几个生物医学数据集上进行了评估:X-Linker 在几个生物医学数据集上进行了评估:BC5CDR-Disease、BioRED-Disease、NCBI-Disease、BC5CDR-Chemical、BioRED-Chemical 和 NLM-Chem,前 1 位的准确率分别为 0.8307、0.7969、0.8271、0.9511、0.9248 和 0.7895。X-Linker 在三个数据集中表现出更高的性能:在 BC5CDR-疾病、NCBI-疾病和 BioRED-Chemical 三个数据集中,X-Linker 表现出更高的性能。相比之下,SapBERT 在其余三个数据集中的表现优于 X-Linker。这两个模型的运算都只依赖于提及字符串。X-Linker 的源代码及其相关数据是公开的,可用于执行生物医学实体链接,而不需要从特定的知识组织系统中预先标记具有标识符的实体。
{"title":"Hybrid X-Linker: Automated Data Generation and Extreme Multi-label Ranking for Biomedical Entity Linking","authors":"Pedro Ruas, Fernando Gallego, Francisco J. Veredas, Francisco M. Couto","doi":"arxiv-2407.06292","DOIUrl":"https://doi.org/arxiv-2407.06292","url":null,"abstract":"State-of-the-art deep learning entity linking methods rely on extensive\u0000human-labelled data, which is costly to acquire. Current datasets are limited\u0000in size, leading to inadequate coverage of biomedical concepts and diminished\u0000performance when applied to new data. In this work, we propose to automatically\u0000generate data to create large-scale training datasets, which allows the\u0000exploration of approaches originally developed for the task of extreme\u0000multi-label ranking in the biomedical entity linking task. We propose the\u0000hybrid X-Linker pipeline that includes different modules to link disease and\u0000chemical entity mentions to concepts in the MEDIC and the CTD-Chemical\u0000vocabularies, respectively. X-Linker was evaluated on several biomedical\u0000datasets: BC5CDR-Disease, BioRED-Disease, NCBI-Disease, BC5CDR-Chemical,\u0000BioRED-Chemical, and NLM-Chem, achieving top-1 accuracies of 0.8307, 0.7969,\u00000.8271, 0.9511, 0.9248, and 0.7895, respectively. X-Linker demonstrated\u0000superior performance in three datasets: BC5CDR-Disease, NCBI-Disease, and\u0000BioRED-Chemical. In contrast, SapBERT outperformed X-Linker in the remaining\u0000three datasets. Both models rely only on the mention string for their\u0000operations. The source code of X-Linker and its associated data are publicly\u0000available for performing biomedical entity linking without requiring\u0000pre-labelled entities with identifiers from specific knowledge organization\u0000systems.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction 历史墨迹:19 世纪拉丁美洲西班牙语报纸语料库与 LLM OCR 更正
Pub Date : 2024-07-04 DOI: arxiv-2407.12838
Laura Manrique-Gómez, Tony Montes, Rubén Manrique
This paper presents two significant contributions: first, a novel dataset of19th-century Latin American press texts, which addresses the lack ofspecialized corpora for historical and linguistic analysis in this region.Second, it introduces a framework for OCR error correction and linguisticsurface form detection in digitized corpora, utilizing a Large Language Model.This framework is adaptable to various contexts and, in this paper, isspecifically applied to the newly created dataset.
本文提出了两个重要贡献:第一,建立了一个包含 19 世纪拉丁美洲新闻文本的新数据集,解决了该地区缺乏历史和语言分析专用语料库的问题;第二,介绍了一个利用大型语言模型在数字化语料库中进行 OCR 纠错和语言表面形式检测的框架。
{"title":"Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction","authors":"Laura Manrique-Gómez, Tony Montes, Rubén Manrique","doi":"arxiv-2407.12838","DOIUrl":"https://doi.org/arxiv-2407.12838","url":null,"abstract":"This paper presents two significant contributions: first, a novel dataset of\u000019th-century Latin American press texts, which addresses the lack of\u0000specialized corpora for historical and linguistic analysis in this region.\u0000Second, it introduces a framework for OCR error correction and linguistic\u0000surface form detection in digitized corpora, utilizing a Large Language Model.\u0000This framework is adaptable to various contexts and, in this paper, is\u0000specifically applied to the newly created dataset.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CiteAssist: A System for Automated Preprint Citation and BibTeX Generation CiteAssist:预印本引文和 BibTeX 自动生成系统
Pub Date : 2024-07-03 DOI: arxiv-2407.03192
Lars Benedikt Kaesberg, Terry Ruas, Jan Philip Wahle, Bela Gipp
We present CiteAssist, a system to automate the generation of BibTeX entriesfor preprints, streamlining the process of bibliographic annotation. Our systemextracts metadata, such as author names, titles, publication dates, andkeywords, to create standardized annotations within the document. CiteAssistautomatically attaches the BibTeX citation to the end of a PDF and links it onthe first page of the document so other researchers gain immediate access tothe correct citation of the article. This method promotes platform flexibilityby ensuring that annotations remain accessible regardless of the repositoryused to publish or access the preprint. The annotations remain available evenif the preprint is viewed externally to CiteAssist. Additionally, the systemadds relevant related papers based on extracted keywords to the preprint,providing researchers with additional publications besides those in relatedwork for further reading. Researchers can enhance their preprints organizationand reference management workflows through a free and publicly available webinterface.
我们介绍的 CiteAssist 是一个为预印本自动生成 BibTeX 条目的系统,它简化了书目注释的过程。我们的系统提取元数据,如作者姓名、标题、出版日期和关键词,在文档中创建标准化注释。CiteAssist 自动将 BibTeX 引文附加到 PDF 的末尾,并将其链接到文档的第一页,这样其他研究人员就能立即访问文章的正确引文。这种方法提高了平台的灵活性,确保无论使用哪个版本库发布或访问预印本,注释都能保持可访问性。即使在 CiteAssist 外部查看预印本,注释也仍然可用。此外,该系统还可根据提取的关键字向预印本添加相关的相关论文,为研究人员提供除相关作品以外的其他出版物,供其进一步阅读。研究人员可以通过免费、公开的网络界面来改进他们的预印本组织和参考文献管理工作流程。
{"title":"CiteAssist: A System for Automated Preprint Citation and BibTeX Generation","authors":"Lars Benedikt Kaesberg, Terry Ruas, Jan Philip Wahle, Bela Gipp","doi":"arxiv-2407.03192","DOIUrl":"https://doi.org/arxiv-2407.03192","url":null,"abstract":"We present CiteAssist, a system to automate the generation of BibTeX entries\u0000for preprints, streamlining the process of bibliographic annotation. Our system\u0000extracts metadata, such as author names, titles, publication dates, and\u0000keywords, to create standardized annotations within the document. CiteAssist\u0000automatically attaches the BibTeX citation to the end of a PDF and links it on\u0000the first page of the document so other researchers gain immediate access to\u0000the correct citation of the article. This method promotes platform flexibility\u0000by ensuring that annotations remain accessible regardless of the repository\u0000used to publish or access the preprint. The annotations remain available even\u0000if the preprint is viewed externally to CiteAssist. Additionally, the system\u0000adds relevant related papers based on extracted keywords to the preprint,\u0000providing researchers with additional publications besides those in related\u0000work for further reading. Researchers can enhance their preprints organization\u0000and reference management workflows through a free and publicly available web\u0000interface.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141547361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Quantitative Methods in Research Evaluation Citation Indicators, Altmetrics, and Artificial Intelligence 研究评价中的定量方法 引用指标、Altmetrics 和人工智能
Pub Date : 2024-06-28 DOI: arxiv-2407.00135
Mike Thelwall
This book critically analyses the value of citation data, altmetrics, andartificial intelligence to support the research evaluation of articles,scholars, departments, universities, countries, and funders. It introduces anddiscusses indicators that can support research evaluation and analyses theirstrengths and weaknesses as well as the generic strengths and weaknesses of theuse of indicators for research assessment. The book includes evidence of thecomparative value of citations and altmetrics in all broad academic fieldsprimarily through comparisons against article level human expert judgementsfrom the UK Research Excellence Framework 2021. It also discusses the potentialapplications of traditional artificial intelligence and large language modelsfor research evaluation, with large scale evidence for the former. The bookconcludes that citation data can be informative and helpful in some researchfields for some research evaluation purposes but that indicators are neveraccurate enough to be described as research quality measures. It also arguesthat AI may be helpful in limited circumstances for some types of researchevaluation.
本书批判性地分析了引用数据、altmetrics和人工智能在支持对文章、学者、院系、大学、国家和资助者进行研究评估方面的价值。本书介绍并讨论了可以支持研究评估的指标,分析了这些指标的优缺点,以及使用指标进行研究评估的一般优缺点。本书主要通过与英国《2021 年卓越研究框架》中的文章级人类专家评判进行比较,证明了引文和 Altmetrics 在所有广泛学术领域的比较价值。书中还讨论了传统人工智能和大型语言模型在研究评估中的潜在应用,并提供了前者的大规模证据。本书的结论是,在某些研究领域,引文数据可以为某些研究评价目的提供信息和帮助,但指标的准确性永远不足以被称为研究质量衡量标准。该书还认为,在有限的情况下,人工智能可能有助于某些类型的研究评估。
{"title":"Quantitative Methods in Research Evaluation Citation Indicators, Altmetrics, and Artificial Intelligence","authors":"Mike Thelwall","doi":"arxiv-2407.00135","DOIUrl":"https://doi.org/arxiv-2407.00135","url":null,"abstract":"This book critically analyses the value of citation data, altmetrics, and\u0000artificial intelligence to support the research evaluation of articles,\u0000scholars, departments, universities, countries, and funders. It introduces and\u0000discusses indicators that can support research evaluation and analyses their\u0000strengths and weaknesses as well as the generic strengths and weaknesses of the\u0000use of indicators for research assessment. The book includes evidence of the\u0000comparative value of citations and altmetrics in all broad academic fields\u0000primarily through comparisons against article level human expert judgements\u0000from the UK Research Excellence Framework 2021. It also discusses the potential\u0000applications of traditional artificial intelligence and large language models\u0000for research evaluation, with large scale evidence for the former. The book\u0000concludes that citation data can be informative and helpful in some research\u0000fields for some research evaluation purposes but that indicators are never\u0000accurate enough to be described as research quality measures. It also argues\u0000that AI may be helpful in limited circumstances for some types of research\u0000evaluation.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia 维基百科引文:从多语言维基百科中提取可复制的引文
Pub Date : 2024-06-27 DOI: arxiv-2406.19291
Natallia Kokash, Giovanni Colavizza
Wikipedia is an essential component of the open science ecosystem, yet it ispoorly integrated with academic open science initiatives. Wikipedia Citationsis a project that focuses on extracting and releasing comprehensive datasets ofcitations from Wikipedia. A total of 29.3 million citations were extracted fromEnglish Wikipedia in May 2020. Following this one-off research project, wedesigned a reproducible pipeline that can process any given Wikipedia dump inthe cloud-based settings. To demonstrate its usability, we extracted 40.6million citations in February 2023 and 44.7 million citations in February 2024.Furthermore, we equipped the pipeline with an adapted Wikipedia citationtemplate translation module to process multilingual Wikipedia articles in 15European languages so that they are parsed and mapped into a generic structuredcitation template. This paper presents our open-source software pipeline toretrieve, classify, and disambiguate citations on demand from a given Wikipediadump.
维基百科是开放科学生态系统的重要组成部分,但它与学术开放科学计划的整合程度却很低。维基百科引文是一个专注于从维基百科中提取和发布全面引文数据集的项目。2020 年 5 月,从英文维基百科中总共提取了 2930 万条引文。在这个一次性研究项目之后,我们设计了一个可重复的管道,可以在基于云的环境中处理任何给定的维基百科转储。为了证明其可用性,我们在 2023 年 2 月提取了 4060 万条引文,在 2024 年 2 月提取了 4470 万条引文。此外,我们还为该管道配备了一个经过改编的维基百科引文模板翻译模块,用于处理 15 种欧洲语言的多语种维基百科文章,以便将其解析并映射到通用的结构化引文模板中。本文介绍了我们的开源软件管道,该管道可按需从给定的维基百科转储中检索、分类和消歧引文。
{"title":"Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia","authors":"Natallia Kokash, Giovanni Colavizza","doi":"arxiv-2406.19291","DOIUrl":"https://doi.org/arxiv-2406.19291","url":null,"abstract":"Wikipedia is an essential component of the open science ecosystem, yet it is\u0000poorly integrated with academic open science initiatives. Wikipedia Citations\u0000is a project that focuses on extracting and releasing comprehensive datasets of\u0000citations from Wikipedia. A total of 29.3 million citations were extracted from\u0000English Wikipedia in May 2020. Following this one-off research project, we\u0000designed a reproducible pipeline that can process any given Wikipedia dump in\u0000the cloud-based settings. To demonstrate its usability, we extracted 40.6\u0000million citations in February 2023 and 44.7 million citations in February 2024.\u0000Furthermore, we equipped the pipeline with an adapted Wikipedia citation\u0000template translation module to process multilingual Wikipedia articles in 15\u0000European languages so that they are parsed and mapped into a generic structured\u0000citation template. This paper presents our open-source software pipeline to\u0000retrieve, classify, and disambiguate citations on demand from a given Wikipedia\u0000dump.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Metrics to Detect Small-Scale and Large-Scale Citation Orchestration 检测小规模和大规模引文编排的指标
Pub Date : 2024-06-27 DOI: arxiv-2406.19219
Iakovos Evdaimon, John P. A. Ioannidis, Giannis Nikolentzos, Michail Chatzianastasis, George Panagopoulos, Michalis Vazirgiannis
Citation counts and related metrics have pervasive uses and misuses inacademia and research appraisal, serving as scholarly influence and recognitionmeasures. Hence, comprehending the citation patterns exhibited by authors isessential for assessing their research impact and contributions within theirrespective fields. Although the h-index, introduced by Hirsch in 2005, hasemerged as a popular bibliometric indicator, it fails to account for theintricate relationships between authors and their citation patterns. Thislimitation becomes particularly relevant in cases where citations arestrategically employed to boost the perceived influence of certain individualsor groups, a phenomenon that we term "orchestration". Orchestrated citationscan introduce biases in citation rankings and therefore necessitate theidentification of such patterns. Here, we use Scopus data to investigateorchestration of citations across all scientific disciplines. Orchestrationcould be small-scale, when the author him/herself and/or a small number ofother authors use citations strategically to boost citation metrics likeh-index; or large-scale, where extensive collaborations among many co-authorslead to high h-index for many/all of them. We propose three orchestrationindicators: extremely low values in the ratio of citations over the square ofthe h-index (indicative of small-scale orchestration); extremely small numberof authors who can explain at least 50% of an author's total citations(indicative of either small-scale or large-scale orchestration); and extremelylarge number of co-authors with more than 50 co-authored papers (indicative oflarge-scale orchestration). The distributions, potential thresholds based on 1%(and 5%) percentiles, and insights from these indicators are explored and putinto perspective across science.
在学术界和研究评估中,引用次数和相关指标被广泛使用和滥用,成为衡量学术影响力和认可度的标准。因此,了解作者的引用模式对于评估其在各自领域的研究影响和贡献至关重要。虽然赫希在 2005 年提出的 h 指数已成为一种流行的文献计量指标,但它未能解释作者之间的错综复杂关系及其引用模式。当引用被策略性地用于提升某些个人或团体的影响力时,这种局限性就变得尤为重要,我们将这种现象称为 "精心策划"。精心策划的引文会给引文排名带来偏差,因此有必要识别这种模式。在此,我们利用 Scopus 数据调查了所有科学学科的协调引文。协调引文可以是小规模的,即作者本人和/或少数其他作者战略性地使用引文来提高h-指数等引文指标;也可以是大规模的,即许多合著者之间的广泛合作导致许多/所有合著者的h-指数都很高。我们提出了三个协调指标:引用次数与 h-index 平方的比值极低(表明小规模协调);能解释作者总引用次数至少 50%的作者人数极少(表明小规模或大规模协调);合著论文超过 50 篇的合著者人数极多(表明大规模协调)。我们探讨了这些指标的分布情况、基于 1%(和 5%)百分位数的潜在阈值以及从中得到的启示,并将其纳入整个科学的视角。
{"title":"Metrics to Detect Small-Scale and Large-Scale Citation Orchestration","authors":"Iakovos Evdaimon, John P. A. Ioannidis, Giannis Nikolentzos, Michail Chatzianastasis, George Panagopoulos, Michalis Vazirgiannis","doi":"arxiv-2406.19219","DOIUrl":"https://doi.org/arxiv-2406.19219","url":null,"abstract":"Citation counts and related metrics have pervasive uses and misuses in\u0000academia and research appraisal, serving as scholarly influence and recognition\u0000measures. Hence, comprehending the citation patterns exhibited by authors is\u0000essential for assessing their research impact and contributions within their\u0000respective fields. Although the h-index, introduced by Hirsch in 2005, has\u0000emerged as a popular bibliometric indicator, it fails to account for the\u0000intricate relationships between authors and their citation patterns. This\u0000limitation becomes particularly relevant in cases where citations are\u0000strategically employed to boost the perceived influence of certain individuals\u0000or groups, a phenomenon that we term \"orchestration\". Orchestrated citations\u0000can introduce biases in citation rankings and therefore necessitate the\u0000identification of such patterns. Here, we use Scopus data to investigate\u0000orchestration of citations across all scientific disciplines. Orchestration\u0000could be small-scale, when the author him/herself and/or a small number of\u0000other authors use citations strategically to boost citation metrics like\u0000h-index; or large-scale, where extensive collaborations among many co-authors\u0000lead to high h-index for many/all of them. We propose three orchestration\u0000indicators: extremely low values in the ratio of citations over the square of\u0000the h-index (indicative of small-scale orchestration); extremely small number\u0000of authors who can explain at least 50% of an author's total citations\u0000(indicative of either small-scale or large-scale orchestration); and extremely\u0000large number of co-authors with more than 50 co-authored papers (indicative of\u0000large-scale orchestration). The distributions, potential thresholds based on 1%\u0000(and 5%) percentiles, and insights from these indicators are explored and put\u0000into perspective across science.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Digital Libraries
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1