Zhichao Fang, Jonathan Dudek, Ed Noyons, Rodrigo Costas
To reflect the extent to which science is cited in policy documents, this paper explores the presence of policy document citations for over 18 million Web of Science-indexed publications published between 2010 and 2019. Enabled by the policy document citation data provided by Overton, a searchable index of policy documents worldwide, the results show that there are 3.9% of publications in the dataset cited at least once by policy documents. Policy document citations present a citation delay towards newly published publications and show a stronger predominance to the document types of review and article. Based on the Overton database, publications in the field of Social Sciences and Humanities have the highest relative presence in policy document citations, followed by Life and Earth Sciences and Biomedical and Health Sciences. Our findings shed light not only on the impact of scientific knowledge on the policy-making process, but also on the particular focus of policy documents indexed by Overton on specific research areas.
{"title":"Science cited in policy documents: Evidence from the Overton database","authors":"Zhichao Fang, Jonathan Dudek, Ed Noyons, Rodrigo Costas","doi":"arxiv-2407.09854","DOIUrl":"https://doi.org/arxiv-2407.09854","url":null,"abstract":"To reflect the extent to which science is cited in policy documents, this\u0000paper explores the presence of policy document citations for over 18 million\u0000Web of Science-indexed publications published between 2010 and 2019. Enabled by\u0000the policy document citation data provided by Overton, a searchable index of\u0000policy documents worldwide, the results show that there are 3.9% of\u0000publications in the dataset cited at least once by policy documents. Policy\u0000document citations present a citation delay towards newly published\u0000publications and show a stronger predominance to the document types of review\u0000and article. Based on the Overton database, publications in the field of Social\u0000Sciences and Humanities have the highest relative presence in policy document\u0000citations, followed by Life and Earth Sciences and Biomedical and Health\u0000Sciences. Our findings shed light not only on the impact of scientific\u0000knowledge on the policy-making process, but also on the particular focus of\u0000policy documents indexed by Overton on specific research areas.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721056","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This guide is for everyone who seeks advice for creating stable, secure, and persistent Uniform Resource Identifiers (URIs) in order to publish their data in accordance to the FAIR principles. The use case does not matter. It could range from publishing the results of a small research project to a large knowledge graph at a big corporation. The FAIR principles apply equally and this is why it is important to put extra thought into the URI selection process. The title aims to extend the tradition of "Cool URIs don't change" and "Cool URIs for the Semantic Web". Much has changed since the publication of these works and we would like to revisit some of the principles. Many still hold today, some had to be reworked, and we could also identify new ones
本指南适用于所有寻求创建稳定、安全和持久的统一资源标识符 (URI) 以按照 FAIR 原则发布数据的人。用例并不重要。小到发布一个小型研究项目的结果,大到发布一个大公司的大型知识图谱,都可以。FAIR 原则同样适用,这就是为什么在选择 URI 的过程中要多花些心思的原因。本标题旨在延续 "Cool URIs don't change "和 "Cool URIs for the Semantic Web "的传统。自这些著作出版以来,已经发生了很多变化,我们希望重新审视其中的一些原则。许多原则在今天仍然有效,有些原则必须重新修订,我们还可以找出新的原则
{"title":"Cool URIs for FAIR Knowledge Graphs","authors":"Andreas Thalhammer","doi":"arxiv-2407.09237","DOIUrl":"https://doi.org/arxiv-2407.09237","url":null,"abstract":"This guide is for everyone who seeks advice for creating stable, secure, and\u0000persistent Uniform Resource Identifiers (URIs) in order to publish their data\u0000in accordance to the FAIR principles. The use case does not matter. It could\u0000range from publishing the results of a small research project to a large\u0000knowledge graph at a big corporation. The FAIR principles apply equally and\u0000this is why it is important to put extra thought into the URI selection\u0000process. The title aims to extend the tradition of \"Cool URIs don't change\" and\u0000\"Cool URIs for the Semantic Web\". Much has changed since the publication of\u0000these works and we would like to revisit some of the principles. Many still\u0000hold today, some had to be reworked, and we could also identify new ones","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"65 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721058","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Andrea Schimmenti, Valentina Pasqual, Francesca Tomasi, Fabio Vitali, Marieke van Erp
Given the wide use of forgery throughout history, scholars have and are continuously engaged in assessing the authenticity of historical documents. However, online catalogues merely offer descriptive metadata for these documents, relegating discussions about their authenticity to free-text formats, making it difficult to study these assessments at scale. This study explores the generation of structured data about documents' authenticity assessment from natural language texts. Our pipeline exploits Large Language Models (LLMs) to select, extract and classify relevant claims about the topic without the need for training, and Semantic Web technologies to structure and type-validate the LLM's results. The final output is a catalogue of documents whose authenticity has been debated, along with scholars' opinions on their authenticity. This process can serve as a valuable resource for integration into catalogues, allowing room for more intricate queries and analyses on the evolution of these debates over centuries.
{"title":"Structuring Authenticity Assessments on Historical Documents using LLMs","authors":"Andrea Schimmenti, Valentina Pasqual, Francesca Tomasi, Fabio Vitali, Marieke van Erp","doi":"arxiv-2407.09290","DOIUrl":"https://doi.org/arxiv-2407.09290","url":null,"abstract":"Given the wide use of forgery throughout history, scholars have and are\u0000continuously engaged in assessing the authenticity of historical documents.\u0000However, online catalogues merely offer descriptive metadata for these\u0000documents, relegating discussions about their authenticity to free-text\u0000formats, making it difficult to study these assessments at scale. This study\u0000explores the generation of structured data about documents' authenticity\u0000assessment from natural language texts. Our pipeline exploits Large Language\u0000Models (LLMs) to select, extract and classify relevant claims about the topic\u0000without the need for training, and Semantic Web technologies to structure and\u0000type-validate the LLM's results. The final output is a catalogue of documents\u0000whose authenticity has been debated, along with scholars' opinions on their\u0000authenticity. This process can serve as a valuable resource for integration\u0000into catalogues, allowing room for more intricate queries and analyses on the\u0000evolution of these debates over centuries.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"13 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141721057","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Luiz do Valle Miranda, Krzysztof Kutt, Grzegorz J. Nalepa
As part of ongoing research projects, three Jagiellonian University units -- the Jagiellonian University Museum, the Jagiellonian University Archives, and the Jagiellonian Library -- are collaborating to digitize cultural heritage documents, describe them in detail, and then integrate these descriptions into a linked data cloud. Achieving this goal requires, as a first step, the development of a metadata model that, on the one hand, complies with existing standards, on the other hand, allows interoperability with other systems, and on the third, captures all the elements of description established by the curators of the collections. In this paper, we present a report on the current status of the work, in which we outline the most important requirements for the data model under development and then make a detailed comparison with the two standards that are the most relevant from the point of view of collections: Europeana Data Model used in Europeana and Encoded Archival Description used in Kalliope.
{"title":"Advancing Manuscript Metadata: Work in Progress at the Jagiellonian University","authors":"Luiz do Valle Miranda, Krzysztof Kutt, Grzegorz J. Nalepa","doi":"arxiv-2407.06976","DOIUrl":"https://doi.org/arxiv-2407.06976","url":null,"abstract":"As part of ongoing research projects, three Jagiellonian University units --\u0000the Jagiellonian University Museum, the Jagiellonian University Archives, and\u0000the Jagiellonian Library -- are collaborating to digitize cultural heritage\u0000documents, describe them in detail, and then integrate these descriptions into\u0000a linked data cloud. Achieving this goal requires, as a first step, the\u0000development of a metadata model that, on the one hand, complies with existing\u0000standards, on the other hand, allows interoperability with other systems, and\u0000on the third, captures all the elements of description established by the\u0000curators of the collections. In this paper, we present a report on the current\u0000status of the work, in which we outline the most important requirements for the\u0000data model under development and then make a detailed comparison with the two\u0000standards that are the most relevant from the point of view of collections:\u0000Europeana Data Model used in Europeana and Encoded Archival Description used in\u0000Kalliope.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"31 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pedro Ruas, Fernando Gallego, Francisco J. Veredas, Francisco M. Couto
State-of-the-art deep learning entity linking methods rely on extensive human-labelled data, which is costly to acquire. Current datasets are limited in size, leading to inadequate coverage of biomedical concepts and diminished performance when applied to new data. In this work, we propose to automatically generate data to create large-scale training datasets, which allows the exploration of approaches originally developed for the task of extreme multi-label ranking in the biomedical entity linking task. We propose the hybrid X-Linker pipeline that includes different modules to link disease and chemical entity mentions to concepts in the MEDIC and the CTD-Chemical vocabularies, respectively. X-Linker was evaluated on several biomedical datasets: BC5CDR-Disease, BioRED-Disease, NCBI-Disease, BC5CDR-Chemical, BioRED-Chemical, and NLM-Chem, achieving top-1 accuracies of 0.8307, 0.7969, 0.8271, 0.9511, 0.9248, and 0.7895, respectively. X-Linker demonstrated superior performance in three datasets: BC5CDR-Disease, NCBI-Disease, and BioRED-Chemical. In contrast, SapBERT outperformed X-Linker in the remaining three datasets. Both models rely only on the mention string for their operations. The source code of X-Linker and its associated data are publicly available for performing biomedical entity linking without requiring pre-labelled entities with identifiers from specific knowledge organization systems.
{"title":"Hybrid X-Linker: Automated Data Generation and Extreme Multi-label Ranking for Biomedical Entity Linking","authors":"Pedro Ruas, Fernando Gallego, Francisco J. Veredas, Francisco M. Couto","doi":"arxiv-2407.06292","DOIUrl":"https://doi.org/arxiv-2407.06292","url":null,"abstract":"State-of-the-art deep learning entity linking methods rely on extensive\u0000human-labelled data, which is costly to acquire. Current datasets are limited\u0000in size, leading to inadequate coverage of biomedical concepts and diminished\u0000performance when applied to new data. In this work, we propose to automatically\u0000generate data to create large-scale training datasets, which allows the\u0000exploration of approaches originally developed for the task of extreme\u0000multi-label ranking in the biomedical entity linking task. We propose the\u0000hybrid X-Linker pipeline that includes different modules to link disease and\u0000chemical entity mentions to concepts in the MEDIC and the CTD-Chemical\u0000vocabularies, respectively. X-Linker was evaluated on several biomedical\u0000datasets: BC5CDR-Disease, BioRED-Disease, NCBI-Disease, BC5CDR-Chemical,\u0000BioRED-Chemical, and NLM-Chem, achieving top-1 accuracies of 0.8307, 0.7969,\u00000.8271, 0.9511, 0.9248, and 0.7895, respectively. X-Linker demonstrated\u0000superior performance in three datasets: BC5CDR-Disease, NCBI-Disease, and\u0000BioRED-Chemical. In contrast, SapBERT outperformed X-Linker in the remaining\u0000three datasets. Both models rely only on the mention string for their\u0000operations. The source code of X-Linker and its associated data are publicly\u0000available for performing biomedical entity linking without requiring\u0000pre-labelled entities with identifiers from specific knowledge organization\u0000systems.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"24 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-08","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577011","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper presents two significant contributions: first, a novel dataset of 19th-century Latin American press texts, which addresses the lack of specialized corpora for historical and linguistic analysis in this region. Second, it introduces a framework for OCR error correction and linguistic surface form detection in digitized corpora, utilizing a Large Language Model. This framework is adaptable to various contexts and, in this paper, is specifically applied to the newly created dataset.
{"title":"Historical Ink: 19th Century Latin American Spanish Newspaper Corpus with LLM OCR Correction","authors":"Laura Manrique-Gómez, Tony Montes, Rubén Manrique","doi":"arxiv-2407.12838","DOIUrl":"https://doi.org/arxiv-2407.12838","url":null,"abstract":"This paper presents two significant contributions: first, a novel dataset of\u000019th-century Latin American press texts, which addresses the lack of\u0000specialized corpora for historical and linguistic analysis in this region.\u0000Second, it introduces a framework for OCR error correction and linguistic\u0000surface form detection in digitized corpora, utilizing a Large Language Model.\u0000This framework is adaptable to various contexts and, in this paper, is\u0000specifically applied to the newly created dataset.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737143","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Lars Benedikt Kaesberg, Terry Ruas, Jan Philip Wahle, Bela Gipp
We present CiteAssist, a system to automate the generation of BibTeX entries for preprints, streamlining the process of bibliographic annotation. Our system extracts metadata, such as author names, titles, publication dates, and keywords, to create standardized annotations within the document. CiteAssist automatically attaches the BibTeX citation to the end of a PDF and links it on the first page of the document so other researchers gain immediate access to the correct citation of the article. This method promotes platform flexibility by ensuring that annotations remain accessible regardless of the repository used to publish or access the preprint. The annotations remain available even if the preprint is viewed externally to CiteAssist. Additionally, the system adds relevant related papers based on extracted keywords to the preprint, providing researchers with additional publications besides those in related work for further reading. Researchers can enhance their preprints organization and reference management workflows through a free and publicly available web interface.
{"title":"CiteAssist: A System for Automated Preprint Citation and BibTeX Generation","authors":"Lars Benedikt Kaesberg, Terry Ruas, Jan Philip Wahle, Bela Gipp","doi":"arxiv-2407.03192","DOIUrl":"https://doi.org/arxiv-2407.03192","url":null,"abstract":"We present CiteAssist, a system to automate the generation of BibTeX entries\u0000for preprints, streamlining the process of bibliographic annotation. Our system\u0000extracts metadata, such as author names, titles, publication dates, and\u0000keywords, to create standardized annotations within the document. CiteAssist\u0000automatically attaches the BibTeX citation to the end of a PDF and links it on\u0000the first page of the document so other researchers gain immediate access to\u0000the correct citation of the article. This method promotes platform flexibility\u0000by ensuring that annotations remain accessible regardless of the repository\u0000used to publish or access the preprint. The annotations remain available even\u0000if the preprint is viewed externally to CiteAssist. Additionally, the system\u0000adds relevant related papers based on extracted keywords to the preprint,\u0000providing researchers with additional publications besides those in related\u0000work for further reading. Researchers can enhance their preprints organization\u0000and reference management workflows through a free and publicly available web\u0000interface.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"29 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141547361","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This book critically analyses the value of citation data, altmetrics, and artificial intelligence to support the research evaluation of articles, scholars, departments, universities, countries, and funders. It introduces and discusses indicators that can support research evaluation and analyses their strengths and weaknesses as well as the generic strengths and weaknesses of the use of indicators for research assessment. The book includes evidence of the comparative value of citations and altmetrics in all broad academic fields primarily through comparisons against article level human expert judgements from the UK Research Excellence Framework 2021. It also discusses the potential applications of traditional artificial intelligence and large language models for research evaluation, with large scale evidence for the former. The book concludes that citation data can be informative and helpful in some research fields for some research evaluation purposes but that indicators are never accurate enough to be described as research quality measures. It also argues that AI may be helpful in limited circumstances for some types of research evaluation.
{"title":"Quantitative Methods in Research Evaluation Citation Indicators, Altmetrics, and Artificial Intelligence","authors":"Mike Thelwall","doi":"arxiv-2407.00135","DOIUrl":"https://doi.org/arxiv-2407.00135","url":null,"abstract":"This book critically analyses the value of citation data, altmetrics, and\u0000artificial intelligence to support the research evaluation of articles,\u0000scholars, departments, universities, countries, and funders. It introduces and\u0000discusses indicators that can support research evaluation and analyses their\u0000strengths and weaknesses as well as the generic strengths and weaknesses of the\u0000use of indicators for research assessment. The book includes evidence of the\u0000comparative value of citations and altmetrics in all broad academic fields\u0000primarily through comparisons against article level human expert judgements\u0000from the UK Research Excellence Framework 2021. It also discusses the potential\u0000applications of traditional artificial intelligence and large language models\u0000for research evaluation, with large scale evidence for the former. The book\u0000concludes that citation data can be informative and helpful in some research\u0000fields for some research evaluation purposes but that indicators are never\u0000accurate enough to be described as research quality measures. It also argues\u0000that AI may be helpful in limited circumstances for some types of research\u0000evaluation.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Wikipedia is an essential component of the open science ecosystem, yet it is poorly integrated with academic open science initiatives. Wikipedia Citations is a project that focuses on extracting and releasing comprehensive datasets of citations from Wikipedia. A total of 29.3 million citations were extracted from English Wikipedia in May 2020. Following this one-off research project, we designed a reproducible pipeline that can process any given Wikipedia dump in the cloud-based settings. To demonstrate its usability, we extracted 40.6 million citations in February 2023 and 44.7 million citations in February 2024. Furthermore, we equipped the pipeline with an adapted Wikipedia citation template translation module to process multilingual Wikipedia articles in 15 European languages so that they are parsed and mapped into a generic structured citation template. This paper presents our open-source software pipeline to retrieve, classify, and disambiguate citations on demand from a given Wikipedia dump.
{"title":"Wikipedia Citations: Reproducible Citation Extraction from Multilingual Wikipedia","authors":"Natallia Kokash, Giovanni Colavizza","doi":"arxiv-2406.19291","DOIUrl":"https://doi.org/arxiv-2406.19291","url":null,"abstract":"Wikipedia is an essential component of the open science ecosystem, yet it is\u0000poorly integrated with academic open science initiatives. Wikipedia Citations\u0000is a project that focuses on extracting and releasing comprehensive datasets of\u0000citations from Wikipedia. A total of 29.3 million citations were extracted from\u0000English Wikipedia in May 2020. Following this one-off research project, we\u0000designed a reproducible pipeline that can process any given Wikipedia dump in\u0000the cloud-based settings. To demonstrate its usability, we extracted 40.6\u0000million citations in February 2023 and 44.7 million citations in February 2024.\u0000Furthermore, we equipped the pipeline with an adapted Wikipedia citation\u0000template translation module to process multilingual Wikipedia articles in 15\u0000European languages so that they are parsed and mapped into a generic structured\u0000citation template. This paper presents our open-source software pipeline to\u0000retrieve, classify, and disambiguate citations on demand from a given Wikipedia\u0000dump.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"25 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508297","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Iakovos Evdaimon, John P. A. Ioannidis, Giannis Nikolentzos, Michail Chatzianastasis, George Panagopoulos, Michalis Vazirgiannis
Citation counts and related metrics have pervasive uses and misuses in academia and research appraisal, serving as scholarly influence and recognition measures. Hence, comprehending the citation patterns exhibited by authors is essential for assessing their research impact and contributions within their respective fields. Although the h-index, introduced by Hirsch in 2005, has emerged as a popular bibliometric indicator, it fails to account for the intricate relationships between authors and their citation patterns. This limitation becomes particularly relevant in cases where citations are strategically employed to boost the perceived influence of certain individuals or groups, a phenomenon that we term "orchestration". Orchestrated citations can introduce biases in citation rankings and therefore necessitate the identification of such patterns. Here, we use Scopus data to investigate orchestration of citations across all scientific disciplines. Orchestration could be small-scale, when the author him/herself and/or a small number of other authors use citations strategically to boost citation metrics like h-index; or large-scale, where extensive collaborations among many co-authors lead to high h-index for many/all of them. We propose three orchestration indicators: extremely low values in the ratio of citations over the square of the h-index (indicative of small-scale orchestration); extremely small number of authors who can explain at least 50% of an author's total citations (indicative of either small-scale or large-scale orchestration); and extremely large number of co-authors with more than 50 co-authored papers (indicative of large-scale orchestration). The distributions, potential thresholds based on 1% (and 5%) percentiles, and insights from these indicators are explored and put into perspective across science.
{"title":"Metrics to Detect Small-Scale and Large-Scale Citation Orchestration","authors":"Iakovos Evdaimon, John P. A. Ioannidis, Giannis Nikolentzos, Michail Chatzianastasis, George Panagopoulos, Michalis Vazirgiannis","doi":"arxiv-2406.19219","DOIUrl":"https://doi.org/arxiv-2406.19219","url":null,"abstract":"Citation counts and related metrics have pervasive uses and misuses in\u0000academia and research appraisal, serving as scholarly influence and recognition\u0000measures. Hence, comprehending the citation patterns exhibited by authors is\u0000essential for assessing their research impact and contributions within their\u0000respective fields. Although the h-index, introduced by Hirsch in 2005, has\u0000emerged as a popular bibliometric indicator, it fails to account for the\u0000intricate relationships between authors and their citation patterns. This\u0000limitation becomes particularly relevant in cases where citations are\u0000strategically employed to boost the perceived influence of certain individuals\u0000or groups, a phenomenon that we term \"orchestration\". Orchestrated citations\u0000can introduce biases in citation rankings and therefore necessitate the\u0000identification of such patterns. Here, we use Scopus data to investigate\u0000orchestration of citations across all scientific disciplines. Orchestration\u0000could be small-scale, when the author him/herself and/or a small number of\u0000other authors use citations strategically to boost citation metrics like\u0000h-index; or large-scale, where extensive collaborations among many co-authors\u0000lead to high h-index for many/all of them. We propose three orchestration\u0000indicators: extremely low values in the ratio of citations over the square of\u0000the h-index (indicative of small-scale orchestration); extremely small number\u0000of authors who can explain at least 50% of an author's total citations\u0000(indicative of either small-scale or large-scale orchestration); and extremely\u0000large number of co-authors with more than 50 co-authored papers (indicative of\u0000large-scale orchestration). The distributions, potential thresholds based on 1%\u0000(and 5%) percentiles, and insights from these indicators are explored and put\u0000into perspective across science.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}