Mladen Popović, Maruf A. Dhali, Lambert Schomaker, Johannes van der Plicht, Kaare Lund Rasmussen, Jacopo La Nasa, Ilaria Degano, Maria Perla Colombini, Eibert Tigchelaar
Determining the chronology of ancient handwritten manuscripts is essential for reconstructing the evolution of ideas. For the Dead Sea Scrolls, this is particularly important. However, there is an almost complete lack of date-bearing manuscripts evenly distributed across the timeline and written in similar scripts available for palaeographic comparison. Here, we present Enoch, a state-of-the-art AI-based date-prediction model, trained on the basis of new radiocarbon-dated samples of the scrolls. Enoch uses established handwriting-style descriptors and applies Bayesian ridge regression. The challenge of this study is that the number of radiocarbon-dated manuscripts is small, while current machine learning requires an abundance of training data. We show that by using combined angular and allographic writing style feature vectors and applying Bayesian ridge regression, Enoch could predict the radiocarbon-based dates from style, supported by leave-one-out validation, with varied MAEs of 27.9 to 30.7 years relative to the radiocarbon dating. Enoch was then used to estimate the dates of 135 unseen manuscripts, revealing that 79 per cent of the samples were considered 'realistic' upon palaeographic post-hoc evaluation. We present a new chronology of the scrolls. The radiocarbon ranges and Enoch's style-based predictions are often older than the traditionally assumed palaeographic estimates. In the range of 300-50 BCE, Enoch's date prediction provides an improved granularity. The study is in line with current developments in multimodal machine-learning techniques, and the methods can be used for date prediction in other partially-dated manuscript collections. This research shows how Enoch's quantitative, probability-based approach can be a tool for palaeographers and historians, re-dating ancient Jewish key texts and contributing to current debates on Jewish and Christian origins.
{"title":"Dating ancient manuscripts using radiocarbon and AI-based writing style analysis","authors":"Mladen Popović, Maruf A. Dhali, Lambert Schomaker, Johannes van der Plicht, Kaare Lund Rasmussen, Jacopo La Nasa, Ilaria Degano, Maria Perla Colombini, Eibert Tigchelaar","doi":"arxiv-2407.12013","DOIUrl":"https://doi.org/arxiv-2407.12013","url":null,"abstract":"Determining the chronology of ancient handwritten manuscripts is essential\u0000for reconstructing the evolution of ideas. For the Dead Sea Scrolls, this is\u0000particularly important. However, there is an almost complete lack of\u0000date-bearing manuscripts evenly distributed across the timeline and written in\u0000similar scripts available for palaeographic comparison. Here, we present Enoch,\u0000a state-of-the-art AI-based date-prediction model, trained on the basis of new\u0000radiocarbon-dated samples of the scrolls. Enoch uses established\u0000handwriting-style descriptors and applies Bayesian ridge regression. The\u0000challenge of this study is that the number of radiocarbon-dated manuscripts is\u0000small, while current machine learning requires an abundance of training data.\u0000We show that by using combined angular and allographic writing style feature\u0000vectors and applying Bayesian ridge regression, Enoch could predict the\u0000radiocarbon-based dates from style, supported by leave-one-out validation, with\u0000varied MAEs of 27.9 to 30.7 years relative to the radiocarbon dating. Enoch was\u0000then used to estimate the dates of 135 unseen manuscripts, revealing that 79\u0000per cent of the samples were considered 'realistic' upon palaeographic post-hoc\u0000evaluation. We present a new chronology of the scrolls. The radiocarbon ranges\u0000and Enoch's style-based predictions are often older than the traditionally\u0000assumed palaeographic estimates. In the range of 300-50 BCE, Enoch's date\u0000prediction provides an improved granularity. The study is in line with current\u0000developments in multimodal machine-learning techniques, and the methods can be\u0000used for date prediction in other partially-dated manuscript collections. This\u0000research shows how Enoch's quantitative, probability-based approach can be a\u0000tool for palaeographers and historians, re-dating ancient Jewish key texts and\u0000contributing to current debates on Jewish and Christian origins.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Stefan Arnold, Dilara Yesilbas, Rene Gröbner, Dominik Riedelbauch, Maik Horn, Sven Weinzierl
Artificial Intelligence (AI) faces persistent challenges in terms of transparency and accountability, which requires rigorous documentation. Through a literature review on documentation practices, we provide an overview of prevailing trends, persistent issues, and the multifaceted interplay of factors influencing the documentation. Our examination of key characteristics such as scope, target audiences, support for multimodality, and level of automation, highlights a dynamic evolution in documentation practices, underscored by a shift towards a more holistic, engaging, and automated documentation.
{"title":"Documentation Practices of Artificial Intelligence","authors":"Stefan Arnold, Dilara Yesilbas, Rene Gröbner, Dominik Riedelbauch, Maik Horn, Sven Weinzierl","doi":"arxiv-2406.18620","DOIUrl":"https://doi.org/arxiv-2406.18620","url":null,"abstract":"Artificial Intelligence (AI) faces persistent challenges in terms of\u0000transparency and accountability, which requires rigorous documentation. Through\u0000a literature review on documentation practices, we provide an overview of\u0000prevailing trends, persistent issues, and the multifaceted interplay of factors\u0000influencing the documentation. Our examination of key characteristics such as\u0000scope, target audiences, support for multimodality, and level of automation,\u0000highlights a dynamic evolution in documentation practices, underscored by a\u0000shift towards a more holistic, engaging, and automated documentation.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Simon van Bellen, Juan Pablo Alperin, Vincent Larivière
Global scholarly publishing has been dominated by a small number of publishers for several decades. We aimed to revisit the debate on corporate control of scholarly publishing by analyzing the relative shares of major publishers and smaller, independent publishers. Using the Web of Science, Dimensions and OpenAlex, we managed to retrieve twice as many articles indexed in Dimensions and OpenAlex, compared to the rather selective Web of Science. As a result of excluding smaller publishers, the 'oligopoly' of scholarly publishers persists, at least in appearance, according to the Web of Science. However, both Dimensions' and OpenAlex' inclusive indexing revealed the share of smaller publishers has been growing rapidly, especially since the onset of large-scale online publishing around 2000, resulting in a current cumulative dominance of smaller publishers. While the expansion of small publishers was most pronounced in the social sciences and humanities, the natural and medical sciences showed a similar trend. A major geographical divergence is also revealed, with some countries, mostly Anglo-Saxon and/or located in northwestern Europe, relying heavily on major publishers for the dissemination of their research, while others being relatively independent of the oligopoly, such as those in Latin America, northern Africa, eastern Europe and parts of Asia. The emergence of digital publishing, the reduction of expenses for printing and distribution and open-source journal management tools may have contributed to the emergence of small publishers, while the development of inclusive bibliometric databases has allowed for the effective indexing of journals and articles. We conclude that enhanced visibility to recently created, independent journals may favour their growth and stimulate global scholarly bibliodiversity.
几十年来,全球学术出版一直由少数出版商主导。我们旨在通过分析大型出版商和小型独立出版商的相对份额,重新审视关于企业控制学术出版的争论。通过使用 Web of Science、Dimensions 和 OpenAlex,我们检索到的被 Dimensions 和 OpenAlex 索引的文章数量是选择性相当强的 Web of Science 的两倍。然而,Dimensions 和 OpenAlex 的包容性索引显示,小型出版商的份额一直在快速增长,尤其是自 2000 年左右大规模在线出版开始以来,导致了当前小型出版商的累积优势。小型出版商的扩张在社会科学和人文科学领域最为明显,自然科学和医学领域也呈现出类似的趋势。地理上也出现了很大的差异,一些国家(主要是盎格鲁-撒克逊国家和/或位于西 北欧的国家)严重依赖大出版商传播其研究成果,而其他国家则相对独立于寡头垄断,如 拉丁美洲、北非、东欧和亚洲部分地区的国家。数字出版的出现、印刷和发行费用的减少以及开源期刊管理工具可能是小型出版商兴起的原因,而包容性文献计量数据库的开发则为有效编制期刊和文章索引创造了条件。我们的结论是,提高新近创办的独立期刊的知名度可能有利于它们的发展,并促进全球学术图书的多样性。
{"title":"The oligopoly of academic publishers persists in exclusive database","authors":"Simon van Bellen, Juan Pablo Alperin, Vincent Larivière","doi":"arxiv-2406.17893","DOIUrl":"https://doi.org/arxiv-2406.17893","url":null,"abstract":"Global scholarly publishing has been dominated by a small number of\u0000publishers for several decades. We aimed to revisit the debate on corporate\u0000control of scholarly publishing by analyzing the relative shares of major\u0000publishers and smaller, independent publishers. Using the Web of Science,\u0000Dimensions and OpenAlex, we managed to retrieve twice as many articles indexed\u0000in Dimensions and OpenAlex, compared to the rather selective Web of Science. As\u0000a result of excluding smaller publishers, the 'oligopoly' of scholarly\u0000publishers persists, at least in appearance, according to the Web of Science.\u0000However, both Dimensions' and OpenAlex' inclusive indexing revealed the share\u0000of smaller publishers has been growing rapidly, especially since the onset of\u0000large-scale online publishing around 2000, resulting in a current cumulative\u0000dominance of smaller publishers. While the expansion of small publishers was\u0000most pronounced in the social sciences and humanities, the natural and medical\u0000sciences showed a similar trend. A major geographical divergence is also\u0000revealed, with some countries, mostly Anglo-Saxon and/or located in\u0000northwestern Europe, relying heavily on major publishers for the dissemination\u0000of their research, while others being relatively independent of the oligopoly,\u0000such as those in Latin America, northern Africa, eastern Europe and parts of\u0000Asia. The emergence of digital publishing, the reduction of expenses for\u0000printing and distribution and open-source journal management tools may have\u0000contributed to the emergence of small publishers, while the development of\u0000inclusive bibliometric databases has allowed for the effective indexing of\u0000journals and articles. We conclude that enhanced visibility to recently\u0000created, independent journals may favour their growth and stimulate global\u0000scholarly bibliodiversity.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"187 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we describe the extraction of all the location entries from a prominent Swedish encyclopedia from the early 20th century, the textit{Nordisk Familjebok} `Nordic Family Book.' We focused on the second edition called textit{Uggleupplagan}, which comprises 38 volumes and over 182,000 articles. This makes it one of the most extensive Swedish encyclopedias. Using a classifier, we first determined the category of the entries. We found that approximately 22 percent of them were locations. We applied a named entity recognition to these entries and we linked them to Wikidata. Wikidata enabled us to extract their precise geographic locations resulting in almost 18,000 valid coordinates. We then analyzed the distribution of these locations and the entry selection process. It showed a higher density within Sweden, Germany, and the United Kingdom. The paper sheds light on the selection and representation of geographic information in the textit{Nordisk Familjebok}, providing insights into historical and societal perspectives. It also paves the way for future investigations into entry selection in different time periods and comparative analyses among various encyclopedias.
{"title":"Mapping the Past: Geographically Linking an Early 20th Century Swedish Encyclopedia with Wikidata","authors":"Axel Ahlin, Alfred Myrne, Pierre Nugues","doi":"arxiv-2406.17903","DOIUrl":"https://doi.org/arxiv-2406.17903","url":null,"abstract":"In this paper, we describe the extraction of all the location entries from a\u0000prominent Swedish encyclopedia from the early 20th century, the textit{Nordisk\u0000Familjebok} `Nordic Family Book.' We focused on the second edition called\u0000textit{Uggleupplagan}, which comprises 38 volumes and over 182,000 articles.\u0000This makes it one of the most extensive Swedish encyclopedias. Using a\u0000classifier, we first determined the category of the entries. We found that\u0000approximately 22 percent of them were locations. We applied a named entity\u0000recognition to these entries and we linked them to Wikidata. Wikidata enabled\u0000us to extract their precise geographic locations resulting in almost 18,000\u0000valid coordinates. We then analyzed the distribution of these locations and the\u0000entry selection process. It showed a higher density within Sweden, Germany, and\u0000the United Kingdom. The paper sheds light on the selection and representation\u0000of geographic information in the textit{Nordisk Familjebok}, providing\u0000insights into historical and societal perspectives. It also paves the way for\u0000future investigations into entry selection in different time periods and\u0000comparative analyses among various encyclopedias.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"66 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zheng Fang, Miguel Arana-Catania, Felix-Anselm van Lier, Juliana Outes Velarde, Harry Bregazzi, Mara Airoldi, Eleanor Carter, Rob Procter
The sheer number of research outputs published every year makes systematic reviewing increasingly time- and resource-intensive. This paper explores the use of machine learning techniques to help navigate the systematic review process. ML has previously been used to reliably 'screen' articles for review - that is, identify relevant articles based on reviewers' inclusion criteria. The application of ML techniques to subsequent stages of a review, however, such as data extraction and evidence mapping, is in its infancy. We therefore set out to develop a series of tools that would assist in the profiling and analysis of 1,952 publications on the theme of 'outcomes-based contracting'. Tools were developed for the following tasks: assign publications into 'policy area' categories; identify and extract key information for evidence mapping, such as organisations, laws, and geographical information; connect the evidence base to an existing dataset on the same topic; and identify subgroups of articles that may share thematic content. An interactive tool using these techniques and a public dataset with their outputs have been released. Our results demonstrate the utility of ML techniques to enhance evidence accessibility and analysis within the systematic review processes. These efforts show promise in potentially yielding substantial efficiencies for future systematic reviewing and for broadening their analytical scope. Our work suggests that there may be implications for the ease with which policymakers and practitioners can access evidence. While ML techniques seem poised to play a significant role in bridging the gap between research and policy by offering innovative ways of gathering, accessing, and analysing data from systematic reviews, we also highlight their current limitations and the need to exercise caution in their application, particularly given the potential for errors and biases.
{"title":"SyROCCo: Enhancing Systematic Reviews using Machine Learning","authors":"Zheng Fang, Miguel Arana-Catania, Felix-Anselm van Lier, Juliana Outes Velarde, Harry Bregazzi, Mara Airoldi, Eleanor Carter, Rob Procter","doi":"arxiv-2406.16527","DOIUrl":"https://doi.org/arxiv-2406.16527","url":null,"abstract":"The sheer number of research outputs published every year makes systematic\u0000reviewing increasingly time- and resource-intensive. This paper explores the\u0000use of machine learning techniques to help navigate the systematic review\u0000process. ML has previously been used to reliably 'screen' articles for review -\u0000that is, identify relevant articles based on reviewers' inclusion criteria. The\u0000application of ML techniques to subsequent stages of a review, however, such as\u0000data extraction and evidence mapping, is in its infancy. We therefore set out\u0000to develop a series of tools that would assist in the profiling and analysis of\u00001,952 publications on the theme of 'outcomes-based contracting'. Tools were\u0000developed for the following tasks: assign publications into 'policy area'\u0000categories; identify and extract key information for evidence mapping, such as\u0000organisations, laws, and geographical information; connect the evidence base to\u0000an existing dataset on the same topic; and identify subgroups of articles that\u0000may share thematic content. An interactive tool using these techniques and a\u0000public dataset with their outputs have been released. Our results demonstrate\u0000the utility of ML techniques to enhance evidence accessibility and analysis\u0000within the systematic review processes. These efforts show promise in\u0000potentially yielding substantial efficiencies for future systematic reviewing\u0000and for broadening their analytical scope. Our work suggests that there may be\u0000implications for the ease with which policymakers and practitioners can access\u0000evidence. While ML techniques seem poised to play a significant role in\u0000bridging the gap between research and policy by offering innovative ways of\u0000gathering, accessing, and analysing data from systematic reviews, we also\u0000highlight their current limitations and the need to exercise caution in their\u0000application, particularly given the potential for errors and biases.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
We approach productivity in science in a longitudinal fashion: We track careers over time, up to 40 years. We first allocate scientists to decile-based publishing productivity classes, from the bottom 10% to the top 10%. Then, we seek patterns of mobility between the classes in two career stages: assistant professorship and associate professorship. Our findings confirm that radically changing publishing productivity levels (upward or downward) almost never happens. Scientists with a very weak past track record in publications emerge as having marginal chances of becoming scientists with a very strong future track record across all science, technology, engineering, mathematics, and medicine (STEMM) fields. Hence, our research shows a long-term character of careers in science, with publishing productivity during the apprenticeship period of assistant professorship heavily influencing productivity during the more independent period of associate professorship. We use individual-level microdata on academic careers (from a national registry of scientists) and individual-level metadata on publications (from the Scopus raw dataset). Polish associate professors tend to be stuck in their productivity classes for years: High performers tend to remain high performers, and low performers tend to remain low performers over their careers. Logistic regression analysis powerfully supports our two-dimensional results. We examine all internationally visible Polish associate professors in five fields of science in STEMM fields (N = 4,165 with N art = 71,841 articles).
{"title":"Are Scientists Changing their Research Productivity Classes When They Move Up the Academic Ladder?","authors":"Marek Kwiek, Wojciech Roszka","doi":"arxiv-2407.04200","DOIUrl":"https://doi.org/arxiv-2407.04200","url":null,"abstract":"We approach productivity in science in a longitudinal fashion: We track\u0000careers over time, up to 40 years. We first allocate scientists to decile-based\u0000publishing productivity classes, from the bottom 10% to the top 10%. Then, we\u0000seek patterns of mobility between the classes in two career stages: assistant\u0000professorship and associate professorship. Our findings confirm that radically\u0000changing publishing productivity levels (upward or downward) almost never\u0000happens. Scientists with a very weak past track record in publications emerge\u0000as having marginal chances of becoming scientists with a very strong future\u0000track record across all science, technology, engineering, mathematics, and\u0000medicine (STEMM) fields. Hence, our research shows a long-term character of\u0000careers in science, with publishing productivity during the apprenticeship\u0000period of assistant professorship heavily influencing productivity during the\u0000more independent period of associate professorship. We use individual-level\u0000microdata on academic careers (from a national registry of scientists) and\u0000individual-level metadata on publications (from the Scopus raw dataset). Polish\u0000associate professors tend to be stuck in their productivity classes for years:\u0000High performers tend to remain high performers, and low performers tend to\u0000remain low performers over their careers. Logistic regression analysis\u0000powerfully supports our two-dimensional results. We examine all internationally\u0000visible Polish associate professors in five fields of science in STEMM fields\u0000(N = 4,165 with N art = 71,841 articles).","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Nick Haupka, Jack H. Culbert, Alexander Schniedermann, Najko Jahn, Philipp Mayr
This study compares and analyses publication and document types in the following bibliographic databases: OpenAlex, Scopus, Web of Science, Semantic Scholar and PubMed. The results demonstrate that typologies can differ considerably between individual database providers. Moreover, the distinction between research and non-research texts, which is required to identify relevant documents for bibliometric analysis, can vary depending on the data source because publications are classified differently in the respective databases. The focus of this study, in addition to the cross-database comparison, is primarily on the coverage and analysis of the publication and document types contained in OpenAlex, as OpenAlex is becoming increasingly important as a free alternative to established proprietary providers for bibliometric analyses at libraries and universities.
本研究对以下书目数据库中的出版物和文件类型进行了比较和分析:OpenAlex、Scopus、Web of Science、SemanticScholar 和 PubMed。结果表明,各个数据库提供商之间的类型学差异很大。此外,由于出版物在各个数据库中的分类方式不同,因此研究文本和非研究文本之间的区别也会因数据源的不同而不同,而这种区别是为文献计量学分析识别相关文献所必需的。除了跨数据库比较之外,本研究的重点主要是 OpenAlex 中所包含的出版物和文献类型的覆盖范围和分析,因为 OpenAlex 作为图书馆和大学文献计量学分析中现有专有提供商的免费替代品正变得越来越重要。
{"title":"Analysis of the Publication and Document Types in OpenAlex, Web of Science, Scopus, Pubmed and Semantic Scholar","authors":"Nick Haupka, Jack H. Culbert, Alexander Schniedermann, Najko Jahn, Philipp Mayr","doi":"arxiv-2406.15154","DOIUrl":"https://doi.org/arxiv-2406.15154","url":null,"abstract":"This study compares and analyses publication and document types in the\u0000following bibliographic databases: OpenAlex, Scopus, Web of Science, Semantic\u0000Scholar and PubMed. The results demonstrate that typologies can differ\u0000considerably between individual database providers. Moreover, the distinction\u0000between research and non-research texts, which is required to identify relevant\u0000documents for bibliometric analysis, can vary depending on the data source\u0000because publications are classified differently in the respective databases.\u0000The focus of this study, in addition to the cross-database comparison, is\u0000primarily on the coverage and analysis of the publication and document types\u0000contained in OpenAlex, as OpenAlex is becoming increasingly important as a free\u0000alternative to established proprietary providers for bibliometric analyses at\u0000libraries and universities.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Following Funk and Owen-Smith (2017), Wu et al. (2019) proposed the disruption index (DI1) as a bibliometric indicator that measures disruptive and consolidating research. When we summarized the literature on the disruption index for our recently published review article (Leibel & Bornmann, 2024), we noticed that the calculation of disruption scores comes with numerous (hidden) degrees of freedom. In this Letter to the Editor, we explain why this analytical flexibility endangers the credibility of bibliometric research based on the DI1 (and its variants) and advertise the application of multiverse-style methods to increase the transparency of the research.
{"title":"The disruption index in the multiverse: The calculation of scores comes with numerous (hidden) degrees of freedom","authors":"Christian Leibel, Lutz Bornmann","doi":"arxiv-2406.13367","DOIUrl":"https://doi.org/arxiv-2406.13367","url":null,"abstract":"Following Funk and Owen-Smith (2017), Wu et al. (2019) proposed the\u0000disruption index (DI1) as a bibliometric indicator that measures disruptive and\u0000consolidating research. When we summarized the literature on the disruption\u0000index for our recently published review article (Leibel & Bornmann, 2024), we\u0000noticed that the calculation of disruption scores comes with numerous (hidden)\u0000degrees of freedom. In this Letter to the Editor, we explain why this\u0000analytical flexibility endangers the credibility of bibliometric research based\u0000on the DI1 (and its variants) and advertise the application of multiverse-style\u0000methods to increase the transparency of the research.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"90 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Regulating AI has emerged as a key societal challenge, but which methods of regulation are effective is unclear. Here, we measure the effectiveness of restricting AI services geographically using the case of ChatGPT and science. OpenAI prohibits access to ChatGPT from several countries including China and Russia. If the restrictions are effective, there should be minimal use of ChatGPT in prohibited countries. Drawing on the finding that early versions of ChatGPT overrepresented distinctive words like "delve," we developed a simple ensemble classifier by training it on abstracts before and after ChatGPT "polishing". Testing on held-out abstracts and those where authors self-declared to have used AI for writing shows that our classifier substantially outperforms off-the-shelf LLM detectors like GPTZero and ZeroGPT. Applying the classifier to preprints from Arxiv, BioRxiv, and MedRxiv reveals that ChatGPT was used in approximately 12.6% of preprints by August 2023 and use was 7.7% higher in countries without legal access. Crucially, these patterns appeared before the first major legal LLM became widely available in China, the largest restricted-country preprint producer. ChatGPT use was associated with higher views and downloads, but not citations or journal placement. Overall, restricting ChatGPT geographically has proven ineffective in science and possibly other domains, likely due to widespread workarounds.
{"title":"Where there's a will there's a way: ChatGPT is used more for science in countries where it is prohibited","authors":"Honglin Bao, Mengyi Sun, Misha Teplitskiy","doi":"arxiv-2406.11583","DOIUrl":"https://doi.org/arxiv-2406.11583","url":null,"abstract":"Regulating AI has emerged as a key societal challenge, but which methods of\u0000regulation are effective is unclear. Here, we measure the effectiveness of\u0000restricting AI services geographically using the case of ChatGPT and science.\u0000OpenAI prohibits access to ChatGPT from several countries including China and\u0000Russia. If the restrictions are effective, there should be minimal use of\u0000ChatGPT in prohibited countries. Drawing on the finding that early versions of\u0000ChatGPT overrepresented distinctive words like \"delve,\" we developed a simple\u0000ensemble classifier by training it on abstracts before and after ChatGPT\u0000\"polishing\". Testing on held-out abstracts and those where authors\u0000self-declared to have used AI for writing shows that our classifier\u0000substantially outperforms off-the-shelf LLM detectors like GPTZero and ZeroGPT.\u0000Applying the classifier to preprints from Arxiv, BioRxiv, and MedRxiv reveals\u0000that ChatGPT was used in approximately 12.6% of preprints by August 2023 and\u0000use was 7.7% higher in countries without legal access. Crucially, these\u0000patterns appeared before the first major legal LLM became widely available in\u0000China, the largest restricted-country preprint producer. ChatGPT use was\u0000associated with higher views and downloads, but not citations or journal\u0000placement. Overall, restricting ChatGPT geographically has proven ineffective\u0000in science and possibly other domains, likely due to widespread workarounds.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Differences between the impacts of Open Access (OA) and non-OA research have been observed over a wide range of citation and altmetric indicators, usually finding an Open Access Advantage (OAA) within specific fields. However, science-wide analyses covering multiple years, indicators and disciplines are lacking. Using citation counts and six altmetrics for 38.7M articles published 2011-21, we compare OA and non-OA papers. The results show that there is no universal OAA across all disciplines or impact indicators: the OAA for citations tends to be lower for more recent papers, whereas the OAAs for news, blogs and Twitter are consistent across years and unrelated to volume of OA publications, whereas the OAAs for Wikipedia, patents and policy citations are more complex. These results support different hypotheses for different subjects and indicators. The evidence is consistent with OA accelerating research impact in the Medical & Health Sciences, Life Sciences and the Humanities; that increased visibility or discoverability is a factor in promoting the translation of research into socio-economic impact; and that OA is a factor in growing online engagement with research in some disciplines. OAAs are therefore complex, dynamic, multi-factorial and require considerable analysis to understand.
{"title":"Evaluating Open Access Advantages for Citations and Altmetrics (2011-21): A Dynamic and Evolving Relationship","authors":"Michael Taylor","doi":"arxiv-2406.10535","DOIUrl":"https://doi.org/arxiv-2406.10535","url":null,"abstract":"Differences between the impacts of Open Access (OA) and non-OA research have\u0000been observed over a wide range of citation and altmetric indicators, usually\u0000finding an Open Access Advantage (OAA) within specific fields. However,\u0000science-wide analyses covering multiple years, indicators and disciplines are\u0000lacking. Using citation counts and six altmetrics for 38.7M articles published\u00002011-21, we compare OA and non-OA papers. The results show that there is no\u0000universal OAA across all disciplines or impact indicators: the OAA for\u0000citations tends to be lower for more recent papers, whereas the OAAs for news,\u0000blogs and Twitter are consistent across years and unrelated to volume of OA\u0000publications, whereas the OAAs for Wikipedia, patents and policy citations are\u0000more complex. These results support different hypotheses for different subjects\u0000and indicators. The evidence is consistent with OA accelerating research impact\u0000in the Medical & Health Sciences, Life Sciences and the Humanities; that\u0000increased visibility or discoverability is a factor in promoting the\u0000translation of research into socio-economic impact; and that OA is a factor in\u0000growing online engagement with research in some disciplines. OAAs are therefore\u0000complex, dynamic, multi-factorial and require considerable analysis to\u0000understand.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}