首页 > 最新文献

arXiv - CS - Digital Libraries最新文献

英文 中文
Dating ancient manuscripts using radiocarbon and AI-based writing style analysis 利用放射性碳和基于人工智能的书写风格分析确定古代手稿的年代
Pub Date : 2024-06-26 DOI: arxiv-2407.12013
Mladen Popović, Maruf A. Dhali, Lambert Schomaker, Johannes van der Plicht, Kaare Lund Rasmussen, Jacopo La Nasa, Ilaria Degano, Maria Perla Colombini, Eibert Tigchelaar
Determining the chronology of ancient handwritten manuscripts is essentialfor reconstructing the evolution of ideas. For the Dead Sea Scrolls, this isparticularly important. However, there is an almost complete lack ofdate-bearing manuscripts evenly distributed across the timeline and written insimilar scripts available for palaeographic comparison. Here, we present Enoch,a state-of-the-art AI-based date-prediction model, trained on the basis of newradiocarbon-dated samples of the scrolls. Enoch uses establishedhandwriting-style descriptors and applies Bayesian ridge regression. Thechallenge of this study is that the number of radiocarbon-dated manuscripts issmall, while current machine learning requires an abundance of training data.We show that by using combined angular and allographic writing style featurevectors and applying Bayesian ridge regression, Enoch could predict theradiocarbon-based dates from style, supported by leave-one-out validation, withvaried MAEs of 27.9 to 30.7 years relative to the radiocarbon dating. Enoch wasthen used to estimate the dates of 135 unseen manuscripts, revealing that 79per cent of the samples were considered 'realistic' upon palaeographic post-hocevaluation. We present a new chronology of the scrolls. The radiocarbon rangesand Enoch's style-based predictions are often older than the traditionallyassumed palaeographic estimates. In the range of 300-50 BCE, Enoch's dateprediction provides an improved granularity. The study is in line with currentdevelopments in multimodal machine-learning techniques, and the methods can beused for date prediction in other partially-dated manuscript collections. Thisresearch shows how Enoch's quantitative, probability-based approach can be atool for palaeographers and historians, re-dating ancient Jewish key texts andcontributing to current debates on Jewish and Christian origins.
确定古代手写手稿的年代对于重建思想的演变至关重要。对于《死海古卷》来说,这一点尤为重要。然而,目前几乎完全缺乏均匀分布在时间轴上、书写文字不同的手稿来进行古文字学比较。在此,我们介绍了基于人工智能的最新日期预测模型 Enoch,该模型是在新的放射碳年代古卷样本基础上训练而成的。Enoch 使用既定的书写风格描述符,并应用贝叶斯脊回归。这项研究面临的挑战是,经过放射性碳测年的手稿数量很少,而目前的机器学习需要大量的训练数据。我们的研究表明,通过使用组合的角度和异体书写风格特征向量,并应用贝叶斯脊回归,Enoch可以根据风格预测基于放射性碳测年的年代,并得到leave-one-out验证的支持,相对于放射性碳测年的最大误差为27.9至30.7年。以诺氏被用来估算 135 份未见手稿的年代,结果显示 79% 的样本在古文字学后评估中被认为是 "符合实际情况 "的。我们提出了卷轴的新年代学。放射性碳范围和伊诺克基于风格的预测往往比传统古文字学的估计要早。在公元前 300-50 年的范围内,伊诺克的日期预测提供了更好的粒度。这项研究与当前多模态机器学习技术的发展相吻合,其方法可用于其他部分年代的手稿收藏的日期预测。这项研究表明,以诺的定量、基于概率的方法可以成为古文字学家和历史学家的工具,重新确定古代犹太教重要典籍的年代,并为当前有关犹太教和基督教起源的争论做出贡献。
{"title":"Dating ancient manuscripts using radiocarbon and AI-based writing style analysis","authors":"Mladen Popović, Maruf A. Dhali, Lambert Schomaker, Johannes van der Plicht, Kaare Lund Rasmussen, Jacopo La Nasa, Ilaria Degano, Maria Perla Colombini, Eibert Tigchelaar","doi":"arxiv-2407.12013","DOIUrl":"https://doi.org/arxiv-2407.12013","url":null,"abstract":"Determining the chronology of ancient handwritten manuscripts is essential\u0000for reconstructing the evolution of ideas. For the Dead Sea Scrolls, this is\u0000particularly important. However, there is an almost complete lack of\u0000date-bearing manuscripts evenly distributed across the timeline and written in\u0000similar scripts available for palaeographic comparison. Here, we present Enoch,\u0000a state-of-the-art AI-based date-prediction model, trained on the basis of new\u0000radiocarbon-dated samples of the scrolls. Enoch uses established\u0000handwriting-style descriptors and applies Bayesian ridge regression. The\u0000challenge of this study is that the number of radiocarbon-dated manuscripts is\u0000small, while current machine learning requires an abundance of training data.\u0000We show that by using combined angular and allographic writing style feature\u0000vectors and applying Bayesian ridge regression, Enoch could predict the\u0000radiocarbon-based dates from style, supported by leave-one-out validation, with\u0000varied MAEs of 27.9 to 30.7 years relative to the radiocarbon dating. Enoch was\u0000then used to estimate the dates of 135 unseen manuscripts, revealing that 79\u0000per cent of the samples were considered 'realistic' upon palaeographic post-hoc\u0000evaluation. We present a new chronology of the scrolls. The radiocarbon ranges\u0000and Enoch's style-based predictions are often older than the traditionally\u0000assumed palaeographic estimates. In the range of 300-50 BCE, Enoch's date\u0000prediction provides an improved granularity. The study is in line with current\u0000developments in multimodal machine-learning techniques, and the methods can be\u0000used for date prediction in other partially-dated manuscript collections. This\u0000research shows how Enoch's quantitative, probability-based approach can be a\u0000tool for palaeographers and historians, re-dating ancient Jewish key texts and\u0000contributing to current debates on Jewish and Christian origins.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737144","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Documentation Practices of Artificial Intelligence 人工智能的文档编制实践
Pub Date : 2024-06-26 DOI: arxiv-2406.18620
Stefan Arnold, Dilara Yesilbas, Rene Gröbner, Dominik Riedelbauch, Maik Horn, Sven Weinzierl
Artificial Intelligence (AI) faces persistent challenges in terms oftransparency and accountability, which requires rigorous documentation. Througha literature review on documentation practices, we provide an overview ofprevailing trends, persistent issues, and the multifaceted interplay of factorsinfluencing the documentation. Our examination of key characteristics such asscope, target audiences, support for multimodality, and level of automation,highlights a dynamic evolution in documentation practices, underscored by ashift towards a more holistic, engaging, and automated documentation.
人工智能(AI)在透明度和问责制方面面临着持续的挑战,这需要严格的文件记录。通过对文献实践的文献综述,我们概述了当前的趋势、持续存在的问题以及影响文献的多方面因素的相互作用。我们对诸如范围、目标受众、对多模态的支持以及自动化程度等关键特征进行了研究,结果表明,文件编制实践正在发生动态演变,其突出表现是文件编制向更全面、更吸引人和更自动化的方向转变。
{"title":"Documentation Practices of Artificial Intelligence","authors":"Stefan Arnold, Dilara Yesilbas, Rene Gröbner, Dominik Riedelbauch, Maik Horn, Sven Weinzierl","doi":"arxiv-2406.18620","DOIUrl":"https://doi.org/arxiv-2406.18620","url":null,"abstract":"Artificial Intelligence (AI) faces persistent challenges in terms of\u0000transparency and accountability, which requires rigorous documentation. Through\u0000a literature review on documentation practices, we provide an overview of\u0000prevailing trends, persistent issues, and the multifaceted interplay of factors\u0000influencing the documentation. Our examination of key characteristics such as\u0000scope, target audiences, support for multimodality, and level of automation,\u0000highlights a dynamic evolution in documentation practices, underscored by a\u0000shift towards a more holistic, engaging, and automated documentation.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-26","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527516","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The oligopoly of academic publishers persists in exclusive database 学术出版商的寡头垄断在独家数据库中持续存在
Pub Date : 2024-06-25 DOI: arxiv-2406.17893
Simon van Bellen, Juan Pablo Alperin, Vincent Larivière
Global scholarly publishing has been dominated by a small number ofpublishers for several decades. We aimed to revisit the debate on corporatecontrol of scholarly publishing by analyzing the relative shares of majorpublishers and smaller, independent publishers. Using the Web of Science,Dimensions and OpenAlex, we managed to retrieve twice as many articles indexedin Dimensions and OpenAlex, compared to the rather selective Web of Science. Asa result of excluding smaller publishers, the 'oligopoly' of scholarlypublishers persists, at least in appearance, according to the Web of Science.However, both Dimensions' and OpenAlex' inclusive indexing revealed the shareof smaller publishers has been growing rapidly, especially since the onset oflarge-scale online publishing around 2000, resulting in a current cumulativedominance of smaller publishers. While the expansion of small publishers wasmost pronounced in the social sciences and humanities, the natural and medicalsciences showed a similar trend. A major geographical divergence is alsorevealed, with some countries, mostly Anglo-Saxon and/or located innorthwestern Europe, relying heavily on major publishers for the disseminationof their research, while others being relatively independent of the oligopoly,such as those in Latin America, northern Africa, eastern Europe and parts ofAsia. The emergence of digital publishing, the reduction of expenses forprinting and distribution and open-source journal management tools may havecontributed to the emergence of small publishers, while the development ofinclusive bibliometric databases has allowed for the effective indexing ofjournals and articles. We conclude that enhanced visibility to recentlycreated, independent journals may favour their growth and stimulate globalscholarly bibliodiversity.
几十年来,全球学术出版一直由少数出版商主导。我们旨在通过分析大型出版商和小型独立出版商的相对份额,重新审视关于企业控制学术出版的争论。通过使用 Web of Science、Dimensions 和 OpenAlex,我们检索到的被 Dimensions 和 OpenAlex 索引的文章数量是选择性相当强的 Web of Science 的两倍。然而,Dimensions 和 OpenAlex 的包容性索引显示,小型出版商的份额一直在快速增长,尤其是自 2000 年左右大规模在线出版开始以来,导致了当前小型出版商的累积优势。小型出版商的扩张在社会科学和人文科学领域最为明显,自然科学和医学领域也呈现出类似的趋势。地理上也出现了很大的差异,一些国家(主要是盎格鲁-撒克逊国家和/或位于西 北欧的国家)严重依赖大出版商传播其研究成果,而其他国家则相对独立于寡头垄断,如 拉丁美洲、北非、东欧和亚洲部分地区的国家。数字出版的出现、印刷和发行费用的减少以及开源期刊管理工具可能是小型出版商兴起的原因,而包容性文献计量数据库的开发则为有效编制期刊和文章索引创造了条件。我们的结论是,提高新近创办的独立期刊的知名度可能有利于它们的发展,并促进全球学术图书的多样性。
{"title":"The oligopoly of academic publishers persists in exclusive database","authors":"Simon van Bellen, Juan Pablo Alperin, Vincent Larivière","doi":"arxiv-2406.17893","DOIUrl":"https://doi.org/arxiv-2406.17893","url":null,"abstract":"Global scholarly publishing has been dominated by a small number of\u0000publishers for several decades. We aimed to revisit the debate on corporate\u0000control of scholarly publishing by analyzing the relative shares of major\u0000publishers and smaller, independent publishers. Using the Web of Science,\u0000Dimensions and OpenAlex, we managed to retrieve twice as many articles indexed\u0000in Dimensions and OpenAlex, compared to the rather selective Web of Science. As\u0000a result of excluding smaller publishers, the 'oligopoly' of scholarly\u0000publishers persists, at least in appearance, according to the Web of Science.\u0000However, both Dimensions' and OpenAlex' inclusive indexing revealed the share\u0000of smaller publishers has been growing rapidly, especially since the onset of\u0000large-scale online publishing around 2000, resulting in a current cumulative\u0000dominance of smaller publishers. While the expansion of small publishers was\u0000most pronounced in the social sciences and humanities, the natural and medical\u0000sciences showed a similar trend. A major geographical divergence is also\u0000revealed, with some countries, mostly Anglo-Saxon and/or located in\u0000northwestern Europe, relying heavily on major publishers for the dissemination\u0000of their research, while others being relatively independent of the oligopoly,\u0000such as those in Latin America, northern Africa, eastern Europe and parts of\u0000Asia. The emergence of digital publishing, the reduction of expenses for\u0000printing and distribution and open-source journal management tools may have\u0000contributed to the emergence of small publishers, while the development of\u0000inclusive bibliometric databases has allowed for the effective indexing of\u0000journals and articles. We conclude that enhanced visibility to recently\u0000created, independent journals may favour their growth and stimulate global\u0000scholarly bibliodiversity.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"187 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508296","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Mapping the Past: Geographically Linking an Early 20th Century Swedish Encyclopedia with Wikidata 映射过去:将 20 世纪早期的瑞典百科全书与维基数据进行地理链接
Pub Date : 2024-06-25 DOI: arxiv-2406.17903
Axel Ahlin, Alfred Myrne, Pierre Nugues
In this paper, we describe the extraction of all the location entries from aprominent Swedish encyclopedia from the early 20th century, the textit{NordiskFamiljebok} `Nordic Family Book.' We focused on the second edition calledtextit{Uggleupplagan}, which comprises 38 volumes and over 182,000 articles.This makes it one of the most extensive Swedish encyclopedias. Using aclassifier, we first determined the category of the entries. We found thatapproximately 22 percent of them were locations. We applied a named entityrecognition to these entries and we linked them to Wikidata. Wikidata enabledus to extract their precise geographic locations resulting in almost 18,000valid coordinates. We then analyzed the distribution of these locations and theentry selection process. It showed a higher density within Sweden, Germany, andthe United Kingdom. The paper sheds light on the selection and representationof geographic information in the textit{Nordisk Familjebok}, providinginsights into historical and societal perspectives. It also paves the way forfuture investigations into entry selection in different time periods andcomparative analyses among various encyclopedias.
在本文中,我们介绍了从 20 世纪初瑞典著名的百科全书《北欧家谱》(textit{NordiskFamiljebok})中提取所有地点条目的方法。北欧家谱》。我们重点研究了称为 (textit{Uggleupplagan})的第二版,它共有 38 卷,超过 182,000 篇文章,是内容最广泛的瑞典百科全书之一。我们首先使用分类器确定了条目的类别。我们发现其中约有 22% 的条目属于地点类。我们对这些条目进行了命名实体识别,并将其链接到 Wikidata。通过 Wikidata,我们提取了这些条目的精确地理位置,得到了近 18,000 个有效坐标。然后,我们分析了这些地点的分布和条目选择过程。结果显示,瑞典、德国和英国境内的密度较高。这篇论文揭示了地理信息在 "Nordisk Familjebok "文本中的选择和表示,提供了历史和社会视角的见解。它还为今后研究不同时期的条目选择以及各种百科全书之间的比较分析铺平了道路。
{"title":"Mapping the Past: Geographically Linking an Early 20th Century Swedish Encyclopedia with Wikidata","authors":"Axel Ahlin, Alfred Myrne, Pierre Nugues","doi":"arxiv-2406.17903","DOIUrl":"https://doi.org/arxiv-2406.17903","url":null,"abstract":"In this paper, we describe the extraction of all the location entries from a\u0000prominent Swedish encyclopedia from the early 20th century, the textit{Nordisk\u0000Familjebok} `Nordic Family Book.' We focused on the second edition called\u0000textit{Uggleupplagan}, which comprises 38 volumes and over 182,000 articles.\u0000This makes it one of the most extensive Swedish encyclopedias. Using a\u0000classifier, we first determined the category of the entries. We found that\u0000approximately 22 percent of them were locations. We applied a named entity\u0000recognition to these entries and we linked them to Wikidata. Wikidata enabled\u0000us to extract their precise geographic locations resulting in almost 18,000\u0000valid coordinates. We then analyzed the distribution of these locations and the\u0000entry selection process. It showed a higher density within Sweden, Germany, and\u0000the United Kingdom. The paper sheds light on the selection and representation\u0000of geographic information in the textit{Nordisk Familjebok}, providing\u0000insights into historical and societal perspectives. It also paves the way for\u0000future investigations into entry selection in different time periods and\u0000comparative analyses among various encyclopedias.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"66 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-25","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527517","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
SyROCCo: Enhancing Systematic Reviews using Machine Learning SyROCCo:利用机器学习加强系统性综述
Pub Date : 2024-06-24 DOI: arxiv-2406.16527
Zheng Fang, Miguel Arana-Catania, Felix-Anselm van Lier, Juliana Outes Velarde, Harry Bregazzi, Mara Airoldi, Eleanor Carter, Rob Procter
The sheer number of research outputs published every year makes systematicreviewing increasingly time- and resource-intensive. This paper explores theuse of machine learning techniques to help navigate the systematic reviewprocess. ML has previously been used to reliably 'screen' articles for review -that is, identify relevant articles based on reviewers' inclusion criteria. Theapplication of ML techniques to subsequent stages of a review, however, such asdata extraction and evidence mapping, is in its infancy. We therefore set outto develop a series of tools that would assist in the profiling and analysis of1,952 publications on the theme of 'outcomes-based contracting'. Tools weredeveloped for the following tasks: assign publications into 'policy area'categories; identify and extract key information for evidence mapping, such asorganisations, laws, and geographical information; connect the evidence base toan existing dataset on the same topic; and identify subgroups of articles thatmay share thematic content. An interactive tool using these techniques and apublic dataset with their outputs have been released. Our results demonstratethe utility of ML techniques to enhance evidence accessibility and analysiswithin the systematic review processes. These efforts show promise inpotentially yielding substantial efficiencies for future systematic reviewingand for broadening their analytical scope. Our work suggests that there may beimplications for the ease with which policymakers and practitioners can accessevidence. While ML techniques seem poised to play a significant role inbridging the gap between research and policy by offering innovative ways ofgathering, accessing, and analysing data from systematic reviews, we alsohighlight their current limitations and the need to exercise caution in theirapplication, particularly given the potential for errors and biases.
每年发表的研究成果数量庞大,使得系统性综述越来越耗费时间和资源。本文探讨了如何利用机器学习技术来帮助引导系统性综述过程。ML 以前曾被用于可靠地 "筛选 "待审文章,即根据审稿人的纳入标准识别相关文章。然而,将人工智能技术应用于综述的后续阶段,如数据提取和证据映射,还处于起步阶段。因此,我们着手开发了一系列工具,以帮助对 1952 篇以 "基于结果的合同 "为主题的出版物进行剖析和分析。我们为以下任务开发了工具:将出版物归入 "政策领域 "类别;识别并提取关键信息以绘制证据图,如组织、法律和地理信息;将证据库与同一主题的现有数据集连接起来;识别可能共享主题内容的文章子群。使用这些技术的互动工具及其输出结果的公共数据集已经发布。我们的研究结果证明了 ML 技术在系统性综述过程中提高证据可获取性和分析能力的实用性。这些努力表明,未来的系统性综述和扩大其分析范围的工作有望大幅提高效率。我们的工作表明,这可能会对决策者和从业人员获取证据的便利性产生影响。通过提供收集、获取和分析系统综述数据的创新方法,ML 技术似乎有望在缩小研究与政策之间的差距方面发挥重要作用,但我们也强调了其当前的局限性,以及在应用过程中谨慎行事的必要性,特别是考虑到可能出现的错误和偏差。
{"title":"SyROCCo: Enhancing Systematic Reviews using Machine Learning","authors":"Zheng Fang, Miguel Arana-Catania, Felix-Anselm van Lier, Juliana Outes Velarde, Harry Bregazzi, Mara Airoldi, Eleanor Carter, Rob Procter","doi":"arxiv-2406.16527","DOIUrl":"https://doi.org/arxiv-2406.16527","url":null,"abstract":"The sheer number of research outputs published every year makes systematic\u0000reviewing increasingly time- and resource-intensive. This paper explores the\u0000use of machine learning techniques to help navigate the systematic review\u0000process. ML has previously been used to reliably 'screen' articles for review -\u0000that is, identify relevant articles based on reviewers' inclusion criteria. The\u0000application of ML techniques to subsequent stages of a review, however, such as\u0000data extraction and evidence mapping, is in its infancy. We therefore set out\u0000to develop a series of tools that would assist in the profiling and analysis of\u00001,952 publications on the theme of 'outcomes-based contracting'. Tools were\u0000developed for the following tasks: assign publications into 'policy area'\u0000categories; identify and extract key information for evidence mapping, such as\u0000organisations, laws, and geographical information; connect the evidence base to\u0000an existing dataset on the same topic; and identify subgroups of articles that\u0000may share thematic content. An interactive tool using these techniques and a\u0000public dataset with their outputs have been released. Our results demonstrate\u0000the utility of ML techniques to enhance evidence accessibility and analysis\u0000within the systematic review processes. These efforts show promise in\u0000potentially yielding substantial efficiencies for future systematic reviewing\u0000and for broadening their analytical scope. Our work suggests that there may be\u0000implications for the ease with which policymakers and practitioners can access\u0000evidence. While ML techniques seem poised to play a significant role in\u0000bridging the gap between research and policy by offering innovative ways of\u0000gathering, accessing, and analysing data from systematic reviews, we also\u0000highlight their current limitations and the need to exercise caution in their\u0000application, particularly given the potential for errors and biases.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"40 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-24","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508298","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Are Scientists Changing their Research Productivity Classes When They Move Up the Academic Ladder? 科学家在学术阶梯上更上一层楼时,是否会改变他们的研究生产力等级?
Pub Date : 2024-06-23 DOI: arxiv-2407.04200
Marek Kwiek, Wojciech Roszka
We approach productivity in science in a longitudinal fashion: We trackcareers over time, up to 40 years. We first allocate scientists to decile-basedpublishing productivity classes, from the bottom 10% to the top 10%. Then, weseek patterns of mobility between the classes in two career stages: assistantprofessorship and associate professorship. Our findings confirm that radicallychanging publishing productivity levels (upward or downward) almost neverhappens. Scientists with a very weak past track record in publications emergeas having marginal chances of becoming scientists with a very strong futuretrack record across all science, technology, engineering, mathematics, andmedicine (STEMM) fields. Hence, our research shows a long-term character ofcareers in science, with publishing productivity during the apprenticeshipperiod of assistant professorship heavily influencing productivity during themore independent period of associate professorship. We use individual-levelmicrodata on academic careers (from a national registry of scientists) andindividual-level metadata on publications (from the Scopus raw dataset). Polishassociate professors tend to be stuck in their productivity classes for years:High performers tend to remain high performers, and low performers tend toremain low performers over their careers. Logistic regression analysispowerfully supports our two-dimensional results. We examine all internationallyvisible Polish associate professors in five fields of science in STEMM fields(N = 4,165 with N art = 71,841 articles).
我们以纵向方式研究科学的生产力:我们对科学家的职业生涯进行了长达 40 年的追踪。我们首先将科学家划分为基于十等分的出版生产力等级,从最低的 10%到最高的 10%。然后,我们在两个职业阶段--助理教授和副教授--寻找不同等级之间的流动模式。我们的研究结果证实,出版生产力水平的急剧变化(向上或向下)几乎从未发生过。在所有科学、技术、工程、数学和医学(STEMM)领域,过去发表论文记录非常薄弱的科学家成为未来发表论文记录非常出色的科学家的机会微乎其微。因此,我们的研究表明,科学领域的职业生涯具有长期性,助理教授学徒期的论文发表率在很大程度上影响着更独立的副教授期的论文发表率。我们使用了个人层面的学术生涯微观数据(来自全国科学家登记处)和个人层面的出版物元数据(来自 Scopus 原始数据集)。波兰的副教授往往在其生产力等级上停留多年:高绩效者往往在其职业生涯中保持高绩效,而低绩效者往往在其职业生涯中保持低绩效。逻辑回归分析有力地支持了我们的二维结果。我们研究了波兰在 STEMM 领域五个科学领域的所有国际知名副教授(N = 4,165 人,N art = 71,841 篇文章)。
{"title":"Are Scientists Changing their Research Productivity Classes When They Move Up the Academic Ladder?","authors":"Marek Kwiek, Wojciech Roszka","doi":"arxiv-2407.04200","DOIUrl":"https://doi.org/arxiv-2407.04200","url":null,"abstract":"We approach productivity in science in a longitudinal fashion: We track\u0000careers over time, up to 40 years. We first allocate scientists to decile-based\u0000publishing productivity classes, from the bottom 10% to the top 10%. Then, we\u0000seek patterns of mobility between the classes in two career stages: assistant\u0000professorship and associate professorship. Our findings confirm that radically\u0000changing publishing productivity levels (upward or downward) almost never\u0000happens. Scientists with a very weak past track record in publications emerge\u0000as having marginal chances of becoming scientists with a very strong future\u0000track record across all science, technology, engineering, mathematics, and\u0000medicine (STEMM) fields. Hence, our research shows a long-term character of\u0000careers in science, with publishing productivity during the apprenticeship\u0000period of assistant professorship heavily influencing productivity during the\u0000more independent period of associate professorship. We use individual-level\u0000microdata on academic careers (from a national registry of scientists) and\u0000individual-level metadata on publications (from the Scopus raw dataset). Polish\u0000associate professors tend to be stuck in their productivity classes for years:\u0000High performers tend to remain high performers, and low performers tend to\u0000remain low performers over their careers. Logistic regression analysis\u0000powerfully supports our two-dimensional results. We examine all internationally\u0000visible Polish associate professors in five fields of science in STEMM fields\u0000(N = 4,165 with N art = 71,841 articles).","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"38 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-23","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141577012","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Analysis of the Publication and Document Types in OpenAlex, Web of Science, Scopus, Pubmed and Semantic Scholar 对 OpenAlex、Web of Science、Scopus、Pubmed 和 Semantic Scholar 中的出版物和文件类型的分析
Pub Date : 2024-06-21 DOI: arxiv-2406.15154
Nick Haupka, Jack H. Culbert, Alexander Schniedermann, Najko Jahn, Philipp Mayr
This study compares and analyses publication and document types in thefollowing bibliographic databases: OpenAlex, Scopus, Web of Science, SemanticScholar and PubMed. The results demonstrate that typologies can differconsiderably between individual database providers. Moreover, the distinctionbetween research and non-research texts, which is required to identify relevantdocuments for bibliometric analysis, can vary depending on the data sourcebecause publications are classified differently in the respective databases.The focus of this study, in addition to the cross-database comparison, isprimarily on the coverage and analysis of the publication and document typescontained in OpenAlex, as OpenAlex is becoming increasingly important as a freealternative to established proprietary providers for bibliometric analyses atlibraries and universities.
本研究对以下书目数据库中的出版物和文件类型进行了比较和分析:OpenAlex、Scopus、Web of Science、SemanticScholar 和 PubMed。结果表明,各个数据库提供商之间的类型学差异很大。此外,由于出版物在各个数据库中的分类方式不同,因此研究文本和非研究文本之间的区别也会因数据源的不同而不同,而这种区别是为文献计量学分析识别相关文献所必需的。除了跨数据库比较之外,本研究的重点主要是 OpenAlex 中所包含的出版物和文献类型的覆盖范围和分析,因为 OpenAlex 作为图书馆和大学文献计量学分析中现有专有提供商的免费替代品正变得越来越重要。
{"title":"Analysis of the Publication and Document Types in OpenAlex, Web of Science, Scopus, Pubmed and Semantic Scholar","authors":"Nick Haupka, Jack H. Culbert, Alexander Schniedermann, Najko Jahn, Philipp Mayr","doi":"arxiv-2406.15154","DOIUrl":"https://doi.org/arxiv-2406.15154","url":null,"abstract":"This study compares and analyses publication and document types in the\u0000following bibliographic databases: OpenAlex, Scopus, Web of Science, Semantic\u0000Scholar and PubMed. The results demonstrate that typologies can differ\u0000considerably between individual database providers. Moreover, the distinction\u0000between research and non-research texts, which is required to identify relevant\u0000documents for bibliometric analysis, can vary depending on the data source\u0000because publications are classified differently in the respective databases.\u0000The focus of this study, in addition to the cross-database comparison, is\u0000primarily on the coverage and analysis of the publication and document types\u0000contained in OpenAlex, as OpenAlex is becoming increasingly important as a free\u0000alternative to established proprietary providers for bibliometric analyses at\u0000libraries and universities.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"15 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The disruption index in the multiverse: The calculation of scores comes with numerous (hidden) degrees of freedom 多元宇宙中的干扰指数:分数的计算带有无数(隐藏的)自由度
Pub Date : 2024-06-19 DOI: arxiv-2406.13367
Christian Leibel, Lutz Bornmann
Following Funk and Owen-Smith (2017), Wu et al. (2019) proposed thedisruption index (DI1) as a bibliometric indicator that measures disruptive andconsolidating research. When we summarized the literature on the disruptionindex for our recently published review article (Leibel & Bornmann, 2024), wenoticed that the calculation of disruption scores comes with numerous (hidden)degrees of freedom. In this Letter to the Editor, we explain why thisanalytical flexibility endangers the credibility of bibliometric research basedon the DI1 (and its variants) and advertise the application of multiverse-stylemethods to increase the transparency of the research.
继Funk和Owen-Smith(2017)之后,Wu等人(2019)提出了破坏指数(DI1),作为衡量破坏性研究和巩固性研究的文献计量指标。当我们为最近发表的综述文章(Leibel & Bornmann, 2024)总结有关破坏指数的文献时,我们注意到破坏指数的计算带有许多(隐藏的)自由度。在这封致编辑的信中,我们解释了为什么这种分析上的灵活性会危及基于 DI1(及其变体)的文献计量学研究的可信度,并提倡应用多元宇宙式的方法来提高研究的透明度。
{"title":"The disruption index in the multiverse: The calculation of scores comes with numerous (hidden) degrees of freedom","authors":"Christian Leibel, Lutz Bornmann","doi":"arxiv-2406.13367","DOIUrl":"https://doi.org/arxiv-2406.13367","url":null,"abstract":"Following Funk and Owen-Smith (2017), Wu et al. (2019) proposed the\u0000disruption index (DI1) as a bibliometric indicator that measures disruptive and\u0000consolidating research. When we summarized the literature on the disruption\u0000index for our recently published review article (Leibel & Bornmann, 2024), we\u0000noticed that the calculation of disruption scores comes with numerous (hidden)\u0000degrees of freedom. In this Letter to the Editor, we explain why this\u0000analytical flexibility endangers the credibility of bibliometric research based\u0000on the DI1 (and its variants) and advertise the application of multiverse-style\u0000methods to increase the transparency of the research.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"90 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508278","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Where there's a will there's a way: ChatGPT is used more for science in countries where it is prohibited 有志者事竟成:在禁止使用 ChatGPT 的国家,它更多地被用于科学研究
Pub Date : 2024-06-17 DOI: arxiv-2406.11583
Honglin Bao, Mengyi Sun, Misha Teplitskiy
Regulating AI has emerged as a key societal challenge, but which methods ofregulation are effective is unclear. Here, we measure the effectiveness ofrestricting AI services geographically using the case of ChatGPT and science.OpenAI prohibits access to ChatGPT from several countries including China andRussia. If the restrictions are effective, there should be minimal use ofChatGPT in prohibited countries. Drawing on the finding that early versions ofChatGPT overrepresented distinctive words like "delve," we developed a simpleensemble classifier by training it on abstracts before and after ChatGPT"polishing". Testing on held-out abstracts and those where authorsself-declared to have used AI for writing shows that our classifiersubstantially outperforms off-the-shelf LLM detectors like GPTZero and ZeroGPT.Applying the classifier to preprints from Arxiv, BioRxiv, and MedRxiv revealsthat ChatGPT was used in approximately 12.6% of preprints by August 2023 anduse was 7.7% higher in countries without legal access. Crucially, thesepatterns appeared before the first major legal LLM became widely available inChina, the largest restricted-country preprint producer. ChatGPT use wasassociated with higher views and downloads, but not citations or journalplacement. Overall, restricting ChatGPT geographically has proven ineffectivein science and possibly other domains, likely due to widespread workarounds.
监管人工智能已成为一项关键的社会挑战,但哪些监管方法是有效的尚不清楚。OpenAI 禁止包括中国和俄罗斯在内的多个国家访问 ChatGPT。如果限制措施有效,那么在被禁止的国家使用 ChatGPT 的情况就应该少之又少。根据早期版本的 ChatGPT 对 "delve"(深入研究)等独特词汇的过多使用这一发现,我们开发了一个简单的集合分类器,在 ChatGPT "打磨 "前后对摘要进行训练。将分类器应用于 Arxiv、BioRxiv 和 MedRxiv 的预印本后发现,到 2023 年 8 月,约有 12.6% 的预印本使用了 ChatGPT,而在没有合法访问权限的国家,使用率则高出 7.7%。最重要的是,这些模式出现在中国这个最大的受限国家预印本生产国广泛提供第一个主要的合法 LLM 之前。ChatGPT 的使用与更高的浏览量和下载量有关,但与引用量和期刊排名无关。总体而言,对 ChatGPT 进行地域限制在科学领域被证明是无效的,在其他领域也可能如此,这很可能是由于普遍存在的变通方法。
{"title":"Where there's a will there's a way: ChatGPT is used more for science in countries where it is prohibited","authors":"Honglin Bao, Mengyi Sun, Misha Teplitskiy","doi":"arxiv-2406.11583","DOIUrl":"https://doi.org/arxiv-2406.11583","url":null,"abstract":"Regulating AI has emerged as a key societal challenge, but which methods of\u0000regulation are effective is unclear. Here, we measure the effectiveness of\u0000restricting AI services geographically using the case of ChatGPT and science.\u0000OpenAI prohibits access to ChatGPT from several countries including China and\u0000Russia. If the restrictions are effective, there should be minimal use of\u0000ChatGPT in prohibited countries. Drawing on the finding that early versions of\u0000ChatGPT overrepresented distinctive words like \"delve,\" we developed a simple\u0000ensemble classifier by training it on abstracts before and after ChatGPT\u0000\"polishing\". Testing on held-out abstracts and those where authors\u0000self-declared to have used AI for writing shows that our classifier\u0000substantially outperforms off-the-shelf LLM detectors like GPTZero and ZeroGPT.\u0000Applying the classifier to preprints from Arxiv, BioRxiv, and MedRxiv reveals\u0000that ChatGPT was used in approximately 12.6% of preprints by August 2023 and\u0000use was 7.7% higher in countries without legal access. Crucially, these\u0000patterns appeared before the first major legal LLM became widely available in\u0000China, the largest restricted-country preprint producer. ChatGPT use was\u0000associated with higher views and downloads, but not citations or journal\u0000placement. Overall, restricting ChatGPT geographically has proven ineffective\u0000in science and possibly other domains, likely due to widespread workarounds.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"17 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508299","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Evaluating Open Access Advantages for Citations and Altmetrics (2011-21): A Dynamic and Evolving Relationship 评估开放获取在引用和 Altmetrics 方面的优势(2011-21):不断发展的动态关系
Pub Date : 2024-06-15 DOI: arxiv-2406.10535
Michael Taylor
Differences between the impacts of Open Access (OA) and non-OA research havebeen observed over a wide range of citation and altmetric indicators, usuallyfinding an Open Access Advantage (OAA) within specific fields. However,science-wide analyses covering multiple years, indicators and disciplines arelacking. Using citation counts and six altmetrics for 38.7M articles published2011-21, we compare OA and non-OA papers. The results show that there is nouniversal OAA across all disciplines or impact indicators: the OAA forcitations tends to be lower for more recent papers, whereas the OAAs for news,blogs and Twitter are consistent across years and unrelated to volume of OApublications, whereas the OAAs for Wikipedia, patents and policy citations aremore complex. These results support different hypotheses for different subjectsand indicators. The evidence is consistent with OA accelerating research impactin the Medical & Health Sciences, Life Sciences and the Humanities; thatincreased visibility or discoverability is a factor in promoting thetranslation of research into socio-economic impact; and that OA is a factor ingrowing online engagement with research in some disciplines. OAAs are thereforecomplex, dynamic, multi-factorial and require considerable analysis tounderstand.
开放存取(OA)研究与非开放存取研究的影响差异已在广泛的引文和计量指标中被观察到,通常在特定领域发现开放存取优势(OAA)。然而,目前还缺乏涵盖多个年份、指标和学科的全科学分析。利用 2011-21 年发表的 3,870 万篇论文的引文计数和六项计量指标,我们对开放存取论文和非开放存取论文进行了比较。结果表明,在所有学科或影响指标中都存在普遍的OAA:较新论文的OAA往往较低,而新闻、博客和Twitter的OAA在不同年份是一致的,且与OA发表量无关,而维基百科、专利和政策引文的OAA则更为复杂。这些结果支持针对不同主题和指标的不同假设。有证据表明,开放式获取加速了医学与健康科学、生命科学和人文学科的研究影响;增加可见性或可发现性是促进研究转化为社会经济影响的一个因素;开放式获取是某些学科在线参与研究的一个因素。因此,开放式获取是复杂的、动态的、多因素的,需要进行大量分析才能理解。
{"title":"Evaluating Open Access Advantages for Citations and Altmetrics (2011-21): A Dynamic and Evolving Relationship","authors":"Michael Taylor","doi":"arxiv-2406.10535","DOIUrl":"https://doi.org/arxiv-2406.10535","url":null,"abstract":"Differences between the impacts of Open Access (OA) and non-OA research have\u0000been observed over a wide range of citation and altmetric indicators, usually\u0000finding an Open Access Advantage (OAA) within specific fields. However,\u0000science-wide analyses covering multiple years, indicators and disciplines are\u0000lacking. Using citation counts and six altmetrics for 38.7M articles published\u00002011-21, we compare OA and non-OA papers. The results show that there is no\u0000universal OAA across all disciplines or impact indicators: the OAA for\u0000citations tends to be lower for more recent papers, whereas the OAAs for news,\u0000blogs and Twitter are consistent across years and unrelated to volume of OA\u0000publications, whereas the OAAs for Wikipedia, patents and policy citations are\u0000more complex. These results support different hypotheses for different subjects\u0000and indicators. The evidence is consistent with OA accelerating research impact\u0000in the Medical & Health Sciences, Life Sciences and the Humanities; that\u0000increased visibility or discoverability is a factor in promoting the\u0000translation of research into socio-economic impact; and that OA is a factor in\u0000growing online engagement with research in some disciplines. OAAs are therefore\u0000complex, dynamic, multi-factorial and require considerable analysis to\u0000understand.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"5 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Digital Libraries
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1