首页 > 最新文献

arXiv - CS - Digital Libraries最新文献

英文 中文
Automating the Identification of High-Value Datasets in Open Government Data Portals 自动识别开放式政府数据门户中的高价值数据集
Pub Date : 2024-06-15 DOI: arxiv-2406.10541
Alfonso Quarati, Anastasija Nikiforova
Recognized for fostering innovation and transparency, driving economicgrowth, enhancing public services, supporting research, empowering citizens,and promoting environmental sustainability, High-Value Datasets (HVD) play acrucial role in the broader Open Government Data (OGD) movement. However,identifying HVD presents a resource-intensive and complex challenge due to thenuanced nature of data value. Our proposal aims to automate the identificationof HVDs on OGD portals using a quantitative approach based on a detailedanalysis of user interest derived from data usage statistics, therebyminimizing the need for human intervention. The proposed method involvesextracting download data, analyzing metrics to identify high-value categories,and comparing HVD datasets across different portals. This automated processprovides valuable insights into trends in dataset usage, reflecting citizens'needs and preferences. The effectiveness of our approach is demonstratedthrough its application to a sample of US OGD city portals. The practicalimplications of this study include contributing to the understanding of HVD atboth local and national levels. By providing a systematic and efficient meansof identifying HVD, our approach aims to inform open governance initiatives andpractices, aiding OGD portal managers and public authorities in their effortsto optimize data dissemination and utilization.
高价值数据集 (HVD) 因其促进创新和提高透明度、推动经济增长、加强公共服务、支持研究、增强公民能力和促进环境可持续性而受到认可,在更广泛的开放式政府数据 (OGD) 运动中发挥着重要作用。然而,由于数据价值的差异性,识别高价值数据集是一项资源密集型的复杂挑战。我们的建议旨在使用定量方法自动识别 OGD 门户上的 HVD,该方法基于从数据使用统计中得出的对用户兴趣的详细分析,从而最大限度地减少了人工干预的需要。所提出的方法包括提取下载数据、分析指标以识别高价值类别,以及比较不同门户网站的 HVD 数据集。这一自动化过程可提供有关数据集使用趋势的宝贵见解,反映出公民的需求和偏好。通过将我们的方法应用于美国 OGD 城市门户网站样本,证明了该方法的有效性。本研究的实际意义包括有助于了解地方和国家层面的 HVD。通过提供系统、高效的 HVD 识别方法,我们的方法旨在为开放治理倡议和实践提供信息,帮助 OGD 门户网站管理者和公共机构优化数据传播和利用。
{"title":"Automating the Identification of High-Value Datasets in Open Government Data Portals","authors":"Alfonso Quarati, Anastasija Nikiforova","doi":"arxiv-2406.10541","DOIUrl":"https://doi.org/arxiv-2406.10541","url":null,"abstract":"Recognized for fostering innovation and transparency, driving economic\u0000growth, enhancing public services, supporting research, empowering citizens,\u0000and promoting environmental sustainability, High-Value Datasets (HVD) play a\u0000crucial role in the broader Open Government Data (OGD) movement. However,\u0000identifying HVD presents a resource-intensive and complex challenge due to the\u0000nuanced nature of data value. Our proposal aims to automate the identification\u0000of HVDs on OGD portals using a quantitative approach based on a detailed\u0000analysis of user interest derived from data usage statistics, thereby\u0000minimizing the need for human intervention. The proposed method involves\u0000extracting download data, analyzing metrics to identify high-value categories,\u0000and comparing HVD datasets across different portals. This automated process\u0000provides valuable insights into trends in dataset usage, reflecting citizens'\u0000needs and preferences. The effectiveness of our approach is demonstrated\u0000through its application to a sample of US OGD city portals. The practical\u0000implications of this study include contributing to the understanding of HVD at\u0000both local and national levels. By providing a systematic and efficient means\u0000of identifying HVD, our approach aims to inform open governance initiatives and\u0000practices, aiding OGD portal managers and public authorities in their efforts\u0000to optimize data dissemination and utilization.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
An open dataset of article processing charges from six large scholarly publishers (2019-2023) 六家大型学术出版商的文章处理费开放数据集(2019-2023 年)
Pub Date : 2024-06-12 DOI: arxiv-2406.08356
Leigh-Ann Butler, Madelaine Hare, Nina Schönfelder, Eric Schares, Juan Pablo Alperin, Stefanie Haustein
This paper introduces a dataset of article processing charges (APCs) producedfrom the price lists of six large scholarly publishers - Elsevier, Frontiers,PLOS, MDPI, Springer Nature and Wiley - between 2019 and 2023. APC price listswere downloaded from publisher websites each year as well as via WaybackMachine snapshots to retrieve fees per journal per year. The dataset includesjournal metadata, APC collection method, and annual APC price list informationin several currencies (USD, EUR, GBP, CHF, JPY, CAD) for 8,712 unique journalsand 36,618 journal-year combinations. The dataset was generated to allow formore precise analysis of APCs and can support library collection developmentand scientometric analysis estimating APCs paid in gold and hybrid OA journals.
本文介绍了根据爱思唯尔、Frontiers、PLOS、MDPI、施普林格-自然(Springer Nature)和威利(Wiley)六家大型学术出版商 2019 年至 2023 年的价目表制作的文章处理费(APC)数据集。每年从出版商网站以及 WaybackMachine 快照中下载 APC 价目表,以检索每种期刊每年的费用。数据集包括 8712 种期刊和 36618 种期刊年组合的多种货币(美元、欧元、英镑、瑞士法郎、日元、加元)的期刊元数据、APC 采集方法和年度 APC 价目表信息。生成该数据集的目的是对 APC 进行更精确的分析,并为图书馆馆藏开发和科学计量分析提供支持,以估算黄金 OA 期刊和混合 OA 期刊支付的 APC。
{"title":"An open dataset of article processing charges from six large scholarly publishers (2019-2023)","authors":"Leigh-Ann Butler, Madelaine Hare, Nina Schönfelder, Eric Schares, Juan Pablo Alperin, Stefanie Haustein","doi":"arxiv-2406.08356","DOIUrl":"https://doi.org/arxiv-2406.08356","url":null,"abstract":"This paper introduces a dataset of article processing charges (APCs) produced\u0000from the price lists of six large scholarly publishers - Elsevier, Frontiers,\u0000PLOS, MDPI, Springer Nature and Wiley - between 2019 and 2023. APC price lists\u0000were downloaded from publisher websites each year as well as via Wayback\u0000Machine snapshots to retrieve fees per journal per year. The dataset includes\u0000journal metadata, APC collection method, and annual APC price list information\u0000in several currencies (USD, EUR, GBP, CHF, JPY, CAD) for 8,712 unique journals\u0000and 36,618 journal-year combinations. The dataset was generated to allow for\u0000more precise analysis of APCs and can support library collection development\u0000and scientometric analysis estimating APCs paid in gold and hybrid OA journals.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Which topics are best represented by science maps? An analysis of clustering effectiveness for citation and text similarity networks 科学地图最能体现哪些主题?引文和文本相似性网络的聚类效果分析
Pub Date : 2024-06-10 DOI: arxiv-2406.06454
Juan Pablo Bascur, Suzan Verberne, Nees Jan van Eck, Ludo Waltman
A science map of topics is a visualization that shows topics identifiedalgorithmically based on the bibliographic metadata of scientific publications.In practice not all topics are well represented in a science map. We analyzedhow effectively different topics are represented in science maps created byclustering biomedical publications. To achieve this, we investigated whichtopic categories, obtained from MeSH terms, are better represented in sciencemaps based on citation or text similarity networks. To evaluate the clusteringeffectiveness of topics, we determined the extent to which documents belongingto the same topic are grouped together in the same cluster. We found that thebest and worst represented topic categories are the same for citation and textsimilarity networks. The best represented topic categories are diseases,psychology, anatomy, organisms and the techniques and equipment used fordiagnostics and therapy, while the worst represented topic categories arenatural science fields, geographical entities, information sciences and healthcare and occupations. Furthermore, for the diseases and organisms topiccategories and for science maps with smaller clusters, we found that topicstend to be better represented in citation similarity networks than in textsimilarity networks.
主题科学地图是一种可视化工具,用于显示根据科学出版物的书目元数据通过算法确定的主题。我们分析了不同主题在通过生物医学出版物聚类创建的科学地图中的有效体现程度。为此,我们研究了从 MeSH 术语中获得的哪些主题类别在基于引文或文本相似性网络的科学地图中得到了更好的体现。为了评估主题的聚类效果,我们确定了属于同一主题的文档在同一聚类中的聚类程度。我们发现,在引文网络和文本相似性网络中,代表性最好和最差的主题类别是相同的。代表性最好的主题类别是疾病、心理学、解剖学、生物体以及诊断和治疗所用的技术和设备,而代表性最差的主题类别是自然科学领域、地理实体、信息科学以及医疗保健和职业。此外,对于疾病和生物体主题类别以及具有较小聚类的科学地图,我们发现主题在引文相似性网络中的代表性往往优于在文本相似性网络中的代表性。
{"title":"Which topics are best represented by science maps? An analysis of clustering effectiveness for citation and text similarity networks","authors":"Juan Pablo Bascur, Suzan Verberne, Nees Jan van Eck, Ludo Waltman","doi":"arxiv-2406.06454","DOIUrl":"https://doi.org/arxiv-2406.06454","url":null,"abstract":"A science map of topics is a visualization that shows topics identified\u0000algorithmically based on the bibliographic metadata of scientific publications.\u0000In practice not all topics are well represented in a science map. We analyzed\u0000how effectively different topics are represented in science maps created by\u0000clustering biomedical publications. To achieve this, we investigated which\u0000topic categories, obtained from MeSH terms, are better represented in science\u0000maps based on citation or text similarity networks. To evaluate the clustering\u0000effectiveness of topics, we determined the extent to which documents belonging\u0000to the same topic are grouped together in the same cluster. We found that the\u0000best and worst represented topic categories are the same for citation and text\u0000similarity networks. The best represented topic categories are diseases,\u0000psychology, anatomy, organisms and the techniques and equipment used for\u0000diagnostics and therapy, while the worst represented topic categories are\u0000natural science fields, geographical entities, information sciences and health\u0000care and occupations. Furthermore, for the diseases and organisms topic\u0000categories and for science maps with smaller clusters, we found that topics\u0000tend to be better represented in citation similarity networks than in text\u0000similarity networks.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Coconut Libtool: Bridging Textual Analysis Gaps for Non-Programmers 椰子 Libtool:为非程序员缩小文本分析差距
Pub Date : 2024-06-10 DOI: arxiv-2406.05949
Faizhal Arif Santosa, Manika Lamba, Crissandra George, J. Stephen Downie
In the era of big and ubiquitous data, professionals and students alike arefinding themselves needing to perform a number of textual analysis tasks.Historically, the general lack of statistical expertise and programming skillshas stopped many with humanities or social sciences backgrounds from performingand fully benefiting from such analyses. Thus, we introduce Coconut Libtool(www.coconut-libtool.com/), an open-source, web-based application that utilizesstate-of-the-art natural language processing (NLP) technologies. CoconutLibtool analyzes text data from customized files and bibliographic databasessuch as Web of Science, Scopus, and Lens. Users can verify which functions canbe performed with the data they have. Coconut Libtool deploys multiplealgorithmic NLP techniques at the backend, including topic modeling (LDA,Biterm, and BERTopic algorithms), network graph visualization, keywordlemmatization, and sunburst visualization. Coconut Libtool is the people-firstweb application designed to be used by professionals, researchers, and studentsin the information sciences, digital humanities, and computational socialsciences domains to promote transparency, reproducibility, accessibility,reciprocity, and responsibility in research practices.
在无处不在的大数据时代,专业人士和学生都发现自己需要执行大量文本分析任务。从历史上看,由于普遍缺乏统计专业知识和编程技能,许多具有人文或社会科学背景的人无法执行此类分析并从中充分受益。因此,我们引入了 Coconut Libtool(www.coconut-libtool.com/),它是一个开源的、基于网络的应用程序,采用了最先进的自然语言处理(NLP)技术。CoconutLibtool 可以分析来自定制文件和书目数据库(如 Web of Science、Scopus 和 Lens)的文本数据。用户可以验证他们所拥有的数据可以执行哪些功能。Coconut Libtool 在后端部署了多种算法的 NLP 技术,包括主题建模(LDA、Biterm 和 BERTopic 算法)、网络图可视化、关键词格式化和旭日可视化。Coconut Libtool 是一款以人为本的网络应用程序,旨在供信息科学、数字人文和计算社会科学领域的专业人士、研究人员和学生使用,以提高研究实践的透明度、可复制性、可访问性、互惠性和责任感。
{"title":"Coconut Libtool: Bridging Textual Analysis Gaps for Non-Programmers","authors":"Faizhal Arif Santosa, Manika Lamba, Crissandra George, J. Stephen Downie","doi":"arxiv-2406.05949","DOIUrl":"https://doi.org/arxiv-2406.05949","url":null,"abstract":"In the era of big and ubiquitous data, professionals and students alike are\u0000finding themselves needing to perform a number of textual analysis tasks.\u0000Historically, the general lack of statistical expertise and programming skills\u0000has stopped many with humanities or social sciences backgrounds from performing\u0000and fully benefiting from such analyses. Thus, we introduce Coconut Libtool\u0000(www.coconut-libtool.com/), an open-source, web-based application that utilizes\u0000state-of-the-art natural language processing (NLP) technologies. Coconut\u0000Libtool analyzes text data from customized files and bibliographic databases\u0000such as Web of Science, Scopus, and Lens. Users can verify which functions can\u0000be performed with the data they have. Coconut Libtool deploys multiple\u0000algorithmic NLP techniques at the backend, including topic modeling (LDA,\u0000Biterm, and BERTopic algorithms), network graph visualization, keyword\u0000lemmatization, and sunburst visualization. Coconut Libtool is the people-first\u0000web application designed to be used by professionals, researchers, and students\u0000in the information sciences, digital humanities, and computational social\u0000sciences domains to promote transparency, reproducibility, accessibility,\u0000reciprocity, and responsibility in research practices.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
The Impact of AI on Academic Research and Publishing 人工智能对学术研究和出版的影响
Pub Date : 2024-06-10 DOI: arxiv-2406.06009
Brady Lund, Manika Lamba, Sang Hoo Oh
Generative artificial intelligence (AI) technologies like ChatGPT, havesignificantly impacted academic writing and publishing through their ability togenerate content at levels comparable to or surpassing human writers. Through areview of recent interdisciplinary literature, this paper examines ethicalconsiderations surrounding the integration of AI into academia, focusing on thepotential for this technology to be used for scholarly misconduct and necessaryoversight when using it for writing, editing, and reviewing of scholarlypapers. The findings highlight the need for collaborative approaches to AIusage among publishers, editors, reviewers, and authors to ensure that thistechnology is used ethically and productively.
像 ChatGPT 这样的人工智能(AI)生成技术能够以媲美或超越人类作者的水平生成内容,从而对学术写作和出版产生了重大影响。本文通过对近期跨学科文献的研究,探讨了将人工智能融入学术界所涉及的伦理问题,重点关注该技术用于学术不端行为的可能性,以及在使用该技术进行学术论文写作、编辑和评审时的必要监督。研究结果突出表明,出版商、编辑、审稿人和作者之间需要采取合作的方式来使用人工智能,以确保该技术的使用符合道德规范并富有成效。
{"title":"The Impact of AI on Academic Research and Publishing","authors":"Brady Lund, Manika Lamba, Sang Hoo Oh","doi":"arxiv-2406.06009","DOIUrl":"https://doi.org/arxiv-2406.06009","url":null,"abstract":"Generative artificial intelligence (AI) technologies like ChatGPT, have\u0000significantly impacted academic writing and publishing through their ability to\u0000generate content at levels comparable to or surpassing human writers. Through a\u0000review of recent interdisciplinary literature, this paper examines ethical\u0000considerations surrounding the integration of AI into academia, focusing on the\u0000potential for this technology to be used for scholarly misconduct and necessary\u0000oversight when using it for writing, editing, and reviewing of scholarly\u0000papers. The findings highlight the need for collaborative approaches to AI\u0000usage among publishers, editors, reviewers, and authors to ensure that this\u0000technology is used ethically and productively.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Text Analysis of ETDs in ProQuest Dissertations and Theses (PQDT) Global (2016-2018) ProQuest Dissertations and Theses (PQDT)全球ETD文本分析(2016-2018年)
Pub Date : 2024-06-10 DOI: arxiv-2406.06076
Manika Lamba
The information explosion in the form of ETDs poses the challenge ofmanagement and extraction of appropriate knowledge for decision-making. Thus,the present study forwards a solution to the above problem by applying topicmining and prediction modeling tools to 263 ETDs submitted to the PQDT Globaldatabase during 2016-18 in the field of library science. This study was dividedinto two phases. The first phase determined the core topics from the ETDs usingTopic-Modeling-Tool (TMT), which was based on latent dirichlet allocation(LDA), whereas the second phase employed prediction analysis usingRapidMinerplatform to annotate the future research articles on the basis of themodeled topics. The core topics (tags) for the studied period were found to bebook history, school librarian, public library, communicative ecology, andinformatics followed by text network and trend analysis on the high probabilitycooccurred words. Lastly, a prediction model using Support Vector Machine (SVM)classifier was created in order to accurately predict the placement of futureETDs going to be submitted to PQDT Global under the five modeled topics (a toe). The tested dataset against the trained data set for the predictiveperformed perfectly.
电子文献形式的信息爆炸给管理和提取适当的决策知识带来了挑战。因此,本研究通过对2016-18年间图书馆学领域提交至PQDT全球数据库的263篇ETD应用主题挖掘和预测建模工具,提出了上述问题的解决方案。本研究分为两个阶段。第一阶段使用基于潜在德里希勒分配(LDA)的主题建模工具(TMT)从ETD中确定核心主题;第二阶段使用RapidMiner平台进行预测分析,在建模主题的基础上对未来研究文章进行注释。研究期间的核心主题(标签)为图书史、学校图书馆员、公共图书馆、传播生态学和信息学,随后对高概率出现的词进行了文本网络和趋势分析。最后,使用支持向量机(SVM)分类器创建了一个预测模型,以准确预测未来提交给 PQDT Global 的ETD 在五个建模主题(脚趾)下的位置。测试数据集与训练数据集的预测结果完全一致。
{"title":"Text Analysis of ETDs in ProQuest Dissertations and Theses (PQDT) Global (2016-2018)","authors":"Manika Lamba","doi":"arxiv-2406.06076","DOIUrl":"https://doi.org/arxiv-2406.06076","url":null,"abstract":"The information explosion in the form of ETDs poses the challenge of\u0000management and extraction of appropriate knowledge for decision-making. Thus,\u0000the present study forwards a solution to the above problem by applying topic\u0000mining and prediction modeling tools to 263 ETDs submitted to the PQDT Global\u0000database during 2016-18 in the field of library science. This study was divided\u0000into two phases. The first phase determined the core topics from the ETDs using\u0000Topic-Modeling-Tool (TMT), which was based on latent dirichlet allocation\u0000(LDA), whereas the second phase employed prediction analysis using\u0000RapidMinerplatform to annotate the future research articles on the basis of the\u0000modeled topics. The core topics (tags) for the studied period were found to be\u0000book history, school librarian, public library, communicative ecology, and\u0000informatics followed by text network and trend analysis on the high probability\u0000cooccurred words. Lastly, a prediction model using Support Vector Machine (SVM)\u0000classifier was created in order to accurately predict the placement of future\u0000ETDs going to be submitted to PQDT Global under the five modeled topics (a to\u0000e). The tested dataset against the trained data set for the predictive\u0000performed perfectly.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Automatically detecting scientific political science texts from a large general document index 从大型通用文件索引中自动检测科学政治学文本
Pub Date : 2024-06-05 DOI: arxiv-2406.03067
Nina Smirnova
This technical report outlines the filtering approach applied to thecollection of the Bielefeld Academic Search Engine (BASE) data to extractarticles from the political science domain. We combined hard and soft filtersto address entries with different available metadata, e.g. title, abstract orkeywords. The hard filter is a weighted keyword-based filter approach. The softfilter uses a multilingual BERT-based classification model, trained to detectscientific articles from the political science domain. We evaluated bothapproaches using an annotated dataset, consisting of scientific articles fromdifferent scientific domains. The weighted keyword-based approach achieved thehighest total accuracy of 0.88. The multilingual BERT-based classificationmodel was fine-tuned using a dataset of 14,178 abstracts from scientificarticles and reached the highest total accuracy of 0.98.
本技术报告概述了应用于比勒费尔德学术搜索引擎(BASE)数据收集的过滤方法,以提取政治学领域的文章。我们将硬过滤和软过滤相结合,以处理具有不同可用元数据(如标题、摘要或关键词)的条目。硬过滤是一种基于关键词的加权过滤方法。软过滤器使用基于 BERT 的多语言分类模型,该模型经过训练,可检测政治科学领域的科学文章。我们使用一个由不同科学领域的科学文章组成的注释数据集对这两种方法进行了评估。基于加权关键词的方法的总准确率最高,达到了 0.88。基于多语种 BERT 的分类模型在一个包含 14,178 篇科学文章摘要的数据集上进行了微调,总准确率达到了最高的 0.98。
{"title":"Automatically detecting scientific political science texts from a large general document index","authors":"Nina Smirnova","doi":"arxiv-2406.03067","DOIUrl":"https://doi.org/arxiv-2406.03067","url":null,"abstract":"This technical report outlines the filtering approach applied to the\u0000collection of the Bielefeld Academic Search Engine (BASE) data to extract\u0000articles from the political science domain. We combined hard and soft filters\u0000to address entries with different available metadata, e.g. title, abstract or\u0000keywords. The hard filter is a weighted keyword-based filter approach. The soft\u0000filter uses a multilingual BERT-based classification model, trained to detect\u0000scientific articles from the political science domain. We evaluated both\u0000approaches using an annotated dataset, consisting of scientific articles from\u0000different scientific domains. The weighted keyword-based approach achieved the\u0000highest total accuracy of 0.88. The multilingual BERT-based classification\u0000model was fine-tuned using a dataset of 14,178 abstracts from scientific\u0000articles and reached the highest total accuracy of 0.98.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141547359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Promotional Language and the Adoption of Innovative Ideas in Science 宣传语言与科学创新理念的采用
Pub Date : 2024-06-04 DOI: arxiv-2406.02798
Hao Peng, Huilian Sophie Qiu, Henrik Barslund Fosse, Brian Uzzi
How are the merits of innovative ideas communicated in science? Here weconduct semantic analyses of grant application success with a focus onscientific promotional language, which has been growing in frequency in manycontexts and purportedly may convey an innovative idea's originality andsignificance. Our analysis attempts to surmount limitations of prior studies byexamining the full text of tens of thousands of both funded and unfunded grantsfrom three leading public and private funding agencies: the NIH, the NSF, andthe Novo Nordisk Foundation, one of the world's largest private sciencefoundations. We find a robust association between promotional language and thesupport and adoption of innovative ideas by funders and other scientists.First, the percentage of promotional language in a grant proposal is associatedwith up to a doubling of the grant's probability of being funded. Second, agrant's promotional language reflects its intrinsic level of innovativeness.Third, the percentage of promotional language predicts the expected citationand productivity impact of publications that are supported by funded grants.Lastly, a computer-assisted experiment that manipulates the promotionallanguage in our data demonstrates how promotional language can communicate themerit of ideas through cognitive activation. With the incidence of promotionallanguage in science steeply rising, and the pivotal role of grants inconverting promising and aspirational ideas into solutions, our analysisprovides empirical evidence that promotional language is associated witheffectively communicating the merits of innovative scientific ideas.
科学界是如何传播创新思想的优点的?在此,我们对资助申请成功与否进行了语义分析,重点是科学宣传用语,这种用语在许多语境中出现的频率越来越高,据称可以传达创新思想的独创性和重要性。我们的分析试图克服以往研究的局限性,对三家主要的公共和私人资助机构--美国国立卫生研究院(NIH)、美国国家科学基金会(NSF)和诺和诺德基金会(全球最大的私人科学基金会之一)--的数以万计获得资助和未获资助的赠款全文进行了研究。我们发现,宣传性语言与资助者和其他科学家对创新想法的支持和采纳之间存在密切联系。首先,资助提案中宣传性语言所占的比例可使资助项目获得资助的概率提高一倍。第三,推广性语言的比例可以预测获得资助的出版物的预期引用率和生产力影响。最后,我们通过计算机辅助实验,对数据中的推广性语言进行了操作,证明了推广性语言是如何通过激活认知来传达想法的价值的。我们的分析提供了实证证据,证明推广性语言与有效传播创新科学思想的优点有关。
{"title":"Promotional Language and the Adoption of Innovative Ideas in Science","authors":"Hao Peng, Huilian Sophie Qiu, Henrik Barslund Fosse, Brian Uzzi","doi":"arxiv-2406.02798","DOIUrl":"https://doi.org/arxiv-2406.02798","url":null,"abstract":"How are the merits of innovative ideas communicated in science? Here we\u0000conduct semantic analyses of grant application success with a focus on\u0000scientific promotional language, which has been growing in frequency in many\u0000contexts and purportedly may convey an innovative idea's originality and\u0000significance. Our analysis attempts to surmount limitations of prior studies by\u0000examining the full text of tens of thousands of both funded and unfunded grants\u0000from three leading public and private funding agencies: the NIH, the NSF, and\u0000the Novo Nordisk Foundation, one of the world's largest private science\u0000foundations. We find a robust association between promotional language and the\u0000support and adoption of innovative ideas by funders and other scientists.\u0000First, the percentage of promotional language in a grant proposal is associated\u0000with up to a doubling of the grant's probability of being funded. Second, a\u0000grant's promotional language reflects its intrinsic level of innovativeness.\u0000Third, the percentage of promotional language predicts the expected citation\u0000and productivity impact of publications that are supported by funded grants.\u0000Lastly, a computer-assisted experiment that manipulates the promotional\u0000language in our data demonstrates how promotional language can communicate the\u0000merit of ideas through cognitive activation. With the incidence of promotional\u0000language in science steeply rising, and the pivotal role of grants in\u0000converting promising and aspirational ideas into solutions, our analysis\u0000provides empirical evidence that promotional language is associated with\u0000effectively communicating the merits of innovative scientific ideas.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
OpenDataLab: Empowering General Artificial Intelligence with Open Datasets 开放数据实验室:利用开放数据集增强通用人工智能能力
Pub Date : 2024-06-04 DOI: arxiv-2407.13773
Conghui He, Wei Li, Zhenjiang Jin, Chao Xu, Bin Wang, Dahua Lin
The advancement of artificial intelligence (AI) hinges on the quality andaccessibility of data, yet the current fragmentation and variability of datasources hinder efficient data utilization. The dispersion of data sources anddiversity of data formats often lead to inefficiencies in data retrieval andprocessing, significantly impeding the progress of AI research andapplications. To address these challenges, this paper introduces OpenDataLab, aplatform designed to bridge the gap between diverse data sources and the needfor unified data processing. OpenDataLab integrates a wide range of open-sourceAI datasets and enhances data acquisition efficiency through intelligentquerying and high-speed downloading services. The platform employs anext-generation AI Data Set Description Language (DSDL), which standardizes therepresentation of multimodal and multi-format data, improving interoperabilityand reusability. Additionally, OpenDataLab optimizes data processing throughtools that complement DSDL. By integrating data with unified data descriptionsand smart data toolchains, OpenDataLab can improve data preparation efficiencyby 30%. We anticipate that OpenDataLab will significantly boost artificialgeneral intelligence (AGI) research and facilitate advancements in related AIfields. For more detailed information, please visit the platform's officialwebsite: https://opendatalab.com.
人工智能(AI)的发展取决于数据的质量和可访问性,然而目前数据源的分散性和可变性阻碍了数据的高效利用。数据源的分散性和数据格式的多样性往往导致数据检索和处理效率低下,严重阻碍了人工智能研究和应用的进展。为了应对这些挑战,本文介绍了 OpenDataLab,这是一个旨在弥合多样化数据源与统一数据处理需求之间差距的平台。OpenDataLab 整合了广泛的开源人工智能数据集,并通过智能查询和高速下载服务提高了数据采集效率。该平台采用下一代人工智能数据集描述语言(DSDL),实现了多模态和多格式数据的标准化呈现,提高了互操作性和可重用性。此外,OpenDataLab 还通过补充 DSDL 的工具来优化数据处理。通过用统一的数据描述和智能数据工具链整合数据,OpenDataLab 可以将数据准备效率提高 30%。我们预计,OpenDataLab 将极大地推动人工智能(AGI)研究,并促进相关人工智能领域的进步。欲了解更多详细信息,请访问该平台的官方网站:https://opendatalab.com。
{"title":"OpenDataLab: Empowering General Artificial Intelligence with Open Datasets","authors":"Conghui He, Wei Li, Zhenjiang Jin, Chao Xu, Bin Wang, Dahua Lin","doi":"arxiv-2407.13773","DOIUrl":"https://doi.org/arxiv-2407.13773","url":null,"abstract":"The advancement of artificial intelligence (AI) hinges on the quality and\u0000accessibility of data, yet the current fragmentation and variability of data\u0000sources hinder efficient data utilization. The dispersion of data sources and\u0000diversity of data formats often lead to inefficiencies in data retrieval and\u0000processing, significantly impeding the progress of AI research and\u0000applications. To address these challenges, this paper introduces OpenDataLab, a\u0000platform designed to bridge the gap between diverse data sources and the need\u0000for unified data processing. OpenDataLab integrates a wide range of open-source\u0000AI datasets and enhances data acquisition efficiency through intelligent\u0000querying and high-speed downloading services. The platform employs a\u0000next-generation AI Data Set Description Language (DSDL), which standardizes the\u0000representation of multimodal and multi-format data, improving interoperability\u0000and reusability. Additionally, OpenDataLab optimizes data processing through\u0000tools that complement DSDL. By integrating data with unified data descriptions\u0000and smart data toolchains, OpenDataLab can improve data preparation efficiency\u0000by 30%. We anticipate that OpenDataLab will significantly boost artificial\u0000general intelligence (AGI) research and facilitate advancements in related AI\u0000fields. For more detailed information, please visit the platform's official\u0000website: https://opendatalab.com.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Twitter should now be referred to as X: How academics, journals and publishers need to make the nomenclatural transition Twitter 现在应被称为 X:学术界、期刊和出版商需要如何进行命名过渡
Pub Date : 2024-05-31 DOI: arxiv-2405.20670
Jaime A. Teixeira da Silva, Serhii Nazarovets
Here, we note how academics, journals and publishers should no longer referto the social media platform Twitter as such, rather as X. Relying on GoogleScholar, we found 16 examples of papers published in the last months of 2023 -essentially during the transition period between Twitter and X - that usedTwitter and X, but in different ways. Unlike that transition period in whichthe binary Twitter/X could have been used in academic papers, we suggest thatpapers should no longer refer to Twitter as Twitter, but only as X, except forhistorical studies about that social media platform, because such use would befactually incorrect.
在此,我们指出学术界、期刊和出版商不应再将社交媒体平台Twitter称为Twitter,而应称为X。通过谷歌学术搜索(GoogleScholar),我们找到了16篇发表于2023年最后几个月的论文,这些论文基本上都是在Twitter和X之间的过渡时期发表的,但以不同的方式使用了Twitter和X。与学术论文中可以使用Twitter/X二进制的过渡时期不同,我们建议,除了有关该社交媒体平台的历史研究外,论文不应再将Twitter称为Twitter,而只应称为X,因为这种用法实际上是不正确的。
{"title":"Twitter should now be referred to as X: How academics, journals and publishers need to make the nomenclatural transition","authors":"Jaime A. Teixeira da Silva, Serhii Nazarovets","doi":"arxiv-2405.20670","DOIUrl":"https://doi.org/arxiv-2405.20670","url":null,"abstract":"Here, we note how academics, journals and publishers should no longer refer\u0000to the social media platform Twitter as such, rather as X. Relying on Google\u0000Scholar, we found 16 examples of papers published in the last months of 2023 -\u0000essentially during the transition period between Twitter and X - that used\u0000Twitter and X, but in different ways. Unlike that transition period in which\u0000the binary Twitter/X could have been used in academic papers, we suggest that\u0000papers should no longer refer to Twitter as Twitter, but only as X, except for\u0000historical studies about that social media platform, because such use would be\u0000factually incorrect.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141252444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
arXiv - CS - Digital Libraries
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1