Recognized for fostering innovation and transparency, driving economic growth, enhancing public services, supporting research, empowering citizens, and promoting environmental sustainability, High-Value Datasets (HVD) play a crucial role in the broader Open Government Data (OGD) movement. However, identifying HVD presents a resource-intensive and complex challenge due to the nuanced nature of data value. Our proposal aims to automate the identification of HVDs on OGD portals using a quantitative approach based on a detailed analysis of user interest derived from data usage statistics, thereby minimizing the need for human intervention. The proposed method involves extracting download data, analyzing metrics to identify high-value categories, and comparing HVD datasets across different portals. This automated process provides valuable insights into trends in dataset usage, reflecting citizens' needs and preferences. The effectiveness of our approach is demonstrated through its application to a sample of US OGD city portals. The practical implications of this study include contributing to the understanding of HVD at both local and national levels. By providing a systematic and efficient means of identifying HVD, our approach aims to inform open governance initiatives and practices, aiding OGD portal managers and public authorities in their efforts to optimize data dissemination and utilization.
{"title":"Automating the Identification of High-Value Datasets in Open Government Data Portals","authors":"Alfonso Quarati, Anastasija Nikiforova","doi":"arxiv-2406.10541","DOIUrl":"https://doi.org/arxiv-2406.10541","url":null,"abstract":"Recognized for fostering innovation and transparency, driving economic\u0000growth, enhancing public services, supporting research, empowering citizens,\u0000and promoting environmental sustainability, High-Value Datasets (HVD) play a\u0000crucial role in the broader Open Government Data (OGD) movement. However,\u0000identifying HVD presents a resource-intensive and complex challenge due to the\u0000nuanced nature of data value. Our proposal aims to automate the identification\u0000of HVDs on OGD portals using a quantitative approach based on a detailed\u0000analysis of user interest derived from data usage statistics, thereby\u0000minimizing the need for human intervention. The proposed method involves\u0000extracting download data, analyzing metrics to identify high-value categories,\u0000and comparing HVD datasets across different portals. This automated process\u0000provides valuable insights into trends in dataset usage, reflecting citizens'\u0000needs and preferences. The effectiveness of our approach is demonstrated\u0000through its application to a sample of US OGD city portals. The practical\u0000implications of this study include contributing to the understanding of HVD at\u0000both local and national levels. By providing a systematic and efficient means\u0000of identifying HVD, our approach aims to inform open governance initiatives and\u0000practices, aiding OGD portal managers and public authorities in their efforts\u0000to optimize data dissemination and utilization.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"87 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Leigh-Ann Butler, Madelaine Hare, Nina Schönfelder, Eric Schares, Juan Pablo Alperin, Stefanie Haustein
This paper introduces a dataset of article processing charges (APCs) produced from the price lists of six large scholarly publishers - Elsevier, Frontiers, PLOS, MDPI, Springer Nature and Wiley - between 2019 and 2023. APC price lists were downloaded from publisher websites each year as well as via Wayback Machine snapshots to retrieve fees per journal per year. The dataset includes journal metadata, APC collection method, and annual APC price list information in several currencies (USD, EUR, GBP, CHF, JPY, CAD) for 8,712 unique journals and 36,618 journal-year combinations. The dataset was generated to allow for more precise analysis of APCs and can support library collection development and scientometric analysis estimating APCs paid in gold and hybrid OA journals.
{"title":"An open dataset of article processing charges from six large scholarly publishers (2019-2023)","authors":"Leigh-Ann Butler, Madelaine Hare, Nina Schönfelder, Eric Schares, Juan Pablo Alperin, Stefanie Haustein","doi":"arxiv-2406.08356","DOIUrl":"https://doi.org/arxiv-2406.08356","url":null,"abstract":"This paper introduces a dataset of article processing charges (APCs) produced\u0000from the price lists of six large scholarly publishers - Elsevier, Frontiers,\u0000PLOS, MDPI, Springer Nature and Wiley - between 2019 and 2023. APC price lists\u0000were downloaded from publisher websites each year as well as via Wayback\u0000Machine snapshots to retrieve fees per journal per year. The dataset includes\u0000journal metadata, APC collection method, and annual APC price list information\u0000in several currencies (USD, EUR, GBP, CHF, JPY, CAD) for 8,712 unique journals\u0000and 36,618 journal-year combinations. The dataset was generated to allow for\u0000more precise analysis of APCs and can support library collection development\u0000and scientometric analysis estimating APCs paid in gold and hybrid OA journals.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"26 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527521","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Juan Pablo Bascur, Suzan Verberne, Nees Jan van Eck, Ludo Waltman
A science map of topics is a visualization that shows topics identified algorithmically based on the bibliographic metadata of scientific publications. In practice not all topics are well represented in a science map. We analyzed how effectively different topics are represented in science maps created by clustering biomedical publications. To achieve this, we investigated which topic categories, obtained from MeSH terms, are better represented in science maps based on citation or text similarity networks. To evaluate the clustering effectiveness of topics, we determined the extent to which documents belonging to the same topic are grouped together in the same cluster. We found that the best and worst represented topic categories are the same for citation and text similarity networks. The best represented topic categories are diseases, psychology, anatomy, organisms and the techniques and equipment used for diagnostics and therapy, while the worst represented topic categories are natural science fields, geographical entities, information sciences and health care and occupations. Furthermore, for the diseases and organisms topic categories and for science maps with smaller clusters, we found that topics tend to be better represented in citation similarity networks than in text similarity networks.
{"title":"Which topics are best represented by science maps? An analysis of clustering effectiveness for citation and text similarity networks","authors":"Juan Pablo Bascur, Suzan Verberne, Nees Jan van Eck, Ludo Waltman","doi":"arxiv-2406.06454","DOIUrl":"https://doi.org/arxiv-2406.06454","url":null,"abstract":"A science map of topics is a visualization that shows topics identified\u0000algorithmically based on the bibliographic metadata of scientific publications.\u0000In practice not all topics are well represented in a science map. We analyzed\u0000how effectively different topics are represented in science maps created by\u0000clustering biomedical publications. To achieve this, we investigated which\u0000topic categories, obtained from MeSH terms, are better represented in science\u0000maps based on citation or text similarity networks. To evaluate the clustering\u0000effectiveness of topics, we determined the extent to which documents belonging\u0000to the same topic are grouped together in the same cluster. We found that the\u0000best and worst represented topic categories are the same for citation and text\u0000similarity networks. The best represented topic categories are diseases,\u0000psychology, anatomy, organisms and the techniques and equipment used for\u0000diagnostics and therapy, while the worst represented topic categories are\u0000natural science fields, geographical entities, information sciences and health\u0000care and occupations. Furthermore, for the diseases and organisms topic\u0000categories and for science maps with smaller clusters, we found that topics\u0000tend to be better represented in citation similarity networks than in text\u0000similarity networks.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"68 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141508300","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Faizhal Arif Santosa, Manika Lamba, Crissandra George, J. Stephen Downie
In the era of big and ubiquitous data, professionals and students alike are finding themselves needing to perform a number of textual analysis tasks. Historically, the general lack of statistical expertise and programming skills has stopped many with humanities or social sciences backgrounds from performing and fully benefiting from such analyses. Thus, we introduce Coconut Libtool (www.coconut-libtool.com/), an open-source, web-based application that utilizes state-of-the-art natural language processing (NLP) technologies. Coconut Libtool analyzes text data from customized files and bibliographic databases such as Web of Science, Scopus, and Lens. Users can verify which functions can be performed with the data they have. Coconut Libtool deploys multiple algorithmic NLP techniques at the backend, including topic modeling (LDA, Biterm, and BERTopic algorithms), network graph visualization, keyword lemmatization, and sunburst visualization. Coconut Libtool is the people-first web application designed to be used by professionals, researchers, and students in the information sciences, digital humanities, and computational social sciences domains to promote transparency, reproducibility, accessibility, reciprocity, and responsibility in research practices.
在无处不在的大数据时代,专业人士和学生都发现自己需要执行大量文本分析任务。从历史上看,由于普遍缺乏统计专业知识和编程技能,许多具有人文或社会科学背景的人无法执行此类分析并从中充分受益。因此,我们引入了 Coconut Libtool(www.coconut-libtool.com/),它是一个开源的、基于网络的应用程序,采用了最先进的自然语言处理(NLP)技术。CoconutLibtool 可以分析来自定制文件和书目数据库(如 Web of Science、Scopus 和 Lens)的文本数据。用户可以验证他们所拥有的数据可以执行哪些功能。Coconut Libtool 在后端部署了多种算法的 NLP 技术,包括主题建模(LDA、Biterm 和 BERTopic 算法)、网络图可视化、关键词格式化和旭日可视化。Coconut Libtool 是一款以人为本的网络应用程序,旨在供信息科学、数字人文和计算社会科学领域的专业人士、研究人员和学生使用,以提高研究实践的透明度、可复制性、可访问性、互惠性和责任感。
{"title":"Coconut Libtool: Bridging Textual Analysis Gaps for Non-Programmers","authors":"Faizhal Arif Santosa, Manika Lamba, Crissandra George, J. Stephen Downie","doi":"arxiv-2406.05949","DOIUrl":"https://doi.org/arxiv-2406.05949","url":null,"abstract":"In the era of big and ubiquitous data, professionals and students alike are\u0000finding themselves needing to perform a number of textual analysis tasks.\u0000Historically, the general lack of statistical expertise and programming skills\u0000has stopped many with humanities or social sciences backgrounds from performing\u0000and fully benefiting from such analyses. Thus, we introduce Coconut Libtool\u0000(www.coconut-libtool.com/), an open-source, web-based application that utilizes\u0000state-of-the-art natural language processing (NLP) technologies. Coconut\u0000Libtool analyzes text data from customized files and bibliographic databases\u0000such as Web of Science, Scopus, and Lens. Users can verify which functions can\u0000be performed with the data they have. Coconut Libtool deploys multiple\u0000algorithmic NLP techniques at the backend, including topic modeling (LDA,\u0000Biterm, and BERTopic algorithms), network graph visualization, keyword\u0000lemmatization, and sunburst visualization. Coconut Libtool is the people-first\u0000web application designed to be used by professionals, researchers, and students\u0000in the information sciences, digital humanities, and computational social\u0000sciences domains to promote transparency, reproducibility, accessibility,\u0000reciprocity, and responsibility in research practices.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"23 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527524","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Generative artificial intelligence (AI) technologies like ChatGPT, have significantly impacted academic writing and publishing through their ability to generate content at levels comparable to or surpassing human writers. Through a review of recent interdisciplinary literature, this paper examines ethical considerations surrounding the integration of AI into academia, focusing on the potential for this technology to be used for scholarly misconduct and necessary oversight when using it for writing, editing, and reviewing of scholarly papers. The findings highlight the need for collaborative approaches to AI usage among publishers, editors, reviewers, and authors to ensure that this technology is used ethically and productively.
{"title":"The Impact of AI on Academic Research and Publishing","authors":"Brady Lund, Manika Lamba, Sang Hoo Oh","doi":"arxiv-2406.06009","DOIUrl":"https://doi.org/arxiv-2406.06009","url":null,"abstract":"Generative artificial intelligence (AI) technologies like ChatGPT, have\u0000significantly impacted academic writing and publishing through their ability to\u0000generate content at levels comparable to or surpassing human writers. Through a\u0000review of recent interdisciplinary literature, this paper examines ethical\u0000considerations surrounding the integration of AI into academia, focusing on the\u0000potential for this technology to be used for scholarly misconduct and necessary\u0000oversight when using it for writing, editing, and reviewing of scholarly\u0000papers. The findings highlight the need for collaborative approaches to AI\u0000usage among publishers, editors, reviewers, and authors to ensure that this\u0000technology is used ethically and productively.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527523","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The information explosion in the form of ETDs poses the challenge of management and extraction of appropriate knowledge for decision-making. Thus, the present study forwards a solution to the above problem by applying topic mining and prediction modeling tools to 263 ETDs submitted to the PQDT Global database during 2016-18 in the field of library science. This study was divided into two phases. The first phase determined the core topics from the ETDs using Topic-Modeling-Tool (TMT), which was based on latent dirichlet allocation (LDA), whereas the second phase employed prediction analysis using RapidMinerplatform to annotate the future research articles on the basis of the modeled topics. The core topics (tags) for the studied period were found to be book history, school librarian, public library, communicative ecology, and informatics followed by text network and trend analysis on the high probability cooccurred words. Lastly, a prediction model using Support Vector Machine (SVM) classifier was created in order to accurately predict the placement of future ETDs going to be submitted to PQDT Global under the five modeled topics (a to e). The tested dataset against the trained data set for the predictive performed perfectly.
电子文献形式的信息爆炸给管理和提取适当的决策知识带来了挑战。因此,本研究通过对2016-18年间图书馆学领域提交至PQDT全球数据库的263篇ETD应用主题挖掘和预测建模工具,提出了上述问题的解决方案。本研究分为两个阶段。第一阶段使用基于潜在德里希勒分配(LDA)的主题建模工具(TMT)从ETD中确定核心主题;第二阶段使用RapidMiner平台进行预测分析,在建模主题的基础上对未来研究文章进行注释。研究期间的核心主题(标签)为图书史、学校图书馆员、公共图书馆、传播生态学和信息学,随后对高概率出现的词进行了文本网络和趋势分析。最后,使用支持向量机(SVM)分类器创建了一个预测模型,以准确预测未来提交给 PQDT Global 的ETD 在五个建模主题(脚趾)下的位置。测试数据集与训练数据集的预测结果完全一致。
{"title":"Text Analysis of ETDs in ProQuest Dissertations and Theses (PQDT) Global (2016-2018)","authors":"Manika Lamba","doi":"arxiv-2406.06076","DOIUrl":"https://doi.org/arxiv-2406.06076","url":null,"abstract":"The information explosion in the form of ETDs poses the challenge of\u0000management and extraction of appropriate knowledge for decision-making. Thus,\u0000the present study forwards a solution to the above problem by applying topic\u0000mining and prediction modeling tools to 263 ETDs submitted to the PQDT Global\u0000database during 2016-18 in the field of library science. This study was divided\u0000into two phases. The first phase determined the core topics from the ETDs using\u0000Topic-Modeling-Tool (TMT), which was based on latent dirichlet allocation\u0000(LDA), whereas the second phase employed prediction analysis using\u0000RapidMinerplatform to annotate the future research articles on the basis of the\u0000modeled topics. The core topics (tags) for the studied period were found to be\u0000book history, school librarian, public library, communicative ecology, and\u0000informatics followed by text network and trend analysis on the high probability\u0000cooccurred words. Lastly, a prediction model using Support Vector Machine (SVM)\u0000classifier was created in order to accurately predict the placement of future\u0000ETDs going to be submitted to PQDT Global under the five modeled topics (a to\u0000e). The tested dataset against the trained data set for the predictive\u0000performed perfectly.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"20 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527522","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This technical report outlines the filtering approach applied to the collection of the Bielefeld Academic Search Engine (BASE) data to extract articles from the political science domain. We combined hard and soft filters to address entries with different available metadata, e.g. title, abstract or keywords. The hard filter is a weighted keyword-based filter approach. The soft filter uses a multilingual BERT-based classification model, trained to detect scientific articles from the political science domain. We evaluated both approaches using an annotated dataset, consisting of scientific articles from different scientific domains. The weighted keyword-based approach achieved the highest total accuracy of 0.88. The multilingual BERT-based classification model was fine-tuned using a dataset of 14,178 abstracts from scientific articles and reached the highest total accuracy of 0.98.
{"title":"Automatically detecting scientific political science texts from a large general document index","authors":"Nina Smirnova","doi":"arxiv-2406.03067","DOIUrl":"https://doi.org/arxiv-2406.03067","url":null,"abstract":"This technical report outlines the filtering approach applied to the\u0000collection of the Bielefeld Academic Search Engine (BASE) data to extract\u0000articles from the political science domain. We combined hard and soft filters\u0000to address entries with different available metadata, e.g. title, abstract or\u0000keywords. The hard filter is a weighted keyword-based filter approach. The soft\u0000filter uses a multilingual BERT-based classification model, trained to detect\u0000scientific articles from the political science domain. We evaluated both\u0000approaches using an annotated dataset, consisting of scientific articles from\u0000different scientific domains. The weighted keyword-based approach achieved the\u0000highest total accuracy of 0.88. The multilingual BERT-based classification\u0000model was fine-tuned using a dataset of 14,178 abstracts from scientific\u0000articles and reached the highest total accuracy of 0.98.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141547359","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Hao Peng, Huilian Sophie Qiu, Henrik Barslund Fosse, Brian Uzzi
How are the merits of innovative ideas communicated in science? Here we conduct semantic analyses of grant application success with a focus on scientific promotional language, which has been growing in frequency in many contexts and purportedly may convey an innovative idea's originality and significance. Our analysis attempts to surmount limitations of prior studies by examining the full text of tens of thousands of both funded and unfunded grants from three leading public and private funding agencies: the NIH, the NSF, and the Novo Nordisk Foundation, one of the world's largest private science foundations. We find a robust association between promotional language and the support and adoption of innovative ideas by funders and other scientists. First, the percentage of promotional language in a grant proposal is associated with up to a doubling of the grant's probability of being funded. Second, a grant's promotional language reflects its intrinsic level of innovativeness. Third, the percentage of promotional language predicts the expected citation and productivity impact of publications that are supported by funded grants. Lastly, a computer-assisted experiment that manipulates the promotional language in our data demonstrates how promotional language can communicate the merit of ideas through cognitive activation. With the incidence of promotional language in science steeply rising, and the pivotal role of grants in converting promising and aspirational ideas into solutions, our analysis provides empirical evidence that promotional language is associated with effectively communicating the merits of innovative scientific ideas.
{"title":"Promotional Language and the Adoption of Innovative Ideas in Science","authors":"Hao Peng, Huilian Sophie Qiu, Henrik Barslund Fosse, Brian Uzzi","doi":"arxiv-2406.02798","DOIUrl":"https://doi.org/arxiv-2406.02798","url":null,"abstract":"How are the merits of innovative ideas communicated in science? Here we\u0000conduct semantic analyses of grant application success with a focus on\u0000scientific promotional language, which has been growing in frequency in many\u0000contexts and purportedly may convey an innovative idea's originality and\u0000significance. Our analysis attempts to surmount limitations of prior studies by\u0000examining the full text of tens of thousands of both funded and unfunded grants\u0000from three leading public and private funding agencies: the NIH, the NSF, and\u0000the Novo Nordisk Foundation, one of the world's largest private science\u0000foundations. We find a robust association between promotional language and the\u0000support and adoption of innovative ideas by funders and other scientists.\u0000First, the percentage of promotional language in a grant proposal is associated\u0000with up to a doubling of the grant's probability of being funded. Second, a\u0000grant's promotional language reflects its intrinsic level of innovativeness.\u0000Third, the percentage of promotional language predicts the expected citation\u0000and productivity impact of publications that are supported by funded grants.\u0000Lastly, a computer-assisted experiment that manipulates the promotional\u0000language in our data demonstrates how promotional language can communicate the\u0000merit of ideas through cognitive activation. With the incidence of promotional\u0000language in science steeply rising, and the pivotal role of grants in\u0000converting promising and aspirational ideas into solutions, our analysis\u0000provides empirical evidence that promotional language is associated with\u0000effectively communicating the merits of innovative scientific ideas.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"12 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141527525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Conghui He, Wei Li, Zhenjiang Jin, Chao Xu, Bin Wang, Dahua Lin
The advancement of artificial intelligence (AI) hinges on the quality and accessibility of data, yet the current fragmentation and variability of data sources hinder efficient data utilization. The dispersion of data sources and diversity of data formats often lead to inefficiencies in data retrieval and processing, significantly impeding the progress of AI research and applications. To address these challenges, this paper introduces OpenDataLab, a platform designed to bridge the gap between diverse data sources and the need for unified data processing. OpenDataLab integrates a wide range of open-source AI datasets and enhances data acquisition efficiency through intelligent querying and high-speed downloading services. The platform employs a next-generation AI Data Set Description Language (DSDL), which standardizes the representation of multimodal and multi-format data, improving interoperability and reusability. Additionally, OpenDataLab optimizes data processing through tools that complement DSDL. By integrating data with unified data descriptions and smart data toolchains, OpenDataLab can improve data preparation efficiency by 30%. We anticipate that OpenDataLab will significantly boost artificial general intelligence (AGI) research and facilitate advancements in related AI fields. For more detailed information, please visit the platform's official website: https://opendatalab.com.
{"title":"OpenDataLab: Empowering General Artificial Intelligence with Open Datasets","authors":"Conghui He, Wei Li, Zhenjiang Jin, Chao Xu, Bin Wang, Dahua Lin","doi":"arxiv-2407.13773","DOIUrl":"https://doi.org/arxiv-2407.13773","url":null,"abstract":"The advancement of artificial intelligence (AI) hinges on the quality and\u0000accessibility of data, yet the current fragmentation and variability of data\u0000sources hinder efficient data utilization. The dispersion of data sources and\u0000diversity of data formats often lead to inefficiencies in data retrieval and\u0000processing, significantly impeding the progress of AI research and\u0000applications. To address these challenges, this paper introduces OpenDataLab, a\u0000platform designed to bridge the gap between diverse data sources and the need\u0000for unified data processing. OpenDataLab integrates a wide range of open-source\u0000AI datasets and enhances data acquisition efficiency through intelligent\u0000querying and high-speed downloading services. The platform employs a\u0000next-generation AI Data Set Description Language (DSDL), which standardizes the\u0000representation of multimodal and multi-format data, improving interoperability\u0000and reusability. Additionally, OpenDataLab optimizes data processing through\u0000tools that complement DSDL. By integrating data with unified data descriptions\u0000and smart data toolchains, OpenDataLab can improve data preparation efficiency\u0000by 30%. We anticipate that OpenDataLab will significantly boost artificial\u0000general intelligence (AGI) research and facilitate advancements in related AI\u0000fields. For more detailed information, please visit the platform's official\u0000website: https://opendatalab.com.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"37 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141737222","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Here, we note how academics, journals and publishers should no longer refer to the social media platform Twitter as such, rather as X. Relying on Google Scholar, we found 16 examples of papers published in the last months of 2023 - essentially during the transition period between Twitter and X - that used Twitter and X, but in different ways. Unlike that transition period in which the binary Twitter/X could have been used in academic papers, we suggest that papers should no longer refer to Twitter as Twitter, but only as X, except for historical studies about that social media platform, because such use would be factually incorrect.
{"title":"Twitter should now be referred to as X: How academics, journals and publishers need to make the nomenclatural transition","authors":"Jaime A. Teixeira da Silva, Serhii Nazarovets","doi":"arxiv-2405.20670","DOIUrl":"https://doi.org/arxiv-2405.20670","url":null,"abstract":"Here, we note how academics, journals and publishers should no longer refer\u0000to the social media platform Twitter as such, rather as X. Relying on Google\u0000Scholar, we found 16 examples of papers published in the last months of 2023 -\u0000essentially during the transition period between Twitter and X - that used\u0000Twitter and X, but in different ways. Unlike that transition period in which\u0000the binary Twitter/X could have been used in academic papers, we suggest that\u0000papers should no longer refer to Twitter as Twitter, but only as X, except for\u0000historical studies about that social media platform, because such use would be\u0000factually incorrect.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"18 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-05-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141252444","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}