The Intelligence Studies Network is a comprehensive resource database for publications, events, conferences, and calls for papers in the field of intelligence studies. It offers a novel solution for monitoring, indexing, and visualising resources. Sources are automatically monitored and added to a manually curated database, ensuring the relevance of items to intelligence studies. Curated outputs are stored in a group library on Zotero, an open-source reference management tool. The metadata of items in Zotero is enriched with OpenAlex, an open access bibliographic database. Finally, outputs are listed and visualised on a Streamlit app, an open-source Python framework for building apps. This paper aims to explain the Intelligence Studies Network database and provide a detailed guide on data sources and the workflow. This study demonstrates that it is possible to create a specialised academic database by using open source tools.
{"title":"'Intelligence Studies Network': A human-curated database for indexing resources with open-source tools","authors":"Yusuf A. Ozkan","doi":"arxiv-2408.03868","DOIUrl":"https://doi.org/arxiv-2408.03868","url":null,"abstract":"The Intelligence Studies Network is a comprehensive resource database for\u0000publications, events, conferences, and calls for papers in the field of\u0000intelligence studies. It offers a novel solution for monitoring, indexing, and\u0000visualising resources. Sources are automatically monitored and added to a\u0000manually curated database, ensuring the relevance of items to intelligence\u0000studies. Curated outputs are stored in a group library on Zotero, an\u0000open-source reference management tool. The metadata of items in Zotero is\u0000enriched with OpenAlex, an open access bibliographic database. Finally, outputs\u0000are listed and visualised on a Streamlit app, an open-source Python framework\u0000for building apps. This paper aims to explain the Intelligence Studies Network\u0000database and provide a detailed guide on data sources and the workflow. This\u0000study demonstrates that it is possible to create a specialised academic\u0000database by using open source tools.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"21 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930691","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Standing at the forefront of knowledge dissemination, digital libraries curate vast collections of scientific literature. However, these scholarly writings are often laden with jargon and tailored for domain experts rather than the general public. As librarians, we strive to offer services to a diverse audience, including those with lower reading levels. To extend our services beyond mere access, we propose fine-tuning a language model to rewrite scholarly abstracts into more comprehensible versions, thereby making scholarly literature more accessible when requested. We began by introducing a corpus specifically designed for training models to simplify scholarly abstracts. This corpus consists of over three thousand pairs of abstracts and significance statements from diverse disciplines. We then fine-tuned four language models using this corpus. The outputs from the models were subsequently examined both quantitatively for accessibility and semantic coherence, and qualitatively for language quality, faithfulness, and completeness. Our findings show that the resulting models can improve readability by over three grade levels, while maintaining fidelity to the original content. Although commercial state-of-the-art models still hold an edge, our models are much more compact, can be deployed locally in an affordable manner, and alleviate the privacy concerns associated with using commercial models. We envision this work as a step toward more inclusive and accessible libraries, improving our services for young readers and those without a college degree.
{"title":"Simplifying Scholarly Abstracts for Accessible Digital Libraries","authors":"Haining Wang, Jason Clark","doi":"arxiv-2408.03899","DOIUrl":"https://doi.org/arxiv-2408.03899","url":null,"abstract":"Standing at the forefront of knowledge dissemination, digital libraries\u0000curate vast collections of scientific literature. However, these scholarly\u0000writings are often laden with jargon and tailored for domain experts rather\u0000than the general public. As librarians, we strive to offer services to a\u0000diverse audience, including those with lower reading levels. To extend our\u0000services beyond mere access, we propose fine-tuning a language model to rewrite\u0000scholarly abstracts into more comprehensible versions, thereby making scholarly\u0000literature more accessible when requested. We began by introducing a corpus\u0000specifically designed for training models to simplify scholarly abstracts. This\u0000corpus consists of over three thousand pairs of abstracts and significance\u0000statements from diverse disciplines. We then fine-tuned four language models\u0000using this corpus. The outputs from the models were subsequently examined both\u0000quantitatively for accessibility and semantic coherence, and qualitatively for\u0000language quality, faithfulness, and completeness. Our findings show that the\u0000resulting models can improve readability by over three grade levels, while\u0000maintaining fidelity to the original content. Although commercial\u0000state-of-the-art models still hold an edge, our models are much more compact,\u0000can be deployed locally in an affordable manner, and alleviate the privacy\u0000concerns associated with using commercial models. We envision this work as a\u0000step toward more inclusive and accessible libraries, improving our services for\u0000young readers and those without a college degree.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"192 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930692","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Ivan Heibi, Arianna Moretti, Silvio Peroni, Marta Soricetti
This article presents the OpenCitations Index, a collection of open citation data maintained by OpenCitations, an independent, not-for-profit infrastructure organisation for open scholarship dedicated to publishing open bibliographic and citation data using Semantic Web and Linked Open Data technologies. The collection involves citation data harvested from multiple sources. To address the possibility of different sources providing citation data for bibliographic entities represented with different identifiers, therefore potentially representing same citation, a deduplication mechanism has been implemented. This ensures that citations integrated into OpenCitations Index are accurately identified uniquely, even when different identifiers are used. This mechanism follows a specific workflow, which encompasses a preprocessing of the original source data, a management of the provided bibliographic metadata, and the generation of new citation data to be integrated into the OpenCitations Index. The process relies on another data collection: OpenCitations Meta, and on the use of a new globally persistent identifier, namely OMID (OpenCitations Meta Identifier). As of July 2024, OpenCitations Index stores over 2 billion unique citation links, harvest from Crossref, the National Institute of Heath Open Citation Collection (NIH-OCC), DataCite, OpenAIRE, and the Japan Link Center (JaLC). OpenCitations Index can be systematically accessed and queried through several services, including SPARQL endpoint, REST APIs, and web interfaces. Additionally, dataset dumps are available for free download and reuse (under CC0 waiver) in various formats (CSV, N-Triples, and Scholix), including provenance and change tracking information.
本文介绍了OpenCitations索引,这是一个由OpenCitations维护的开放引文数据集。OpenCitations是一个独立的非营利性开放学术基础设施组织,致力于利用语义网(Semantic Web)和关联开放数据(Linked Open Data)技术发布开放书目和引文数据。该文集涉及从多个来源获取的引文数据。为了解决不同来源为使用不同标识符表示的书目实体提供引文数据,从而可能代表相同引文的问题,我们实施了重复数据删除机制。该机制遵循一个特定的工作流程,其中包括对原始源数据的预处理、对所提供书目元数据的管理,以及生成新的引文数据以集成到 OpenCitations 索引中:该过程依赖于另一个数据收集:OpenCitations Meta,以及使用一个新的全球持久标识符,即 OMID(OpenCitations MetaIdentifier)。截至 2024 年 7 月,OpenCitations 索引存储了超过 20 亿条唯一引用链接,这些链接来自 Crossref、美国国立卫生研究院开放引文集(NIH-OCC)、DataCite、OpenAIRE 和日本链接中心(JaLC)。OpenCitations Index 可通过 SPARQL 端点、REST API 和 Web 界面等多种服务进行系统访问和查询。此外,数据集转储可通过各种格式(CSV、N-Triples 和 Scholix)免费下载和重复使用(根据CC0 豁免),包括证明和变更跟踪信息。
{"title":"The OpenCitations Index","authors":"Ivan Heibi, Arianna Moretti, Silvio Peroni, Marta Soricetti","doi":"arxiv-2408.02321","DOIUrl":"https://doi.org/arxiv-2408.02321","url":null,"abstract":"This article presents the OpenCitations Index, a collection of open citation\u0000data maintained by OpenCitations, an independent, not-for-profit infrastructure\u0000organisation for open scholarship dedicated to publishing open bibliographic\u0000and citation data using Semantic Web and Linked Open Data technologies. The\u0000collection involves citation data harvested from multiple sources. To address\u0000the possibility of different sources providing citation data for bibliographic\u0000entities represented with different identifiers, therefore potentially\u0000representing same citation, a deduplication mechanism has been implemented.\u0000This ensures that citations integrated into OpenCitations Index are accurately\u0000identified uniquely, even when different identifiers are used. This mechanism\u0000follows a specific workflow, which encompasses a preprocessing of the original\u0000source data, a management of the provided bibliographic metadata, and the\u0000generation of new citation data to be integrated into the OpenCitations Index.\u0000The process relies on another data collection: OpenCitations Meta, and on the\u0000use of a new globally persistent identifier, namely OMID (OpenCitations Meta\u0000Identifier). As of July 2024, OpenCitations Index stores over 2 billion unique\u0000citation links, harvest from Crossref, the National Institute of Heath Open\u0000Citation Collection (NIH-OCC), DataCite, OpenAIRE, and the Japan Link Center\u0000(JaLC). OpenCitations Index can be systematically accessed and queried through\u0000several services, including SPARQL endpoint, REST APIs, and web interfaces.\u0000Additionally, dataset dumps are available for free download and reuse (under\u0000CC0 waiver) in various formats (CSV, N-Triples, and Scholix), including\u0000provenance and change tracking information.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930693","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Joshua Morriss, Tod Brindle, Jessica Bah Rösman, Daniel Reibsamen, Andreas Enz
Systematic literature reviews are the highest quality of evidence in research. However, the review process is hindered by significant resource and data constraints. The Literature Review Network (LRN) is the first of its kind explainable AI platform adhering to PRISMA 2020 standards, designed to automate the entire literature review process. LRN was evaluated in the domain of surgical glove practices using 3 search strings developed by experts to query PubMed. A non-expert trained all LRN models. Performance was benchmarked against an expert manual review. Explainability and performance metrics assessed LRN's ability to replicate the experts' review. Concordance was measured with the Jaccard index and confusion matrices. Researchers were blinded to the other's results until study completion. Overlapping studies were integrated into an LRN-generated systematic review. LRN models demonstrated superior classification accuracy without expert training, achieving 84.78% and 85.71% accuracy. The highest performance model achieved high interrater reliability (k = 0.4953) and explainability metrics, linking 'reduce', 'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51% of the relevant literature despite diverging from the non-expert's judgments (k = 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN outperformed the manual review (19,920 minutes over 11 months), reducing the entire process to 288.6 minutes over 5 days. This study demonstrates that explainable AI does not require expert training to successfully conduct PRISMA-compliant systematic literature reviews like an expert. LRN summarized the results of surgical glove studies and identified themes that were nearly identical to the clinical researchers' findings. Explainable AI can accurately expedite our understanding of clinical practices, potentially revolutionizing healthcare research.
{"title":"The Literature Review Network: An Explainable Artificial Intelligence for Systematic Literature Reviews, Meta-analyses, and Method Development","authors":"Joshua Morriss, Tod Brindle, Jessica Bah Rösman, Daniel Reibsamen, Andreas Enz","doi":"arxiv-2408.05239","DOIUrl":"https://doi.org/arxiv-2408.05239","url":null,"abstract":"Systematic literature reviews are the highest quality of evidence in\u0000research. However, the review process is hindered by significant resource and\u0000data constraints. The Literature Review Network (LRN) is the first of its kind\u0000explainable AI platform adhering to PRISMA 2020 standards, designed to automate\u0000the entire literature review process. LRN was evaluated in the domain of\u0000surgical glove practices using 3 search strings developed by experts to query\u0000PubMed. A non-expert trained all LRN models. Performance was benchmarked\u0000against an expert manual review. Explainability and performance metrics\u0000assessed LRN's ability to replicate the experts' review. Concordance was\u0000measured with the Jaccard index and confusion matrices. Researchers were\u0000blinded to the other's results until study completion. Overlapping studies were\u0000integrated into an LRN-generated systematic review. LRN models demonstrated\u0000superior classification accuracy without expert training, achieving 84.78% and\u000085.71% accuracy. The highest performance model achieved high interrater\u0000reliability (k = 0.4953) and explainability metrics, linking 'reduce',\u0000'accident', and 'sharp' with 'double-gloving'. Another LRN model covered 91.51%\u0000of the relevant literature despite diverging from the non-expert's judgments (k\u0000= 0.2174), with the terms 'latex', 'double' (gloves), and 'indication'. LRN\u0000outperformed the manual review (19,920 minutes over 11 months), reducing the\u0000entire process to 288.6 minutes over 5 days. This study demonstrates that\u0000explainable AI does not require expert training to successfully conduct\u0000PRISMA-compliant systematic literature reviews like an expert. LRN summarized\u0000the results of surgical glove studies and identified themes that were nearly\u0000identical to the clinical researchers' findings. Explainable AI can accurately\u0000expedite our understanding of clinical practices, potentially revolutionizing\u0000healthcare research.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"60 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219123","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This paper introduces the Unique Citing Documents Journal Impact Factor(Uniq-JIF) as a supplement to the traditional Journal Impact Factor(JIF). The Uniq-JIF counts each citing document only once, aiming to reduce the effects of citation manipulations. Analysis of 2023 Journal Citation Reports data shows that for most journals, the Uniq-JIF is less than 20% lower than the JIF, though some journals show a drop of over 75%. The Uniq-JIF also highlights significant reductions for journals suppressed due to citation issues, indicating its effectiveness in identifying problematic journals. The Uniq-JIF offers a more nuanced view of a journal's influence and can help reveal journals needing further scrutiny.
{"title":"The Unique Citing Documents Journal Impact Factor (Uniq-JIF) as a Supplement for the standard Journal Impact Factor","authors":"Zhesi Shen, Li Li, Yu Liao","doi":"arxiv-2408.08884","DOIUrl":"https://doi.org/arxiv-2408.08884","url":null,"abstract":"This paper introduces the Unique Citing Documents Journal Impact\u0000Factor(Uniq-JIF) as a supplement to the traditional Journal Impact Factor(JIF).\u0000The Uniq-JIF counts each citing document only once, aiming to reduce the\u0000effects of citation manipulations. Analysis of 2023 Journal Citation Reports\u0000data shows that for most journals, the Uniq-JIF is less than 20% lower than the\u0000JIF, though some journals show a drop of over 75%. The Uniq-JIF also highlights\u0000significant reductions for journals suppressed due to citation issues,\u0000indicating its effectiveness in identifying problematic journals. The Uniq-JIF\u0000offers a more nuanced view of a journal's influence and can help reveal\u0000journals needing further scrutiny.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"307 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219120","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
As the use of Generative Artificial Intelligence tools have grown in higher education and research, there have been increasing calls for transparency and granularity around the use and attribution of the use of these tools. Thus far, this need has been met via the recommended inclusion of a note, with little to no guidance on what the note itself should include. This has been identified as a problem to the use of AI in academic and research contexts. This article introduces The Artificial Intelligence Disclosure (AID) Framework, a standard, comprehensive, and detailed framework meant to inform the development and writing of GenAI disclosure for education and research.
{"title":"The Artificial Intelligence Disclosure (AID) Framework: An Introduction","authors":"Kari D. Weaver","doi":"arxiv-2408.01904","DOIUrl":"https://doi.org/arxiv-2408.01904","url":null,"abstract":"As the use of Generative Artificial Intelligence tools have grown in higher\u0000education and research, there have been increasing calls for transparency and\u0000granularity around the use and attribution of the use of these tools. Thus far,\u0000this need has been met via the recommended inclusion of a note, with little to\u0000no guidance on what the note itself should include. This has been identified as\u0000a problem to the use of AI in academic and research contexts. This article\u0000introduces The Artificial Intelligence Disclosure (AID) Framework, a standard,\u0000comprehensive, and detailed framework meant to inform the development and\u0000writing of GenAI disclosure for education and research.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"30 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141930694","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
This study aimed to utilize bibliometric methods to analyze trends in international Magnetoencephalography (MEG) research from 2013 to 2022. Due to the limited volume of domestic literature on MEG, this analysis focuses solely on the global research landscape, providing insights from the past decade as a representative sample. This study utilized bibliometric methods to explore and analyze the progress, hotspots and developmental trends in international MEG research spanning from 1995 to 2022. The results indicated a dynamic and steady growth trend in the overall number of publications in MEG. Ryusuke Kakigi emerged as the most prolific author, while Neuroimage led as the most prolific journal. Current hotspots in MEG research encompass resting state, networks, functional connectivity, phase dynamics, oscillation, and more. Future trends in MEG research are poised to advance across three key aspects: disease treatment and practical applications, experimental foundations and technical advancements, and fundamental and advanced human cognition. In the future, there should be a focus on enhancing cross-integration and utilization of MEG with other instruments to diversify research methodologies in this field
本研究旨在利用文献计量学方法分析 2013 至 2022 年国际脑磁图(MEG)研究的趋势。由于国内有关 MEG 的文献数量有限,本分析仅关注全球研究格局,提供过去十年的研究见解作为代表性样本。本研究采用文献计量学方法,探讨和分析了 1995 至 2022 年间国际 MEG 研究的进展、热点和发展趋势。研究结果表明,MEG 的论文总数呈动态稳定增长趋势。柿木龙介(Ryusuke Kakigie)成为发表论文最多的作者,而《神经影像》(Neuroimage)则成为发表论文最多的期刊。目前 MEG 研究的热点包括静息状态、网络、功能连接、相位动力学、振荡等。MEG 研究的未来趋势将在三个关键方面取得进展:疾病治疗和实际应用、实验基础和技术进步,以及基础和高级人类认知。未来,应重点加强 MEG 与其他仪器的交叉整合和利用,以丰富该领域的研究方法。
{"title":"Hotspots and Trends in Magnetoencephalography Research (2013-2022): A Bibliometric Analysis","authors":"Shen Liu, Jingwen Zhao","doi":"arxiv-2408.08877","DOIUrl":"https://doi.org/arxiv-2408.08877","url":null,"abstract":"This study aimed to utilize bibliometric methods to analyze trends in\u0000international Magnetoencephalography (MEG) research from 2013 to 2022. Due to\u0000the limited volume of domestic literature on MEG, this analysis focuses solely\u0000on the global research landscape, providing insights from the past decade as a\u0000representative sample. This study utilized bibliometric methods to explore and\u0000analyze the progress, hotspots and developmental trends in international MEG\u0000research spanning from 1995 to 2022. The results indicated a dynamic and steady\u0000growth trend in the overall number of publications in MEG. Ryusuke Kakigi\u0000emerged as the most prolific author, while Neuroimage led as the most prolific\u0000journal. Current hotspots in MEG research encompass resting state, networks,\u0000functional connectivity, phase dynamics, oscillation, and more. Future trends\u0000in MEG research are poised to advance across three key aspects: disease\u0000treatment and practical applications, experimental foundations and technical\u0000advancements, and fundamental and advanced human cognition. In the future,\u0000there should be a focus on enhancing cross-integration and utilization of MEG\u0000with other instruments to diversify research methodologies in this field","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"6 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-08-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142219121","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary
HAL (Hyper Articles en Ligne) is the French national publication repository, used by most higher education and research organizations for their open science policy. As a digital library, it is a rich repository of scholarly documents, but its potential for advanced research has been underutilized. We present HALvest, a unique dataset that bridges the gap between citation networks and the full text of papers submitted on HAL. We craft our dataset by filtering HAL for scholarly publications, resulting in approximately 700,000 documents, spanning 34 languages across 13 identified domains, suitable for language model training, and yielding approximately 16.5 billion tokens (with 8 billion in French and 7 billion in English, the most represented languages). We transform the metadata of each paper into a citation network, producing a directed heterogeneous graph. This graph includes uniquely identified authors on HAL, as well as all open submitted papers, and their citations. We provide a baseline for authorship attribution using the dataset, implement a range of state-of-the-art models in graph representation learning for link prediction, and discuss the usefulness of our generated knowledge graph structure.
HAL(Hyper Articles en Ligne)是法国国家出版物库,被大多数高等教育和研究机构用于其开放科学政策。作为一个数字图书馆,它拥有丰富的学术文献资源,但其在高级研究方面的潜力却未得到充分利用。HALvest 是一个独特的数据集,它在引文网络和 HAL 上提交的论文全文之间架起了一座桥梁。我们通过过滤 HAL 上的学术出版物来制作我们的数据集,最终得到了约 70 万篇文档,涵盖 13 个已确定领域的 34 种语言,适合语言模型训练,并产生了约 165 亿个词块(其中法语和英语分别为 80 亿和 70 亿,是代表性最强的语言)。我们将每篇论文的元数据转化为引文网络,生成有向异构图。该图包括 HAL 上唯一标识的作者、所有公开提交的论文及其引文。我们利用该数据集提供了作者归属的基线,实现了一系列用于链接预测的图表示学习的最新模型,并讨论了我们生成的知识图结构的实用性。
{"title":"Harvesting Textual and Structured Data from the HAL Publication Repository","authors":"Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary","doi":"arxiv-2407.20595","DOIUrl":"https://doi.org/arxiv-2407.20595","url":null,"abstract":"HAL (Hyper Articles en Ligne) is the French national publication repository,\u0000used by most higher education and research organizations for their open science\u0000policy. As a digital library, it is a rich repository of scholarly documents,\u0000but its potential for advanced research has been underutilized. We present\u0000HALvest, a unique dataset that bridges the gap between citation networks and\u0000the full text of papers submitted on HAL. We craft our dataset by filtering HAL\u0000for scholarly publications, resulting in approximately 700,000 documents,\u0000spanning 34 languages across 13 identified domains, suitable for language model\u0000training, and yielding approximately 16.5 billion tokens (with 8 billion in\u0000French and 7 billion in English, the most represented languages). We transform\u0000the metadata of each paper into a citation network, producing a directed\u0000heterogeneous graph. This graph includes uniquely identified authors on HAL, as\u0000well as all open submitted papers, and their citations. We provide a baseline\u0000for authorship attribution using the dataset, implement a range of\u0000state-of-the-art models in graph representation learning for link prediction,\u0000and discuss the usefulness of our generated knowledge graph structure.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"113 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868311","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
I propose the R-Index, defined as the difference between the sum of review responsibilities for a researcher's publications and the number of reviews they have completed, as a novel metric to effectively characterize a researcher's contribution to the peer review process. This index aims to balance the demands placed on the peer review system by a researcher's publication output with their engagement in reviewing others' work, providing a measure of whether they are giving back to the academic community commensurately with their own publication demands. The R-Index offers a straightforward and fair approach to encourage equitable participation in peer review, thereby supporting the sustainability and efficiency of the scholarly publishing process.
{"title":"R-Index: A Metric for Assessing Researcher Contributions to Peer Review","authors":"Milad Malekzadeh","doi":"arxiv-2407.19949","DOIUrl":"https://doi.org/arxiv-2407.19949","url":null,"abstract":"I propose the R-Index, defined as the difference between the sum of review\u0000responsibilities for a researcher's publications and the number of reviews they\u0000have completed, as a novel metric to effectively characterize a researcher's\u0000contribution to the peer review process. This index aims to balance the demands\u0000placed on the peer review system by a researcher's publication output with\u0000their engagement in reviewing others' work, providing a measure of whether they\u0000are giving back to the academic community commensurately with their own\u0000publication demands. The R-Index offers a straightforward and fair approach to\u0000encourage equitable participation in peer review, thereby supporting the\u0000sustainability and efficiency of the scholarly publishing process.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"1 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868192","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Adilson Vital Jr., Filipi N. Silva, Osvaldo N. Oliveira Jr., Diego R. Amancio
The impact of research papers, typically measured in terms of citation counts, depends on several factors, including the reputation of the authors, journals, and institutions, in addition to the quality of the scientific work. In this paper, we present an approach that combines natural language processing and machine learning to predict the impact of papers in a specific journal. Our focus is on the text, which should correlate with impact and the topics covered in the research. We employed a dataset of over 40,000 articles from ACS Applied Materials and Interfaces spanning from 2012 to 2022. The data was processed using various text embedding techniques and classified with supervised machine learning algorithms. Papers were categorized into the top 20% most cited within the journal, using both yearly and cumulative citation counts as metrics. Our analysis reveals that the method employing generative pre-trained transformers (GPT) was the most efficient for embedding, while the random forest algorithm exhibited the best predictive power among the machine learning algorithms. An optimized accuracy of 80% in predicting whether a paper was among the top 20% most cited was achieved for the cumulative citation count when abstracts were processed. This accuracy is noteworthy, considering that author, institution, and early citation pattern information were not taken into account. The accuracy increased only slightly when the full texts of the papers were processed. Also significant is the finding that a simpler embedding technique, term frequency-inverse document frequency (TFIDF), yielded performance close to that of GPT. Since TFIDF captures the topics of the paper we infer that, apart from considering author and institution biases, citation counts for the considered journal may be predicted by identifying topics and "reading" the abstract of a paper.
{"title":"Predicting citation impact of research papers using GPT and other text embeddings","authors":"Adilson Vital Jr., Filipi N. Silva, Osvaldo N. Oliveira Jr., Diego R. Amancio","doi":"arxiv-2407.19942","DOIUrl":"https://doi.org/arxiv-2407.19942","url":null,"abstract":"The impact of research papers, typically measured in terms of citation\u0000counts, depends on several factors, including the reputation of the authors,\u0000journals, and institutions, in addition to the quality of the scientific work.\u0000In this paper, we present an approach that combines natural language processing\u0000and machine learning to predict the impact of papers in a specific journal. Our\u0000focus is on the text, which should correlate with impact and the topics covered\u0000in the research. We employed a dataset of over 40,000 articles from ACS Applied\u0000Materials and Interfaces spanning from 2012 to 2022. The data was processed\u0000using various text embedding techniques and classified with supervised machine\u0000learning algorithms. Papers were categorized into the top 20% most cited within\u0000the journal, using both yearly and cumulative citation counts as metrics. Our\u0000analysis reveals that the method employing generative pre-trained transformers\u0000(GPT) was the most efficient for embedding, while the random forest algorithm\u0000exhibited the best predictive power among the machine learning algorithms. An\u0000optimized accuracy of 80% in predicting whether a paper was among the top 20%\u0000most cited was achieved for the cumulative citation count when abstracts were\u0000processed. This accuracy is noteworthy, considering that author, institution,\u0000and early citation pattern information were not taken into account. The\u0000accuracy increased only slightly when the full texts of the papers were\u0000processed. Also significant is the finding that a simpler embedding technique,\u0000term frequency-inverse document frequency (TFIDF), yielded performance close to\u0000that of GPT. Since TFIDF captures the topics of the paper we infer that, apart\u0000from considering author and institution biases, citation counts for the\u0000considered journal may be predicted by identifying topics and \"reading\" the\u0000abstract of a paper.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"50 1","pages":""},"PeriodicalIF":0.0,"publicationDate":"2024-07-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141868190","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}