从 HAL 出版物资料库获取文本和结构化数据

Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary
{"title":"从 HAL 出版物资料库获取文本和结构化数据","authors":"Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary","doi":"arxiv-2407.20595","DOIUrl":null,"url":null,"abstract":"HAL (Hyper Articles en Ligne) is the French national publication repository,\nused by most higher education and research organizations for their open science\npolicy. As a digital library, it is a rich repository of scholarly documents,\nbut its potential for advanced research has been underutilized. We present\nHALvest, a unique dataset that bridges the gap between citation networks and\nthe full text of papers submitted on HAL. We craft our dataset by filtering HAL\nfor scholarly publications, resulting in approximately 700,000 documents,\nspanning 34 languages across 13 identified domains, suitable for language model\ntraining, and yielding approximately 16.5 billion tokens (with 8 billion in\nFrench and 7 billion in English, the most represented languages). We transform\nthe metadata of each paper into a citation network, producing a directed\nheterogeneous graph. This graph includes uniquely identified authors on HAL, as\nwell as all open submitted papers, and their citations. We provide a baseline\nfor authorship attribution using the dataset, implement a range of\nstate-of-the-art models in graph representation learning for link prediction,\nand discuss the usefulness of our generated knowledge graph structure.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"113 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Harvesting Textual and Structured Data from the HAL Publication Repository\",\"authors\":\"Francis Kulumba, Wissam Antoun, Guillaume Vimont, Laurent Romary\",\"doi\":\"arxiv-2407.20595\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"HAL (Hyper Articles en Ligne) is the French national publication repository,\\nused by most higher education and research organizations for their open science\\npolicy. As a digital library, it is a rich repository of scholarly documents,\\nbut its potential for advanced research has been underutilized. We present\\nHALvest, a unique dataset that bridges the gap between citation networks and\\nthe full text of papers submitted on HAL. We craft our dataset by filtering HAL\\nfor scholarly publications, resulting in approximately 700,000 documents,\\nspanning 34 languages across 13 identified domains, suitable for language model\\ntraining, and yielding approximately 16.5 billion tokens (with 8 billion in\\nFrench and 7 billion in English, the most represented languages). We transform\\nthe metadata of each paper into a citation network, producing a directed\\nheterogeneous graph. This graph includes uniquely identified authors on HAL, as\\nwell as all open submitted papers, and their citations. We provide a baseline\\nfor authorship attribution using the dataset, implement a range of\\nstate-of-the-art models in graph representation learning for link prediction,\\nand discuss the usefulness of our generated knowledge graph structure.\",\"PeriodicalId\":501285,\"journal\":{\"name\":\"arXiv - CS - Digital Libraries\",\"volume\":\"113 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-30\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Digital Libraries\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.20595\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.20595","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

HAL(Hyper Articles en Ligne)是法国国家出版物库,被大多数高等教育和研究机构用于其开放科学政策。作为一个数字图书馆,它拥有丰富的学术文献资源,但其在高级研究方面的潜力却未得到充分利用。HALvest 是一个独特的数据集,它在引文网络和 HAL 上提交的论文全文之间架起了一座桥梁。我们通过过滤 HAL 上的学术出版物来制作我们的数据集,最终得到了约 70 万篇文档,涵盖 13 个已确定领域的 34 种语言,适合语言模型训练,并产生了约 165 亿个词块(其中法语和英语分别为 80 亿和 70 亿,是代表性最强的语言)。我们将每篇论文的元数据转化为引文网络,生成有向异构图。该图包括 HAL 上唯一标识的作者、所有公开提交的论文及其引文。我们利用该数据集提供了作者归属的基线,实现了一系列用于链接预测的图表示学习的最新模型,并讨论了我们生成的知识图结构的实用性。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Harvesting Textual and Structured Data from the HAL Publication Repository
HAL (Hyper Articles en Ligne) is the French national publication repository, used by most higher education and research organizations for their open science policy. As a digital library, it is a rich repository of scholarly documents, but its potential for advanced research has been underutilized. We present HALvest, a unique dataset that bridges the gap between citation networks and the full text of papers submitted on HAL. We craft our dataset by filtering HAL for scholarly publications, resulting in approximately 700,000 documents, spanning 34 languages across 13 identified domains, suitable for language model training, and yielding approximately 16.5 billion tokens (with 8 billion in French and 7 billion in English, the most represented languages). We transform the metadata of each paper into a citation network, producing a directed heterogeneous graph. This graph includes uniquely identified authors on HAL, as well as all open submitted papers, and their citations. We provide a baseline for authorship attribution using the dataset, implement a range of state-of-the-art models in graph representation learning for link prediction, and discuss the usefulness of our generated knowledge graph structure.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Publishing Instincts: An Exploration-Exploitation Framework for Studying Academic Publishing Behavior and "Home Venues" Research Citations Building Trust in Wikipedia Evaluating the Linguistic Coverage of OpenAlex: An Assessment of Metadata Accuracy and Completeness Towards understanding evolution of science through language model series Ensuring Adherence to Standards in Experiment-Related Metadata Entered Via Spreadsheets
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1