通过语言模型系列了解科学的演变

Junjie Dong, Zhuoqi Lyu, Qing Ke
{"title":"通过语言模型系列了解科学的演变","authors":"Junjie Dong, Zhuoqi Lyu, Qing Ke","doi":"arxiv-2409.09636","DOIUrl":null,"url":null,"abstract":"We introduce AnnualBERT, a series of language models designed specifically to\ncapture the temporal evolution of scientific text. Deviating from the\nprevailing paradigms of subword tokenizations and \"one model to rule them all\",\nAnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model\npretrained from scratch on the full-text of 1.7 million arXiv papers published\nuntil 2008 and a collection of progressively trained models on arXiv papers at\nan annual basis. We demonstrate the effectiveness of AnnualBERT models by\nshowing that they not only have comparable performances in standard tasks but\nalso achieve state-of-the-art performances on domain-specific NLP tasks as well\nas link prediction tasks in the arXiv citation network. We then utilize probing\ntasks to quantify the models' behavior in terms of representation learning and\nforgetting as time progresses. Our approach enables the pretrained models to\nnot only improve performances on scientific text processing tasks but also to\nprovide insights into the development of scientific discourse over time. The\nseries of the models is available at https://huggingface.co/jd445/AnnualBERTs.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Towards understanding evolution of science through language model series\",\"authors\":\"Junjie Dong, Zhuoqi Lyu, Qing Ke\",\"doi\":\"arxiv-2409.09636\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"We introduce AnnualBERT, a series of language models designed specifically to\\ncapture the temporal evolution of scientific text. Deviating from the\\nprevailing paradigms of subword tokenizations and \\\"one model to rule them all\\\",\\nAnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model\\npretrained from scratch on the full-text of 1.7 million arXiv papers published\\nuntil 2008 and a collection of progressively trained models on arXiv papers at\\nan annual basis. We demonstrate the effectiveness of AnnualBERT models by\\nshowing that they not only have comparable performances in standard tasks but\\nalso achieve state-of-the-art performances on domain-specific NLP tasks as well\\nas link prediction tasks in the arXiv citation network. We then utilize probing\\ntasks to quantify the models' behavior in terms of representation learning and\\nforgetting as time progresses. Our approach enables the pretrained models to\\nnot only improve performances on scientific text processing tasks but also to\\nprovide insights into the development of scientific discourse over time. The\\nseries of the models is available at https://huggingface.co/jd445/AnnualBERTs.\",\"PeriodicalId\":501285,\"journal\":{\"name\":\"arXiv - CS - Digital Libraries\",\"volume\":\"4 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-15\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Digital Libraries\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.09636\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09636","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

我们介绍了 AnnualBERT,这是一系列专为捕捉科学文本的时间演变而设计的语言模型。与目前流行的分词标记化和 "一个模型统治所有模型 "的模式不同,AnnualBERT采用整词作为标记,由一个在截至2008年发表的170万篇arXiv论文全文上从头开始训练的基础RoBERT模型和一系列每年在arXiv论文上逐步训练的模型组成。我们证明了 AnnualBERT 模型的有效性,它们不仅在标准任务中表现相当,而且在特定领域的 NLP 任务以及 arXiv 引用网络中的链接预测任务中也达到了最先进的水平。然后,我们利用探测任务来量化模型随着时间推移在表征学习和遗忘方面的行为。我们的方法使得预训练模型不仅能提高科学文本处理任务的性能,还能深入了解科学话语随时间的发展。有关模型的系列信息,请访问 https://huggingface.co/jd445/AnnualBERTs。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
Towards understanding evolution of science through language model series
We introduce AnnualBERT, a series of language models designed specifically to capture the temporal evolution of scientific text. Deviating from the prevailing paradigms of subword tokenizations and "one model to rule them all", AnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model pretrained from scratch on the full-text of 1.7 million arXiv papers published until 2008 and a collection of progressively trained models on arXiv papers at an annual basis. We demonstrate the effectiveness of AnnualBERT models by showing that they not only have comparable performances in standard tasks but also achieve state-of-the-art performances on domain-specific NLP tasks as well as link prediction tasks in the arXiv citation network. We then utilize probing tasks to quantify the models' behavior in terms of representation learning and forgetting as time progresses. Our approach enables the pretrained models to not only improve performances on scientific text processing tasks but also to provide insights into the development of scientific discourse over time. The series of the models is available at https://huggingface.co/jd445/AnnualBERTs.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Publishing Instincts: An Exploration-Exploitation Framework for Studying Academic Publishing Behavior and "Home Venues" Research Citations Building Trust in Wikipedia Evaluating the Linguistic Coverage of OpenAlex: An Assessment of Metadata Accuracy and Completeness Towards understanding evolution of science through language model series Ensuring Adherence to Standards in Experiment-Related Metadata Entered Via Spreadsheets
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1