Towards understanding evolution of science through language model series

arXiv - CS - Digital Libraries Pub Date : 2024-09-15 DOI:arxiv-2409.09636

Junjie Dong, Zhuoqi Lyu, Qing Ke

{"title":"Towards understanding evolution of science through language model series","authors":"Junjie Dong, Zhuoqi Lyu, Qing Ke","doi":"arxiv-2409.09636","DOIUrl":null,"url":null,"abstract":"We introduce AnnualBERT, a series of language models designed specifically to\ncapture the temporal evolution of scientific text. Deviating from the\nprevailing paradigms of subword tokenizations and \"one model to rule them all\",\nAnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model\npretrained from scratch on the full-text of 1.7 million arXiv papers published\nuntil 2008 and a collection of progressively trained models on arXiv papers at\nan annual basis. We demonstrate the effectiveness of AnnualBERT models by\nshowing that they not only have comparable performances in standard tasks but\nalso achieve state-of-the-art performances on domain-specific NLP tasks as well\nas link prediction tasks in the arXiv citation network. We then utilize probing\ntasks to quantify the models' behavior in terms of representation learning and\nforgetting as time progresses. Our approach enables the pretrained models to\nnot only improve performances on scientific text processing tasks but also to\nprovide insights into the development of scientific discourse over time. The\nseries of the models is available at https://huggingface.co/jd445/AnnualBERTs.","PeriodicalId":501285,"journal":{"name":"arXiv - CS - Digital Libraries","volume":"4 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Digital Libraries","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09636","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

We introduce AnnualBERT, a series of language models designed specifically to capture the temporal evolution of scientific text. Deviating from the prevailing paradigms of subword tokenizations and "one model to rule them all", AnnualBERT adopts whole words as tokens and is composed of a base RoBERTa model pretrained from scratch on the full-text of 1.7 million arXiv papers published until 2008 and a collection of progressively trained models on arXiv papers at an annual basis. We demonstrate the effectiveness of AnnualBERT models by showing that they not only have comparable performances in standard tasks but also achieve state-of-the-art performances on domain-specific NLP tasks as well as link prediction tasks in the arXiv citation network. We then utilize probing tasks to quantify the models' behavior in terms of representation learning and forgetting as time progresses. Our approach enables the pretrained models to not only improve performances on scientific text processing tasks but also to provide insights into the development of scientific discourse over time. The series of the models is available at https://huggingface.co/jd445/AnnualBERTs.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

通过语言模型系列了解科学的演变

我们介绍了 AnnualBERT，这是一系列专为捕捉科学文本的时间演变而设计的语言模型。与目前流行的分词标记化和 "一个模型统治所有模型 "的模式不同，AnnualBERT采用整词作为标记，由一个在截至2008年发表的170万篇arXiv论文全文上从头开始训练的基础RoBERT模型和一系列每年在arXiv论文上逐步训练的模型组成。我们证明了 AnnualBERT 模型的有效性，它们不仅在标准任务中表现相当，而且在特定领域的 NLP 任务以及 arXiv 引用网络中的链接预测任务中也达到了最先进的水平。然后，我们利用探测任务来量化模型随着时间推移在表征学习和遗忘方面的行为。我们的方法使得预训练模型不仅能提高科学文本处理任务的性能，还能深入了解科学话语随时间的发展。有关模型的系列信息，请访问 https://huggingface.co/jd445/AnnualBERTs。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - CS - Digital Libraries

自引率

0.00%

发文量