Synthetic continued pretraining

Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, Tatsunori Hashimoto
{"title":"Synthetic continued pretraining","authors":"Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, Tatsunori Hashimoto","doi":"arxiv-2409.07431","DOIUrl":null,"url":null,"abstract":"Pretraining on large-scale, unstructured internet text has enabled language\nmodels to acquire a significant amount of world knowledge. However, this\nknowledge acquisition is data-inefficient -- to learn a given fact, models must\nbe trained on hundreds to thousands of diverse representations of it. This\nposes a challenge when adapting a pretrained model to a small corpus of\ndomain-specific documents, where each fact may appear rarely or only once. We\npropose to bridge this gap with synthetic continued pretraining: using the\nsmall domain-specific corpus to synthesize a large corpus more amenable to\nlearning, and then performing continued pretraining on the synthesized corpus.\nWe instantiate this proposal with EntiGraph, a synthetic data augmentation\nalgorithm that extracts salient entities from the source documents and then\ngenerates diverse text by drawing connections between the sampled entities.\nSynthetic continued pretraining using EntiGraph enables a language model to\nanswer questions and follow generic instructions related to the source\ndocuments without access to them. If instead, the source documents are\navailable at inference time, we show that the knowledge acquired through our\napproach compounds with retrieval-augmented generation. To better understand\nthese results, we build a simple mathematical model of EntiGraph, and show how\nsynthetic data augmentation can \"rearrange\" knowledge to enable more\ndata-efficient learning.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07431","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Pretraining on large-scale, unstructured internet text has enabled language models to acquire a significant amount of world knowledge. However, this knowledge acquisition is data-inefficient -- to learn a given fact, models must be trained on hundreds to thousands of diverse representations of it. This poses a challenge when adapting a pretrained model to a small corpus of domain-specific documents, where each fact may appear rarely or only once. We propose to bridge this gap with synthetic continued pretraining: using the small domain-specific corpus to synthesize a large corpus more amenable to learning, and then performing continued pretraining on the synthesized corpus. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm that extracts salient entities from the source documents and then generates diverse text by drawing connections between the sampled entities. Synthetic continued pretraining using EntiGraph enables a language model to answer questions and follow generic instructions related to the source documents without access to them. If instead, the source documents are available at inference time, we show that the knowledge acquired through our approach compounds with retrieval-augmented generation. To better understand these results, we build a simple mathematical model of EntiGraph, and show how synthetic data augmentation can "rearrange" knowledge to enable more data-efficient learning.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
合成持续预培训
在大规模、非结构化的互联网文本上进行预训练使语言模型获得了大量的世界知识。然而,这种知识获取的数据效率很低--要学习一个给定的事实,模型必须在成百上千个不同的表征上进行训练。当把预先训练好的模型应用于小型特定领域文档语料库时,这就带来了挑战,因为在这些语料库中,每个事实可能很少出现或只出现一次。我们建议通过合成持续预训练来弥补这一差距:使用小型特定领域语料库合成更适合学习的大型语料库,然后在合成语料库上进行持续预训练。我们用 EntiGraph 实现了这一建议,它是一种合成数据增强算法,可以从源文档中提取突出实体,并通过绘制采样实体之间的联系生成多样化文本。使用 EntiGraph 进行合成持续预训练,可以让语言模型在无法访问源文档的情况下回答问题并遵循与源文档相关的通用指令。如果源文档在推理时可用,我们就会发现通过我们的方法获得的知识与检索增强生成的知识相辅相成。为了更好地理解这些结果,我们建立了一个简单的 EntiGraph 数学模型,并展示了合成数据增强如何 "重新排列 "知识,以实现更高效的学习。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Fitting Multilevel Factor Models Cartan moving frames and the data manifolds Symmetry-Based Structured Matrices for Efficient Approximately Equivariant Networks Recurrent Interpolants for Probabilistic Time Series Prediction PieClam: A Universal Graph Autoencoder Based on Overlapping Inclusive and Exclusive Communities
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1