Synthetic continued pretraining

arXiv - STAT - Machine Learning Pub Date : 2024-09-11 DOI:arxiv-2409.07431

Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, Tatsunori Hashimoto

{"title":"Synthetic continued pretraining","authors":"Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, Tatsunori Hashimoto","doi":"arxiv-2409.07431","DOIUrl":null,"url":null,"abstract":"Pretraining on large-scale, unstructured internet text has enabled language\nmodels to acquire a significant amount of world knowledge. However, this\nknowledge acquisition is data-inefficient -- to learn a given fact, models must\nbe trained on hundreds to thousands of diverse representations of it. This\nposes a challenge when adapting a pretrained model to a small corpus of\ndomain-specific documents, where each fact may appear rarely or only once. We\npropose to bridge this gap with synthetic continued pretraining: using the\nsmall domain-specific corpus to synthesize a large corpus more amenable to\nlearning, and then performing continued pretraining on the synthesized corpus.\nWe instantiate this proposal with EntiGraph, a synthetic data augmentation\nalgorithm that extracts salient entities from the source documents and then\ngenerates diverse text by drawing connections between the sampled entities.\nSynthetic continued pretraining using EntiGraph enables a language model to\nanswer questions and follow generic instructions related to the source\ndocuments without access to them. If instead, the source documents are\navailable at inference time, we show that the knowledge acquired through our\napproach compounds with retrieval-augmented generation. To better understand\nthese results, we build a simple mathematical model of EntiGraph, and show how\nsynthetic data augmentation can \"rearrange\" knowledge to enable more\ndata-efficient learning.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07431","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Pretraining on large-scale, unstructured internet text has enabled language models to acquire a significant amount of world knowledge. However, this knowledge acquisition is data-inefficient -- to learn a given fact, models must be trained on hundreds to thousands of diverse representations of it. This poses a challenge when adapting a pretrained model to a small corpus of domain-specific documents, where each fact may appear rarely or only once. We propose to bridge this gap with synthetic continued pretraining: using the small domain-specific corpus to synthesize a large corpus more amenable to learning, and then performing continued pretraining on the synthesized corpus. We instantiate this proposal with EntiGraph, a synthetic data augmentation algorithm that extracts salient entities from the source documents and then generates diverse text by drawing connections between the sampled entities. Synthetic continued pretraining using EntiGraph enables a language model to answer questions and follow generic instructions related to the source documents without access to them. If instead, the source documents are available at inference time, we show that the knowledge acquired through our approach compounds with retrieval-augmented generation. To better understand these results, we build a simple mathematical model of EntiGraph, and show how synthetic data augmentation can "rearrange" knowledge to enable more data-efficient learning.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

合成持续预培训

在大规模、非结构化的互联网文本上进行预训练使语言模型获得了大量的世界知识。然而，这种知识获取的数据效率很低--要学习一个给定的事实，模型必须在成百上千个不同的表征上进行训练。当把预先训练好的模型应用于小型特定领域文档语料库时，这就带来了挑战，因为在这些语料库中，每个事实可能很少出现或只出现一次。我们建议通过合成持续预训练来弥补这一差距：使用小型特定领域语料库合成更适合学习的大型语料库，然后在合成语料库上进行持续预训练。我们用 EntiGraph 实现了这一建议，它是一种合成数据增强算法，可以从源文档中提取突出实体，并通过绘制采样实体之间的联系生成多样化文本。使用 EntiGraph 进行合成持续预训练，可以让语言模型在无法访问源文档的情况下回答问题并遵循与源文档相关的通用指令。如果源文档在推理时可用，我们就会发现通过我们的方法获得的知识与检索增强生成的知识相辅相成。为了更好地理解这些结果，我们建立了一个简单的 EntiGraph 数学模型，并展示了合成数据增强如何 "重新排列 "知识，以实现更高效的学习。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

arXiv - STAT - Machine Learning

自引率

0.00%

发文量

期刊最新文献

Fitting Multilevel Factor Models Cartan moving frames and the data manifolds Symmetry-Based Structured Matrices for Efficient Approximately Equivariant Networks Recurrent Interpolants for Probabilistic Time Series Prediction PieClam: A Universal Graph Autoencoder Based on Overlapping Inclusive and Exclusive Communities