Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, Tatsunori Hashimoto
{"title":"合成持续预培训","authors":"Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, Tatsunori Hashimoto","doi":"arxiv-2409.07431","DOIUrl":null,"url":null,"abstract":"Pretraining on large-scale, unstructured internet text has enabled language\nmodels to acquire a significant amount of world knowledge. However, this\nknowledge acquisition is data-inefficient -- to learn a given fact, models must\nbe trained on hundreds to thousands of diverse representations of it. This\nposes a challenge when adapting a pretrained model to a small corpus of\ndomain-specific documents, where each fact may appear rarely or only once. We\npropose to bridge this gap with synthetic continued pretraining: using the\nsmall domain-specific corpus to synthesize a large corpus more amenable to\nlearning, and then performing continued pretraining on the synthesized corpus.\nWe instantiate this proposal with EntiGraph, a synthetic data augmentation\nalgorithm that extracts salient entities from the source documents and then\ngenerates diverse text by drawing connections between the sampled entities.\nSynthetic continued pretraining using EntiGraph enables a language model to\nanswer questions and follow generic instructions related to the source\ndocuments without access to them. If instead, the source documents are\navailable at inference time, we show that the knowledge acquired through our\napproach compounds with retrieval-augmented generation. To better understand\nthese results, we build a simple mathematical model of EntiGraph, and show how\nsynthetic data augmentation can \"rearrange\" knowledge to enable more\ndata-efficient learning.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"10 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Synthetic continued pretraining\",\"authors\":\"Zitong Yang, Neil Band, Shuangping Li, Emmanuel Candès, Tatsunori Hashimoto\",\"doi\":\"arxiv-2409.07431\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Pretraining on large-scale, unstructured internet text has enabled language\\nmodels to acquire a significant amount of world knowledge. However, this\\nknowledge acquisition is data-inefficient -- to learn a given fact, models must\\nbe trained on hundreds to thousands of diverse representations of it. This\\nposes a challenge when adapting a pretrained model to a small corpus of\\ndomain-specific documents, where each fact may appear rarely or only once. We\\npropose to bridge this gap with synthetic continued pretraining: using the\\nsmall domain-specific corpus to synthesize a large corpus more amenable to\\nlearning, and then performing continued pretraining on the synthesized corpus.\\nWe instantiate this proposal with EntiGraph, a synthetic data augmentation\\nalgorithm that extracts salient entities from the source documents and then\\ngenerates diverse text by drawing connections between the sampled entities.\\nSynthetic continued pretraining using EntiGraph enables a language model to\\nanswer questions and follow generic instructions related to the source\\ndocuments without access to them. If instead, the source documents are\\navailable at inference time, we show that the knowledge acquired through our\\napproach compounds with retrieval-augmented generation. To better understand\\nthese results, we build a simple mathematical model of EntiGraph, and show how\\nsynthetic data augmentation can \\\"rearrange\\\" knowledge to enable more\\ndata-efficient learning.\",\"PeriodicalId\":501340,\"journal\":{\"name\":\"arXiv - STAT - Machine Learning\",\"volume\":\"10 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - STAT - Machine Learning\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07431\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07431","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Pretraining on large-scale, unstructured internet text has enabled language
models to acquire a significant amount of world knowledge. However, this
knowledge acquisition is data-inefficient -- to learn a given fact, models must
be trained on hundreds to thousands of diverse representations of it. This
poses a challenge when adapting a pretrained model to a small corpus of
domain-specific documents, where each fact may appear rarely or only once. We
propose to bridge this gap with synthetic continued pretraining: using the
small domain-specific corpus to synthesize a large corpus more amenable to
learning, and then performing continued pretraining on the synthesized corpus.
We instantiate this proposal with EntiGraph, a synthetic data augmentation
algorithm that extracts salient entities from the source documents and then
generates diverse text by drawing connections between the sampled entities.
Synthetic continued pretraining using EntiGraph enables a language model to
answer questions and follow generic instructions related to the source
documents without access to them. If instead, the source documents are
available at inference time, we show that the knowledge acquired through our
approach compounds with retrieval-augmented generation. To better understand
these results, we build a simple mathematical model of EntiGraph, and show how
synthetic data augmentation can "rearrange" knowledge to enable more
data-efficient learning.