Tristan Thrush, Christopher Potts, Tatsunori Hashimoto
{"title":"Improving Pretraining Data Using Perplexity Correlations","authors":"Tristan Thrush, Christopher Potts, Tatsunori Hashimoto","doi":"arxiv-2409.05816","DOIUrl":null,"url":null,"abstract":"Quality pretraining data is often seen as the key to high-performance\nlanguage models. However, progress in understanding pretraining data has been\nslow due to the costly pretraining runs required for data selection\nexperiments. We present a framework that avoids these costs and selects\nhigh-quality pretraining data without any LLM training of our own. Our work is\nbased on a simple observation: LLM losses on many pretraining texts are\ncorrelated with downstream benchmark performance, and selecting\nhigh-correlation documents is an effective pretraining data selection method.\nWe build a new statistical framework for data selection centered around\nestimates of perplexity-benchmark correlations and perform data selection using\na sample of 90 LLMs taken from the Open LLM Leaderboard on texts from tens of\nthousands of web domains. In controlled pretraining experiments at the 160M\nparameter scale on 8 benchmarks, our approach outperforms DSIR on every\nbenchmark, while matching the best data selector found in DataComp-LM, a\nhand-engineered bigram classifier.","PeriodicalId":501340,"journal":{"name":"arXiv - STAT - Machine Learning","volume":"9 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - STAT - Machine Learning","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.05816","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Quality pretraining data is often seen as the key to high-performance
language models. However, progress in understanding pretraining data has been
slow due to the costly pretraining runs required for data selection
experiments. We present a framework that avoids these costs and selects
high-quality pretraining data without any LLM training of our own. Our work is
based on a simple observation: LLM losses on many pretraining texts are
correlated with downstream benchmark performance, and selecting
high-correlation documents is an effective pretraining data selection method.
We build a new statistical framework for data selection centered around
estimates of perplexity-benchmark correlations and perform data selection using
a sample of 90 LLMs taken from the Open LLM Leaderboard on texts from tens of
thousands of web domains. In controlled pretraining experiments at the 160M
parameter scale on 8 benchmarks, our approach outperforms DSIR on every
benchmark, while matching the best data selector found in DataComp-LM, a
hand-engineered bigram classifier.