{"title":"Efficacy of Synthetic Data as a Benchmark","authors":"Gaurav Maheshwari, Dmitry Ivanov, Kevin El Haddad","doi":"arxiv-2409.11968","DOIUrl":null,"url":null,"abstract":"Large language models (LLMs) have enabled a range of applications in\nzero-shot and few-shot learning settings, including the generation of synthetic\ndatasets for training and testing. However, to reliably use these synthetic\ndatasets, it is essential to understand how representative they are of\nreal-world data. We investigate this by assessing the effectiveness of\ngenerating synthetic data through LLM and using it as a benchmark for various\nNLP tasks. Our experiments across six datasets, and three different tasks, show\nthat while synthetic data can effectively capture performance of various\nmethods for simpler tasks, such as intent classification, it falls short for\nmore complex tasks like named entity recognition. Additionally, we propose a\nnew metric called the bias factor, which evaluates the biases introduced when\nthe same LLM is used to both generate benchmarking data and to perform the\ntasks. We find that smaller LLMs exhibit biases towards their own generated\ndata, whereas larger models do not. Overall, our findings suggest that the\neffectiveness of synthetic data as a benchmark varies depending on the task,\nand that practitioners should rely on data generated from multiple larger\nmodels whenever possible.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-18","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11968","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Large language models (LLMs) have enabled a range of applications in
zero-shot and few-shot learning settings, including the generation of synthetic
datasets for training and testing. However, to reliably use these synthetic
datasets, it is essential to understand how representative they are of
real-world data. We investigate this by assessing the effectiveness of
generating synthetic data through LLM and using it as a benchmark for various
NLP tasks. Our experiments across six datasets, and three different tasks, show
that while synthetic data can effectively capture performance of various
methods for simpler tasks, such as intent classification, it falls short for
more complex tasks like named entity recognition. Additionally, we propose a
new metric called the bias factor, which evaluates the biases introduced when
the same LLM is used to both generate benchmarking data and to perform the
tasks. We find that smaller LLMs exhibit biases towards their own generated
data, whereas larger models do not. Overall, our findings suggest that the
effectiveness of synthetic data as a benchmark varies depending on the task,
and that practitioners should rely on data generated from multiple larger
models whenever possible.