Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee
{"title":"多样化与征服:以多样性为中心的数据选择与迭代改进","authors":"Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee","doi":"arxiv-2409.11378","DOIUrl":null,"url":null,"abstract":"Finetuning large language models on instruction data is crucial for enhancing\npre-trained knowledge and improving instruction-following capabilities. As\ninstruction datasets proliferate, selecting optimal data for effective training\nbecomes increasingly important. This work addresses the question: How can we\ndetermine the optimal subset of data for effective training? While existing\nresearch often emphasizes local criteria like instance quality for subset\nselection, we argue that a global approach focused on data diversity is more\ncritical. Our method employs k-means clustering to ensure the selected subset\neffectively represents the full dataset. We propose an iterative refinement\nmethod inspired by active learning techniques to resample instances from\nclusters, reassessing each cluster's importance and sampling weight in every\ntraining iteration. This approach reduces the effect of outliers and\nautomatically filters out clusters containing low-quality data. Through\nextensive evaluation across natural language reasoning, general world\nknowledge, code and math reasoning tasks, and by fine-tuning models from\nvarious families, we observe consistent improvements, achieving a 7% increase\nover random selection and a 3.8% improvement over state-of-the-art sampling\nmethods. Our work highlights the significance of diversity-first sampling when\nfinetuning LLMs to enhance performance across a broad array of evaluation\ntasks. Our code is available at\nhttps://github.com/for-ai/iterative-data-selection.","PeriodicalId":501030,"journal":{"name":"arXiv - CS - Computation and Language","volume":"91 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement\",\"authors\":\"Simon Yu, Liangyu Chen, Sara Ahmadian, Marzieh Fadaee\",\"doi\":\"arxiv-2409.11378\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Finetuning large language models on instruction data is crucial for enhancing\\npre-trained knowledge and improving instruction-following capabilities. As\\ninstruction datasets proliferate, selecting optimal data for effective training\\nbecomes increasingly important. This work addresses the question: How can we\\ndetermine the optimal subset of data for effective training? While existing\\nresearch often emphasizes local criteria like instance quality for subset\\nselection, we argue that a global approach focused on data diversity is more\\ncritical. Our method employs k-means clustering to ensure the selected subset\\neffectively represents the full dataset. We propose an iterative refinement\\nmethod inspired by active learning techniques to resample instances from\\nclusters, reassessing each cluster's importance and sampling weight in every\\ntraining iteration. This approach reduces the effect of outliers and\\nautomatically filters out clusters containing low-quality data. Through\\nextensive evaluation across natural language reasoning, general world\\nknowledge, code and math reasoning tasks, and by fine-tuning models from\\nvarious families, we observe consistent improvements, achieving a 7% increase\\nover random selection and a 3.8% improvement over state-of-the-art sampling\\nmethods. Our work highlights the significance of diversity-first sampling when\\nfinetuning LLMs to enhance performance across a broad array of evaluation\\ntasks. Our code is available at\\nhttps://github.com/for-ai/iterative-data-selection.\",\"PeriodicalId\":501030,\"journal\":{\"name\":\"arXiv - CS - Computation and Language\",\"volume\":\"91 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Computation and Language\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.11378\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Computation and Language","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.11378","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
摘要
在教学数据上对大型语言模型进行微调,对于增强预先训练的知识和提高教学能力至关重要。随着教学数据集的增多,选择最佳数据进行有效训练变得越来越重要。这项工作要解决的问题是:如何为有效训练确定最佳数据子集?虽然现有研究通常强调子集选择的局部标准,如实例质量,但我们认为,以数据多样性为重点的全局方法更为关键。我们的方法采用 k 均值聚类,以确保所选子集能有效代表整个数据集。我们提出了一种受主动学习技术启发的迭代改进方法,从聚类中重新抽取实例,在每次训练迭代中重新评估每个聚类的重要性和抽样权重。这种方法可以减少异常值的影响,并自动过滤掉包含低质量数据的聚类。通过对自然语言推理、一般世界知识、代码和数学推理任务的广泛评估,以及对不同系列模型的微调,我们观察到了一致的改进,比随机选择提高了 7%,比最先进的抽样方法提高了 3.8%。我们的工作凸显了多样性优先采样在调整 LLM 以提高各种评估任务性能时的重要性。我们的代码可在https://github.com/for-ai/iterative-data-selection。
Diversify and Conquer: Diversity-Centric Data Selection with Iterative Refinement
Finetuning large language models on instruction data is crucial for enhancing
pre-trained knowledge and improving instruction-following capabilities. As
instruction datasets proliferate, selecting optimal data for effective training
becomes increasingly important. This work addresses the question: How can we
determine the optimal subset of data for effective training? While existing
research often emphasizes local criteria like instance quality for subset
selection, we argue that a global approach focused on data diversity is more
critical. Our method employs k-means clustering to ensure the selected subset
effectively represents the full dataset. We propose an iterative refinement
method inspired by active learning techniques to resample instances from
clusters, reassessing each cluster's importance and sampling weight in every
training iteration. This approach reduces the effect of outliers and
automatically filters out clusters containing low-quality data. Through
extensive evaluation across natural language reasoning, general world
knowledge, code and math reasoning tasks, and by fine-tuning models from
various families, we observe consistent improvements, achieving a 7% increase
over random selection and a 3.8% improvement over state-of-the-art sampling
methods. Our work highlights the significance of diversity-first sampling when
finetuning LLMs to enhance performance across a broad array of evaluation
tasks. Our code is available at
https://github.com/for-ai/iterative-data-selection.