A Survey of LLM Datasets: From Autoregressive Model to AI Chatbot

IF 1.3 3区计算机科学 Q4 COMPUTER SCIENCE, HARDWARE & ARCHITECTURE Journal of Computer Science and Technology Pub Date : 2024-07-22 DOI:10.1007/s11390-024-3767-3

Fei Du, Xin-Jian Ma, Jing-Ru Yang, Yi Liu, Chao-Ran Luo, Xue-Bin Wang, Hai-Ou Jiang, Xiang Jing

{"title":"A Survey of LLM Datasets: From Autoregressive Model to AI Chatbot","authors":"Fei Du, Xin-Jian Ma, Jing-Ru Yang, Yi Liu, Chao-Ran Luo, Xue-Bin Wang, Hai-Ou Jiang, Xiang Jing","doi":"10.1007/s11390-024-3767-3","DOIUrl":null,"url":null,"abstract":"<p>Since OpenAI opened access to ChatGPT, large language models (LLMs) become an increasingly popular topic attracting researchers’ attention from abundant domains. However, public researchers meet some problems when developing LLMs given that most of the LLMs are produced by industries and the training details are typically unrevealed. Since datasets are an important setup of LLMs, this paper does a holistic survey on the training datasets used in both the pre-train and fine-tune processes. The paper first summarizes 16 pre-train datasets and 16 fine-tune datasets used in the state-of-the-art LLMs. Secondly, based on the properties of the pre-train and fine-tune processes, it comments on pre-train datasets from quality, quantity, and relation with models, and comments on fine-tune datasets from quality, quantity, and concerns. This study then critically figures out the problems and research trends that exist in current LLM datasets. The study helps public researchers train and investigate LLMs by visual cases and provides useful comments to the research community regarding data development. To the best of our knowledge, this paper is the first to summarize and discuss datasets used in both autoregressive and chat LLMs. The survey offers insights and suggestions to researchers and LLM developers as they build their models, and contributes to the LLM study by pointing out the existing problems of LLM studies from the perspective of data.</p>","PeriodicalId":50222,"journal":{"name":"Journal of Computer Science and Technology","volume":"16 1","pages":""},"PeriodicalIF":1.3000,"publicationDate":"2024-07-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"Journal of Computer Science and Technology","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.1007/s11390-024-3767-3","RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"COMPUTER SCIENCE, HARDWARE & ARCHITECTURE","Score":null,"Total":0}

引用次数: 0

Abstract

Since OpenAI opened access to ChatGPT, large language models (LLMs) become an increasingly popular topic attracting researchers’ attention from abundant domains. However, public researchers meet some problems when developing LLMs given that most of the LLMs are produced by industries and the training details are typically unrevealed. Since datasets are an important setup of LLMs, this paper does a holistic survey on the training datasets used in both the pre-train and fine-tune processes. The paper first summarizes 16 pre-train datasets and 16 fine-tune datasets used in the state-of-the-art LLMs. Secondly, based on the properties of the pre-train and fine-tune processes, it comments on pre-train datasets from quality, quantity, and relation with models, and comments on fine-tune datasets from quality, quantity, and concerns. This study then critically figures out the problems and research trends that exist in current LLM datasets. The study helps public researchers train and investigate LLMs by visual cases and provides useful comments to the research community regarding data development. To the best of our knowledge, this paper is the first to summarize and discuss datasets used in both autoregressive and chat LLMs. The survey offers insights and suggestions to researchers and LLM developers as they build their models, and contributes to the LLM study by pointing out the existing problems of LLM studies from the perspective of data.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

LLM 数据集调查：从自回归模型到人工智能聊天机器人

自 OpenAI 开放 ChatGPT 访问权限以来，大型语言模型（LLM）日益成为一个热门话题，吸引了众多领域研究人员的关注。然而，由于大多数 LLM 都是由企业生产的，而且训练细节通常不公开，因此公共研究人员在开发 LLM 时遇到了一些问题。由于数据集是 LLM 的重要设置，本文对预训练和微调过程中使用的训练数据集进行了全面调查。本文首先总结了最先进的 LLM 所使用的 16 个预训练数据集和 16 个微调数据集。其次，根据预训练和微调过程的特性，从质量、数量和与模型的关系等方面对预训练数据集进行了评述，并从质量、数量和关注点等方面对微调数据集进行了评述。然后，本研究批判性地指出了当前 LLM 数据集存在的问题和研究趋势。本研究通过可视化案例帮助公共研究人员训练和研究 LLM，并为研究界提供有关数据开发的有用意见。据我们所知，本文是第一篇总结和讨论自回归和聊天 LLM 数据集的文章。该调查为研究人员和 LLM 开发人员建立模型提供了见解和建议，并从数据的角度指出了 LLM 研究中存在的问题，为 LLM 研究做出了贡献。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

Journal of Computer Science and Technology 工程技术-计算机：软件工程

CiteScore

4.00

自引率

0.00%

发文量

2255

审稿时长

9.8 months

期刊介绍： Journal of Computer Science and Technology (JCST), the first English language journal in the computer field published in China, is an international forum for scientists and engineers involved in all aspects of computer science and technology to publish high quality and refereed papers. Papers reporting original research and innovative applications from all parts of the world are welcome. Papers for publication in the journal are selected through rigorous peer review, to ensure originality, timeliness, relevance, and readability. While the journal emphasizes the publication of previously unpublished materials, selected conference papers with exceptional merit that require wider exposure are, at the discretion of the editors, also published, provided they meet the journal''s peer review standards. The journal also seeks clearly written survey and review articles from experts in the field, to promote insightful understanding of the state-of-the-art and technology trends. Topics covered by Journal of Computer Science and Technology include but are not limited to: -Computer Architecture and Systems -Artificial Intelligence and Pattern Recognition -Computer Networks and Distributed Computing -Computer Graphics and Multimedia -Software Systems -Data Management and Data Mining -Theory and Algorithms -Emerging Areas