数据集蒸馏：综述

IF 20.8 1区计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE IEEE Transactions on Pattern Analysis and Machine Intelligence Pub Date : 2023-01-17 DOI:10.48550/arXiv.2301.07014

Ruonan Yu, Songhua Liu, Xinchao Wang

{"title":"数据集蒸馏：综述","authors":"Ruonan Yu, Songhua Liu, Xinchao Wang","doi":"10.48550/arXiv.2301.07014","DOIUrl":null,"url":null,"abstract":"Recent success of deep learning is largely attributed to the sheer amount of data used for training deep neural networks. Despite the unprecedented success, the massive data, unfortunately, significantly increases the burden on storage and transmission and further gives rise to a cumbersome model training process. Besides, relying on the raw data for training per se yields concerns about privacy and copyright. To alleviate these shortcomings, dataset distillation (DD), also known as dataset condensation (DC), was introduced and has recently attracted much research attention in the community. Given an original dataset, DD aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. In this paper, we give a comprehensive review and summary of recent advances in DD and its application. We first introduce the task formally and propose an overall algorithmic framework followed by all existing DD methods. Next, we provide a systematic taxonomy of current methodologies in this area, and discuss their theoretical interconnections. We also present current challenges in DD through extensive empirical studies and envision possible directions for future works.","PeriodicalId":13426,"journal":{"name":"IEEE Transactions on Pattern Analysis and Machine Intelligence","volume":" ","pages":""},"PeriodicalIF":20.8000,"publicationDate":"2023-01-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"26","resultStr":"{\"title\":\"Dataset Distillation: A Comprehensive Review\",\"authors\":\"Ruonan Yu, Songhua Liu, Xinchao Wang\",\"doi\":\"10.48550/arXiv.2301.07014\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recent success of deep learning is largely attributed to the sheer amount of data used for training deep neural networks. Despite the unprecedented success, the massive data, unfortunately, significantly increases the burden on storage and transmission and further gives rise to a cumbersome model training process. Besides, relying on the raw data for training per se yields concerns about privacy and copyright. To alleviate these shortcomings, dataset distillation (DD), also known as dataset condensation (DC), was introduced and has recently attracted much research attention in the community. Given an original dataset, DD aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. In this paper, we give a comprehensive review and summary of recent advances in DD and its application. We first introduce the task formally and propose an overall algorithmic framework followed by all existing DD methods. Next, we provide a systematic taxonomy of current methodologies in this area, and discuss their theoretical interconnections. We also present current challenges in DD through extensive empirical studies and envision possible directions for future works.\",\"PeriodicalId\":13426,\"journal\":{\"name\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"volume\":\" \",\"pages\":\"\"},\"PeriodicalIF\":20.8000,\"publicationDate\":\"2023-01-17\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"26\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE Transactions on Pattern Analysis and Machine Intelligence\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://doi.org/10.48550/arXiv.2301.07014\",\"RegionNum\":1,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE Transactions on Pattern Analysis and Machine Intelligence","FirstCategoryId":"94","ListUrlMain":"https://doi.org/10.48550/arXiv.2301.07014","RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE","Score":null,"Total":0}

引用次数: 26

摘要

最近深度学习的成功很大程度上归功于用于训练深度神经网络的大量数据。尽管取得了前所未有的成功，但不幸的是，海量数据大大增加了存储和传输的负担，并进一步导致了繁琐的模型训练过程。此外，依靠原始数据进行培训本身就会引起对隐私和版权的担忧。为了克服这些缺点，数据集蒸馏(DD)也被称为数据集冷凝(DC)，近年来引起了业界的广泛关注。给定原始数据集，DD旨在派生一个包含合成样本的小得多的数据集，在此基础上训练的模型产生与原始数据集训练的模型相当的性能。本文对近年来DD及其应用的研究进展进行了综述。我们首先正式介绍了该任务，并提出了一个遵循所有现有DD方法的总体算法框架。接下来，我们对这一领域的当前方法进行了系统的分类，并讨论了它们在理论上的相互联系。我们还通过广泛的实证研究提出了DD当前面临的挑战，并展望了未来工作的可能方向。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

Dataset Distillation: A Comprehensive Review

Recent success of deep learning is largely attributed to the sheer amount of data used for training deep neural networks. Despite the unprecedented success, the massive data, unfortunately, significantly increases the burden on storage and transmission and further gives rise to a cumbersome model training process. Besides, relying on the raw data for training per se yields concerns about privacy and copyright. To alleviate these shortcomings, dataset distillation (DD), also known as dataset condensation (DC), was introduced and has recently attracted much research attention in the community. Given an original dataset, DD aims to derive a much smaller dataset containing synthetic samples, based on which the trained models yield performance comparable with those trained on the original dataset. In this paper, we give a comprehensive review and summary of recent advances in DD and its application. We first introduce the task formally and propose an overall algorithmic framework followed by all existing DD methods. Next, we provide a systematic taxonomy of current methodologies in this area, and discuss their theoretical interconnections. We also present current challenges in DD through extensive empirical studies and envision possible directions for future works.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

IEEE Transactions on Pattern Analysis and Machine Intelligence 工程技术-工程：电子与电气

CiteScore

28.40

自引率

3.00%

发文量

885

审稿时长

8.5 months

期刊介绍： The IEEE Transactions on Pattern Analysis and Machine Intelligence publishes articles on all traditional areas of computer vision and image understanding, all traditional areas of pattern analysis and recognition, and selected areas of machine intelligence, with a particular emphasis on machine learning for pattern analysis. Areas such as techniques for visual search, document and handwriting analysis, medical image analysis, video and image sequence analysis, content-based retrieval of image and video, face and gesture recognition and relevant specialized hardware and/or software architectures are also covered.

期刊最新文献

Streaming quanta sensors for online, high-performance imaging and vision FSD V2: Improving Fully Sparse 3D Object Detection with Virtual Voxels Partial Scene Text Retrieval BokehMe++: Harmonious Fusion of Classical and Neural Rendering for Versatile Bokeh Creation DiffI2I: Efficient Diffusion Model for Image-to-Image Translation