Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

IF 18.6 IEEE transactions on pattern analysis and machine intelligence Pub Date : 2025-01-13 DOI:10.1109/TPAMI.2025.3528394

Xiao Wang;Jianlong Wu;Zijia Lin;Fuzheng Zhang;Di Zhang;Liqiang Nie

{"title":"Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding","authors":"Xiao Wang;Jianlong Wu;Zijia Lin;Fuzheng Zhang;Di Zhang;Liqiang Nie","doi":"10.1109/TPAMI.2025.3528394","DOIUrl":null,"url":null,"abstract":"Recently, video-language understanding has achieved great success through large-scale pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively reveals an “impossible trinity” among data quantity, diversity, and quality in pre-training datasets. Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations. These methods successfully refine the original annotations by leveraging useful information in multimodal video content (frames, tags, ASR transcripts, etc.). Nevertheless, they struggle to mitigate noise within synthetic annotations and lack scalability as the dataset size expands. To address these issues, we introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods. For iterative refinement, we first leverage a video-language model to generate synthetic annotations, resulting in a refined dataset. Then, we pre-train on it and fine-tune on human refinement examples for a stronger model. These processes are repeated for continuous improvement. For noise control, we present AdaTaiLr, a novel method that requires weaker assumptions on noise distribution. This method proves more effective in large datasets and offers theoretical guarantees. The combination of iterative refinement and AdaTaiLr can achieve better scalability in video-language understanding. Extensive experiments show that our framework outperforms existing data refinement baselines, delivering a 3% performance boost and improving dataset quality with minimal diversity loss. Furthermore, our refined dataset facilitates significant improvements in various video-language understanding tasks, including video question answering and text-video retrieval.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2912-2923"},"PeriodicalIF":18.6000,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10839067/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

Abstract

Recently, video-language understanding has achieved great success through large-scale pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively reveals an “impossible trinity” among data quantity, diversity, and quality in pre-training datasets. Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations. These methods successfully refine the original annotations by leveraging useful information in multimodal video content (frames, tags, ASR transcripts, etc.). Nevertheless, they struggle to mitigate noise within synthetic annotations and lack scalability as the dataset size expands. To address these issues, we introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods. For iterative refinement, we first leverage a video-language model to generate synthetic annotations, resulting in a refined dataset. Then, we pre-train on it and fine-tune on human refinement examples for a stronger model. These processes are repeated for continuous improvement. For noise control, we present AdaTaiLr, a novel method that requires weaker assumptions on noise distribution. This method proves more effective in large datasets and offers theoretical guarantees. The combination of iterative refinement and AdaTaiLr can achieve better scalability in video-language understanding. Extensive experiments show that our framework outperforms existing data refinement baselines, delivering a 3% performance boost and improving dataset quality with minimal diversity loss. Furthermore, our refined dataset facilitates significant improvements in various video-language understanding tasks, including video question answering and text-video retrieval.

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

视频数据飞轮：解决视频语言理解中不可能的数据三位一体

近年来，视频语言理解通过大规模的预训练取得了巨大的成功。然而，数据短缺仍然是一个普遍的挑战。本研究定量地揭示了预训练数据集中数据数量、多样性和质量之间的“不可能三位一体”。最近的努力是通过合成注释来改进大规模、多样化的ASR数据集，这些数据集受到低质量的影响。这些方法通过利用多模态视频内容（帧、标签、ASR转录本等）中的有用信息，成功地改进了原始注释。然而，它们很难减轻合成注释中的噪声，并且随着数据集大小的扩大而缺乏可伸缩性。为了解决这些问题，我们引入了Video DataFlywheel框架，该框架通过改进的噪声控制方法迭代地改进视频注释。对于迭代改进，我们首先利用视频语言模型来生成合成注释，从而得到一个改进的数据集。然后，我们对其进行预训练，并对人类改进示例进行微调，以获得更强的模型。重复这些过程以不断改进。对于噪声控制，我们提出了一种对噪声分布要求较弱的新方法AdaTaiLr。该方法在大数据集上更为有效，并提供了理论保证。迭代细化与AdaTaiLr的结合可以在视频语言理解中实现更好的可扩展性。大量的实验表明，我们的框架优于现有的数据优化基线，提供了3%的性能提升，并以最小的多样性损失提高了数据集质量。此外，我们的精细化数据集促进了各种视频语言理解任务的显著改进，包括视频问答和文本视频检索。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

求助全文

约1分钟内获得全文去求助

来源期刊

IEEE transactions on pattern analysis and machine intelligence

自引率

0.00%

发文量