Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding

Xiao Wang;Jianlong Wu;Zijia Lin;Fuzheng Zhang;Di Zhang;Liqiang Nie
{"title":"Video DataFlywheel: Resolving the Impossible Data Trinity in Video-Language Understanding","authors":"Xiao Wang;Jianlong Wu;Zijia Lin;Fuzheng Zhang;Di Zhang;Liqiang Nie","doi":"10.1109/TPAMI.2025.3528394","DOIUrl":null,"url":null,"abstract":"Recently, video-language understanding has achieved great success through large-scale pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively reveals an “impossible trinity” among data quantity, diversity, and quality in pre-training datasets. Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations. These methods successfully refine the original annotations by leveraging useful information in multimodal video content (frames, tags, ASR transcripts, etc.). Nevertheless, they struggle to mitigate noise within synthetic annotations and lack scalability as the dataset size expands. To address these issues, we introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods. For iterative refinement, we first leverage a video-language model to generate synthetic annotations, resulting in a refined dataset. Then, we pre-train on it and fine-tune on human refinement examples for a stronger model. These processes are repeated for continuous improvement. For noise control, we present AdaTaiLr, a novel method that requires weaker assumptions on noise distribution. This method proves more effective in large datasets and offers theoretical guarantees. The combination of iterative refinement and AdaTaiLr can achieve better scalability in video-language understanding. Extensive experiments show that our framework outperforms existing data refinement baselines, delivering a 3% performance boost and improving dataset quality with minimal diversity loss. Furthermore, our refined dataset facilitates significant improvements in various video-language understanding tasks, including video question answering and text-video retrieval.","PeriodicalId":94034,"journal":{"name":"IEEE transactions on pattern analysis and machine intelligence","volume":"47 4","pages":"2912-2923"},"PeriodicalIF":18.6000,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE transactions on pattern analysis and machine intelligence","FirstCategoryId":"1085","ListUrlMain":"https://ieeexplore.ieee.org/document/10839067/","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Recently, video-language understanding has achieved great success through large-scale pre-training. However, data scarcity remains a prevailing challenge. This study quantitatively reveals an “impossible trinity” among data quantity, diversity, and quality in pre-training datasets. Recent efforts seek to refine large-scale, diverse ASR datasets compromised by low quality through synthetic annotations. These methods successfully refine the original annotations by leveraging useful information in multimodal video content (frames, tags, ASR transcripts, etc.). Nevertheless, they struggle to mitigate noise within synthetic annotations and lack scalability as the dataset size expands. To address these issues, we introduce the Video DataFlywheel framework, which iteratively refines video annotations with improved noise control methods. For iterative refinement, we first leverage a video-language model to generate synthetic annotations, resulting in a refined dataset. Then, we pre-train on it and fine-tune on human refinement examples for a stronger model. These processes are repeated for continuous improvement. For noise control, we present AdaTaiLr, a novel method that requires weaker assumptions on noise distribution. This method proves more effective in large datasets and offers theoretical guarantees. The combination of iterative refinement and AdaTaiLr can achieve better scalability in video-language understanding. Extensive experiments show that our framework outperforms existing data refinement baselines, delivering a 3% performance boost and improving dataset quality with minimal diversity loss. Furthermore, our refined dataset facilitates significant improvements in various video-language understanding tasks, including video question answering and text-video retrieval.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
视频数据飞轮:解决视频语言理解中不可能的数据三位一体
近年来,视频语言理解通过大规模的预训练取得了巨大的成功。然而,数据短缺仍然是一个普遍的挑战。本研究定量地揭示了预训练数据集中数据数量、多样性和质量之间的“不可能三位一体”。最近的努力是通过合成注释来改进大规模、多样化的ASR数据集,这些数据集受到低质量的影响。这些方法通过利用多模态视频内容(帧、标签、ASR转录本等)中的有用信息,成功地改进了原始注释。然而,它们很难减轻合成注释中的噪声,并且随着数据集大小的扩大而缺乏可伸缩性。为了解决这些问题,我们引入了Video DataFlywheel框架,该框架通过改进的噪声控制方法迭代地改进视频注释。对于迭代改进,我们首先利用视频语言模型来生成合成注释,从而得到一个改进的数据集。然后,我们对其进行预训练,并对人类改进示例进行微调,以获得更强的模型。重复这些过程以不断改进。对于噪声控制,我们提出了一种对噪声分布要求较弱的新方法AdaTaiLr。该方法在大数据集上更为有效,并提供了理论保证。迭代细化与AdaTaiLr的结合可以在视频语言理解中实现更好的可扩展性。大量的实验表明,我们的框架优于现有的数据优化基线,提供了3%的性能提升,并以最小的多样性损失提高了数据集质量。此外,我们的精细化数据集促进了各种视频语言理解任务的显著改进,包括视频问答和文本视频检索。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Unsupervised Gaze Representation Learning by Switching Features. H2OT: Hierarchical Hourglass Tokenizer for Efficient Video Pose Transformers. MV2DFusion: Leveraging Modality-Specific Object Semantics for Multi-Modal 3D Detection. Parse Trees Guided LLM Prompt Compression. Cross-Spectral Analysis of Bivariate Graph Signals.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1