从AV语音基准中提取纯度视频挑战的初步研究

Haoran Yan, Huijun Lu, Dunbo Cai, Tao Hang, Ling Qian
{"title":"从AV语音基准中提取纯度视频挑战的初步研究","authors":"Haoran Yan, Huijun Lu, Dunbo Cai, Tao Hang, Ling Qian","doi":"10.1145/3517077.3517091","DOIUrl":null,"url":null,"abstract":"Recently reported deep audiovisual models have shown promising results on solving the cocktail party problem and are attracting new studies. Audiovisual datasets are an important basis for these studies. Here we investigate the AVSpeech dataset[1], a popular dataset that was launched by the Google team, for training deep audio-visual models for multi-talker speech separation. Our goal is to derive a special kind of video, called purity video, from the dataset. A piece of purity video contains continuous image frames of the same person with a face within a time. A natural question is how we can extract purity videos, as many as possible, from the AVSpeech dataset. This paper presents the tools and methods we utilized, problems we encountered, and the purity video we obtained. Our main contributions are as follows: 1) We propose a solution to extract a derivation subset of the AVSpeech dataset that is of high quality and more than the existing training sets publicly available. 2) We implemented the above solution to perform experiments on the AVSpeech dataset and got insightful results; 3) We also evaluated our proposed solution on our manually labeled dataset called VTData. Experiments show that our solution is effective and robust. We hope this work can help the community in exploiting the AVSpeech dataset for other video understanding tasks.","PeriodicalId":233686,"journal":{"name":"2022 7th International Conference on Multimedia and Image Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A preliminary study of challenges in extracting purity videos from the AV Speech Benchmark\",\"authors\":\"Haoran Yan, Huijun Lu, Dunbo Cai, Tao Hang, Ling Qian\",\"doi\":\"10.1145/3517077.3517091\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently reported deep audiovisual models have shown promising results on solving the cocktail party problem and are attracting new studies. Audiovisual datasets are an important basis for these studies. Here we investigate the AVSpeech dataset[1], a popular dataset that was launched by the Google team, for training deep audio-visual models for multi-talker speech separation. Our goal is to derive a special kind of video, called purity video, from the dataset. A piece of purity video contains continuous image frames of the same person with a face within a time. A natural question is how we can extract purity videos, as many as possible, from the AVSpeech dataset. This paper presents the tools and methods we utilized, problems we encountered, and the purity video we obtained. Our main contributions are as follows: 1) We propose a solution to extract a derivation subset of the AVSpeech dataset that is of high quality and more than the existing training sets publicly available. 2) We implemented the above solution to perform experiments on the AVSpeech dataset and got insightful results; 3) We also evaluated our proposed solution on our manually labeled dataset called VTData. Experiments show that our solution is effective and robust. We hope this work can help the community in exploiting the AVSpeech dataset for other video understanding tasks.\",\"PeriodicalId\":233686,\"journal\":{\"name\":\"2022 7th International Conference on Multimedia and Image Processing\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 7th International Conference on Multimedia and Image Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3517077.3517091\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Multimedia and Image Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3517077.3517091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

摘要

最近报道的深度视听模型在解决鸡尾酒会问题上显示出有希望的结果,并吸引了新的研究。视听数据集是这些研究的重要基础。在这里,我们研究了AVSpeech数据集[1],这是一个由Google团队推出的流行数据集,用于训练用于多说话者语音分离的深度视听模型。我们的目标是从数据集中获得一种特殊类型的视频,称为纯度视频。一段纯视频包含同一个人在一段时间内的连续图像帧。一个自然的问题是我们如何从AVSpeech数据集中提取尽可能多的纯度视频。本文介绍了我们使用的工具和方法,遇到的问题,以及我们获得的纯度视频。我们的主要贡献如下:1)我们提出了一个解决方案来提取AVSpeech数据集的派生子集,该子集具有高质量,并且比现有的公开可用的训练集更多。2)我们将上述解决方案在AVSpeech数据集上进行了实验,得到了有见地的结果;3)我们还在名为VTData的人工标记数据集上评估了我们提出的解决方案。实验结果表明,该方法具有较好的鲁棒性和有效性。我们希望这项工作可以帮助社区利用AVSpeech数据集进行其他视频理解任务。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
A preliminary study of challenges in extracting purity videos from the AV Speech Benchmark
Recently reported deep audiovisual models have shown promising results on solving the cocktail party problem and are attracting new studies. Audiovisual datasets are an important basis for these studies. Here we investigate the AVSpeech dataset[1], a popular dataset that was launched by the Google team, for training deep audio-visual models for multi-talker speech separation. Our goal is to derive a special kind of video, called purity video, from the dataset. A piece of purity video contains continuous image frames of the same person with a face within a time. A natural question is how we can extract purity videos, as many as possible, from the AVSpeech dataset. This paper presents the tools and methods we utilized, problems we encountered, and the purity video we obtained. Our main contributions are as follows: 1) We propose a solution to extract a derivation subset of the AVSpeech dataset that is of high quality and more than the existing training sets publicly available. 2) We implemented the above solution to perform experiments on the AVSpeech dataset and got insightful results; 3) We also evaluated our proposed solution on our manually labeled dataset called VTData. Experiments show that our solution is effective and robust. We hope this work can help the community in exploiting the AVSpeech dataset for other video understanding tasks.
求助全文
通过发布文献求助,成功后即可免费获取论文全文。 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Research on Capsule Leakage Detection Based on Linear Array Camera Multi-Focus Image Fusion Based on Improved CNN Research on the Online Recognition of the Motion Image of the Adjacent Joints of the Lower Limbs Speckle suppression and texture preservation in optical coherence tomography images using variational image decomposition Structure design of the shutter with slider-crank mechanism
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1