从AV语音基准中提取纯度视频挑战的初步研究

2022 7th International Conference on Multimedia and Image Processing Pub Date : 2022-01-14 DOI:10.1145/3517077.3517091

Haoran Yan, Huijun Lu, Dunbo Cai, Tao Hang, Ling Qian

{"title":"从AV语音基准中提取纯度视频挑战的初步研究","authors":"Haoran Yan, Huijun Lu, Dunbo Cai, Tao Hang, Ling Qian","doi":"10.1145/3517077.3517091","DOIUrl":null,"url":null,"abstract":"Recently reported deep audiovisual models have shown promising results on solving the cocktail party problem and are attracting new studies. Audiovisual datasets are an important basis for these studies. Here we investigate the AVSpeech dataset[1], a popular dataset that was launched by the Google team, for training deep audio-visual models for multi-talker speech separation. Our goal is to derive a special kind of video, called purity video, from the dataset. A piece of purity video contains continuous image frames of the same person with a face within a time. A natural question is how we can extract purity videos, as many as possible, from the AVSpeech dataset. This paper presents the tools and methods we utilized, problems we encountered, and the purity video we obtained. Our main contributions are as follows: 1) We propose a solution to extract a derivation subset of the AVSpeech dataset that is of high quality and more than the existing training sets publicly available. 2) We implemented the above solution to perform experiments on the AVSpeech dataset and got insightful results; 3) We also evaluated our proposed solution on our manually labeled dataset called VTData. Experiments show that our solution is effective and robust. We hope this work can help the community in exploiting the AVSpeech dataset for other video understanding tasks.","PeriodicalId":233686,"journal":{"name":"2022 7th International Conference on Multimedia and Image Processing","volume":"23 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2022-01-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A preliminary study of challenges in extracting purity videos from the AV Speech Benchmark\",\"authors\":\"Haoran Yan, Huijun Lu, Dunbo Cai, Tao Hang, Ling Qian\",\"doi\":\"10.1145/3517077.3517091\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently reported deep audiovisual models have shown promising results on solving the cocktail party problem and are attracting new studies. Audiovisual datasets are an important basis for these studies. Here we investigate the AVSpeech dataset[1], a popular dataset that was launched by the Google team, for training deep audio-visual models for multi-talker speech separation. Our goal is to derive a special kind of video, called purity video, from the dataset. A piece of purity video contains continuous image frames of the same person with a face within a time. A natural question is how we can extract purity videos, as many as possible, from the AVSpeech dataset. This paper presents the tools and methods we utilized, problems we encountered, and the purity video we obtained. Our main contributions are as follows: 1) We propose a solution to extract a derivation subset of the AVSpeech dataset that is of high quality and more than the existing training sets publicly available. 2) We implemented the above solution to perform experiments on the AVSpeech dataset and got insightful results; 3) We also evaluated our proposed solution on our manually labeled dataset called VTData. Experiments show that our solution is effective and robust. We hope this work can help the community in exploiting the AVSpeech dataset for other video understanding tasks.\",\"PeriodicalId\":233686,\"journal\":{\"name\":\"2022 7th International Conference on Multimedia and Image Processing\",\"volume\":\"23 1\",\"pages\":\"0\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2022-01-14\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"2022 7th International Conference on Multimedia and Image Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/10.1145/3517077.3517091\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"2022 7th International Conference on Multimedia and Image Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3517077.3517091","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

最近报道的深度视听模型在解决鸡尾酒会问题上显示出有希望的结果，并吸引了新的研究。视听数据集是这些研究的重要基础。在这里，我们研究了AVSpeech数据集[1]，这是一个由Google团队推出的流行数据集，用于训练用于多说话者语音分离的深度视听模型。我们的目标是从数据集中获得一种特殊类型的视频，称为纯度视频。一段纯视频包含同一个人在一段时间内的连续图像帧。一个自然的问题是我们如何从AVSpeech数据集中提取尽可能多的纯度视频。本文介绍了我们使用的工具和方法，遇到的问题，以及我们获得的纯度视频。我们的主要贡献如下:1)我们提出了一个解决方案来提取AVSpeech数据集的派生子集，该子集具有高质量，并且比现有的公开可用的训练集更多。2)我们将上述解决方案在AVSpeech数据集上进行了实验，得到了有见地的结果;3)我们还在名为VTData的人工标记数据集上评估了我们提出的解决方案。实验结果表明，该方法具有较好的鲁棒性和有效性。我们希望这项工作可以帮助社区利用AVSpeech数据集进行其他视频理解任务。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

A preliminary study of challenges in extracting purity videos from the AV Speech Benchmark

Recently reported deep audiovisual models have shown promising results on solving the cocktail party problem and are attracting new studies. Audiovisual datasets are an important basis for these studies. Here we investigate the AVSpeech dataset[1], a popular dataset that was launched by the Google team, for training deep audio-visual models for multi-talker speech separation. Our goal is to derive a special kind of video, called purity video, from the dataset. A piece of purity video contains continuous image frames of the same person with a face within a time. A natural question is how we can extract purity videos, as many as possible, from the AVSpeech dataset. This paper presents the tools and methods we utilized, problems we encountered, and the purity video we obtained. Our main contributions are as follows: 1) We propose a solution to extract a derivation subset of the AVSpeech dataset that is of high quality and more than the existing training sets publicly available. 2) We implemented the above solution to perform experiments on the AVSpeech dataset and got insightful results; 3) We also evaluated our proposed solution on our manually labeled dataset called VTData. Experiments show that our solution is effective and robust. We hope this work can help the community in exploiting the AVSpeech dataset for other video understanding tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

2022 7th International Conference on Multimedia and Image Processing

自引率

0.00%

发文量