{"title":"Mutual Learning for Acoustic Matching and Dereverberation via Visual Scene-driven Diffusion","authors":"Jian Ma, Wenguan Wang, Yi Yang, Feng Zheng","doi":"arxiv-2407.10373","DOIUrl":null,"url":null,"abstract":"Visual acoustic matching (VAM) is pivotal for enhancing the immersive\nexperience, and the task of dereverberation is effective in improving audio\nintelligibility. Existing methods treat each task independently, overlooking\nthe inherent reciprocity between them. Moreover, these methods depend on paired\ntraining data, which is challenging to acquire, impeding the utilization of\nextensive unpaired data. In this paper, we introduce MVSD, a mutual learning\nframework based on diffusion models. MVSD considers the two tasks\nsymmetrically, exploiting the reciprocal relationship to facilitate learning\nfrom inverse tasks and overcome data scarcity. Furthermore, we employ the\ndiffusion model as foundational conditional converters to circumvent the\ntraining instability and over-smoothing drawbacks of conventional GAN\narchitectures. Specifically, MVSD employs two converters: one for VAM called\nreverberator and one for dereverberation called dereverberator. The\ndereverberator judges whether the reverberation audio generated by reverberator\nsounds like being in the conditional visual scenario, and vice versa. By\nforming a closed loop, these two converters can generate informative feedback\nsignals to optimize the inverse tasks, even with easily acquired one-way\nunpaired data. Extensive experiments on two standard benchmarks, i.e.,\nSoundSpaces-Speech and Acoustic AVSpeech, exhibit that our framework can\nimprove the performance of the reverberator and dereverberator and better match\nspecified visual scenarios.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.10373","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Visual acoustic matching (VAM) is pivotal for enhancing the immersive
experience, and the task of dereverberation is effective in improving audio
intelligibility. Existing methods treat each task independently, overlooking
the inherent reciprocity between them. Moreover, these methods depend on paired
training data, which is challenging to acquire, impeding the utilization of
extensive unpaired data. In this paper, we introduce MVSD, a mutual learning
framework based on diffusion models. MVSD considers the two tasks
symmetrically, exploiting the reciprocal relationship to facilitate learning
from inverse tasks and overcome data scarcity. Furthermore, we employ the
diffusion model as foundational conditional converters to circumvent the
training instability and over-smoothing drawbacks of conventional GAN
architectures. Specifically, MVSD employs two converters: one for VAM called
reverberator and one for dereverberation called dereverberator. The
dereverberator judges whether the reverberation audio generated by reverberator
sounds like being in the conditional visual scenario, and vice versa. By
forming a closed loop, these two converters can generate informative feedback
signals to optimize the inverse tasks, even with easily acquired one-way
unpaired data. Extensive experiments on two standard benchmarks, i.e.,
SoundSpaces-Speech and Acoustic AVSpeech, exhibit that our framework can
improve the performance of the reverberator and dereverberator and better match
specified visual scenarios.