Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction

Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman
{"title":"Sim2Real Transfer for Audio-Visual Navigation with Frequency-Adaptive Acoustic Field Prediction","authors":"Changan Chen, Jordi Ramos, Anshul Tomar, Kristen Grauman","doi":"arxiv-2405.02821","DOIUrl":null,"url":null,"abstract":"Sim2real transfer has received increasing attention lately due to the success\nof learning robotic tasks in simulation end-to-end. While there has been a lot\nof progress in transferring vision-based navigation policies, the existing\nsim2real strategy for audio-visual navigation performs data augmentation\nempirically without measuring the acoustic gap. The sound differs from light in\nthat it spans across much wider frequencies and thus requires a different\nsolution for sim2real. We propose the first treatment of sim2real for\naudio-visual navigation by disentangling it into acoustic field prediction\n(AFP) and waypoint navigation. We first validate our design choice in the\nSoundSpaces simulator and show improvement on the Continuous AudioGoal\nnavigation benchmark. We then collect real-world data to measure the spectral\ndifference between the simulation and the real world by training AFP models\nthat only take a specific frequency subband as input. We further propose a\nfrequency-adaptive strategy that intelligently selects the best frequency band\nfor prediction based on both the measured spectral difference and the energy\ndistribution of the received audio, which improves the performance on the real\ndata. Lastly, we build a real robot platform and show that the transferred\npolicy can successfully navigate to sounding objects. This work demonstrates\nthe potential of building intelligent agents that can see, hear, and act\nentirely from simulation, and transferring them to the real world.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-05-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2405.02821","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Sim2real transfer has received increasing attention lately due to the success of learning robotic tasks in simulation end-to-end. While there has been a lot of progress in transferring vision-based navigation policies, the existing sim2real strategy for audio-visual navigation performs data augmentation empirically without measuring the acoustic gap. The sound differs from light in that it spans across much wider frequencies and thus requires a different solution for sim2real. We propose the first treatment of sim2real for audio-visual navigation by disentangling it into acoustic field prediction (AFP) and waypoint navigation. We first validate our design choice in the SoundSpaces simulator and show improvement on the Continuous AudioGoal navigation benchmark. We then collect real-world data to measure the spectral difference between the simulation and the real world by training AFP models that only take a specific frequency subband as input. We further propose a frequency-adaptive strategy that intelligently selects the best frequency band for prediction based on both the measured spectral difference and the energy distribution of the received audio, which improves the performance on the real data. Lastly, we build a real robot platform and show that the transferred policy can successfully navigate to sounding objects. This work demonstrates the potential of building intelligent agents that can see, hear, and act entirely from simulation, and transferring them to the real world.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用频率自适应声场预测进行视听导航的 Sim2Real 传输
由于端到端仿真机器人任务学习的成功,仿真到真实的转换近来受到越来越多的关注。虽然在基于视觉的导航策略转移方面已经取得了很大进展,但现有的用于视听导航的 Sim2 Real 策略是在不测量声学间隙的情况下经验性地执行数据增强。声音与光不同,它的频率跨度更大,因此需要不同的 sim2real 解决方案。我们首次提出了用于视听导航的 sim2real 方法,将其分为声场预测(AFP)和航点导航。我们首先在声场模拟器(SoundSpaces)中验证了我们的设计选择,并在连续音频目标导航(Continuous AudioGoalnavigation)基准测试中展示了改进效果。然后,我们收集真实世界的数据,通过训练只将特定频率子带作为输入的 AFP 模型来测量模拟与真实世界之间的频谱差异。我们进一步提出了一种频率自适应策略,根据测量到的频谱差和接收音频的能量分布,智能地选择最佳频段进行预测,从而提高了在真实数据上的性能。最后,我们搭建了一个真实的机器人平台,并展示了所传输的策略能够成功导航到发声物体。这项工作展示了构建智能代理的潜力,这些代理可以完全通过模拟来观看、聆听和行动,并将它们转移到真实世界中。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration Prevailing Research Areas for Music AI in the Era of Foundation Models Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1