A cross-talk robust multichannel VAD model for multiparty agent interactions trained using synthetic re-recordings

ArXiv Pub Date : 2024-02-15 DOI:10.48550/arXiv.2402.09797
Hyewon Han, Naveen Kumar
{"title":"A cross-talk robust multichannel VAD model for multiparty agent interactions trained using synthetic re-recordings","authors":"Hyewon Han, Naveen Kumar","doi":"10.48550/arXiv.2402.09797","DOIUrl":null,"url":null,"abstract":"In this work, we propose a novel cross-talk rejection framework for a multi-channel multi-talker setup for a live multiparty interactive show. Our far-field audio setup is required to be hands-free during live interaction and comprises four adjacent talkers with directional microphones in the same space. Such setups often introduce heavy cross-talk between channels, resulting in reduced automatic speech recognition (ASR) and natural language understanding (NLU) performance. To address this problem, we propose voice activity detection (VAD) model for all talkers using multichannel information, which is then used to filter audio for downstream tasks. We adopt a synthetic training data generation approach through playback and re-recording for such scenarios, simulating challenging speech overlap conditions. We train our models on this synthetic data and demonstrate that our approach outperforms single-channel VAD models and energy-based multi-channel VAD algorithm in various acoustic environments. In addition to VAD results, we also present multiparty ASR evaluation results to highlight the impact of using our VAD model for filtering audio in downstream tasks by significantly reducing the insertion error.","PeriodicalId":8425,"journal":{"name":"ArXiv","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-02-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ArXiv","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.48550/arXiv.2402.09797","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

In this work, we propose a novel cross-talk rejection framework for a multi-channel multi-talker setup for a live multiparty interactive show. Our far-field audio setup is required to be hands-free during live interaction and comprises four adjacent talkers with directional microphones in the same space. Such setups often introduce heavy cross-talk between channels, resulting in reduced automatic speech recognition (ASR) and natural language understanding (NLU) performance. To address this problem, we propose voice activity detection (VAD) model for all talkers using multichannel information, which is then used to filter audio for downstream tasks. We adopt a synthetic training data generation approach through playback and re-recording for such scenarios, simulating challenging speech overlap conditions. We train our models on this synthetic data and demonstrate that our approach outperforms single-channel VAD models and energy-based multi-channel VAD algorithm in various acoustic environments. In addition to VAD results, we also present multiparty ASR evaluation results to highlight the impact of using our VAD model for filtering audio in downstream tasks by significantly reducing the insertion error.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
利用合成再记录训练多方代理互动的交叉稳健多通道 VAD 模型
在这项工作中,我们为现场多方互动节目的多通道多谈话者设置提出了一种新颖的串扰抑制框架。我们的远场音频设置要求在现场互动时免提,由四个相邻的谈话者在同一空间内使用定向麦克风组成。这种设置通常会在声道之间产生严重的串扰,从而降低自动语音识别(ASR)和自然语言理解(NLU)的性能。为解决这一问题,我们提出了利用多通道信息对所有说话者进行语音活动检测(VAD)的模型,然后利用该模型为下游任务过滤音频。我们采用一种合成训练数据生成方法,通过回放和重新录制此类场景,模拟具有挑战性的语音重叠条件。我们在这些合成数据上训练我们的模型,并证明我们的方法在各种声学环境中优于单通道 VAD 模型和基于能量的多通道 VAD 算法。除了 VAD 结果外,我们还展示了多方 ASR 评估结果,以强调在下游任务中使用我们的 VAD 模型过滤音频的影响,即显著减少插入误差。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Learning temporal relationships between symbols with Laplace Neural Manifolds. Probabilistic Genotype-Phenotype Maps Reveal Mutational Robustness of RNA Folding, Spin Glasses, and Quantum Circuits. Reliability of energy landscape analysis of resting-state functional MRI data. The Dynamic Sensorium competition for predicting large-scale mouse visual cortex activity from videos. LinearAlifold: Linear-Time Consensus Structure Prediction for RNA Alignments.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1