DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training

Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li
{"title":"DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training","authors":"Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li","doi":"arxiv-2409.09289","DOIUrl":null,"url":null,"abstract":"Analyzing real-world multimodal signals is an essential and challenging task\nfor intelligent voice assistants (IVAs). Mainstream approaches have achieved\nremarkable performance on various downstream tasks of IVAs with pre-trained\naudio models and text models. However, these models are pre-trained\nindependently and usually on tasks different from target domains, resulting in\nsub-optimal modality representations for downstream tasks. Moreover, in many\ndomains, collecting enough language-audio pairs is extremely hard, and\ntranscribing raw audio also requires high professional skills, making it\ndifficult or even infeasible to joint pre-training. To address these\npainpoints, we propose DSCLAP, a simple and effective framework that enables\nlanguage-audio pre-training with only raw audio signal input. Specifically,\nDSCLAP converts raw audio signals into text via an ASR system and combines a\ncontrastive learning objective and a language-audio matching objective to align\nthe audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of\nin-vehicle domain audio. Empirical results on two downstream tasks show that\nwhile conceptually simple, DSCLAP significantly outperforms the baseline models\nin all metrics, showing great promise for domain-specific IVAs applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0

Abstract

Analyzing real-world multimodal signals is an essential and challenging task for intelligent voice assistants (IVAs). Mainstream approaches have achieved remarkable performance on various downstream tasks of IVAs with pre-trained audio models and text models. However, these models are pre-trained independently and usually on tasks different from target domains, resulting in sub-optimal modality representations for downstream tasks. Moreover, in many domains, collecting enough language-audio pairs is extremely hard, and transcribing raw audio also requires high professional skills, making it difficult or even infeasible to joint pre-training. To address these painpoints, we propose DSCLAP, a simple and effective framework that enables language-audio pre-training with only raw audio signal input. Specifically, DSCLAP converts raw audio signals into text via an ASR system and combines a contrastive learning objective and a language-audio matching objective to align the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of in-vehicle domain audio. Empirical results on two downstream tasks show that while conceptually simple, DSCLAP significantly outperforms the baseline models in all metrics, showing great promise for domain-specific IVAs applications.
查看原文
分享 分享
微信好友 朋友圈 QQ好友 复制链接
本刊更多论文
DSCLAP:特定领域对比语言-音频预培训
分析真实世界的多模态信号是智能语音助手(IVA)的一项重要而又具有挑战性的任务。主流方法通过预训练音频模型和文本模型,在 IVA 的各种下游任务中取得了显著的性能。然而,这些模型都是独立预训练的,而且通常是在与目标领域不同的任务上进行预训练,从而导致下游任务的模态表示不够理想。此外,在许多领域,收集足够的语言-音频对非常困难,而翻译原始音频也需要很高的专业技能,这使得联合预训练变得困难甚至不可行。为了解决这些问题,我们提出了 DSCLAP,这是一个简单有效的框架,只需输入原始音频信号就能实现语言和音频的预训练。具体来说,DSCLAP 通过 ASR 系统将原始音频信号转换为文本,并将对比学习目标和语言音频匹配目标结合起来,使音频和 ASR 转录内容保持一致。我们在 12107 小时的车载领域音频上对 DSCLAP 进行了预训练。两个下游任务的实证结果表明,虽然概念简单,但 DSCLAP 在所有指标上都明显优于基线模型,为特定领域的 IVAs 应用展示了巨大的前景。
本文章由计算机程序翻译,如有差异,请以英文原文为准。
求助全文
约1分钟内获得全文 去求助
来源期刊
自引率
0.00%
发文量
0
期刊最新文献
Explaining Deep Learning Embeddings for Speech Emotion Recognition by Predicting Interpretable Acoustic Features ESPnet-EZ: Python-only ESPnet for Easy Fine-tuning and Integration Prevailing Research Areas for Music AI in the Era of Foundation Models Egocentric Speaker Classification in Child-Adult Dyadic Interactions: From Sensing to Computational Modeling The T05 System for The VoiceMOS Challenge 2024: Transfer Learning from Deep Image Classifier to Naturalness MOS Prediction of High-Quality Synthetic Speech
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
已复制链接
已复制链接
快去分享给好友吧!
我知道了
×
扫码分享
扫码分享
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1