Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li
{"title":"DSCLAP: Domain-Specific Contrastive Language-Audio Pre-Training","authors":"Shengqiang Liu, Da Liu, Anna Wang, Zhiyu Zhang, Jie Gao, Yali Li","doi":"arxiv-2409.09289","DOIUrl":null,"url":null,"abstract":"Analyzing real-world multimodal signals is an essential and challenging task\nfor intelligent voice assistants (IVAs). Mainstream approaches have achieved\nremarkable performance on various downstream tasks of IVAs with pre-trained\naudio models and text models. However, these models are pre-trained\nindependently and usually on tasks different from target domains, resulting in\nsub-optimal modality representations for downstream tasks. Moreover, in many\ndomains, collecting enough language-audio pairs is extremely hard, and\ntranscribing raw audio also requires high professional skills, making it\ndifficult or even infeasible to joint pre-training. To address these\npainpoints, we propose DSCLAP, a simple and effective framework that enables\nlanguage-audio pre-training with only raw audio signal input. Specifically,\nDSCLAP converts raw audio signals into text via an ASR system and combines a\ncontrastive learning objective and a language-audio matching objective to align\nthe audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of\nin-vehicle domain audio. Empirical results on two downstream tasks show that\nwhile conceptually simple, DSCLAP significantly outperforms the baseline models\nin all metrics, showing great promise for domain-specific IVAs applications.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-14","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.09289","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Analyzing real-world multimodal signals is an essential and challenging task
for intelligent voice assistants (IVAs). Mainstream approaches have achieved
remarkable performance on various downstream tasks of IVAs with pre-trained
audio models and text models. However, these models are pre-trained
independently and usually on tasks different from target domains, resulting in
sub-optimal modality representations for downstream tasks. Moreover, in many
domains, collecting enough language-audio pairs is extremely hard, and
transcribing raw audio also requires high professional skills, making it
difficult or even infeasible to joint pre-training. To address these
painpoints, we propose DSCLAP, a simple and effective framework that enables
language-audio pre-training with only raw audio signal input. Specifically,
DSCLAP converts raw audio signals into text via an ASR system and combines a
contrastive learning objective and a language-audio matching objective to align
the audio and ASR transcriptions. We pre-train DSCLAP on 12,107 hours of
in-vehicle domain audio. Empirical results on two downstream tasks show that
while conceptually simple, DSCLAP significantly outperforms the baseline models
in all metrics, showing great promise for domain-specific IVAs applications.