MoWE-Audio：使用弱编码器混合的多任务音频LLMs

arXiv - EE - Audio and Speech Processing Pub Date : 2024-09-10 DOI:arxiv-2409.06635

Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw

{"title":"MoWE-Audio：使用弱编码器混合的多任务音频LLMs","authors":"Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw","doi":"arxiv-2409.06635","DOIUrl":null,"url":null,"abstract":"The rapid advancements in large language models (LLMs) have significantly\nenhanced natural language processing capabilities, facilitating the development\nof AudioLLMs that process and understand speech and audio inputs alongside\ntext. Existing AudioLLMs typically combine a pre-trained audio encoder with a\npre-trained LLM, which are subsequently finetuned on specific audio tasks.\nHowever, the pre-trained audio encoder has constrained capacity to capture\nfeatures for new tasks and datasets. To address this, we propose to incorporate\nmixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE\nsupplements a base encoder with a pool of relatively light weight encoders,\nselectively activated based on the audio input to enhance feature extraction\nwithout significantly increasing model size. Our empirical results demonstrate\nthat MoWE effectively improves multi-task performance, broadening the\napplicability of AudioLLMs to more diverse audio tasks.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"20 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders\",\"authors\":\"Wenyu Zhang, Shuo Sun, Bin Wang, Xunlong Zou, Zhuohan Liu, Yingxu He, Geyu Lin, Nancy F. Chen, Ai Ti Aw\",\"doi\":\"arxiv-2409.06635\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The rapid advancements in large language models (LLMs) have significantly\\nenhanced natural language processing capabilities, facilitating the development\\nof AudioLLMs that process and understand speech and audio inputs alongside\\ntext. Existing AudioLLMs typically combine a pre-trained audio encoder with a\\npre-trained LLM, which are subsequently finetuned on specific audio tasks.\\nHowever, the pre-trained audio encoder has constrained capacity to capture\\nfeatures for new tasks and datasets. To address this, we propose to incorporate\\nmixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE\\nsupplements a base encoder with a pool of relatively light weight encoders,\\nselectively activated based on the audio input to enhance feature extraction\\nwithout significantly increasing model size. Our empirical results demonstrate\\nthat MoWE effectively improves multi-task performance, broadening the\\napplicability of AudioLLMs to more diverse audio tasks.\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":\"20 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-10\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.06635\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.06635","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}

引用次数: 0

摘要

大型语言模型（LLM）的快速发展极大地增强了自然语言处理能力，促进了音频LLM 的发展，音频LLM 可以处理和理解语音和音频输入以及文本。现有的音频LLM 通常将预先训练好的音频编码器与预先训练好的 LLM 结合在一起，然后在特定的音频任务中对其进行微调。然而，预先训练好的音频编码器捕捉新任务和数据集特征的能力受到限制。为了解决这个问题，我们建议将 "弱 "编码器混合物（MoWE）纳入音频LLM 框架。MoWE 使用相对较轻的编码器池对基本编码器进行补充，并根据音频输入有选择性地激活，从而在不显著增加模型大小的情况下增强特征提取。我们的实证结果表明，MoWE 有效地提高了多任务性能，扩大了 AudioLLM 在更多样化音频任务中的应用范围。

本文章由计算机程序翻译，如有差异，请以英文原文为准。

查看原文

微信好友朋友圈 QQ好友复制链接

本刊更多论文

MoWE-Audio: Multitask AudioLLMs with Mixture of Weak Encoders

The rapid advancements in large language models (LLMs) have significantly enhanced natural language processing capabilities, facilitating the development of AudioLLMs that process and understand speech and audio inputs alongside text. Existing AudioLLMs typically combine a pre-trained audio encoder with a pre-trained LLM, which are subsequently finetuned on specific audio tasks. However, the pre-trained audio encoder has constrained capacity to capture features for new tasks and datasets. To address this, we propose to incorporate mixtures of `weak' encoders (MoWE) into the AudioLLM framework. MoWE supplements a base encoder with a pool of relatively light weight encoders, selectively activated based on the audio input to enhance feature extraction without significantly increasing model size. Our empirical results demonstrate that MoWE effectively improves multi-task performance, broadening the applicability of AudioLLMs to more diverse audio tasks.

求助全文

通过发布文献求助，成功后即可免费获取论文全文。去求助

来源期刊

arXiv - EE - Audio and Speech Processing

自引率

0.00%

发文量