{"title":"Enhancing Low-Resource Language and Instruction Following Capabilities of Audio Language Models","authors":"Potsawee Manakul, Guangzhi Sun, Warit Sirichotedumrong, Kasima Tharnpipitchai, Kunat Pipatanakul","doi":"arxiv-2409.10999","DOIUrl":null,"url":null,"abstract":"Audio language models can understand audio inputs and perform a range of\naudio-related tasks based on instructions, such as speech recognition and audio\ncaptioning, where the instructions are usually textual prompts. Audio language\nmodels are mostly initialized from pre-trained audio encoders and large\nlanguage models (LLMs). Although these pre-trained components were developed to\nsupport multiple languages, audio-language models are trained predominantly on\nEnglish data, which may limit their usability to only English instructions or\nEnglish speech inputs. First, this paper examines the performance of existing\naudio language models in an underserved language using Thai as an example. This\npaper demonstrates that, despite being built on multilingual backbones, audio\nlanguage models do not exhibit cross-lingual emergent abilities to low-resource\nlanguages. Second, this paper studies data mixture for developing audio\nlanguage models that are optimized for a target language as well as English. In\naddition. this paper integrates audio comprehension and speech\ninstruction-following capabilities into a single unified model. Our experiments\nprovide insights into data mixture for enhancing instruction-following\ncapabilities in both a low-resource language and English. Our model,\nTyphoon-Audio, outperforms existing open-source audio language models by a\nconsiderable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro in\nboth English and Thai languages.","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":"167 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-09-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.10999","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 0
Abstract
Audio language models can understand audio inputs and perform a range of
audio-related tasks based on instructions, such as speech recognition and audio
captioning, where the instructions are usually textual prompts. Audio language
models are mostly initialized from pre-trained audio encoders and large
language models (LLMs). Although these pre-trained components were developed to
support multiple languages, audio-language models are trained predominantly on
English data, which may limit their usability to only English instructions or
English speech inputs. First, this paper examines the performance of existing
audio language models in an underserved language using Thai as an example. This
paper demonstrates that, despite being built on multilingual backbones, audio
language models do not exhibit cross-lingual emergent abilities to low-resource
languages. Second, this paper studies data mixture for developing audio
language models that are optimized for a target language as well as English. In
addition. this paper integrates audio comprehension and speech
instruction-following capabilities into a single unified model. Our experiments
provide insights into data mixture for enhancing instruction-following
capabilities in both a low-resource language and English. Our model,
Typhoon-Audio, outperforms existing open-source audio language models by a
considerable margin, and it is comparable to state-of-the-art Gemini-1.5-Pro in
both English and Thai languages.