{"title":"声学语言模型评估套件","authors":"Gallil Maimon, Amit Roth, Yossi Adi","doi":"arxiv-2409.07437","DOIUrl":null,"url":null,"abstract":"Speech language models have recently demonstrated great potential as\nuniversal speech processing systems. Such models have the ability to model the\nrich acoustic information existing in audio signals, beyond spoken content,\nsuch as emotion, background noise, etc. Despite this, evaluation benchmarks\nwhich evaluate awareness to a wide range of acoustic aspects, are lacking. To\nhelp bridge this gap, we introduce SALMon, a novel evaluation suite\nencompassing background noise, emotion, speaker identity and room impulse\nresponse. The proposed benchmarks both evaluate the consistency of the\ninspected element and how much it matches the spoken text. We follow a\nmodelling based approach, measuring whether a model gives correct samples\nhigher scores than incorrect ones. This approach makes the benchmark fast to\ncompute even for large models. We evaluated several speech language models on\nSALMon, thus highlighting the strengths and weaknesses of each evaluated\nmethod. Code and data are publicly available at\nhttps://pages.cs.huji.ac.il/adiyoss-lab/salmon/ .","PeriodicalId":501284,"journal":{"name":"arXiv - EE - Audio and Speech Processing","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"A Suite for Acoustic Language Model Evaluation\",\"authors\":\"Gallil Maimon, Amit Roth, Yossi Adi\",\"doi\":\"arxiv-2409.07437\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Speech language models have recently demonstrated great potential as\\nuniversal speech processing systems. Such models have the ability to model the\\nrich acoustic information existing in audio signals, beyond spoken content,\\nsuch as emotion, background noise, etc. Despite this, evaluation benchmarks\\nwhich evaluate awareness to a wide range of acoustic aspects, are lacking. To\\nhelp bridge this gap, we introduce SALMon, a novel evaluation suite\\nencompassing background noise, emotion, speaker identity and room impulse\\nresponse. The proposed benchmarks both evaluate the consistency of the\\ninspected element and how much it matches the spoken text. We follow a\\nmodelling based approach, measuring whether a model gives correct samples\\nhigher scores than incorrect ones. This approach makes the benchmark fast to\\ncompute even for large models. We evaluated several speech language models on\\nSALMon, thus highlighting the strengths and weaknesses of each evaluated\\nmethod. Code and data are publicly available at\\nhttps://pages.cs.huji.ac.il/adiyoss-lab/salmon/ .\",\"PeriodicalId\":501284,\"journal\":{\"name\":\"arXiv - EE - Audio and Speech Processing\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-09-11\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - EE - Audio and Speech Processing\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2409.07437\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - EE - Audio and Speech Processing","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2409.07437","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Speech language models have recently demonstrated great potential as
universal speech processing systems. Such models have the ability to model the
rich acoustic information existing in audio signals, beyond spoken content,
such as emotion, background noise, etc. Despite this, evaluation benchmarks
which evaluate awareness to a wide range of acoustic aspects, are lacking. To
help bridge this gap, we introduce SALMon, a novel evaluation suite
encompassing background noise, emotion, speaker identity and room impulse
response. The proposed benchmarks both evaluate the consistency of the
inspected element and how much it matches the spoken text. We follow a
modelling based approach, measuring whether a model gives correct samples
higher scores than incorrect ones. This approach makes the benchmark fast to
compute even for large models. We evaluated several speech language models on
SALMon, thus highlighting the strengths and weaknesses of each evaluated
method. Code and data are publicly available at
https://pages.cs.huji.ac.il/adiyoss-lab/salmon/ .