{"title":"从混合语音中发现少量关键词","authors":"Junming Yuan, Ying Shi, LanTian Li, Dong Wang, Askar Hamdulla","doi":"arxiv-2407.06078","DOIUrl":null,"url":null,"abstract":"Few-shot keyword spotting (KWS) aims to detect unknown keywords with limited\ntraining samples. A commonly used approach is the pre-training and fine-tuning\nframework. While effective in clean conditions, this approach struggles with\nmixed keyword spotting -- simultaneously detecting multiple keywords blended in\nan utterance, which is crucial in real-world applications. Previous research\nhas proposed a Mix-Training (MT) approach to solve the problem, however, it has\nnever been tested in the few-shot scenario. In this paper, we investigate the\npossibility of using MT and other relevant methods to solve the two practical\nchallenges together: few-shot and mixed speech. Experiments conducted on the\nLibriSpeech and Google Speech Command corpora demonstrate that MT is highly\neffective on this task when employed in either the pre-training phase or the\nfine-tuning phase. Moreover, combining SSL-based large-scale pre-training\n(HuBert) and MT fine-tuning yields very strong results in all the test\nconditions.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":"18 1","pages":""},"PeriodicalIF":0.0000,"publicationDate":"2024-07-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Few-Shot Keyword Spotting from Mixed Speech\",\"authors\":\"Junming Yuan, Ying Shi, LanTian Li, Dong Wang, Askar Hamdulla\",\"doi\":\"arxiv-2407.06078\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Few-shot keyword spotting (KWS) aims to detect unknown keywords with limited\\ntraining samples. A commonly used approach is the pre-training and fine-tuning\\nframework. While effective in clean conditions, this approach struggles with\\nmixed keyword spotting -- simultaneously detecting multiple keywords blended in\\nan utterance, which is crucial in real-world applications. Previous research\\nhas proposed a Mix-Training (MT) approach to solve the problem, however, it has\\nnever been tested in the few-shot scenario. In this paper, we investigate the\\npossibility of using MT and other relevant methods to solve the two practical\\nchallenges together: few-shot and mixed speech. Experiments conducted on the\\nLibriSpeech and Google Speech Command corpora demonstrate that MT is highly\\neffective on this task when employed in either the pre-training phase or the\\nfine-tuning phase. Moreover, combining SSL-based large-scale pre-training\\n(HuBert) and MT fine-tuning yields very strong results in all the test\\nconditions.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":\"18 1\",\"pages\":\"\"},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-05\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.06078\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.06078","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Few-shot keyword spotting (KWS) aims to detect unknown keywords with limited
training samples. A commonly used approach is the pre-training and fine-tuning
framework. While effective in clean conditions, this approach struggles with
mixed keyword spotting -- simultaneously detecting multiple keywords blended in
an utterance, which is crucial in real-world applications. Previous research
has proposed a Mix-Training (MT) approach to solve the problem, however, it has
never been tested in the few-shot scenario. In this paper, we investigate the
possibility of using MT and other relevant methods to solve the two practical
challenges together: few-shot and mixed speech. Experiments conducted on the
LibriSpeech and Google Speech Command corpora demonstrate that MT is highly
effective on this task when employed in either the pre-training phase or the
fine-tuning phase. Moreover, combining SSL-based large-scale pre-training
(HuBert) and MT fine-tuning yields very strong results in all the test
conditions.