{"title":"用于音源分离的语义分组网络","authors":"Shentong Mo, Yapeng Tian","doi":"arxiv-2407.03736","DOIUrl":null,"url":null,"abstract":"Recently, audio-visual separation approaches have taken advantage of the\nnatural synchronization between the two modalities to boost audio source\nseparation performance. They extracted high-level semantics from visual inputs\nas the guidance to help disentangle sound representation for individual\nsources. Can we directly learn to disentangle the individual semantics from the\nsound itself? The dilemma is that multiple sound sources are mixed together in\nthe original space. To tackle the difficulty, in this paper, we present a novel\nSemantic Grouping Network, termed as SGN, that can directly disentangle sound\nrepresentations and extract high-level semantic information for each source\nfrom input audio mixture. Specifically, SGN aggregates category-wise source\nfeatures through learnable class tokens of sounds. Then, the aggregated\nsemantic features can be used as the guidance to separate the corresponding\naudio sources from the mixture. We conducted extensive experiments on\nmusic-only and universal sound separation benchmarks: MUSIC, FUSS, MUSDB18, and\nVGG-Sound. The results demonstrate that our SGN significantly outperforms\nprevious audio-only methods and audio-visual models without utilizing\nadditional visual cues.","PeriodicalId":501178,"journal":{"name":"arXiv - CS - Sound","volume":null,"pages":null},"PeriodicalIF":0.0000,"publicationDate":"2024-07-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Semantic Grouping Network for Audio Source Separation\",\"authors\":\"Shentong Mo, Yapeng Tian\",\"doi\":\"arxiv-2407.03736\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"Recently, audio-visual separation approaches have taken advantage of the\\nnatural synchronization between the two modalities to boost audio source\\nseparation performance. They extracted high-level semantics from visual inputs\\nas the guidance to help disentangle sound representation for individual\\nsources. Can we directly learn to disentangle the individual semantics from the\\nsound itself? The dilemma is that multiple sound sources are mixed together in\\nthe original space. To tackle the difficulty, in this paper, we present a novel\\nSemantic Grouping Network, termed as SGN, that can directly disentangle sound\\nrepresentations and extract high-level semantic information for each source\\nfrom input audio mixture. Specifically, SGN aggregates category-wise source\\nfeatures through learnable class tokens of sounds. Then, the aggregated\\nsemantic features can be used as the guidance to separate the corresponding\\naudio sources from the mixture. We conducted extensive experiments on\\nmusic-only and universal sound separation benchmarks: MUSIC, FUSS, MUSDB18, and\\nVGG-Sound. The results demonstrate that our SGN significantly outperforms\\nprevious audio-only methods and audio-visual models without utilizing\\nadditional visual cues.\",\"PeriodicalId\":501178,\"journal\":{\"name\":\"arXiv - CS - Sound\",\"volume\":null,\"pages\":null},\"PeriodicalIF\":0.0000,\"publicationDate\":\"2024-07-04\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"arXiv - CS - Sound\",\"FirstCategoryId\":\"1085\",\"ListUrlMain\":\"https://doi.org/arxiv-2407.03736\",\"RegionNum\":0,\"RegionCategory\":null,\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"\",\"JCRName\":\"\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"arXiv - CS - Sound","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/arxiv-2407.03736","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
Semantic Grouping Network for Audio Source Separation
Recently, audio-visual separation approaches have taken advantage of the
natural synchronization between the two modalities to boost audio source
separation performance. They extracted high-level semantics from visual inputs
as the guidance to help disentangle sound representation for individual
sources. Can we directly learn to disentangle the individual semantics from the
sound itself? The dilemma is that multiple sound sources are mixed together in
the original space. To tackle the difficulty, in this paper, we present a novel
Semantic Grouping Network, termed as SGN, that can directly disentangle sound
representations and extract high-level semantic information for each source
from input audio mixture. Specifically, SGN aggregates category-wise source
features through learnable class tokens of sounds. Then, the aggregated
semantic features can be used as the guidance to separate the corresponding
audio sources from the mixture. We conducted extensive experiments on
music-only and universal sound separation benchmarks: MUSIC, FUSS, MUSDB18, and
VGG-Sound. The results demonstrate that our SGN significantly outperforms
previous audio-only methods and audio-visual models without utilizing
additional visual cues.