Yadong Guan;Jiqing Han;Hongwei Song;Shiwen Deng;Guibin Zheng;Tieran Zheng;Yongjun He
{"title":"基于声音活动感知的跨任务协作训练用于半监督声音事件检测","authors":"Yadong Guan;Jiqing Han;Hongwei Song;Shiwen Deng;Guibin Zheng;Tieran Zheng;Yongjun He","doi":"10.1109/TASLP.2024.3451983","DOIUrl":null,"url":null,"abstract":"The training of sound event detection (SED) models remains a challenge of insufficient supervision due to limited frame-wise labeled data. Mainstream research on this problem has adopted semi-supervised training strategies that generate pseudo-labels for unlabeled data and use these data for the training of a model. Recent works further introduce multi-task training strategies to impose additional supervision. However, the auxiliary tasks employed in these methods either lack frame-wise guidance or exhibit unsuitable task designs. Furthermore, they fail to exploit inter-task relationships effectively, which can serve as valuable supervision. In this paper, we introduce a novel task, sound occurrence and overlap detection (SOD), which detects predefined sound activity patterns, including non-overlapping and overlapping cases. On the basis of SOD, we propose a cross-task collaborative training framework that leverages the relationship between SED and SOD to improve the SED model. Firstly, by jointly optimizing the two tasks in a multi-task manner, the SED model is encouraged to learn features sensitive to sound activity. Subsequently, the cross-task consistency regularization is proposed to promote consistent predictions between SED and SOD. Finally, we propose a pseudo-label selection method that uses inconsistent predictions between the two tasks to identify potential wrong pseudo-labels and mitigate their confirmation bias. In the inference phase, only the trained SED model is used, thus no additional computation and storage costs are incurred. Extensive experiments on the DESED dataset demonstrate the effectiveness of our method.","PeriodicalId":13332,"journal":{"name":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","volume":"32 ","pages":"3947-3959"},"PeriodicalIF":4.1000,"publicationDate":"2024-08-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"0","resultStr":"{\"title\":\"Sound Activity-Aware Based Cross-Task Collaborative Training for Semi-Supervised Sound Event Detection\",\"authors\":\"Yadong Guan;Jiqing Han;Hongwei Song;Shiwen Deng;Guibin Zheng;Tieran Zheng;Yongjun He\",\"doi\":\"10.1109/TASLP.2024.3451983\",\"DOIUrl\":null,\"url\":null,\"abstract\":\"The training of sound event detection (SED) models remains a challenge of insufficient supervision due to limited frame-wise labeled data. Mainstream research on this problem has adopted semi-supervised training strategies that generate pseudo-labels for unlabeled data and use these data for the training of a model. Recent works further introduce multi-task training strategies to impose additional supervision. However, the auxiliary tasks employed in these methods either lack frame-wise guidance or exhibit unsuitable task designs. Furthermore, they fail to exploit inter-task relationships effectively, which can serve as valuable supervision. In this paper, we introduce a novel task, sound occurrence and overlap detection (SOD), which detects predefined sound activity patterns, including non-overlapping and overlapping cases. On the basis of SOD, we propose a cross-task collaborative training framework that leverages the relationship between SED and SOD to improve the SED model. Firstly, by jointly optimizing the two tasks in a multi-task manner, the SED model is encouraged to learn features sensitive to sound activity. Subsequently, the cross-task consistency regularization is proposed to promote consistent predictions between SED and SOD. Finally, we propose a pseudo-label selection method that uses inconsistent predictions between the two tasks to identify potential wrong pseudo-labels and mitigate their confirmation bias. In the inference phase, only the trained SED model is used, thus no additional computation and storage costs are incurred. Extensive experiments on the DESED dataset demonstrate the effectiveness of our method.\",\"PeriodicalId\":13332,\"journal\":{\"name\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"volume\":\"32 \",\"pages\":\"3947-3959\"},\"PeriodicalIF\":4.1000,\"publicationDate\":\"2024-08-29\",\"publicationTypes\":\"Journal Article\",\"fieldsOfStudy\":null,\"isOpenAccess\":false,\"openAccessPdf\":\"\",\"citationCount\":\"0\",\"resultStr\":null,\"platform\":\"Semanticscholar\",\"paperid\":null,\"PeriodicalName\":\"IEEE/ACM Transactions on Audio, Speech, and Language Processing\",\"FirstCategoryId\":\"94\",\"ListUrlMain\":\"https://ieeexplore.ieee.org/document/10659190/\",\"RegionNum\":2,\"RegionCategory\":\"计算机科学\",\"ArticlePicture\":[],\"TitleCN\":null,\"AbstractTextCN\":null,\"PMCID\":null,\"EPubDate\":\"\",\"PubModel\":\"\",\"JCR\":\"Q1\",\"JCRName\":\"ACOUSTICS\",\"Score\":null,\"Total\":0}","platform":"Semanticscholar","paperid":null,"PeriodicalName":"IEEE/ACM Transactions on Audio, Speech, and Language Processing","FirstCategoryId":"94","ListUrlMain":"https://ieeexplore.ieee.org/document/10659190/","RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q1","JCRName":"ACOUSTICS","Score":null,"Total":0}
引用次数: 0
摘要
由于帧标注数据有限,声音事件检测(SED)模型的训练仍然面临监督不足的挑战。针对这一问题的主流研究采用了半监督训练策略,即为未标记数据生成伪标签,并使用这些数据来训练模型。最近的研究进一步引入了多任务训练策略,以施加额外的监督。然而,这些方法采用的辅助任务要么缺乏框架指导,要么任务设计不合适。此外,这些方法未能有效利用任务间的关系,而这种关系可以起到宝贵的监督作用。在本文中,我们引入了一项新任务--声音发生和重叠检测(SOD),它能检测预定义的声音活动模式,包括非重叠和重叠情况。在 SOD 的基础上,我们提出了一个跨任务协同训练框架,利用 SED 和 SOD 之间的关系来改进 SED 模型。首先,通过多任务方式联合优化两个任务,鼓励 SED 模型学习对声音活动敏感的特征。随后,我们提出了跨任务一致性正则化,以促进 SED 和 SOD 预测的一致性。最后,我们提出了一种伪标签选择方法,利用两个任务之间不一致的预测来识别潜在的错误伪标签,并减轻其确认偏差。在推理阶段,只使用训练有素的 SED 模型,因此不会产生额外的计算和存储成本。在 DESED 数据集上进行的大量实验证明了我们方法的有效性。
Sound Activity-Aware Based Cross-Task Collaborative Training for Semi-Supervised Sound Event Detection
The training of sound event detection (SED) models remains a challenge of insufficient supervision due to limited frame-wise labeled data. Mainstream research on this problem has adopted semi-supervised training strategies that generate pseudo-labels for unlabeled data and use these data for the training of a model. Recent works further introduce multi-task training strategies to impose additional supervision. However, the auxiliary tasks employed in these methods either lack frame-wise guidance or exhibit unsuitable task designs. Furthermore, they fail to exploit inter-task relationships effectively, which can serve as valuable supervision. In this paper, we introduce a novel task, sound occurrence and overlap detection (SOD), which detects predefined sound activity patterns, including non-overlapping and overlapping cases. On the basis of SOD, we propose a cross-task collaborative training framework that leverages the relationship between SED and SOD to improve the SED model. Firstly, by jointly optimizing the two tasks in a multi-task manner, the SED model is encouraged to learn features sensitive to sound activity. Subsequently, the cross-task consistency regularization is proposed to promote consistent predictions between SED and SOD. Finally, we propose a pseudo-label selection method that uses inconsistent predictions between the two tasks to identify potential wrong pseudo-labels and mitigate their confirmation bias. In the inference phase, only the trained SED model is used, thus no additional computation and storage costs are incurred. Extensive experiments on the DESED dataset demonstrate the effectiveness of our method.
期刊介绍:
The IEEE/ACM Transactions on Audio, Speech, and Language Processing covers audio, speech and language processing and the sciences that support them. In audio processing: transducers, room acoustics, active sound control, human audition, analysis/synthesis/coding of music, and consumer audio. In speech processing: areas such as speech analysis, synthesis, coding, speech and speaker recognition, speech production and perception, and speech enhancement. In language processing: speech and text analysis, understanding, generation, dialog management, translation, summarization, question answering and document indexing and retrieval, as well as general language modeling.