{"title":"Self-Supervised Audio-Visual Speaker Representation with Co-Meta Learning","authors":"Hui Chen, Hanyi Zhang, Longbiao Wang, Kong-Aik Lee, Meng Liu, J. Dang","doi":"10.1109/ICASSP49357.2023.10096925","DOIUrl":null,"url":null,"abstract":"In self-supervised speaker verification, the quality of pseudo labels determines the upper bound of its performance and it is not uncommon to end up with massive amount of unreliable pseudo labels. We observe that the complementary information in different modalities ensures a robust supervisory signal for audio and visual representation learning. This motivates us to propose an audio-visual self-supervised learning framework named Co-Meta Learning. Inspired by the Coteaching+, we design a strategy that allows the information of two modalities to be coordinated through the Update by Disagreement. Moreover, we use the idea of modelagnostic meta learning (MAML) to update the network parameters, which makes the hard samples of two modalities to be better resolved by the other modality through gradient regularization. Compared to the baseline, our proposed method achieves a 29.8%, 11.7% and 12.9% relative improvement on Vox-O, Vox-E and Vox-H trials of Voxceleb1 evaluation dataset respectively.","PeriodicalId":113072,"journal":{"name":"ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","volume":"118 1","pages":"0"},"PeriodicalIF":0.0000,"publicationDate":"2023-06-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"3","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"ICASSP 2023 - 2023 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1109/ICASSP49357.2023.10096925","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"","JCRName":"","Score":null,"Total":0}
引用次数: 3
Abstract
In self-supervised speaker verification, the quality of pseudo labels determines the upper bound of its performance and it is not uncommon to end up with massive amount of unreliable pseudo labels. We observe that the complementary information in different modalities ensures a robust supervisory signal for audio and visual representation learning. This motivates us to propose an audio-visual self-supervised learning framework named Co-Meta Learning. Inspired by the Coteaching+, we design a strategy that allows the information of two modalities to be coordinated through the Update by Disagreement. Moreover, we use the idea of modelagnostic meta learning (MAML) to update the network parameters, which makes the hard samples of two modalities to be better resolved by the other modality through gradient regularization. Compared to the baseline, our proposed method achieves a 29.8%, 11.7% and 12.9% relative improvement on Vox-O, Vox-E and Vox-H trials of Voxceleb1 evaluation dataset respectively.