Yu-Chih Tsai, Tse-Yu Pan, Ting-Yang Kao, Yi-Hsuan Yang, Min-Chun Hu
{"title":"EMVGAN: Emotion-Aware Music-Video Common Representation Learning via Generative Adversarial Networks","authors":"Yu-Chih Tsai, Tse-Yu Pan, Ting-Yang Kao, Yi-Hsuan Yang, Min-Chun Hu","doi":"10.1145/3512730.3533718","DOIUrl":null,"url":null,"abstract":"Music can enhance our emotional reactions to videos and images, while videos and images can enrich our emotional response to music. Cross-modality retrieval technology can be used to recommend appropriate music for a given video and vice versa. However, the heterogeneity gap caused by the inconsistent distribution between different data modalities complicates learning the common representation space from different modalities. Accordingly, we propose an emotion-aware music-video cross-modal generative adversarial network (EMVGAN) model to build an affective common embedding space to bridge the heterogeneity gap among different data modalities. The evaluation results revealed that the proposed EMVGAN model can learn affective common representations with convincing performance while outperforming other existing models. Furthermore, the satisfactory performance of the proposed network encouraged us to undertake the music-video bidirectional retrieval task.","PeriodicalId":43265,"journal":{"name":"International Journal of Mobile Computing and Multimedia Communications","volume":"18 1","pages":""},"PeriodicalIF":0.4000,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":"1","resultStr":null,"platform":"Semanticscholar","paperid":null,"PeriodicalName":"International Journal of Mobile Computing and Multimedia Communications","FirstCategoryId":"1085","ListUrlMain":"https://doi.org/10.1145/3512730.3533718","RegionNum":0,"RegionCategory":null,"ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":null,"EPubDate":"","PubModel":"","JCR":"Q4","JCRName":"TELECOMMUNICATIONS","Score":null,"Total":0}
引用次数: 1
Abstract
Music can enhance our emotional reactions to videos and images, while videos and images can enrich our emotional response to music. Cross-modality retrieval technology can be used to recommend appropriate music for a given video and vice versa. However, the heterogeneity gap caused by the inconsistent distribution between different data modalities complicates learning the common representation space from different modalities. Accordingly, we propose an emotion-aware music-video cross-modal generative adversarial network (EMVGAN) model to build an affective common embedding space to bridge the heterogeneity gap among different data modalities. The evaluation results revealed that the proposed EMVGAN model can learn affective common representations with convincing performance while outperforming other existing models. Furthermore, the satisfactory performance of the proposed network encouraged us to undertake the music-video bidirectional retrieval task.