Cross-modal hashing is a hot issue in the multimedia community, which is to generate compact hash code from multimedia content for efficient cross-modal search. Two challenges, i.e., (1) How to efficiently enhance cross-modal semantic mining is essential for cross-modal hash code learning, and (2) How to combine multiple semantic correlations learning to improve the semantic similarity preserving, cannot be ignored. To this end, this paper proposed a novel end-to-end cross-modal hashing approach, named Multiple Semantic Structure-Preserving Quantization (MSSPQ) that is to integrate deep hashing model with multiple semantic correlation learning to boost hash learning performance. The multiple semantic correlation learning consists of inter-modal and intra-modal pairwise correlation learning and Cosine correlation learning, which can comprehensively capture cross-modal consistent semantics and realize semantic similarity preserving. Extensive experiments are conducted on three multimedia datasets, which confirms that the proposed method outperforms the baselines.
{"title":"MSSPQ: Multiple Semantic Structure-Preserving Quantization for Cross-Modal Retrieval","authors":"Lei Zhu, Liewu Cai, Jiayu Song, Xinghui Zhu, Chengyuan Zhang, Shichao Zhang","doi":"10.1145/3512527.3531417","DOIUrl":"https://doi.org/10.1145/3512527.3531417","url":null,"abstract":"Cross-modal hashing is a hot issue in the multimedia community, which is to generate compact hash code from multimedia content for efficient cross-modal search. Two challenges, i.e., (1) How to efficiently enhance cross-modal semantic mining is essential for cross-modal hash code learning, and (2) How to combine multiple semantic correlations learning to improve the semantic similarity preserving, cannot be ignored. To this end, this paper proposed a novel end-to-end cross-modal hashing approach, named Multiple Semantic Structure-Preserving Quantization (MSSPQ) that is to integrate deep hashing model with multiple semantic correlation learning to boost hash learning performance. The multiple semantic correlation learning consists of inter-modal and intra-modal pairwise correlation learning and Cosine correlation learning, which can comprehensively capture cross-modal consistent semantics and realize semantic similarity preserving. Extensive experiments are conducted on three multimedia datasets, which confirms that the proposed method outperforms the baselines.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116689784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The concept of cycle consistency in couple mapping has helped CycleGAN illustrate remarkable performance in the context of image-to-image translation. However, its limitations in object transfiguration have not been ideally solved yet. In order to alleviate previous problems of wrong transformation position, degeneration, and artifacts, this work presents a new approach called Directional Generative Adversarial Network (DiGAN) in the field of object transfiguration. The major contribution of this work is threefold. First, paired directional generators are designed for both intra-domain and inter-domain generations. Second, a segmentation network based on Mask R-CNN is introduced to build conditional inputs for both generators and discriminators. Third, a feature loss and a segmentation loss are added to optimize the model. Experimental results indicate that DiGAN surpasses CycleGAN and AttentionGAN by 17.2% and 60.9% higher on Inception Score, 15.5% and 2.05% lower on Fréchet Inception Distance, and 14.2% and 15.6% lower on VGG distance, respectively, in horse-to-zebra mapping.
{"title":"DiGAN: Directional Generative Adversarial Network for Object Transfiguration","authors":"Zhen Luo, Yingfang Zhang, Pei Zhong, Jingjing Chen, Donglong Chen","doi":"10.1145/3512527.3531400","DOIUrl":"https://doi.org/10.1145/3512527.3531400","url":null,"abstract":"The concept of cycle consistency in couple mapping has helped CycleGAN illustrate remarkable performance in the context of image-to-image translation. However, its limitations in object transfiguration have not been ideally solved yet. In order to alleviate previous problems of wrong transformation position, degeneration, and artifacts, this work presents a new approach called Directional Generative Adversarial Network (DiGAN) in the field of object transfiguration. The major contribution of this work is threefold. First, paired directional generators are designed for both intra-domain and inter-domain generations. Second, a segmentation network based on Mask R-CNN is introduced to build conditional inputs for both generators and discriminators. Third, a feature loss and a segmentation loss are added to optimize the model. Experimental results indicate that DiGAN surpasses CycleGAN and AttentionGAN by 17.2% and 60.9% higher on Inception Score, 15.5% and 2.05% lower on Fréchet Inception Distance, and 14.2% and 15.6% lower on VGG distance, respectively, in horse-to-zebra mapping.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127810764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Video-text retrieval is an essential task in cross-modal information retrieval, i.e., retrieving relevant videos from a large and unlabelled dataset given textual queries. Existing methods that simply pool the image features (e.g., based on the CLIP encoder [14]) from frames to build the video descriptor often result in sub-optimal video-text search accuracy since the information among different modalities is not fully exchanged and aligned. In this paper, we proposed a novel dual-encoder model to address the challenging video-text retrieval problem, which uses a highly efficient cross-attention module to facilitate the information exchange between multiple modalities (i.e., video and text). The proposed VideoCLIP is evaluated on two benchmark video-text datasets, MSRVTT and DiDeMo, and the results show that our model can outperform existing state-of-the-art methods while the retrieval speed is much faster than the traditional query-agnostic search model.
{"title":"VideoCLIP: A Cross-Attention Model for Fast Video-Text Retrieval Task with Image CLIP","authors":"Yikang Li, Jenhao Hsiao, C. Ho","doi":"10.1145/3512527.3531429","DOIUrl":"https://doi.org/10.1145/3512527.3531429","url":null,"abstract":"Video-text retrieval is an essential task in cross-modal information retrieval, i.e., retrieving relevant videos from a large and unlabelled dataset given textual queries. Existing methods that simply pool the image features (e.g., based on the CLIP encoder [14]) from frames to build the video descriptor often result in sub-optimal video-text search accuracy since the information among different modalities is not fully exchanged and aligned. In this paper, we proposed a novel dual-encoder model to address the challenging video-text retrieval problem, which uses a highly efficient cross-attention module to facilitate the information exchange between multiple modalities (i.e., video and text). The proposed VideoCLIP is evaluated on two benchmark video-text datasets, MSRVTT and DiDeMo, and the results show that our model can outperform existing state-of-the-art methods while the retrieval speed is much faster than the traditional query-agnostic search model.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132774324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yiran Zhu, Guangji Huang, Xing Xu, Yanli Ji, Fumin Shen
In skeleton-based action recognition, Graph Convolutional Networks (GCNs) have achieved remarkable performance since the skeleton representation of human action can be naturally modeled by the graph structure. Most of the existing GCN-based methods extract skeleton features by exploiting single-scale joint information, while neglecting the valuable multi-scale contextual information. Besides, the commonly used strided convolution in temporal dimension could evenly filters out the keyframes we expect to preserve and leads to the loss of keyframe information. To address these issues, we propose a novel Selective Hypergraph Convolution Network, dubbed Selective-HCN, which stacks two key modules: Selective-scale Hypergraph Convolution (SHC) and Selective-frame Temporal Convolution (STC). The SHC module represents the human skeleton as the graph and hypergraph to fully extract multi-scale information, and selectively fuse features at various scales. Instead of traditional strided temporal convolution, the STC module can adaptively select keyframes and filter redundant frames according to the importance of the frames. Extensive experiments on two challenging skeleton action benchmarks, i.e., NTU-RGB+D and Skeleton-Kinetics, demonstrate the superiority and effectiveness of our proposed method.
{"title":"Selective Hypergraph Convolutional Networks for Skeleton-based Action Recognition","authors":"Yiran Zhu, Guangji Huang, Xing Xu, Yanli Ji, Fumin Shen","doi":"10.1145/3512527.3531367","DOIUrl":"https://doi.org/10.1145/3512527.3531367","url":null,"abstract":"In skeleton-based action recognition, Graph Convolutional Networks (GCNs) have achieved remarkable performance since the skeleton representation of human action can be naturally modeled by the graph structure. Most of the existing GCN-based methods extract skeleton features by exploiting single-scale joint information, while neglecting the valuable multi-scale contextual information. Besides, the commonly used strided convolution in temporal dimension could evenly filters out the keyframes we expect to preserve and leads to the loss of keyframe information. To address these issues, we propose a novel Selective Hypergraph Convolution Network, dubbed Selective-HCN, which stacks two key modules: Selective-scale Hypergraph Convolution (SHC) and Selective-frame Temporal Convolution (STC). The SHC module represents the human skeleton as the graph and hypergraph to fully extract multi-scale information, and selectively fuse features at various scales. Instead of traditional strided temporal convolution, the STC module can adaptively select keyframes and filter redundant frames according to the importance of the frames. Extensive experiments on two challenging skeleton action benchmarks, i.e., NTU-RGB+D and Skeleton-Kinetics, demonstrate the superiority and effectiveness of our proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121872394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sequential recommendations can dynamically model user interests, which has great value since users' interests may change rapidly with time. Traditional sequential recommendation methods assume that the user behaviors are rigidly ordered and sequentially dependent. However, some user behaviors have flexible orders, meaning the behaviors may occur in any order and are not sequentially dependent. Therefore, traditional methods may capture inaccurate user interests based on wrong dependencies. Motivated by this, several methods identify flexible orders by continuity or similarity. However, these methods fail to comprehensively understand the nature of flexible orders since continuity or similarity do not determine order flexibilities. Therefore, these methods may misidentify flexible orders, leading to inappropriate recommendations. To address these issues, we propose a Flexible Order aware Sequential Recommendation (FOSR) method to identify flexible orders comprehensively. We argue that orders' flexibilities are highly related to the frequencies of item pair co-occurrences. In light of this, FOSR employs a probabilistic based flexible order evaluation module to simulate item pair frequencies and infer accurate order flexibilities. The frequency labeling module extracts labels from the real item pair frequencies to guide the order flexibility measurement. Given the measured order flexibilities, we develop a flexible order aware self-attention module to model dependencies from flexible orders comprehensively and learn dynamic user interests effectively. Extensive experiments on four benchmark datasets show that our model outperforms various state-of-the-art sequential recommendation methods.
{"title":"Flexible Order Aware Sequential Recommendation","authors":"Mingda Qian, Xiaoyan Gu, Lingyang Chu, Feifei Dai, Haihui Fan, Borang Li","doi":"10.1145/3512527.3531407","DOIUrl":"https://doi.org/10.1145/3512527.3531407","url":null,"abstract":"Sequential recommendations can dynamically model user interests, which has great value since users' interests may change rapidly with time. Traditional sequential recommendation methods assume that the user behaviors are rigidly ordered and sequentially dependent. However, some user behaviors have flexible orders, meaning the behaviors may occur in any order and are not sequentially dependent. Therefore, traditional methods may capture inaccurate user interests based on wrong dependencies. Motivated by this, several methods identify flexible orders by continuity or similarity. However, these methods fail to comprehensively understand the nature of flexible orders since continuity or similarity do not determine order flexibilities. Therefore, these methods may misidentify flexible orders, leading to inappropriate recommendations. To address these issues, we propose a Flexible Order aware Sequential Recommendation (FOSR) method to identify flexible orders comprehensively. We argue that orders' flexibilities are highly related to the frequencies of item pair co-occurrences. In light of this, FOSR employs a probabilistic based flexible order evaluation module to simulate item pair frequencies and infer accurate order flexibilities. The frequency labeling module extracts labels from the real item pair frequencies to guide the order flexibility measurement. Given the measured order flexibilities, we develop a flexible order aware self-attention module to model dependencies from flexible orders comprehensively and learn dynamic user interests effectively. Extensive experiments on four benchmark datasets show that our model outperforms various state-of-the-art sequential recommendation methods.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124288621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alessandro B. Melchiorre, D. Penz, Christian Ganhör, Oleg Lesota, Vasco Fragoso, Florian Friztl, Emilia Parada-Cabaleiro, Franz Schubert, M. Schedl
We introduce Emotion-aware Music Tower Blocks (EmoMTB), an audiovisual interface to explore large music collections. It creates a musical landscape, by adopting the metaphor of a city, where similar songs are grouped into the same building and nearby buildings form neighborhoods of particular genres. In order to personalize the user experience, an underlying classifier monitors textual user-generated content, by predicting their emotional state and adapting the audiovisual elements of the interface accordingly. EmoMTB enables users to explore different musical styles either within their comfort zone or outside of it. Besides, tailoring the results of the recommender engine to match the affective state of the user, EmoMTB offers a unique way to discover and enjoy music. EmoMTB supports exploring a collection of circa half a million streamed songs using a regular smartphone as a control interface to navigate in the landscape.
{"title":"EmoMTB: Emotion-aware Music Tower Blocks","authors":"Alessandro B. Melchiorre, D. Penz, Christian Ganhör, Oleg Lesota, Vasco Fragoso, Florian Friztl, Emilia Parada-Cabaleiro, Franz Schubert, M. Schedl","doi":"10.1145/3512527.3531351","DOIUrl":"https://doi.org/10.1145/3512527.3531351","url":null,"abstract":"We introduce Emotion-aware Music Tower Blocks (EmoMTB), an audiovisual interface to explore large music collections. It creates a musical landscape, by adopting the metaphor of a city, where similar songs are grouped into the same building and nearby buildings form neighborhoods of particular genres. In order to personalize the user experience, an underlying classifier monitors textual user-generated content, by predicting their emotional state and adapting the audiovisual elements of the interface accordingly. EmoMTB enables users to explore different musical styles either within their comfort zone or outside of it. Besides, tailoring the results of the recommender engine to match the affective state of the user, EmoMTB offers a unique way to discover and enjoy music. EmoMTB supports exploring a collection of circa half a million streamed songs using a regular smartphone as a control interface to navigate in the landscape.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125231396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Rino Naka, Marie Katsurai, Keisuke Yanagi, Ryosuke Goto
Clothing image retrieval is becoming increasingly important as users on social media grow to enjoy sharing their daily outfits. Most conventional methods offer single query-based retrieval and depend on visual features learnt via target classification training. This paper presents an embedding learning framework that uses novel style description features available on users' posts, allowing image-based and multiple choice-based queries for practical clothing image retrieval. Specifically, the proposed method exploits the following complementary information for representing fashion styles: season tags, style tags, users' heights, and silhouette descriptions. Then, we learn embeddings based on a quadruplet loss that considers the ranked pairings of the visual features and the proposed style description features, enabling flexible outfit search based on either of these two types of features as queries. Experiments conducted on WEAR posts demonstrated the effectiveness of the proposed method compared with several baseline methods.
{"title":"Fashion Style-Aware Embeddings for Clothing Image Retrieval","authors":"Rino Naka, Marie Katsurai, Keisuke Yanagi, Ryosuke Goto","doi":"10.1145/3512527.3531433","DOIUrl":"https://doi.org/10.1145/3512527.3531433","url":null,"abstract":"Clothing image retrieval is becoming increasingly important as users on social media grow to enjoy sharing their daily outfits. Most conventional methods offer single query-based retrieval and depend on visual features learnt via target classification training. This paper presents an embedding learning framework that uses novel style description features available on users' posts, allowing image-based and multiple choice-based queries for practical clothing image retrieval. Specifically, the proposed method exploits the following complementary information for representing fashion styles: season tags, style tags, users' heights, and silhouette descriptions. Then, we learn embeddings based on a quadruplet loss that considers the ranked pairings of the visual features and the proposed style description features, enabling flexible outfit search based on either of these two types of features as queries. Experiments conducted on WEAR posts demonstrated the effectiveness of the proposed method compared with several baseline methods.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130141540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yunyan Yan, Chuanbin Liu, Hongtao Xie, Sicheng Zhang, Zhendong Mao
Bone age assessment (BAA) is vital in pediatric clinical diagnosis. Existing deep learning methods predict bone age based on Regions of Interest (RoIs) detection or segmentation of hand radiograph, which requires expensive annotations. Limitations of radiographic technique on imaging and cost hinder their clinical application as well. Compared to X-ray images, ultrasonic images are rather clean, cheap and flexible, but the deep learning research on ultrasonic BAA is still a white space. For this purpose, we propose a weakly supervised interpretable framework entitled USB-Net, utilizing ultrasonic pelvis images and only image-level age annotations. USB-Net consists of automatic anatomical RoI detection stage and age assessment stage. In the detection stage, USB-Net locates the discriminative anatomical RoIs of pelvis through attention heatmap without any extra RoI supervision. In the assessment stage, the cropped anatomical RoI patch is fed as fine-grained input to estimate age. In addition, we provide the first ultrasonic BAA dataset composed of 1644 ultrasonic hip joint images with image-level labels of age and gender. The experimental results verify that our model keeps consistent attention with human knowledge and achieves 16.24 days mean absolute error (MAE) on USBAA dataset.
{"title":"Weakly Supervised Pediatric Bone Age Assessment Using Ultrasonic Images via Automatic Anatomical RoI Detection","authors":"Yunyan Yan, Chuanbin Liu, Hongtao Xie, Sicheng Zhang, Zhendong Mao","doi":"10.1145/3512527.3531436","DOIUrl":"https://doi.org/10.1145/3512527.3531436","url":null,"abstract":"Bone age assessment (BAA) is vital in pediatric clinical diagnosis. Existing deep learning methods predict bone age based on Regions of Interest (RoIs) detection or segmentation of hand radiograph, which requires expensive annotations. Limitations of radiographic technique on imaging and cost hinder their clinical application as well. Compared to X-ray images, ultrasonic images are rather clean, cheap and flexible, but the deep learning research on ultrasonic BAA is still a white space. For this purpose, we propose a weakly supervised interpretable framework entitled USB-Net, utilizing ultrasonic pelvis images and only image-level age annotations. USB-Net consists of automatic anatomical RoI detection stage and age assessment stage. In the detection stage, USB-Net locates the discriminative anatomical RoIs of pelvis through attention heatmap without any extra RoI supervision. In the assessment stage, the cropped anatomical RoI patch is fed as fine-grained input to estimate age. In addition, we provide the first ultrasonic BAA dataset composed of 1644 ultrasonic hip joint images with image-level labels of age and gender. The experimental results verify that our model keeps consistent attention with human knowledge and achieves 16.24 days mean absolute error (MAE) on USBAA dataset.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124525910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Mingao Zhang, Changhong Liu, Yong Chen, Zhenchun Lei, Mingwen Wang
It is necessary for the music-to-dance generation to consider both the kinematics in dance that is highly complex and non-linear and the connection between music and dance movement that is far from deterministic. Existing approaches attempt to address the limited creativity problem, but it is still a very challenging task. First, it is a long-term sequence-to-sequence task. Second, it is noisy in the extracted motion keypoints. Last, there exist local and global dependencies in the music sequence and the dance motion sequence. To address these issues, we propose a novel autoregressive generative framework that predicts future motions based on past motions and music. This framework contains a music conformer, a motion conformer, and a cross-modal conformer, which utilizes the conformer to encode music and motion sequences, and further adapt the cross-modal conformer to the noisy dance motion data that enable it to not only capture local and global dependencies among the sequences but also reduce the effect of noisy data. Quantitative and qualitative experimental results on the publicly available music-to-dance dataset demonstrate our method improves greatly upon the baselines and can generate long-term coherent dance motions well-coordinated with the music.
{"title":"Music-to-Dance Generation with Multiple Conformer","authors":"Mingao Zhang, Changhong Liu, Yong Chen, Zhenchun Lei, Mingwen Wang","doi":"10.1145/3512527.3531430","DOIUrl":"https://doi.org/10.1145/3512527.3531430","url":null,"abstract":"It is necessary for the music-to-dance generation to consider both the kinematics in dance that is highly complex and non-linear and the connection between music and dance movement that is far from deterministic. Existing approaches attempt to address the limited creativity problem, but it is still a very challenging task. First, it is a long-term sequence-to-sequence task. Second, it is noisy in the extracted motion keypoints. Last, there exist local and global dependencies in the music sequence and the dance motion sequence. To address these issues, we propose a novel autoregressive generative framework that predicts future motions based on past motions and music. This framework contains a music conformer, a motion conformer, and a cross-modal conformer, which utilizes the conformer to encode music and motion sequences, and further adapt the cross-modal conformer to the noisy dance motion data that enable it to not only capture local and global dependencies among the sequences but also reduce the effect of noisy data. Quantitative and qualitative experimental results on the publicly available music-to-dance dataset demonstrate our method improves greatly upon the baselines and can generate long-term coherent dance motions well-coordinated with the music.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"s1-16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127192861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yaoxin Zhuo, Yikang Li, Jenhao Hsiao, C. Ho, Baoxin Li
With the ever-increasing multimedia data on the Web, cross-modal video-text retrieval has received a lot of attention in recent years. Deep cross-modal hashing approaches utilize the Hamming space for achieving fast retrieval. However, most existing algorithms have difficulties in seeking or constructing a well-defined joint semantic space. In this paper, an unsupervised deep cross-modal video-text hashing approach (CLIP4Hashing) is proposed, which mitigates the difficulties in bridging between different modalities in the Hamming space through building a single hashing net by employing the pre-trained CLIP model. The approach is enhanced by two novel techniques, the dynamic weighting strategy and the design of the min-max hashing layer, which are found to be the main sources of the performance gain. Compared with conventional deep cross-modal hashing algorithms, CLIP4Hashing does not require data-specific hyper-parameters. With evaluation using three challenging video-text benchmark datasets, we demonstrate that CLIP4Hashing is able to significantly outperform existing state-of-the-art hashing algorithms. Additionally, with larger bit sizes (e.g., 2048 bits), CLIP4Hashing can even deliver competitive performance compared with the results based on non-hashing features.
{"title":"CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval","authors":"Yaoxin Zhuo, Yikang Li, Jenhao Hsiao, C. Ho, Baoxin Li","doi":"10.1145/3512527.3531381","DOIUrl":"https://doi.org/10.1145/3512527.3531381","url":null,"abstract":"With the ever-increasing multimedia data on the Web, cross-modal video-text retrieval has received a lot of attention in recent years. Deep cross-modal hashing approaches utilize the Hamming space for achieving fast retrieval. However, most existing algorithms have difficulties in seeking or constructing a well-defined joint semantic space. In this paper, an unsupervised deep cross-modal video-text hashing approach (CLIP4Hashing) is proposed, which mitigates the difficulties in bridging between different modalities in the Hamming space through building a single hashing net by employing the pre-trained CLIP model. The approach is enhanced by two novel techniques, the dynamic weighting strategy and the design of the min-max hashing layer, which are found to be the main sources of the performance gain. Compared with conventional deep cross-modal hashing algorithms, CLIP4Hashing does not require data-specific hyper-parameters. With evaluation using three challenging video-text benchmark datasets, we demonstrate that CLIP4Hashing is able to significantly outperform existing state-of-the-art hashing algorithms. Additionally, with larger bit sizes (e.g., 2048 bits), CLIP4Hashing can even deliver competitive performance compared with the results based on non-hashing features.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"2011 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132889463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}