首页 > 最新文献

Proceedings of the 2022 International Conference on Multimedia Retrieval最新文献

英文 中文
MSSPQ: Multiple Semantic Structure-Preserving Quantization for Cross-Modal Retrieval 跨模态检索的多语义结构保持量化
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531417
Lei Zhu, Liewu Cai, Jiayu Song, Xinghui Zhu, Chengyuan Zhang, Shichao Zhang
Cross-modal hashing is a hot issue in the multimedia community, which is to generate compact hash code from multimedia content for efficient cross-modal search. Two challenges, i.e., (1) How to efficiently enhance cross-modal semantic mining is essential for cross-modal hash code learning, and (2) How to combine multiple semantic correlations learning to improve the semantic similarity preserving, cannot be ignored. To this end, this paper proposed a novel end-to-end cross-modal hashing approach, named Multiple Semantic Structure-Preserving Quantization (MSSPQ) that is to integrate deep hashing model with multiple semantic correlation learning to boost hash learning performance. The multiple semantic correlation learning consists of inter-modal and intra-modal pairwise correlation learning and Cosine correlation learning, which can comprehensively capture cross-modal consistent semantics and realize semantic similarity preserving. Extensive experiments are conducted on three multimedia datasets, which confirms that the proposed method outperforms the baselines.
跨模态哈希是多媒体界研究的热点问题之一,它是利用多媒体内容生成紧凑的哈希码以实现高效的跨模态搜索。如何有效地增强跨模态语义挖掘是跨模态哈希码学习的关键,以及如何结合多个语义关联学习来提高语义相似度保持是两个不容忽视的挑战。为此,本文提出了一种新的端到端跨模态哈希方法——多语义结构保持量化(Multiple Semantic Structure-Preserving quanti量化,MSSPQ),该方法将深度哈希模型与多语义相关学习相结合,以提高哈希学习性能。多语义相关学习包括模态间和模态内两两相关学习和余弦相关学习,可以全面捕获跨模态一致语义,实现语义相似度保持。在三个多媒体数据集上进行了大量的实验,证实了该方法优于基线。
{"title":"MSSPQ: Multiple Semantic Structure-Preserving Quantization for Cross-Modal Retrieval","authors":"Lei Zhu, Liewu Cai, Jiayu Song, Xinghui Zhu, Chengyuan Zhang, Shichao Zhang","doi":"10.1145/3512527.3531417","DOIUrl":"https://doi.org/10.1145/3512527.3531417","url":null,"abstract":"Cross-modal hashing is a hot issue in the multimedia community, which is to generate compact hash code from multimedia content for efficient cross-modal search. Two challenges, i.e., (1) How to efficiently enhance cross-modal semantic mining is essential for cross-modal hash code learning, and (2) How to combine multiple semantic correlations learning to improve the semantic similarity preserving, cannot be ignored. To this end, this paper proposed a novel end-to-end cross-modal hashing approach, named Multiple Semantic Structure-Preserving Quantization (MSSPQ) that is to integrate deep hashing model with multiple semantic correlation learning to boost hash learning performance. The multiple semantic correlation learning consists of inter-modal and intra-modal pairwise correlation learning and Cosine correlation learning, which can comprehensively capture cross-modal consistent semantics and realize semantic similarity preserving. Extensive experiments are conducted on three multimedia datasets, which confirms that the proposed method outperforms the baselines.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"62 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"116689784","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
DiGAN: Directional Generative Adversarial Network for Object Transfiguration 面向对象变形的定向生成对抗网络
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531400
Zhen Luo, Yingfang Zhang, Pei Zhong, Jingjing Chen, Donglong Chen
The concept of cycle consistency in couple mapping has helped CycleGAN illustrate remarkable performance in the context of image-to-image translation. However, its limitations in object transfiguration have not been ideally solved yet. In order to alleviate previous problems of wrong transformation position, degeneration, and artifacts, this work presents a new approach called Directional Generative Adversarial Network (DiGAN) in the field of object transfiguration. The major contribution of this work is threefold. First, paired directional generators are designed for both intra-domain and inter-domain generations. Second, a segmentation network based on Mask R-CNN is introduced to build conditional inputs for both generators and discriminators. Third, a feature loss and a segmentation loss are added to optimize the model. Experimental results indicate that DiGAN surpasses CycleGAN and AttentionGAN by 17.2% and 60.9% higher on Inception Score, 15.5% and 2.05% lower on Fréchet Inception Distance, and 14.2% and 15.6% lower on VGG distance, respectively, in horse-to-zebra mapping.
耦合映射中的循环一致性概念帮助CycleGAN在图像到图像的翻译中展示了卓越的性能。然而,它在对象变形方面的局限性还没有得到理想的解决。为了缓解以往变换位置错误、退化和伪影等问题,本文在对象变换领域提出了一种新的方法——定向生成对抗网络(DiGAN)。这项工作的主要贡献有三个方面。首先,设计了域内和域间生成的成对方向生成器。其次,引入基于Mask R-CNN的分割网络,为生成器和鉴别器构建条件输入。第三,加入特征损失和分割损失对模型进行优化。实验结果表明,DiGAN在马-斑马映射的Inception Score上分别比CycleGAN和AttentionGAN高17.2%和60.9%,在fr盗梦距离上分别比CycleGAN和AttentionGAN低15.5%和2.05%,在VGG距离上分别比CycleGAN和AttentionGAN低14.2%和15.6%。
{"title":"DiGAN: Directional Generative Adversarial Network for Object Transfiguration","authors":"Zhen Luo, Yingfang Zhang, Pei Zhong, Jingjing Chen, Donglong Chen","doi":"10.1145/3512527.3531400","DOIUrl":"https://doi.org/10.1145/3512527.3531400","url":null,"abstract":"The concept of cycle consistency in couple mapping has helped CycleGAN illustrate remarkable performance in the context of image-to-image translation. However, its limitations in object transfiguration have not been ideally solved yet. In order to alleviate previous problems of wrong transformation position, degeneration, and artifacts, this work presents a new approach called Directional Generative Adversarial Network (DiGAN) in the field of object transfiguration. The major contribution of this work is threefold. First, paired directional generators are designed for both intra-domain and inter-domain generations. Second, a segmentation network based on Mask R-CNN is introduced to build conditional inputs for both generators and discriminators. Third, a feature loss and a segmentation loss are added to optimize the model. Experimental results indicate that DiGAN surpasses CycleGAN and AttentionGAN by 17.2% and 60.9% higher on Inception Score, 15.5% and 2.05% lower on Fréchet Inception Distance, and 14.2% and 15.6% lower on VGG distance, respectively, in horse-to-zebra mapping.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127810764","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
VideoCLIP: A Cross-Attention Model for Fast Video-Text Retrieval Task with Image CLIP 基于图像剪辑的快速视频文本检索任务的交叉注意模型
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531429
Yikang Li, Jenhao Hsiao, C. Ho
Video-text retrieval is an essential task in cross-modal information retrieval, i.e., retrieving relevant videos from a large and unlabelled dataset given textual queries. Existing methods that simply pool the image features (e.g., based on the CLIP encoder [14]) from frames to build the video descriptor often result in sub-optimal video-text search accuracy since the information among different modalities is not fully exchanged and aligned. In this paper, we proposed a novel dual-encoder model to address the challenging video-text retrieval problem, which uses a highly efficient cross-attention module to facilitate the information exchange between multiple modalities (i.e., video and text). The proposed VideoCLIP is evaluated on two benchmark video-text datasets, MSRVTT and DiDeMo, and the results show that our model can outperform existing state-of-the-art methods while the retrieval speed is much faster than the traditional query-agnostic search model.
视频文本检索是跨模式信息检索中的一项重要任务,即从给定文本查询的大型未标记数据集中检索相关视频。现有的方法只是简单地从帧中汇集图像特征(例如,基于CLIP编码器[14])来构建视频描述符,由于不同模态之间的信息没有完全交换和对齐,通常会导致视频文本搜索精度次优。在本文中,我们提出了一种新的双编码器模型来解决具有挑战性的视频文本检索问题,该模型使用高效的交叉注意模块来促进多种模式(即视频和文本)之间的信息交换。在MSRVTT和DiDeMo两个基准视频文本数据集上对所提出的VideoCLIP进行了评估,结果表明,我们的模型优于现有的最先进的方法,并且检索速度比传统的查询无关搜索模型快得多。
{"title":"VideoCLIP: A Cross-Attention Model for Fast Video-Text Retrieval Task with Image CLIP","authors":"Yikang Li, Jenhao Hsiao, C. Ho","doi":"10.1145/3512527.3531429","DOIUrl":"https://doi.org/10.1145/3512527.3531429","url":null,"abstract":"Video-text retrieval is an essential task in cross-modal information retrieval, i.e., retrieving relevant videos from a large and unlabelled dataset given textual queries. Existing methods that simply pool the image features (e.g., based on the CLIP encoder [14]) from frames to build the video descriptor often result in sub-optimal video-text search accuracy since the information among different modalities is not fully exchanged and aligned. In this paper, we proposed a novel dual-encoder model to address the challenging video-text retrieval problem, which uses a highly efficient cross-attention module to facilitate the information exchange between multiple modalities (i.e., video and text). The proposed VideoCLIP is evaluated on two benchmark video-text datasets, MSRVTT and DiDeMo, and the results show that our model can outperform existing state-of-the-art methods while the retrieval speed is much faster than the traditional query-agnostic search model.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"36 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132774324","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Selective Hypergraph Convolutional Networks for Skeleton-based Action Recognition 基于骨架的动作识别的选择性超图卷积网络
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531367
Yiran Zhu, Guangji Huang, Xing Xu, Yanli Ji, Fumin Shen
In skeleton-based action recognition, Graph Convolutional Networks (GCNs) have achieved remarkable performance since the skeleton representation of human action can be naturally modeled by the graph structure. Most of the existing GCN-based methods extract skeleton features by exploiting single-scale joint information, while neglecting the valuable multi-scale contextual information. Besides, the commonly used strided convolution in temporal dimension could evenly filters out the keyframes we expect to preserve and leads to the loss of keyframe information. To address these issues, we propose a novel Selective Hypergraph Convolution Network, dubbed Selective-HCN, which stacks two key modules: Selective-scale Hypergraph Convolution (SHC) and Selective-frame Temporal Convolution (STC). The SHC module represents the human skeleton as the graph and hypergraph to fully extract multi-scale information, and selectively fuse features at various scales. Instead of traditional strided temporal convolution, the STC module can adaptively select keyframes and filter redundant frames according to the importance of the frames. Extensive experiments on two challenging skeleton action benchmarks, i.e., NTU-RGB+D and Skeleton-Kinetics, demonstrate the superiority and effectiveness of our proposed method.
在基于骨架的动作识别中,图卷积网络(GCNs)取得了显著的性能,因为人类动作的骨架表示可以通过图结构自然地建模。现有的基于遗传神经网络的方法大多是利用单尺度关节信息提取骨架特征,而忽略了有价值的多尺度上下文信息。此外,常用的时间维跨行卷积会均匀滤除我们希望保留的关键帧,导致关键帧信息的丢失。为了解决这些问题,我们提出了一种新的选择性超图卷积网络,称为Selective- hcn,它堆叠了两个关键模块:选择性尺度超图卷积(Selective scale Hypergraph Convolution, SHC)和选择性帧时间卷积(Selective frame Temporal Convolution, STC)。SHC模块将人体骨骼以图和超图的形式表示,充分提取多尺度信息,并有选择地融合不同尺度的特征。STC模块可以自适应地选择关键帧,并根据帧的重要性对冗余帧进行滤波,而不是传统的跨行时间卷积。在两个具有挑战性的骨骼动作基准,即NTU-RGB+D和skeleton - kinetics上进行了大量实验,证明了我们提出的方法的优越性和有效性。
{"title":"Selective Hypergraph Convolutional Networks for Skeleton-based Action Recognition","authors":"Yiran Zhu, Guangji Huang, Xing Xu, Yanli Ji, Fumin Shen","doi":"10.1145/3512527.3531367","DOIUrl":"https://doi.org/10.1145/3512527.3531367","url":null,"abstract":"In skeleton-based action recognition, Graph Convolutional Networks (GCNs) have achieved remarkable performance since the skeleton representation of human action can be naturally modeled by the graph structure. Most of the existing GCN-based methods extract skeleton features by exploiting single-scale joint information, while neglecting the valuable multi-scale contextual information. Besides, the commonly used strided convolution in temporal dimension could evenly filters out the keyframes we expect to preserve and leads to the loss of keyframe information. To address these issues, we propose a novel Selective Hypergraph Convolution Network, dubbed Selective-HCN, which stacks two key modules: Selective-scale Hypergraph Convolution (SHC) and Selective-frame Temporal Convolution (STC). The SHC module represents the human skeleton as the graph and hypergraph to fully extract multi-scale information, and selectively fuse features at various scales. Instead of traditional strided temporal convolution, the STC module can adaptively select keyframes and filter redundant frames according to the importance of the frames. Extensive experiments on two challenging skeleton action benchmarks, i.e., NTU-RGB+D and Skeleton-Kinetics, demonstrate the superiority and effectiveness of our proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"7 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121872394","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Flexible Order Aware Sequential Recommendation 灵活的顺序感知顺序推荐
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531407
Mingda Qian, Xiaoyan Gu, Lingyang Chu, Feifei Dai, Haihui Fan, Borang Li
Sequential recommendations can dynamically model user interests, which has great value since users' interests may change rapidly with time. Traditional sequential recommendation methods assume that the user behaviors are rigidly ordered and sequentially dependent. However, some user behaviors have flexible orders, meaning the behaviors may occur in any order and are not sequentially dependent. Therefore, traditional methods may capture inaccurate user interests based on wrong dependencies. Motivated by this, several methods identify flexible orders by continuity or similarity. However, these methods fail to comprehensively understand the nature of flexible orders since continuity or similarity do not determine order flexibilities. Therefore, these methods may misidentify flexible orders, leading to inappropriate recommendations. To address these issues, we propose a Flexible Order aware Sequential Recommendation (FOSR) method to identify flexible orders comprehensively. We argue that orders' flexibilities are highly related to the frequencies of item pair co-occurrences. In light of this, FOSR employs a probabilistic based flexible order evaluation module to simulate item pair frequencies and infer accurate order flexibilities. The frequency labeling module extracts labels from the real item pair frequencies to guide the order flexibility measurement. Given the measured order flexibilities, we develop a flexible order aware self-attention module to model dependencies from flexible orders comprehensively and learn dynamic user interests effectively. Extensive experiments on four benchmark datasets show that our model outperforms various state-of-the-art sequential recommendation methods.
顺序推荐可以对用户的兴趣进行动态建模,由于用户的兴趣可能会随着时间的变化而迅速变化,因此具有很大的价值。传统的顺序推荐方法假设用户行为是严格有序和顺序依赖的。然而,一些用户行为具有灵活的顺序,这意味着这些行为可以以任何顺序发生,并且不依赖于顺序。因此,传统方法可能基于错误的依赖关系捕获不准确的用户兴趣。受此启发,几种方法通过连续性或相似性来识别柔性顺序。然而,这些方法不能全面地理解柔性订单的本质,因为连续性或相似性并不能决定订单的柔性。因此,这些方法可能会错误地识别灵活的订单,从而导致不适当的建议。为了解决这些问题,我们提出了一种柔性订单感知顺序推荐(FOSR)方法来全面识别柔性订单。我们认为订单的灵活性与项目对共现的频率高度相关。因此,FOSR采用基于概率的柔性订单评估模块来模拟商品对频率,从而推断出准确的订单灵活性。频率标注模块从真实的商品对频率中提取标签来指导订单灵活性的测量。在测量订单灵活性的前提下,我们开发了一个灵活的订单感知自关注模块来全面建模灵活订单的依赖关系,并有效地学习动态用户兴趣。在四个基准数据集上的大量实验表明,我们的模型优于各种最先进的顺序推荐方法。
{"title":"Flexible Order Aware Sequential Recommendation","authors":"Mingda Qian, Xiaoyan Gu, Lingyang Chu, Feifei Dai, Haihui Fan, Borang Li","doi":"10.1145/3512527.3531407","DOIUrl":"https://doi.org/10.1145/3512527.3531407","url":null,"abstract":"Sequential recommendations can dynamically model user interests, which has great value since users' interests may change rapidly with time. Traditional sequential recommendation methods assume that the user behaviors are rigidly ordered and sequentially dependent. However, some user behaviors have flexible orders, meaning the behaviors may occur in any order and are not sequentially dependent. Therefore, traditional methods may capture inaccurate user interests based on wrong dependencies. Motivated by this, several methods identify flexible orders by continuity or similarity. However, these methods fail to comprehensively understand the nature of flexible orders since continuity or similarity do not determine order flexibilities. Therefore, these methods may misidentify flexible orders, leading to inappropriate recommendations. To address these issues, we propose a Flexible Order aware Sequential Recommendation (FOSR) method to identify flexible orders comprehensively. We argue that orders' flexibilities are highly related to the frequencies of item pair co-occurrences. In light of this, FOSR employs a probabilistic based flexible order evaluation module to simulate item pair frequencies and infer accurate order flexibilities. The frequency labeling module extracts labels from the real item pair frequencies to guide the order flexibility measurement. Given the measured order flexibilities, we develop a flexible order aware self-attention module to model dependencies from flexible orders comprehensively and learn dynamic user interests effectively. Extensive experiments on four benchmark datasets show that our model outperforms various state-of-the-art sequential recommendation methods.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124288621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 2
EmoMTB: Emotion-aware Music Tower Blocks EmoMTB:情感感知音乐塔楼
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531351
Alessandro B. Melchiorre, D. Penz, Christian Ganhör, Oleg Lesota, Vasco Fragoso, Florian Friztl, Emilia Parada-Cabaleiro, Franz Schubert, M. Schedl
We introduce Emotion-aware Music Tower Blocks (EmoMTB), an audiovisual interface to explore large music collections. It creates a musical landscape, by adopting the metaphor of a city, where similar songs are grouped into the same building and nearby buildings form neighborhoods of particular genres. In order to personalize the user experience, an underlying classifier monitors textual user-generated content, by predicting their emotional state and adapting the audiovisual elements of the interface accordingly. EmoMTB enables users to explore different musical styles either within their comfort zone or outside of it. Besides, tailoring the results of the recommender engine to match the affective state of the user, EmoMTB offers a unique way to discover and enjoy music. EmoMTB supports exploring a collection of circa half a million streamed songs using a regular smartphone as a control interface to navigate in the landscape.
我们介绍了情感感知音乐塔块(EmoMTB),一个用于探索大型音乐收藏的视听界面。它通过采用城市的隐喻,创造了一个音乐景观,在同一个建筑中,相似的歌曲被分组,附近的建筑形成了特定类型的社区。为了个性化用户体验,底层分类器通过预测用户的情绪状态并相应地调整界面的视听元素来监控文本用户生成的内容。EmoMTB使用户可以在舒适区或舒适区之外探索不同的音乐风格。此外,根据用户的情感状态定制推荐引擎的结果,EmoMTB提供了一种独特的发现和欣赏音乐的方式。EmoMTB支持使用普通智能手机作为控制界面来浏览大约50万首流媒体歌曲。
{"title":"EmoMTB: Emotion-aware Music Tower Blocks","authors":"Alessandro B. Melchiorre, D. Penz, Christian Ganhör, Oleg Lesota, Vasco Fragoso, Florian Friztl, Emilia Parada-Cabaleiro, Franz Schubert, M. Schedl","doi":"10.1145/3512527.3531351","DOIUrl":"https://doi.org/10.1145/3512527.3531351","url":null,"abstract":"We introduce Emotion-aware Music Tower Blocks (EmoMTB), an audiovisual interface to explore large music collections. It creates a musical landscape, by adopting the metaphor of a city, where similar songs are grouped into the same building and nearby buildings form neighborhoods of particular genres. In order to personalize the user experience, an underlying classifier monitors textual user-generated content, by predicting their emotional state and adapting the audiovisual elements of the interface accordingly. EmoMTB enables users to explore different musical styles either within their comfort zone or outside of it. Besides, tailoring the results of the recommender engine to match the affective state of the user, EmoMTB offers a unique way to discover and enjoy music. EmoMTB supports exploring a collection of circa half a million streamed songs using a regular smartphone as a control interface to navigate in the landscape.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"125231396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
Fashion Style-Aware Embeddings for Clothing Image Retrieval 服装图像检索的时尚风格感知嵌入
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531433
Rino Naka, Marie Katsurai, Keisuke Yanagi, Ryosuke Goto
Clothing image retrieval is becoming increasingly important as users on social media grow to enjoy sharing their daily outfits. Most conventional methods offer single query-based retrieval and depend on visual features learnt via target classification training. This paper presents an embedding learning framework that uses novel style description features available on users' posts, allowing image-based and multiple choice-based queries for practical clothing image retrieval. Specifically, the proposed method exploits the following complementary information for representing fashion styles: season tags, style tags, users' heights, and silhouette descriptions. Then, we learn embeddings based on a quadruplet loss that considers the ranked pairings of the visual features and the proposed style description features, enabling flexible outfit search based on either of these two types of features as queries. Experiments conducted on WEAR posts demonstrated the effectiveness of the proposed method compared with several baseline methods.
随着社交媒体上的用户越来越喜欢分享他们的日常着装,服装图像检索变得越来越重要。大多数传统方法提供基于单一查询的检索,并且依赖于通过目标分类训练学习到的视觉特征。本文提出了一个嵌入学习框架,该框架利用用户帖子中可用的新颖风格描述特征,允许基于图像和基于多选择的查询用于实际的服装图像检索。具体来说,所提出的方法利用以下补充信息来表示时尚风格:季节标签、风格标签、用户身高和轮廓描述。然后,我们学习基于四重损失的嵌入,考虑视觉特征和建议的风格描述特征的排名配对,实现基于这两种类型特征中的任何一种作为查询的灵活的服装搜索。在WEAR桩上进行的实验证明了该方法与几种基线方法的有效性。
{"title":"Fashion Style-Aware Embeddings for Clothing Image Retrieval","authors":"Rino Naka, Marie Katsurai, Keisuke Yanagi, Ryosuke Goto","doi":"10.1145/3512527.3531433","DOIUrl":"https://doi.org/10.1145/3512527.3531433","url":null,"abstract":"Clothing image retrieval is becoming increasingly important as users on social media grow to enjoy sharing their daily outfits. Most conventional methods offer single query-based retrieval and depend on visual features learnt via target classification training. This paper presents an embedding learning framework that uses novel style description features available on users' posts, allowing image-based and multiple choice-based queries for practical clothing image retrieval. Specifically, the proposed method exploits the following complementary information for representing fashion styles: season tags, style tags, users' heights, and silhouette descriptions. Then, we learn embeddings based on a quadruplet loss that considers the ranked pairings of the visual features and the proposed style description features, enabling flexible outfit search based on either of these two types of features as queries. Experiments conducted on WEAR posts demonstrated the effectiveness of the proposed method compared with several baseline methods.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"112 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130141540","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Weakly Supervised Pediatric Bone Age Assessment Using Ultrasonic Images via Automatic Anatomical RoI Detection 基于自动解剖RoI检测的超声图像弱监督儿童骨龄评估
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531436
Yunyan Yan, Chuanbin Liu, Hongtao Xie, Sicheng Zhang, Zhendong Mao
Bone age assessment (BAA) is vital in pediatric clinical diagnosis. Existing deep learning methods predict bone age based on Regions of Interest (RoIs) detection or segmentation of hand radiograph, which requires expensive annotations. Limitations of radiographic technique on imaging and cost hinder their clinical application as well. Compared to X-ray images, ultrasonic images are rather clean, cheap and flexible, but the deep learning research on ultrasonic BAA is still a white space. For this purpose, we propose a weakly supervised interpretable framework entitled USB-Net, utilizing ultrasonic pelvis images and only image-level age annotations. USB-Net consists of automatic anatomical RoI detection stage and age assessment stage. In the detection stage, USB-Net locates the discriminative anatomical RoIs of pelvis through attention heatmap without any extra RoI supervision. In the assessment stage, the cropped anatomical RoI patch is fed as fine-grained input to estimate age. In addition, we provide the first ultrasonic BAA dataset composed of 1644 ultrasonic hip joint images with image-level labels of age and gender. The experimental results verify that our model keeps consistent attention with human knowledge and achieves 16.24 days mean absolute error (MAE) on USBAA dataset.
骨龄评估(BAA)在儿科临床诊断中至关重要。现有的深度学习方法基于感兴趣区域(roi)检测或手部x光片分割来预测骨龄,这需要昂贵的注释。放射技术在成像和成本上的局限性也阻碍了其临床应用。与x射线图像相比,超声图像干净、廉价、灵活,但超声BAA的深度学习研究仍然是一个空白。为此,我们提出了一个弱监督的可解释框架,称为USB-Net,利用超声骨盆图像和仅图像级年龄注释。USB-Net由自动解剖RoI检测阶段和年龄评估阶段组成。在检测阶段,USB-Net通过注意热图定位骨盆的鉴别解剖RoI,无需额外的RoI监督。在评估阶段,将裁剪的解剖感兴趣区域作为细粒度输入进行年龄估计。此外,我们提供了第一个超声BAA数据集,该数据集由1644张超声髋关节图像组成,具有图像级的年龄和性别标签。实验结果表明,该模型与人类知识保持一致,在USBAA数据集上达到16.24天的平均绝对误差(MAE)。
{"title":"Weakly Supervised Pediatric Bone Age Assessment Using Ultrasonic Images via Automatic Anatomical RoI Detection","authors":"Yunyan Yan, Chuanbin Liu, Hongtao Xie, Sicheng Zhang, Zhendong Mao","doi":"10.1145/3512527.3531436","DOIUrl":"https://doi.org/10.1145/3512527.3531436","url":null,"abstract":"Bone age assessment (BAA) is vital in pediatric clinical diagnosis. Existing deep learning methods predict bone age based on Regions of Interest (RoIs) detection or segmentation of hand radiograph, which requires expensive annotations. Limitations of radiographic technique on imaging and cost hinder their clinical application as well. Compared to X-ray images, ultrasonic images are rather clean, cheap and flexible, but the deep learning research on ultrasonic BAA is still a white space. For this purpose, we propose a weakly supervised interpretable framework entitled USB-Net, utilizing ultrasonic pelvis images and only image-level age annotations. USB-Net consists of automatic anatomical RoI detection stage and age assessment stage. In the detection stage, USB-Net locates the discriminative anatomical RoIs of pelvis through attention heatmap without any extra RoI supervision. In the assessment stage, the cropped anatomical RoI patch is fed as fine-grained input to estimate age. In addition, we provide the first ultrasonic BAA dataset composed of 1644 ultrasonic hip joint images with image-level labels of age and gender. The experimental results verify that our model keeps consistent attention with human knowledge and achieves 16.24 days mean absolute error (MAE) on USBAA dataset.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"51 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124525910","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Music-to-Dance Generation with Multiple Conformer 具有多重一致性的音乐-舞蹈世代
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531430
Mingao Zhang, Changhong Liu, Yong Chen, Zhenchun Lei, Mingwen Wang
It is necessary for the music-to-dance generation to consider both the kinematics in dance that is highly complex and non-linear and the connection between music and dance movement that is far from deterministic. Existing approaches attempt to address the limited creativity problem, but it is still a very challenging task. First, it is a long-term sequence-to-sequence task. Second, it is noisy in the extracted motion keypoints. Last, there exist local and global dependencies in the music sequence and the dance motion sequence. To address these issues, we propose a novel autoregressive generative framework that predicts future motions based on past motions and music. This framework contains a music conformer, a motion conformer, and a cross-modal conformer, which utilizes the conformer to encode music and motion sequences, and further adapt the cross-modal conformer to the noisy dance motion data that enable it to not only capture local and global dependencies among the sequences but also reduce the effect of noisy data. Quantitative and qualitative experimental results on the publicly available music-to-dance dataset demonstrate our method improves greatly upon the baselines and can generate long-term coherent dance motions well-coordinated with the music.
从音乐到舞蹈的生成,既要考虑舞蹈中高度复杂和非线性的运动学,又要考虑音乐与舞蹈运动之间远不确定的联系。现有的方法试图解决有限的创造力问题,但这仍然是一个非常具有挑战性的任务。首先,这是一个长期的序列到序列的任务。其次,提取的运动关键点存在噪声。最后,音乐序列和舞蹈动作序列存在局部依赖关系和全局依赖关系。为了解决这些问题,我们提出了一种新的自回归生成框架,该框架基于过去的动作和音乐来预测未来的动作。该框架包含一个音乐调整器、一个动作调整器和一个跨模态调整器,利用调整器对音乐和动作序列进行编码,并进一步使跨模态调整器适应嘈杂的舞蹈动作数据,使其不仅能够捕获序列之间的局部和全局依赖关系,还能减少噪声数据的影响。在公开可用的音乐-舞蹈数据集上的定量和定性实验结果表明,我们的方法在基线上有了很大的改进,并且可以生成与音乐协调良好的长期连贯的舞蹈动作。
{"title":"Music-to-Dance Generation with Multiple Conformer","authors":"Mingao Zhang, Changhong Liu, Yong Chen, Zhenchun Lei, Mingwen Wang","doi":"10.1145/3512527.3531430","DOIUrl":"https://doi.org/10.1145/3512527.3531430","url":null,"abstract":"It is necessary for the music-to-dance generation to consider both the kinematics in dance that is highly complex and non-linear and the connection between music and dance movement that is far from deterministic. Existing approaches attempt to address the limited creativity problem, but it is still a very challenging task. First, it is a long-term sequence-to-sequence task. Second, it is noisy in the extracted motion keypoints. Last, there exist local and global dependencies in the music sequence and the dance motion sequence. To address these issues, we propose a novel autoregressive generative framework that predicts future motions based on past motions and music. This framework contains a music conformer, a motion conformer, and a cross-modal conformer, which utilizes the conformer to encode music and motion sequences, and further adapt the cross-modal conformer to the noisy dance motion data that enable it to not only capture local and global dependencies among the sequences but also reduce the effect of noisy data. Quantitative and qualitative experimental results on the publicly available music-to-dance dataset demonstrate our method improves greatly upon the baselines and can generate long-term coherent dance motions well-coordinated with the music.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"s1-16 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127192861","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 5
CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval CLIP4Hashing:用于跨模态视频文本检索的无监督深度哈希
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531381
Yaoxin Zhuo, Yikang Li, Jenhao Hsiao, C. Ho, Baoxin Li
With the ever-increasing multimedia data on the Web, cross-modal video-text retrieval has received a lot of attention in recent years. Deep cross-modal hashing approaches utilize the Hamming space for achieving fast retrieval. However, most existing algorithms have difficulties in seeking or constructing a well-defined joint semantic space. In this paper, an unsupervised deep cross-modal video-text hashing approach (CLIP4Hashing) is proposed, which mitigates the difficulties in bridging between different modalities in the Hamming space through building a single hashing net by employing the pre-trained CLIP model. The approach is enhanced by two novel techniques, the dynamic weighting strategy and the design of the min-max hashing layer, which are found to be the main sources of the performance gain. Compared with conventional deep cross-modal hashing algorithms, CLIP4Hashing does not require data-specific hyper-parameters. With evaluation using three challenging video-text benchmark datasets, we demonstrate that CLIP4Hashing is able to significantly outperform existing state-of-the-art hashing algorithms. Additionally, with larger bit sizes (e.g., 2048 bits), CLIP4Hashing can even deliver competitive performance compared with the results based on non-hashing features.
随着网络上多媒体数据的不断增加,跨模式视频文本检索近年来受到了广泛的关注。深度跨模态哈希方法利用汉明空间实现快速检索。然而,大多数现有算法在寻找或构建定义良好的联合语义空间方面存在困难。本文提出了一种无监督深度跨模态视频文本哈希方法(CLIP4Hashing),该方法通过使用预训练的CLIP模型构建单个哈希网络,减轻了在汉明空间中不同模态之间桥接的困难。该方法通过动态加权策略和最小-最大哈希层设计两种新技术得到增强,这两种技术被发现是性能提升的主要来源。与传统的深度跨模态哈希算法相比,clip4hash不需要特定于数据的超参数。通过使用三个具有挑战性的视频文本基准数据集进行评估,我们证明了CLIP4Hashing能够显著优于现有的最先进的哈希算法。此外,对于更大的位大小(例如2048位),CLIP4Hashing甚至可以提供与基于非哈希特性的结果相比具有竞争力的性能。
{"title":"CLIP4Hashing: Unsupervised Deep Hashing for Cross-Modal Video-Text Retrieval","authors":"Yaoxin Zhuo, Yikang Li, Jenhao Hsiao, C. Ho, Baoxin Li","doi":"10.1145/3512527.3531381","DOIUrl":"https://doi.org/10.1145/3512527.3531381","url":null,"abstract":"With the ever-increasing multimedia data on the Web, cross-modal video-text retrieval has received a lot of attention in recent years. Deep cross-modal hashing approaches utilize the Hamming space for achieving fast retrieval. However, most existing algorithms have difficulties in seeking or constructing a well-defined joint semantic space. In this paper, an unsupervised deep cross-modal video-text hashing approach (CLIP4Hashing) is proposed, which mitigates the difficulties in bridging between different modalities in the Hamming space through building a single hashing net by employing the pre-trained CLIP model. The approach is enhanced by two novel techniques, the dynamic weighting strategy and the design of the min-max hashing layer, which are found to be the main sources of the performance gain. Compared with conventional deep cross-modal hashing algorithms, CLIP4Hashing does not require data-specific hyper-parameters. With evaluation using three challenging video-text benchmark datasets, we demonstrate that CLIP4Hashing is able to significantly outperform existing state-of-the-art hashing algorithms. Additionally, with larger bit sizes (e.g., 2048 bits), CLIP4Hashing can even deliver competitive performance compared with the results based on non-hashing features.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"2011 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132889463","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 9
期刊
Proceedings of the 2022 International Conference on Multimedia Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:481959085
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1