首页 > 最新文献

Proceedings of the 2022 International Conference on Multimedia Retrieval最新文献

英文 中文
Improve Image Captioning by Modeling Dynamic Scene Graph Extension 通过建模动态场景图扩展改进图像字幕
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531401
Minghao Geng, Qingjie Zhao
Recently, scene graph generation methods have been used in image captioning to encode the objects and their relationships in the encoder-decoder framework, where the decoder selects part of the graph nodes as input for word inference. However, current methods attend to scene graph relying on ambiguous language information, neglecting the strong connections between scene graph nodes. In this paper, we propose a Scene Graph Extension (SGE) architecture to model the dynamic scene graph extension using the partly generated sentence. Our model first uses the generated words and previous attention results of scene graph nodes to make up a partial scene graph. Then we choose objects or relationships that has close connection with the generated graph to infer the next word. Our SGE is appealing in view that it is pluggable to any scene graph based image captioning method. We conduct the extensive experiments on MSCOCO dataset. The results shows that the proposed SGE significantly outperforms the baselines, resulting in a state-of-the-art performance under most metrics.
最近,场景图生成方法被用于图像字幕,在编码器-解码器框架中对对象及其关系进行编码,其中解码器选择部分图节点作为单词推理的输入。然而,目前的场景图处理方法依赖于模糊的语言信息,忽略了场景图节点之间的强连接。本文提出了一种场景图扩展(SGE)架构,利用部分生成的句子对动态场景图扩展进行建模。我们的模型首先使用生成的词和场景图节点之前的关注结果组成部分场景图。然后,我们选择与生成的图有密切联系的对象或关系来推断下一个单词。我们的SGE很有吸引力,因为它可以插入到任何基于场景图的图像字幕方法中。我们在MSCOCO数据集上进行了大量的实验。结果表明,建议的SGE显著优于基线,在大多数指标下产生最先进的性能。
{"title":"Improve Image Captioning by Modeling Dynamic Scene Graph Extension","authors":"Minghao Geng, Qingjie Zhao","doi":"10.1145/3512527.3531401","DOIUrl":"https://doi.org/10.1145/3512527.3531401","url":null,"abstract":"Recently, scene graph generation methods have been used in image captioning to encode the objects and their relationships in the encoder-decoder framework, where the decoder selects part of the graph nodes as input for word inference. However, current methods attend to scene graph relying on ambiguous language information, neglecting the strong connections between scene graph nodes. In this paper, we propose a Scene Graph Extension (SGE) architecture to model the dynamic scene graph extension using the partly generated sentence. Our model first uses the generated words and previous attention results of scene graph nodes to make up a partial scene graph. Then we choose objects or relationships that has close connection with the generated graph to infer the next word. Our SGE is appealing in view that it is pluggable to any scene graph based image captioning method. We conduct the extensive experiments on MSCOCO dataset. The results shows that the proposed SGE significantly outperforms the baselines, resulting in a state-of-the-art performance under most metrics.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123311027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Temporal-Consistent Visual Clue Attentive Network for Video-Based Person Re-Identification 基于视频的人再识别的时间一致视觉线索注意网络
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531362
Bingliang Jiao, Liying Gao, Peng Wang
Video-based person re-identification (ReID) aims to match video trajectories of pedestrians across multi-view cameras and has important applications in criminal investigation and intelligent surveillance. Compared with single image re-identification, the abundant temporal information contained in video sequences makes it describe pedestrian instances more precisely and effectively. Recently, most existing video-based person ReID algorithms have made use of temporal information by fusing diverse visual contents captured in independent frames. However, these algorithms only measure the salience of visual clues in each single frame, inevitably introducing momentary interference caused by factors like occlusion. Therefore, in this work, we introduce a Temporal-consistent Visual Clue Attentive Network (TVCAN), which is designed to capture temporal-consistently salient pedestrian contents among frames. Our TVCAN consists of two major modules, the TCSA module, and the TCCA module, which are responsible for capturing and emphasizing consistently salient visual contents from the spatial dimension and channel dimension, respectively. Through extensive experiments, the effectiveness of our designed modules has been verified. Additionally, our TVCAN outperforms all compared state-of-the-art methods on three mainstream benchmarks.
基于视频的人再识别(ReID)旨在匹配跨多视角摄像机的行人视频轨迹,在刑事侦查和智能监控中具有重要应用。与单幅图像的再识别相比,视频序列中所包含的丰富的时间信息使其能够更准确、有效地描述行人实例。目前,大多数基于视频的人物ReID算法通过融合在独立帧中捕获的不同视觉内容来利用时间信息。然而,这些算法只测量每一帧视觉线索的显著性,不可避免地引入了遮挡等因素造成的瞬间干扰。因此,在这项工作中,我们引入了一个时间一致的视觉线索注意网络(TVCAN),旨在捕捉帧之间时间一致的显著行人内容。我们的TVCAN由两大模块组成:TCSA模块和TCCA模块,分别负责从空间维度和渠道维度上持续捕捉和强调突出的视觉内容。通过大量的实验,验证了所设计模块的有效性。此外,我们的TVCAN在三个主流基准测试中优于所有比较先进的方法。
{"title":"Temporal-Consistent Visual Clue Attentive Network for Video-Based Person Re-Identification","authors":"Bingliang Jiao, Liying Gao, Peng Wang","doi":"10.1145/3512527.3531362","DOIUrl":"https://doi.org/10.1145/3512527.3531362","url":null,"abstract":"Video-based person re-identification (ReID) aims to match video trajectories of pedestrians across multi-view cameras and has important applications in criminal investigation and intelligent surveillance. Compared with single image re-identification, the abundant temporal information contained in video sequences makes it describe pedestrian instances more precisely and effectively. Recently, most existing video-based person ReID algorithms have made use of temporal information by fusing diverse visual contents captured in independent frames. However, these algorithms only measure the salience of visual clues in each single frame, inevitably introducing momentary interference caused by factors like occlusion. Therefore, in this work, we introduce a Temporal-consistent Visual Clue Attentive Network (TVCAN), which is designed to capture temporal-consistently salient pedestrian contents among frames. Our TVCAN consists of two major modules, the TCSA module, and the TCCA module, which are responsible for capturing and emphasizing consistently salient visual contents from the spatial dimension and channel dimension, respectively. Through extensive experiments, the effectiveness of our designed modules has been verified. Additionally, our TVCAN outperforms all compared state-of-the-art methods on three mainstream benchmarks.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"12 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123917842","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
The Impact of Dataset Splits on Classification Performance in Medical Videos 数据集分割对医学视频分类性能的影响
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531424
Markus Fox, Klaus Schoeffmann
The creation of datasets in medical imaging is a central topic of research, especially with the advances of deep learning in the past decade. Publications of such datasets typically report baseline results with one or more deep neural networks in the form of established performance metrics (e.g., F1-score, Jaccard, etc.). Then, much work is done trying to beat these baseline metrics to compare different neural architectures. However, these reported metrics are almost meaningless when the underlying data does not conform to specific standards. In order to better understand what standards we need, we have reproduced and analyzed a study of four medical image classification datasets in laparoscopy. With automated frame extraction of surgical videos, we find that the resulting images are way too similar and produce high evaluation metrics by design. We show this similarity with a basic SIFT algorithm that produces high evaluation metrics on the original data. We confirm our hypothesis by creating and evaluating a video-based dataset split from the original images. The original network evaluated on the video-based split performs worse than our basic SIFT algorithm on the original data.
医学成像中数据集的创建是研究的中心主题,特别是在过去十年中深度学习的进步。此类数据集的出版物通常以已建立的性能指标(例如F1-score, Jaccard等)的形式报告一个或多个深度神经网络的基线结果。然后,为了比较不同的神经体系结构,需要做很多工作来尝试超越这些基准指标。然而,当底层数据不符合特定标准时,这些报告的度量几乎是无意义的。为了更好地了解我们需要什么样的标准,我们对腹腔镜中四种医学图像分类数据集的研究进行了再现和分析。通过对手术视频的自动帧提取,我们发现生成的图像过于相似,并且通过设计产生了很高的评价指标。我们用一种对原始数据产生高评价指标的基本SIFT算法来显示这种相似性。我们通过创建和评估从原始图像中分离出来的基于视频的数据集来确认我们的假设。在基于视频的分割上评估的原始网络在原始数据上的表现比我们的基本SIFT算法差。
{"title":"The Impact of Dataset Splits on Classification Performance in Medical Videos","authors":"Markus Fox, Klaus Schoeffmann","doi":"10.1145/3512527.3531424","DOIUrl":"https://doi.org/10.1145/3512527.3531424","url":null,"abstract":"The creation of datasets in medical imaging is a central topic of research, especially with the advances of deep learning in the past decade. Publications of such datasets typically report baseline results with one or more deep neural networks in the form of established performance metrics (e.g., F1-score, Jaccard, etc.). Then, much work is done trying to beat these baseline metrics to compare different neural architectures. However, these reported metrics are almost meaningless when the underlying data does not conform to specific standards. In order to better understand what standards we need, we have reproduced and analyzed a study of four medical image classification datasets in laparoscopy. With automated frame extraction of surgical videos, we find that the resulting images are way too similar and produce high evaluation metrics by design. We show this similarity with a basic SIFT algorithm that produces high evaluation metrics on the original data. We confirm our hypothesis by creating and evaluating a video-based dataset split from the original images. The original network evaluated on the video-based split performs worse than our basic SIFT algorithm on the original data.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121060847","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
MuLER: Multiplet-Loss for Emotion Recognition 情感识别的多重损失
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531406
Anwer Slimi, M. Zrigui, H. Nicolas
With the rise of human-machine interactions, it has become necessary for machines to better understand humans in order to respond appropriately. Hence, in order to increase communication and interaction, it would be ideal for machines to automatically detect human emotions. Speech Emotion Recognition (SER) has been a focus of a lot of studies in the past few years. However, they can be considered poor in accuracy and must be improved. In our work, we propose a new loss function that aims to encode speeches instead of classifying them directly as the majority of the existing models do. The encoding will be done in a way that utterances with the same labels would have similar encodings. The encoded speeches were tested on two datasets and we managed to get 88.19% accuracy with the RAVDESS (Ryerson Audiovisual Database of Emotional Speech and Song) dataset and 91.66% accuracy with the RML (Ryerson Multimedia Research Lab) dataset.
随着人机交互的兴起,机器有必要更好地了解人类,以便做出适当的反应。因此,为了增加交流和互动,机器能够自动检测人类的情绪将是理想的。语音情感识别(SER)是近年来研究的热点之一。然而,它们的准确性很差,必须加以改进。在我们的工作中,我们提出了一个新的损失函数,旨在对语音进行编码,而不是像大多数现有模型那样直接对语音进行分类。编码将以具有相同标签的话语具有相似编码的方式进行。编码后的语音在两个数据集上进行了测试,我们使用RAVDESS (Ryerson audio - visual Database of Emotional Speech and Song)数据集获得了88.19%的准确率,使用RML (Ryerson Multimedia Research Lab)数据集获得了91.66%的准确率。
{"title":"MuLER: Multiplet-Loss for Emotion Recognition","authors":"Anwer Slimi, M. Zrigui, H. Nicolas","doi":"10.1145/3512527.3531406","DOIUrl":"https://doi.org/10.1145/3512527.3531406","url":null,"abstract":"With the rise of human-machine interactions, it has become necessary for machines to better understand humans in order to respond appropriately. Hence, in order to increase communication and interaction, it would be ideal for machines to automatically detect human emotions. Speech Emotion Recognition (SER) has been a focus of a lot of studies in the past few years. However, they can be considered poor in accuracy and must be improved. In our work, we propose a new loss function that aims to encode speeches instead of classifying them directly as the majority of the existing models do. The encoding will be done in a way that utterances with the same labels would have similar encodings. The encoded speeches were tested on two datasets and we managed to get 88.19% accuracy with the RAVDESS (Ryerson Audiovisual Database of Emotional Speech and Song) dataset and 91.66% accuracy with the RML (Ryerson Multimedia Research Lab) dataset.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131969991","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 1
An Effective Two-way Metapath Encoder over Heterogeneous Information Network for Recommendation 面向推荐的异构信息网络双向有效元路径编码器
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531402
Yanbin Jiang, Huifang Ma, Xiaohui Zhang, Zhixin Li, Liang Chang
Heterogeneous information networks (HINs) are widely used in recommender system research due to their ability to model complex auxiliary information beyond historical interactions to alleviate data sparsity problem. Existing HIN-based recommendation studies have achieved great success via performing graph convolution operators between pairs of nodes on predefined metapath induced graphs, but they have the following major limitations. First, existing heterogeneous network construction strategies tend to exploit item attributes while failing to effectively model user relations. In addition, previous HIN-based recommendation models mainly convert heterogeneous graph into homogeneous graphs by defining metapaths ignoring the complicated relation dependency involved on the metapath. To tackle these limitations, we propose a novel recommendation model with two-way metapath encoder for top-N recommendation, which models metapath similarity and sequence relation dependency in HIN to learn node representations. Specifically, our model first learns the initial node representation through a pre-training module, and then identifies potential friends and item relations based on their similarity to construct a unified HIN. We then develop the two-way encoder module with similarity encoder and instance encoder to capture the similarity collaborative signals and relational dependency on different metapaths. Finally, the representations on different meta-paths are aggregated through the attention fusion layer to yield rich representations. Extensive experiments on three real datasets demonstrate the effectiveness of our method.
异构信息网络(HINs)在推荐系统研究中得到了广泛的应用,因为它能够在历史交互之外对复杂的辅助信息进行建模,从而缓解了数据稀疏性问题。现有的基于hin的推荐研究通过在预定义的元路径诱导图上对节点之间执行图卷积算子,取得了很大的成功,但存在以下主要局限性。首先,现有的异构网络构建策略倾向于利用项目属性,而不能有效地对用户关系进行建模。此外,以往基于hin的推荐模型主要通过定义元路径将异构图转换为同构图,忽略了元路径上复杂的关系依赖。为了解决这些限制,我们提出了一种新的推荐模型,该模型采用双向元路径编码器进行top-N推荐,该模型对HIN中的元路径相似性和序列关系依赖进行建模,以学习节点表示。具体来说,我们的模型首先通过预训练模块学习初始节点表示,然后根据潜在的朋友和物品关系的相似度来识别潜在的朋友和物品关系,从而构建一个统一的HIN。然后,我们开发了具有相似编码器和实例编码器的双向编码器模块,以捕获不同元路径上的相似协作信号和关系依赖。最后,通过注意力融合层对不同元路径上的表示进行聚合,生成丰富的表示。在三个真实数据集上的大量实验证明了该方法的有效性。
{"title":"An Effective Two-way Metapath Encoder over Heterogeneous Information Network for Recommendation","authors":"Yanbin Jiang, Huifang Ma, Xiaohui Zhang, Zhixin Li, Liang Chang","doi":"10.1145/3512527.3531402","DOIUrl":"https://doi.org/10.1145/3512527.3531402","url":null,"abstract":"Heterogeneous information networks (HINs) are widely used in recommender system research due to their ability to model complex auxiliary information beyond historical interactions to alleviate data sparsity problem. Existing HIN-based recommendation studies have achieved great success via performing graph convolution operators between pairs of nodes on predefined metapath induced graphs, but they have the following major limitations. First, existing heterogeneous network construction strategies tend to exploit item attributes while failing to effectively model user relations. In addition, previous HIN-based recommendation models mainly convert heterogeneous graph into homogeneous graphs by defining metapaths ignoring the complicated relation dependency involved on the metapath. To tackle these limitations, we propose a novel recommendation model with two-way metapath encoder for top-N recommendation, which models metapath similarity and sequence relation dependency in HIN to learn node representations. Specifically, our model first learns the initial node representation through a pre-training module, and then identifies potential friends and item relations based on their similarity to construct a unified HIN. We then develop the two-way encoder module with similarity encoder and instance encoder to capture the similarity collaborative signals and relational dependency on different metapaths. Finally, the representations on different meta-paths are aggregated through the attention fusion layer to yield rich representations. Extensive experiments on three real datasets demonstrate the effectiveness of our method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"3 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131452485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 4
Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames 用集中注意力总结视频并考虑视频帧的独特性和多样性
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531404
Evlampios Apostolidis, Georgios Balaouras, V. Mezaris, I. Patras
In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance.
在这项工作中,我们描述了一种新的无监督视频摘要方法。为了克服现有无监督视频摘要方法的局限性,即与生成器-鉴别器体系结构的不稳定训练、使用rnn对远程帧的依赖关系建模以及并行化基于rnn的网络体系结构的训练过程有关的局限性,所开发的方法仅依赖于使用自关注机制来估计视频帧的重要性。我们的方法不是简单地基于全局注意力对帧之间的依赖关系建模,而是集成了集中注意力机制,能够关注注意力矩阵主对角线上不重叠的块,并通过提取和利用有关视频相关帧的唯一性和多样性的知识来丰富现有信息。通过这种方式,我们的方法可以更好地估计视频不同部分的重要性,并大大减少了可学习参数的数量。使用两个基准数据集(SumMe和TVSum)的实验评估表明,所提出的方法与其他最先进的无监督总结方法相比具有竞争力,并证明了其生成非常接近人类偏好的视频摘要的能力。一项聚焦于引入成分的消融研究,即集中注意与基于注意的框架独特性和多样性估计相结合,显示了它们对整体总结性能的相对贡献。
{"title":"Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames","authors":"Evlampios Apostolidis, Georgios Balaouras, V. Mezaris, I. Patras","doi":"10.1145/3512527.3531404","DOIUrl":"https://doi.org/10.1145/3512527.3531404","url":null,"abstract":"In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115834424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 14
Joint Modality Synergy and Spatio-temporal Cue Purification for Moment Localization 瞬间定位的联合模态协同与时空线索净化
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531396
Xingyu Shen, L. Lan, Huibin Tan, Xiang Zhang, X. Ma, Zhigang Luo
Currently, many approaches to the sentence query based moment location (SQML) task emphasize (inter-)modality interaction between video and language query via transformer-based cross-attention or contrastive learning. However, they could still face two issues: 1) modality interaction could be unexpectedly friendly to modality specific learning that merely learns modality specific patterns, and 2) modality interaction easily confuses spatio-temporal cues and ultimately makes time cues in the original video ambiguous. In this paper, we propose a modality synergy with spatio-temporal cue purification method (MS2P) for SQML to address the above two issues. Particularly, a conceptually simple modality synergy strategy is explored to keep features modality specific while absorbing the other modality complementary information with both carefully designed cross-attention unit and non-contrastive learning. As a result, modality specific semantics can be calibrated progressively in a safer way. To preserve time cues in original video, we further purify video representation into spatial and temporal parts to enhance localization resolution by the proposed two light-weight sentence-aware filtering operations. Experiments on Charades-STA, TACoS, and ActivityNet Caption datasets show our model outperforms the state-of-the-art approaches by a large margin.
目前,基于句子查询的时刻定位(SQML)任务的许多方法都强调通过基于转换的交叉注意或对比学习来实现视频和语言查询之间的情态交互。然而,他们仍然可能面临两个问题:1)情态交互可能对仅学习情态特定模式的情态特定学习意外友好;2)情态交互容易混淆时空线索,最终使原始视频中的时间线索模糊不清。在本文中,我们提出了一种与时空线索净化方法(MS2P)相结合的SQML模态协同方法来解决上述两个问题。特别地,我们探索了一种概念上简单的情态协同策略,通过精心设计的交叉注意单元和非对比学习来吸收其他情态的补充信息,同时保持特征的情态特异性。因此,可以以更安全的方式逐步校准特定于情态的语义。为了保留原始视频中的时间线索,我们进一步将视频表示净化为空间和时间部分,通过提出两种轻量级的句子感知滤波操作来提高定位分辨率。在Charades-STA、TACoS和ActivityNet字幕数据集上的实验表明,我们的模型在很大程度上优于最先进的方法。
{"title":"Joint Modality Synergy and Spatio-temporal Cue Purification for Moment Localization","authors":"Xingyu Shen, L. Lan, Huibin Tan, Xiang Zhang, X. Ma, Zhigang Luo","doi":"10.1145/3512527.3531396","DOIUrl":"https://doi.org/10.1145/3512527.3531396","url":null,"abstract":"Currently, many approaches to the sentence query based moment location (SQML) task emphasize (inter-)modality interaction between video and language query via transformer-based cross-attention or contrastive learning. However, they could still face two issues: 1) modality interaction could be unexpectedly friendly to modality specific learning that merely learns modality specific patterns, and 2) modality interaction easily confuses spatio-temporal cues and ultimately makes time cues in the original video ambiguous. In this paper, we propose a modality synergy with spatio-temporal cue purification method (MS2P) for SQML to address the above two issues. Particularly, a conceptually simple modality synergy strategy is explored to keep features modality specific while absorbing the other modality complementary information with both carefully designed cross-attention unit and non-contrastive learning. As a result, modality specific semantics can be calibrated progressively in a safer way. To preserve time cues in original video, we further purify video representation into spatial and temporal parts to enhance localization resolution by the proposed two light-weight sentence-aware filtering operations. Experiments on Charades-STA, TACoS, and ActivityNet Caption datasets show our model outperforms the state-of-the-art approaches by a large margin.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121236117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 3
Cross-Pixel Dependency with Boundary-Feature Transformation for Weakly Supervised Semantic Segmentation 基于边界特征变换的弱监督语义分割的跨像素依赖
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531360
Yuhui Guo, Xun Liang, Tang Hui, Bo Wu, Xiangping Zheng
Weakly supervised semantic segmentation with image-level labels is a challenging problem that typically relies on the initial responses generated by the classification network to locate object regions. However, such initial responses only cover the most discriminative parts of the object and may incorrectly activate in the background regions. To address this problem, we propose a Cross-pixel Dependency with Boundary-feature Transformation (CDBT) method for weakly supervised semantic segmentation. Specifically, we develop a boundary-feature transformation mechanism, to build strong connections among pixels belonging to the same object but weak connections among different objects. Moreover, we design a cross-pixel dependency module to enhance the initial responses, which exploits context appearance information and refines the prediction of current pixels by the relations of global channel pixels, thus generating pseudo labels of higher quality for training the semantic segmentation network. Extensive experiments on the PASCAL VOC 2012 segmentation benchmark demonstrate that our method outperforms state-of-the-art methods using image-level labels as weak supervision.
带有图像级标签的弱监督语义分割是一个具有挑战性的问题,它通常依赖于分类网络产生的初始响应来定位目标区域。然而,这样的初始反应只覆盖了物体最具辨别性的部分,可能在背景区域被错误地激活。为了解决这个问题,我们提出了一种基于边界特征变换的跨像素依赖(CDBT)方法用于弱监督语义分割。具体而言,我们开发了一种边界-特征转换机制,在属于同一对象的像素之间建立强连接,而在不同对象之间建立弱连接。此外,我们设计了一个跨像素依赖模块来增强初始响应,该模块利用上下文外观信息,通过全局通道像素的关系来细化当前像素的预测,从而生成更高质量的伪标签,用于训练语义分割网络。在PASCAL VOC 2012分割基准上的大量实验表明,我们的方法优于使用图像级标签作为弱监督的最先进方法。
{"title":"Cross-Pixel Dependency with Boundary-Feature Transformation for Weakly Supervised Semantic Segmentation","authors":"Yuhui Guo, Xun Liang, Tang Hui, Bo Wu, Xiangping Zheng","doi":"10.1145/3512527.3531360","DOIUrl":"https://doi.org/10.1145/3512527.3531360","url":null,"abstract":"Weakly supervised semantic segmentation with image-level labels is a challenging problem that typically relies on the initial responses generated by the classification network to locate object regions. However, such initial responses only cover the most discriminative parts of the object and may incorrectly activate in the background regions. To address this problem, we propose a Cross-pixel Dependency with Boundary-feature Transformation (CDBT) method for weakly supervised semantic segmentation. Specifically, we develop a boundary-feature transformation mechanism, to build strong connections among pixels belonging to the same object but weak connections among different objects. Moreover, we design a cross-pixel dependency module to enhance the initial responses, which exploits context appearance information and refines the prediction of current pixels by the relations of global channel pixels, thus generating pseudo labels of higher quality for training the semantic segmentation network. Extensive experiments on the PASCAL VOC 2012 segmentation benchmark demonstrate that our method outperforms state-of-the-art methods using image-level labels as weak supervision.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124132525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MMArt-ACM 2022: 5th Joint Workshop on Multimedia Artworks Analysis and Attractiveness Computing in Multimedia MMArt-ACM 2022:第五届多媒体艺术作品分析与吸引力计算联合研讨会
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531442
Naoko Nitta, Anita Hu, Kensuke Tobitani
In addition to classical art types like paintings and sculptures, new types of artworks emerge following the advancement of deep learning, social platforms, media capturing devices, and media processing tools. Large volumes of machine-/user-generated content or professionally-edited content are shared and disseminated on the Web. Novel multimedia artworks, therefore, emerge rapidly in the era of social media and big data. The ever-increasing amount of illustrations/comics/animations on this platform gives rise to challenges of automatic classification, indexing, and retrieval that have been studied widely in other areas but not necessarily for this emerging type of artwork. In addition to objective entities like objects, events, and scenes, studies of cognitive properties emerge. Among various kinds of computational cognitive analyses, we focus on attractiveness analysis in this workshop. The topics of the accepted papers cover the affective analysis of texts, images, and music. The actual MMArt-ACM 2022 Proceedings are available at: https://dl.acm.org/citation.cfm?id=3512730.
除了绘画和雕塑等经典艺术类型之外,随着深度学习、社交平台、媒体捕捉设备和媒体处理工具的发展,新的艺术类型也出现了。大量机器/用户生成的内容或专业编辑的内容在Web上共享和传播。因此,在社交媒体和大数据时代,新的多媒体艺术作品迅速涌现。这个平台上插图/漫画/动画数量的不断增加带来了自动分类、索引和检索的挑战,这些挑战已经在其他领域得到了广泛的研究,但不一定适用于这种新兴的艺术作品类型。除了客体、事件和场景等客观实体之外,还出现了对认知特性的研究。在各种计算认知分析中,本次研讨会主要关注吸引力分析。接受论文的主题包括文本、图像和音乐的情感分析。实际的MMArt-ACM 2022会议记录可在:https://dl.acm.org/citation.cfm?id=3512730上获得。
{"title":"MMArt-ACM 2022: 5th Joint Workshop on Multimedia Artworks Analysis and Attractiveness Computing in Multimedia","authors":"Naoko Nitta, Anita Hu, Kensuke Tobitani","doi":"10.1145/3512527.3531442","DOIUrl":"https://doi.org/10.1145/3512527.3531442","url":null,"abstract":"In addition to classical art types like paintings and sculptures, new types of artworks emerge following the advancement of deep learning, social platforms, media capturing devices, and media processing tools. Large volumes of machine-/user-generated content or professionally-edited content are shared and disseminated on the Web. Novel multimedia artworks, therefore, emerge rapidly in the era of social media and big data. The ever-increasing amount of illustrations/comics/animations on this platform gives rise to challenges of automatic classification, indexing, and retrieval that have been studied widely in other areas but not necessarily for this emerging type of artwork. In addition to objective entities like objects, events, and scenes, studies of cognitive properties emerge. Among various kinds of computational cognitive analyses, we focus on attractiveness analysis in this workshop. The topics of the accepted papers cover the affective analysis of texts, images, and music. The actual MMArt-ACM 2022 Proceedings are available at: https://dl.acm.org/citation.cfm?id=3512730.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124858684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MultiCLU: Multi-stage Context Learning and Utilization for Storefront Accessibility Detection and Evaluation MultiCLU:店面可达性检测与评价的多阶段语境学习与利用
Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531361
X. Wang, Jiajun Chen, Hao Tang, Zhigang Zhu
In this work, a storefront accessibility image dataset is collected from Google street view and is labeled with three main objects for storefront accessibility: doors (for store entrances), doorknobs (for accessing the entrances) and stairs (for leading to the entrances). Then MultiCLU, a new multi-stage context learning and utilization approach, is proposed with the following four stages: Context in Labeling (CIL), Context in Training (CIT), Context in Detection (CID) and Context in Evaluation (CIE). The CIL stage automatically extends the label for each knob to include more local contextual information. In the CIT stage, a deep learning method is used to project the visual information extracted by a Faster R-CNN based object detector to semantic space generated by a Graph Convolutional Network. The CID stage uses the spatial relation reasoning between categories to refine the confidence score. Finally in the CIE stage, a new loose evaluation metric for storefront accessibility, especially for knob category, is proposed to efficiently help BLV users to find estimated knob locations. Our experiment results show that the proposed MultiCLU framework can achieve significantly better performance than the baseline detector using Faster R-CNN, with +13.4% on mAP and +15.8% on recall, respectively. Our new evaluation metric also introduces a new way to evaluate storefront accessibility objects, which could benefit BLV group in real life.
在这项工作中,从谷歌街景中收集了一个店面可访问性图像数据集,并标记了店面可访问性的三个主要对象:门(用于商店入口),门把手(用于进入入口)和楼梯(用于通往入口)。在此基础上,提出了一种新的多阶段语境学习和利用方法MultiCLU,该方法分为四个阶段:标记语境(CIL)、训练语境(CIT)、检测语境(CID)和评价语境(CIE)。CIL阶段自动扩展每个旋钮的标签,以包含更多的本地上下文信息。在CIT阶段,使用深度学习方法将基于Faster R-CNN的对象检测器提取的视觉信息投影到由Graph Convolutional Network生成的语义空间中。CID阶段使用类别之间的空间关系推理来细化置信分数。最后,在CIE阶段,提出了一个新的松散的店面可达性评价指标,特别是旋钮类别,以有效地帮助BLV用户找到估计的旋钮位置。实验结果表明,与使用Faster R-CNN的基准检测器相比,所提出的MultiCLU框架的性能明显更好,mAP和recall分别为+13.4%和+15.8%。我们的新评价指标也引入了一种新的评价店面可访问性对象的方法,这对现实生活中的BLV群体有很大的帮助。
{"title":"MultiCLU: Multi-stage Context Learning and Utilization for Storefront Accessibility Detection and Evaluation","authors":"X. Wang, Jiajun Chen, Hao Tang, Zhigang Zhu","doi":"10.1145/3512527.3531361","DOIUrl":"https://doi.org/10.1145/3512527.3531361","url":null,"abstract":"In this work, a storefront accessibility image dataset is collected from Google street view and is labeled with three main objects for storefront accessibility: doors (for store entrances), doorknobs (for accessing the entrances) and stairs (for leading to the entrances). Then MultiCLU, a new multi-stage context learning and utilization approach, is proposed with the following four stages: Context in Labeling (CIL), Context in Training (CIT), Context in Detection (CID) and Context in Evaluation (CIE). The CIL stage automatically extends the label for each knob to include more local contextual information. In the CIT stage, a deep learning method is used to project the visual information extracted by a Faster R-CNN based object detector to semantic space generated by a Graph Convolutional Network. The CID stage uses the spatial relation reasoning between categories to refine the confidence score. Finally in the CIE stage, a new loose evaluation metric for storefront accessibility, especially for knob category, is proposed to efficiently help BLV users to find estimated knob locations. Our experiment results show that the proposed MultiCLU framework can achieve significantly better performance than the baseline detector using Faster R-CNN, with +13.4% on mAP and +15.8% on recall, respectively. Our new evaluation metric also introduces a new way to evaluate storefront accessibility objects, which could benefit BLV group in real life.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132396519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Proceedings of the 2022 International Conference on Multimedia Retrieval
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1