Proceedings of the 2022 International Conference on Multimedia Retrieval最新文献_第7页

Improving Image Captioning via Enhancing Dual-Side Context Awareness 通过增强双面上下文感知改善图像字幕

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531379

Yi-Meng Gao, Ning Wang, Wei Suo, Mengyang Sun, Peifeng Wang

Recent work on visual question answering demonstrate that grid features can work as well as region feature on vision language tasks. In the meantime, transformer-based model and its variants have shown remarkable performance on image captioning. However, the object-contextual information missing caused by the single granularity nature of grid feature on the encoder side, as well as the future contextual information missing due to the left2right decoding paradigm of transformer decoder, remains unexplored. In this work, we tackle these two problems by enhancing contextual information at dual-side:(i) at encoder side, we propose Context-Aware Self-Attention module, in which the key/value is expanded with adjacent rectangle region where each region contains two or more aggregated grid features; this enables grid feature with varying granularity, storing adequate contextual information for object with different scale. (ii) at decoder side, we incorporate a dual-way decoding strategy, in which left2right and right2left decoding are conducted simultaneously and interactively. It utilizes both past and future contextual information when generates current word. Combining these two modules with a vanilla transformer, our Context-Aware Transformer(CATNet) achieves a new state-of-the-art on MSCOCO benchmark.

最近关于视觉问答的研究表明，网格特征可以和区域特征一样有效地处理视觉语言任务。同时，基于变压器的模型及其变体在图像字幕方面表现出了显著的性能。然而，由于编码器侧网格特征的单粒度特性而导致的对象-上下文信息缺失，以及由于变压器解码器的左向右解码范式而导致的未来上下文信息缺失，仍未得到研究。在这项工作中，我们通过增强两侧的上下文信息来解决这两个问题:(i)在编码器侧，我们提出了上下文感知自关注模块，其中键/值扩展为相邻的矩形区域，其中每个区域包含两个或多个聚合网格特征;这使得网格特征具有不同的粒度，为不同规模的对象存储足够的上下文信息。(ii)在解码器端，我们采用了一种双向解码策略，其中左对右和右对左解码同时交互进行。它在生成当前词时同时利用过去和将来的上下文信息。将这两个模块与一个普通变压器相结合，我们的上下文感知变压器(CATNet)实现了MSCOCO基准测试的最新技术。

{"title":"Improving Image Captioning via Enhancing Dual-Side Context Awareness","authors":"Yi-Meng Gao, Ning Wang, Wei Suo, Mengyang Sun, Peifeng Wang","doi":"10.1145/3512527.3531379","DOIUrl":"https://doi.org/10.1145/3512527.3531379","url":null,"abstract":"Recent work on visual question answering demonstrate that grid features can work as well as region feature on vision language tasks. In the meantime, transformer-based model and its variants have shown remarkable performance on image captioning. However, the object-contextual information missing caused by the single granularity nature of grid feature on the encoder side, as well as the future contextual information missing due to the left2right decoding paradigm of transformer decoder, remains unexplored. In this work, we tackle these two problems by enhancing contextual information at dual-side:(i) at encoder side, we propose Context-Aware Self-Attention module, in which the key/value is expanded with adjacent rectangle region where each region contains two or more aggregated grid features; this enables grid feature with varying granularity, storing adequate contextual information for object with different scale. (ii) at decoder side, we incorporate a dual-way decoding strategy, in which left2right and right2left decoding are conducted simultaneously and interactively. It utilizes both past and future contextual information when generates current word. Combining these two modules with a vanilla transformer, our Context-Aware Transformer(CATNet) achieves a new state-of-the-art on MSCOCO benchmark.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"21 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129099885","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 2

MFGAN: A Lightweight Fast Multi-task Multi-scale Feature-fusion Model based on GAN MFGAN:基于GAN的轻量级快速多任务多尺度特征融合模型

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531410

Lijia Deng, Yu-dong Zhang

Cell segmentation and counting is a time-consuming task and an important experimental step in traditional biomedical research. Many current counting methods require exact cell locations. However, there are few such cell datasets with detailed object coordinates. Most existing cell datasets only have the total number of cells and a global segmentation labelling. To make more effective use of existing datasets, we divided the cell counting task into cell number prediction and cell segmentation respectively. This paper proposed a lightweight fast multi-task multi-scale feature fusion model based on generative adversarial networks (MFGAN). To coordinate the learning of these two tasks, we proposed a Combined Hybrid Loss function (CH Loss) and used conditional GAN to train our network. We proposed a Lightweight Fast Multitask Generator (LFMG) which reduced the number of parameters by 20% compared with U-Net but got better performance on cell segmentation. We used multi-scale feature fusion technology to improve the quality of reconstructed segmentation images. In addition, we also proposed a Structure Fusion Discrimination (SFD) to refine the accuracy of the details of the features. Our method achieved non-Point-based counting that no longer needs to annotate the exact position of each cell in the image during the training and successfully achieved excellent results on cell counting and cell segmentation.

细胞分割与计数是传统生物医学研究中一项耗时且重要的实验步骤。许多当前的计数方法需要精确的细胞位置。然而，很少有这样的单元数据集具有详细的对象坐标。大多数现有的细胞数据集只有细胞总数和一个全局分割标记。为了更有效地利用现有数据集，我们将细胞计数任务分为细胞数预测和细胞分割。提出了一种基于生成对抗网络(MFGAN)的轻量级快速多任务多尺度特征融合模型。为了协调这两个任务的学习，我们提出了一个组合混合损失函数(CH Loss)，并使用条件GAN来训练我们的网络。我们提出了一种轻量级快速多任务生成器(LFMG)，与U-Net相比，它减少了20%的参数数量，但在单元分割方面具有更好的性能。采用多尺度特征融合技术提高重构分割图像的质量。此外，我们还提出了一种结构融合识别(SFD)来提高特征细节的准确性。我们的方法实现了非基于点的计数，在训练过程中不再需要标注图像中每个细胞的确切位置，并且在细胞计数和细胞分割方面取得了很好的效果。

{"title":"MFGAN: A Lightweight Fast Multi-task Multi-scale Feature-fusion Model based on GAN","authors":"Lijia Deng, Yu-dong Zhang","doi":"10.1145/3512527.3531410","DOIUrl":"https://doi.org/10.1145/3512527.3531410","url":null,"abstract":"Cell segmentation and counting is a time-consuming task and an important experimental step in traditional biomedical research. Many current counting methods require exact cell locations. However, there are few such cell datasets with detailed object coordinates. Most existing cell datasets only have the total number of cells and a global segmentation labelling. To make more effective use of existing datasets, we divided the cell counting task into cell number prediction and cell segmentation respectively. This paper proposed a lightweight fast multi-task multi-scale feature fusion model based on generative adversarial networks (MFGAN). To coordinate the learning of these two tasks, we proposed a Combined Hybrid Loss function (CH Loss) and used conditional GAN to train our network. We proposed a Lightweight Fast Multitask Generator (LFMG) which reduced the number of parameters by 20% compared with U-Net but got better performance on cell segmentation. We used multi-scale feature fusion technology to improve the quality of reconstructed segmentation images. In addition, we also proposed a Structure Fusion Discrimination (SFD) to refine the accuracy of the details of the features. Our method achieved non-Point-based counting that no longer needs to annotate the exact position of each cell in the image during the training and successfully achieved excellent results on cell counting and cell segmentation.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"34 2 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114273828","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis 文本到图像合成中多粒度特征的解纠缠表示和层次细化

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531389

Pei Dong, L. Wu, Lei Meng, Xiangxu Meng

In this paper, we focus on generating photo-realistic images from given text descriptions. Current methods first generate an initial image and then progressively refine it to a high-resolution one. These methods typically indiscriminately refine all granularity features output from the previous stage. However, the ability to express different granularity features in each stage is not consistent, and it is difficult to express precise semantics by further refining the features with poor quality generated in the previous stage. Current methods cannot refine different granularity features independently, resulting in that it is challenging to clearly express all factors of semantics in generated image, and some features even become worse. To address this issue, we propose a Hierarchical Disentangled Representations Generative Adversarial Networks (HDR-GAN) to generate photo-realistic images by explicitly disentangling and individually modeling the factors of semantics in the image. HDR-GAN introduces a novel component called multi-granularity feature disentangled encoder to represent image information comprehensively through explicitly disentangling multi-granularity features including pose, shape and texture. Moreover, we develop a novel Multi-granularity Feature Refinement (MFR) containing a Coarse-grained Feature Refinement (CFR) model and a Fine-grained Feature Refinement (FFR) model. CFR utilizes coarse-grained disentangled representations (e.g., pose and shape) to clarify category information, while FFR employs fine-grained disentangled representations (e.g., texture) to reflect instance-level details. Extensive experiments on two well-studied and publicly available datasets (i.e., CUB-200 and CLEVR-SV) demonstrate the rationality and superiority of our method.

在本文中，我们专注于从给定的文本描述生成逼真的图像。目前的方法首先生成初始图像，然后逐步将其细化为高分辨率图像。这些方法通常不加选择地细化前一阶段的所有粒度特征输出。然而，每个阶段表达不同粒度特征的能力并不一致，对前一阶段生成的质量较差的特征进行进一步细化，难以表达精确的语义。目前的方法无法独立地细化不同粒度的特征，导致生成的图像难以清晰地表达所有的语义因素，有些特征甚至变得更差。为了解决这个问题，我们提出了一种分层解纠缠表示生成对抗网络(HDR-GAN)，通过明确解纠缠和单独建模图像中的语义因素来生成逼真的图像。HDR-GAN引入了一种新的多粒度特征解纠缠编码器，通过显式解纠缠姿态、形状和纹理等多粒度特征，全面地表示图像信息。此外，我们还开发了一种新的多粒度特征细化(MFR)方法，该方法包含一个粗粒度特征细化(CFR)模型和一个细粒度特征细化(FFR)模型。CFR使用粗粒度的解纠缠表示(如姿态和形状)来澄清类别信息，而FFR使用细粒度的解纠缠表示(如纹理)来反映实例级细节。在两个研究充分且公开可用的数据集(即CUB-200和CLEVR-SV)上进行的大量实验证明了我们方法的合理性和优越性。

{"title":"Disentangled Representations and Hierarchical Refinement of Multi-Granularity Features for Text-to-Image Synthesis","authors":"Pei Dong, L. Wu, Lei Meng, Xiangxu Meng","doi":"10.1145/3512527.3531389","DOIUrl":"https://doi.org/10.1145/3512527.3531389","url":null,"abstract":"In this paper, we focus on generating photo-realistic images from given text descriptions. Current methods first generate an initial image and then progressively refine it to a high-resolution one. These methods typically indiscriminately refine all granularity features output from the previous stage. However, the ability to express different granularity features in each stage is not consistent, and it is difficult to express precise semantics by further refining the features with poor quality generated in the previous stage. Current methods cannot refine different granularity features independently, resulting in that it is challenging to clearly express all factors of semantics in generated image, and some features even become worse. To address this issue, we propose a Hierarchical Disentangled Representations Generative Adversarial Networks (HDR-GAN) to generate photo-realistic images by explicitly disentangling and individually modeling the factors of semantics in the image. HDR-GAN introduces a novel component called multi-granularity feature disentangled encoder to represent image information comprehensively through explicitly disentangling multi-granularity features including pose, shape and texture. Moreover, we develop a novel Multi-granularity Feature Refinement (MFR) containing a Coarse-grained Feature Refinement (CFR) model and a Fine-grained Feature Refinement (FFR) model. CFR utilizes coarse-grained disentangled representations (e.g., pose and shape) to clarify category information, while FFR employs fine-grained disentangled representations (e.g., texture) to reflect instance-level details. Extensive experiments on two well-studied and publicly available datasets (i.e., CUB-200 and CLEVR-SV) demonstrate the rationality and superiority of our method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114681329","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 5

MuLER: Multiplet-Loss for Emotion Recognition 情感识别的多重损失

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531406

Anwer Slimi, M. Zrigui, H. Nicolas

With the rise of human-machine interactions, it has become necessary for machines to better understand humans in order to respond appropriately. Hence, in order to increase communication and interaction, it would be ideal for machines to automatically detect human emotions. Speech Emotion Recognition (SER) has been a focus of a lot of studies in the past few years. However, they can be considered poor in accuracy and must be improved. In our work, we propose a new loss function that aims to encode speeches instead of classifying them directly as the majority of the existing models do. The encoding will be done in a way that utterances with the same labels would have similar encodings. The encoded speeches were tested on two datasets and we managed to get 88.19% accuracy with the RAVDESS (Ryerson Audiovisual Database of Emotional Speech and Song) dataset and 91.66% accuracy with the RML (Ryerson Multimedia Research Lab) dataset.

随着人机交互的兴起，机器有必要更好地了解人类，以便做出适当的反应。因此，为了增加交流和互动，机器能够自动检测人类的情绪将是理想的。语音情感识别(SER)是近年来研究的热点之一。然而，它们的准确性很差，必须加以改进。在我们的工作中，我们提出了一个新的损失函数，旨在对语音进行编码，而不是像大多数现有模型那样直接对语音进行分类。编码将以具有相同标签的话语具有相似编码的方式进行。编码后的语音在两个数据集上进行了测试，我们使用RAVDESS (Ryerson audio - visual Database of Emotional Speech and Song)数据集获得了88.19%的准确率，使用RML (Ryerson Multimedia Research Lab)数据集获得了91.66%的准确率。

引用次数: 1

An Effective Two-way Metapath Encoder over Heterogeneous Information Network for Recommendation 面向推荐的异构信息网络双向有效元路径编码器

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531402

Yanbin Jiang, Huifang Ma, Xiaohui Zhang, Zhixin Li, Liang Chang

Heterogeneous information networks (HINs) are widely used in recommender system research due to their ability to model complex auxiliary information beyond historical interactions to alleviate data sparsity problem. Existing HIN-based recommendation studies have achieved great success via performing graph convolution operators between pairs of nodes on predefined metapath induced graphs, but they have the following major limitations. First, existing heterogeneous network construction strategies tend to exploit item attributes while failing to effectively model user relations. In addition, previous HIN-based recommendation models mainly convert heterogeneous graph into homogeneous graphs by defining metapaths ignoring the complicated relation dependency involved on the metapath. To tackle these limitations, we propose a novel recommendation model with two-way metapath encoder for top-N recommendation, which models metapath similarity and sequence relation dependency in HIN to learn node representations. Specifically, our model first learns the initial node representation through a pre-training module, and then identifies potential friends and item relations based on their similarity to construct a unified HIN. We then develop the two-way encoder module with similarity encoder and instance encoder to capture the similarity collaborative signals and relational dependency on different metapaths. Finally, the representations on different meta-paths are aggregated through the attention fusion layer to yield rich representations. Extensive experiments on three real datasets demonstrate the effectiveness of our method.

异构信息网络(HINs)在推荐系统研究中得到了广泛的应用，因为它能够在历史交互之外对复杂的辅助信息进行建模，从而缓解了数据稀疏性问题。现有的基于hin的推荐研究通过在预定义的元路径诱导图上对节点之间执行图卷积算子，取得了很大的成功，但存在以下主要局限性。首先，现有的异构网络构建策略倾向于利用项目属性，而不能有效地对用户关系进行建模。此外，以往基于hin的推荐模型主要通过定义元路径将异构图转换为同构图，忽略了元路径上复杂的关系依赖。为了解决这些限制，我们提出了一种新的推荐模型，该模型采用双向元路径编码器进行top-N推荐，该模型对HIN中的元路径相似性和序列关系依赖进行建模，以学习节点表示。具体来说，我们的模型首先通过预训练模块学习初始节点表示，然后根据潜在的朋友和物品关系的相似度来识别潜在的朋友和物品关系，从而构建一个统一的HIN。然后，我们开发了具有相似编码器和实例编码器的双向编码器模块，以捕获不同元路径上的相似协作信号和关系依赖。最后，通过注意力融合层对不同元路径上的表示进行聚合，生成丰富的表示。在三个真实数据集上的大量实验证明了该方法的有效性。

{"title":"An Effective Two-way Metapath Encoder over Heterogeneous Information Network for Recommendation","authors":"Yanbin Jiang, Huifang Ma, Xiaohui Zhang, Zhixin Li, Liang Chang","doi":"10.1145/3512527.3531402","DOIUrl":"https://doi.org/10.1145/3512527.3531402","url":null,"abstract":"Heterogeneous information networks (HINs) are widely used in recommender system research due to their ability to model complex auxiliary information beyond historical interactions to alleviate data sparsity problem. Existing HIN-based recommendation studies have achieved great success via performing graph convolution operators between pairs of nodes on predefined metapath induced graphs, but they have the following major limitations. First, existing heterogeneous network construction strategies tend to exploit item attributes while failing to effectively model user relations. In addition, previous HIN-based recommendation models mainly convert heterogeneous graph into homogeneous graphs by defining metapaths ignoring the complicated relation dependency involved on the metapath. To tackle these limitations, we propose a novel recommendation model with two-way metapath encoder for top-N recommendation, which models metapath similarity and sequence relation dependency in HIN to learn node representations. Specifically, our model first learns the initial node representation through a pre-training module, and then identifies potential friends and item relations based on their similarity to construct a unified HIN. We then develop the two-way encoder module with similarity encoder and instance encoder to capture the similarity collaborative signals and relational dependency on different metapaths. Finally, the representations on different meta-paths are aggregated through the attention fusion layer to yield rich representations. Extensive experiments on three real datasets demonstrate the effectiveness of our method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"3 4","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131452485","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames 用集中注意力总结视频并考虑视频帧的独特性和多样性

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531404

Evlampios Apostolidis, Georgios Balaouras, V. Mezaris, I. Patras

In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance.

在这项工作中，我们描述了一种新的无监督视频摘要方法。为了克服现有无监督视频摘要方法的局限性，即与生成器-鉴别器体系结构的不稳定训练、使用rnn对远程帧的依赖关系建模以及并行化基于rnn的网络体系结构的训练过程有关的局限性，所开发的方法仅依赖于使用自关注机制来估计视频帧的重要性。我们的方法不是简单地基于全局注意力对帧之间的依赖关系建模，而是集成了集中注意力机制，能够关注注意力矩阵主对角线上不重叠的块，并通过提取和利用有关视频相关帧的唯一性和多样性的知识来丰富现有信息。通过这种方式，我们的方法可以更好地估计视频不同部分的重要性，并大大减少了可学习参数的数量。使用两个基准数据集(SumMe和TVSum)的实验评估表明，所提出的方法与其他最先进的无监督总结方法相比具有竞争力，并证明了其生成非常接近人类偏好的视频摘要的能力。一项聚焦于引入成分的消融研究，即集中注意与基于注意的框架独特性和多样性估计相结合，显示了它们对整体总结性能的相对贡献。

{"title":"Summarizing Videos using Concentrated Attention and Considering the Uniqueness and Diversity of the Video Frames","authors":"Evlampios Apostolidis, Georgios Balaouras, V. Mezaris, I. Patras","doi":"10.1145/3512527.3531404","DOIUrl":"https://doi.org/10.1145/3512527.3531404","url":null,"abstract":"In this work, we describe a new method for unsupervised video summarization. To overcome limitations of existing unsupervised video summarization approaches, that relate to the unstable training of Generator-Discriminator architectures, the use of RNNs for modeling long-range frames' dependencies and the ability to parallelize the training process of RNN-based network architectures, the developed method relies solely on the use of a self-attention mechanism to estimate the importance of video frames. Instead of simply modeling the frames' dependencies based on global attention, our method integrates a concentrated attention mechanism that is able to focus on non-overlapping blocks in the main diagonal of the attention matrix, and to enrich the existing information by extracting and exploiting knowledge about the uniqueness and diversity of the associated frames of the video. In this way, our method makes better estimates about the significance of different parts of the video, and drastically reduces the number of learnable parameters. Experimental evaluations using two benchmarking datasets (SumMe and TVSum) show the competitiveness of the proposed method against other state-of-the-art unsupervised summarization approaches, and demonstrate its ability to produce video summaries that are very close to the human preferences. An ablation study that focuses on the introduced components, namely the use of concentrated attention in combination with attention-based estimates about the frames' uniqueness and diversity, shows their relative contributions to the overall summarization performance.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115834424","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 14

Joint Modality Synergy and Spatio-temporal Cue Purification for Moment Localization 瞬间定位的联合模态协同与时空线索净化

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531396

Xingyu Shen, L. Lan, Huibin Tan, Xiang Zhang, X. Ma, Zhigang Luo

Currently, many approaches to the sentence query based moment location (SQML) task emphasize (inter-)modality interaction between video and language query via transformer-based cross-attention or contrastive learning. However, they could still face two issues: 1) modality interaction could be unexpectedly friendly to modality specific learning that merely learns modality specific patterns, and 2) modality interaction easily confuses spatio-temporal cues and ultimately makes time cues in the original video ambiguous. In this paper, we propose a modality synergy with spatio-temporal cue purification method (MS2P) for SQML to address the above two issues. Particularly, a conceptually simple modality synergy strategy is explored to keep features modality specific while absorbing the other modality complementary information with both carefully designed cross-attention unit and non-contrastive learning. As a result, modality specific semantics can be calibrated progressively in a safer way. To preserve time cues in original video, we further purify video representation into spatial and temporal parts to enhance localization resolution by the proposed two light-weight sentence-aware filtering operations. Experiments on Charades-STA, TACoS, and ActivityNet Caption datasets show our model outperforms the state-of-the-art approaches by a large margin.

目前，基于句子查询的时刻定位(SQML)任务的许多方法都强调通过基于转换的交叉注意或对比学习来实现视频和语言查询之间的情态交互。然而，他们仍然可能面临两个问题:1)情态交互可能对仅学习情态特定模式的情态特定学习意外友好;2)情态交互容易混淆时空线索，最终使原始视频中的时间线索模糊不清。在本文中，我们提出了一种与时空线索净化方法(MS2P)相结合的SQML模态协同方法来解决上述两个问题。特别地，我们探索了一种概念上简单的情态协同策略，通过精心设计的交叉注意单元和非对比学习来吸收其他情态的补充信息，同时保持特征的情态特异性。因此，可以以更安全的方式逐步校准特定于情态的语义。为了保留原始视频中的时间线索，我们进一步将视频表示净化为空间和时间部分，通过提出两种轻量级的句子感知滤波操作来提高定位分辨率。在Charades-STA、TACoS和ActivityNet字幕数据集上的实验表明，我们的模型在很大程度上优于最先进的方法。

{"title":"Joint Modality Synergy and Spatio-temporal Cue Purification for Moment Localization","authors":"Xingyu Shen, L. Lan, Huibin Tan, Xiang Zhang, X. Ma, Zhigang Luo","doi":"10.1145/3512527.3531396","DOIUrl":"https://doi.org/10.1145/3512527.3531396","url":null,"abstract":"Currently, many approaches to the sentence query based moment location (SQML) task emphasize (inter-)modality interaction between video and language query via transformer-based cross-attention or contrastive learning. However, they could still face two issues: 1) modality interaction could be unexpectedly friendly to modality specific learning that merely learns modality specific patterns, and 2) modality interaction easily confuses spatio-temporal cues and ultimately makes time cues in the original video ambiguous. In this paper, we propose a modality synergy with spatio-temporal cue purification method (MS2P) for SQML to address the above two issues. Particularly, a conceptually simple modality synergy strategy is explored to keep features modality specific while absorbing the other modality complementary information with both carefully designed cross-attention unit and non-contrastive learning. As a result, modality specific semantics can be calibrated progressively in a safer way. To preserve time cues in original video, we further purify video representation into spatial and temporal parts to enhance localization resolution by the proposed two light-weight sentence-aware filtering operations. Experiments on Charades-STA, TACoS, and ActivityNet Caption datasets show our model outperforms the state-of-the-art approaches by a large margin.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"54 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121236117","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 3

Cross-Pixel Dependency with Boundary-Feature Transformation for Weakly Supervised Semantic Segmentation 基于边界特征变换的弱监督语义分割的跨像素依赖

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531360

Yuhui Guo, Xun Liang, Tang Hui, Bo Wu, Xiangping Zheng

Weakly supervised semantic segmentation with image-level labels is a challenging problem that typically relies on the initial responses generated by the classification network to locate object regions. However, such initial responses only cover the most discriminative parts of the object and may incorrectly activate in the background regions. To address this problem, we propose a Cross-pixel Dependency with Boundary-feature Transformation (CDBT) method for weakly supervised semantic segmentation. Specifically, we develop a boundary-feature transformation mechanism, to build strong connections among pixels belonging to the same object but weak connections among different objects. Moreover, we design a cross-pixel dependency module to enhance the initial responses, which exploits context appearance information and refines the prediction of current pixels by the relations of global channel pixels, thus generating pseudo labels of higher quality for training the semantic segmentation network. Extensive experiments on the PASCAL VOC 2012 segmentation benchmark demonstrate that our method outperforms state-of-the-art methods using image-level labels as weak supervision.

带有图像级标签的弱监督语义分割是一个具有挑战性的问题，它通常依赖于分类网络产生的初始响应来定位目标区域。然而，这样的初始反应只覆盖了物体最具辨别性的部分，可能在背景区域被错误地激活。为了解决这个问题，我们提出了一种基于边界特征变换的跨像素依赖(CDBT)方法用于弱监督语义分割。具体而言，我们开发了一种边界-特征转换机制，在属于同一对象的像素之间建立强连接，而在不同对象之间建立弱连接。此外，我们设计了一个跨像素依赖模块来增强初始响应，该模块利用上下文外观信息，通过全局通道像素的关系来细化当前像素的预测，从而生成更高质量的伪标签，用于训练语义分割网络。在PASCAL VOC 2012分割基准上的大量实验表明，我们的方法优于使用图像级标签作为弱监督的最先进方法。

{"title":"Cross-Pixel Dependency with Boundary-Feature Transformation for Weakly Supervised Semantic Segmentation","authors":"Yuhui Guo, Xun Liang, Tang Hui, Bo Wu, Xiangping Zheng","doi":"10.1145/3512527.3531360","DOIUrl":"https://doi.org/10.1145/3512527.3531360","url":null,"abstract":"Weakly supervised semantic segmentation with image-level labels is a challenging problem that typically relies on the initial responses generated by the classification network to locate object regions. However, such initial responses only cover the most discriminative parts of the object and may incorrectly activate in the background regions. To address this problem, we propose a Cross-pixel Dependency with Boundary-feature Transformation (CDBT) method for weakly supervised semantic segmentation. Specifically, we develop a boundary-feature transformation mechanism, to build strong connections among pixels belonging to the same object but weak connections among different objects. Moreover, we design a cross-pixel dependency module to enhance the initial responses, which exploits context appearance information and refines the prediction of current pixels by the relations of global channel pixels, thus generating pseudo labels of higher quality for training the semantic segmentation network. Extensive experiments on the PASCAL VOC 2012 segmentation benchmark demonstrate that our method outperforms state-of-the-art methods using image-level labels as weak supervision.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"84 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124132525","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MMArt-ACM 2022: 5th Joint Workshop on Multimedia Artworks Analysis and Attractiveness Computing in Multimedia MMArt-ACM 2022:第五届多媒体艺术作品分析与吸引力计算联合研讨会

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531442

Naoko Nitta, Anita Hu, Kensuke Tobitani

In addition to classical art types like paintings and sculptures, new types of artworks emerge following the advancement of deep learning, social platforms, media capturing devices, and media processing tools. Large volumes of machine-/user-generated content or professionally-edited content are shared and disseminated on the Web. Novel multimedia artworks, therefore, emerge rapidly in the era of social media and big data. The ever-increasing amount of illustrations/comics/animations on this platform gives rise to challenges of automatic classification, indexing, and retrieval that have been studied widely in other areas but not necessarily for this emerging type of artwork. In addition to objective entities like objects, events, and scenes, studies of cognitive properties emerge. Among various kinds of computational cognitive analyses, we focus on attractiveness analysis in this workshop. The topics of the accepted papers cover the affective analysis of texts, images, and music. The actual MMArt-ACM 2022 Proceedings are available at: https://dl.acm.org/citation.cfm?id=3512730.

除了绘画和雕塑等经典艺术类型之外，随着深度学习、社交平台、媒体捕捉设备和媒体处理工具的发展，新的艺术类型也出现了。大量机器/用户生成的内容或专业编辑的内容在Web上共享和传播。因此，在社交媒体和大数据时代，新的多媒体艺术作品迅速涌现。这个平台上插图/漫画/动画数量的不断增加带来了自动分类、索引和检索的挑战，这些挑战已经在其他领域得到了广泛的研究，但不一定适用于这种新兴的艺术作品类型。除了客体、事件和场景等客观实体之外，还出现了对认知特性的研究。在各种计算认知分析中，本次研讨会主要关注吸引力分析。接受论文的主题包括文本、图像和音乐的情感分析。实际的MMArt-ACM 2022会议记录可在:https://dl.acm.org/citation.cfm?id=3512730上获得。

{"title":"MMArt-ACM 2022: 5th Joint Workshop on Multimedia Artworks Analysis and Attractiveness Computing in Multimedia","authors":"Naoko Nitta, Anita Hu, Kensuke Tobitani","doi":"10.1145/3512527.3531442","DOIUrl":"https://doi.org/10.1145/3512527.3531442","url":null,"abstract":"In addition to classical art types like paintings and sculptures, new types of artworks emerge following the advancement of deep learning, social platforms, media capturing devices, and media processing tools. Large volumes of machine-/user-generated content or professionally-edited content are shared and disseminated on the Web. Novel multimedia artworks, therefore, emerge rapidly in the era of social media and big data. The ever-increasing amount of illustrations/comics/animations on this platform gives rise to challenges of automatic classification, indexing, and retrieval that have been studied widely in other areas but not necessarily for this emerging type of artwork. In addition to objective entities like objects, events, and scenes, studies of cognitive properties emerge. Among various kinds of computational cognitive analyses, we focus on attractiveness analysis in this workshop. The topics of the accepted papers cover the affective analysis of texts, images, and music. The actual MMArt-ACM 2022 Proceedings are available at: https://dl.acm.org/citation.cfm?id=3512730.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"10 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124858684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

MultiCLU: Multi-stage Context Learning and Utilization for Storefront Accessibility Detection and Evaluation MultiCLU:店面可达性检测与评价的多阶段语境学习与利用

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531361

X. Wang, Jiajun Chen, Hao Tang, Zhigang Zhu

In this work, a storefront accessibility image dataset is collected from Google street view and is labeled with three main objects for storefront accessibility: doors (for store entrances), doorknobs (for accessing the entrances) and stairs (for leading to the entrances). Then MultiCLU, a new multi-stage context learning and utilization approach, is proposed with the following four stages: Context in Labeling (CIL), Context in Training (CIT), Context in Detection (CID) and Context in Evaluation (CIE). The CIL stage automatically extends the label for each knob to include more local contextual information. In the CIT stage, a deep learning method is used to project the visual information extracted by a Faster R-CNN based object detector to semantic space generated by a Graph Convolutional Network. The CID stage uses the spatial relation reasoning between categories to refine the confidence score. Finally in the CIE stage, a new loose evaluation metric for storefront accessibility, especially for knob category, is proposed to efficiently help BLV users to find estimated knob locations. Our experiment results show that the proposed MultiCLU framework can achieve significantly better performance than the baseline detector using Faster R-CNN, with +13.4% on mAP and +15.8% on recall, respectively. Our new evaluation metric also introduces a new way to evaluate storefront accessibility objects, which could benefit BLV group in real life.

在这项工作中，从谷歌街景中收集了一个店面可访问性图像数据集，并标记了店面可访问性的三个主要对象:门(用于商店入口)，门把手(用于进入入口)和楼梯(用于通往入口)。在此基础上，提出了一种新的多阶段语境学习和利用方法MultiCLU，该方法分为四个阶段:标记语境(CIL)、训练语境(CIT)、检测语境(CID)和评价语境(CIE)。CIL阶段自动扩展每个旋钮的标签，以包含更多的本地上下文信息。在CIT阶段，使用深度学习方法将基于Faster R-CNN的对象检测器提取的视觉信息投影到由Graph Convolutional Network生成的语义空间中。CID阶段使用类别之间的空间关系推理来细化置信分数。最后，在CIE阶段，提出了一个新的松散的店面可达性评价指标，特别是旋钮类别，以有效地帮助BLV用户找到估计的旋钮位置。实验结果表明，与使用Faster R-CNN的基准检测器相比，所提出的MultiCLU框架的性能明显更好，mAP和recall分别为+13.4%和+15.8%。我们的新评价指标也引入了一种新的评价店面可访问性对象的方法，这对现实生活中的BLV群体有很大的帮助。

{"title":"MultiCLU: Multi-stage Context Learning and Utilization for Storefront Accessibility Detection and Evaluation","authors":"X. Wang, Jiajun Chen, Hao Tang, Zhigang Zhu","doi":"10.1145/3512527.3531361","DOIUrl":"https://doi.org/10.1145/3512527.3531361","url":null,"abstract":"In this work, a storefront accessibility image dataset is collected from Google street view and is labeled with three main objects for storefront accessibility: doors (for store entrances), doorknobs (for accessing the entrances) and stairs (for leading to the entrances). Then MultiCLU, a new multi-stage context learning and utilization approach, is proposed with the following four stages: Context in Labeling (CIL), Context in Training (CIT), Context in Detection (CID) and Context in Evaluation (CIE). The CIL stage automatically extends the label for each knob to include more local contextual information. In the CIT stage, a deep learning method is used to project the visual information extracted by a Faster R-CNN based object detector to semantic space generated by a Graph Convolutional Network. The CID stage uses the spatial relation reasoning between categories to refine the confidence score. Finally in the CIE stage, a new loose evaluation metric for storefront accessibility, especially for knob category, is proposed to efficiently help BLV users to find estimated knob locations. Our experiment results show that the proposed MultiCLU framework can achieve significantly better performance than the baseline detector using Faster R-CNN, with +13.4% on mAP and +15.8% on recall, respectively. Our new evaluation metric also introduces a new way to evaluate storefront accessibility objects, which could benefit BLV group in real life.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"11 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132396519","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0