Proceedings of the 2022 International Conference on Multimedia Retrieval最新文献_第4页

Sequential Intention-aware Recommender based on User Interaction Graph 基于用户交互图的顺序意图感知推荐

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531390

Jinpeng Chen, Yuan Cao, Fan Zhang, Pengfei Sun, Kaimin Wei

The next-item recommendation problem has received more and more attention from researchers in recent years. Ignoring the implicit item semantic information, existing algorithms focus more on the user-item binary relationship and suffer from high data sparsity. Inspired by the fact that user's decision-making process is often influenced by both intention and preference, this paper presents a SequentiAl inTentiOn-aware Recommender based on a user Interaction graph (Satori). In Satori, we first use a novel user interaction graph to construct relationships between users, items, and categories. Then, we leverage a graph attention network to extract auxiliary features on the graph and generate the three embeddings. Next, we adopt self-attention mechanism to model user intention and preference respectively which are later combined to form a hybrid user representation. Finally, the hybrid user representation and previously obtained item representation are both sent to the prediction modul to calculate the predicted item score. Testing on real-world datasets, the results prove that our approach outperforms state-of-the-art methods.

下一项推荐问题近年来越来越受到研究者的关注。现有算法忽略了隐式的项目语义信息，更多地关注用户-项目二元关系，存在数据稀疏性高的问题。考虑到用户的决策过程经常受到意图和偏好的双重影响，本文提出了一种基于用户交互图(Satori)的顺序意图感知推荐器。在Satori中，我们首先使用一种新的用户交互图来构建用户、项目和类别之间的关系。然后，我们利用图关注网络提取图上的辅助特征并生成三个嵌入。其次，我们采用自注意机制分别对用户意向和偏好建模，然后将其组合形成混合用户表示。最后，混合用户表示和之前获得的项目表示都被发送到预测模块，计算预测的项目得分。在真实世界的数据集上进行测试，结果证明我们的方法优于最先进的方法。

{"title":"Sequential Intention-aware Recommender based on User Interaction Graph","authors":"Jinpeng Chen, Yuan Cao, Fan Zhang, Pengfei Sun, Kaimin Wei","doi":"10.1145/3512527.3531390","DOIUrl":"https://doi.org/10.1145/3512527.3531390","url":null,"abstract":"The next-item recommendation problem has received more and more attention from researchers in recent years. Ignoring the implicit item semantic information, existing algorithms focus more on the user-item binary relationship and suffer from high data sparsity. Inspired by the fact that user's decision-making process is often influenced by both intention and preference, this paper presents a SequentiAl inTentiOn-aware Recommender based on a user Interaction graph (Satori). In Satori, we first use a novel user interaction graph to construct relationships between users, items, and categories. Then, we leverage a graph attention network to extract auxiliary features on the graph and generate the three embeddings. Next, we adopt self-attention mechanism to model user intention and preference respectively which are later combined to form a hybrid user representation. Finally, the hybrid user representation and previously obtained item representation are both sent to the prediction modul to calculate the predicted item score. Testing on real-world datasets, the results prove that our approach outperforms state-of-the-art methods.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131644133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Style-woven Attention Network for Zero-shot Ink Wash Painting Style Transfer 零拍水墨画风格转移的风格编织关注网络

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531391

Haochen Sun, L. Wu, Xiang Li, Xiangxu Meng

Traditional Chinese painting is a unique form of artistic expression. Compared with western art painting, it pays more attention to the verve in visual effect, especially ink painting, which makes good use of lines and pays little attention to information such as texture. Some style transfer methods have recently begun to apply traditional Chinese painting style (such as ink wash style) to photorealistic. Ink stylization of different types of real-world photos in a dataset using these style transfer methods has some limitations. When the input images are animal types that have not been seen in the training set, the generated results retain some semantic features of the data in the training set, resulting in distortion. Therefore, in this paper, we attempt to separate the feature representations for styles and contents and propose a style-woven attention network to achieve zero-shot ink wash painting style transfer. Our model learns to disentangle the data representations in an unsupervised fashion and capture the semantic correlations of content and style. In addition, an ink style loss is added to improve the learning ability of the style encoder. In order to verify the ability of ink wash stylization, we augmented the publicly available dataset $ChipPhi$. Extensive experiments based on a wide validation set prove that our method achieves state-of-the-art results.

中国画是一种独特的艺术表现形式。与西方艺术绘画相比，它在视觉效果上更注重神韵，尤其是水墨画，善于运用线条，很少注重纹理等信息。最近，一些风格转移方法开始将中国传统绘画风格(如水墨画风格)应用于照片写实。使用这些风格转移方法对数据集中不同类型的真实世界照片进行墨水风格化有一定的局限性。当输入的图像是训练集中没有出现过的动物类型时，生成的结果保留了训练集中数据的一些语义特征，导致失真。因此，本文尝试将风格与内容的特征表征分离，提出一种风格编织的关注网络，实现零拍水墨画风格的传递。我们的模型学习以无监督的方式解开数据表示，并捕获内容和风格的语义相关性。此外，还增加了墨水风格损失，提高了风格编码器的学习能力。为了验证水墨风格化的能力，我们增加了公开可用的数据集$ChipPhi$。基于广泛验证集的大量实验证明，我们的方法达到了最先进的结果。

{"title":"Style-woven Attention Network for Zero-shot Ink Wash Painting Style Transfer","authors":"Haochen Sun, L. Wu, Xiang Li, Xiangxu Meng","doi":"10.1145/3512527.3531391","DOIUrl":"https://doi.org/10.1145/3512527.3531391","url":null,"abstract":"Traditional Chinese painting is a unique form of artistic expression. Compared with western art painting, it pays more attention to the verve in visual effect, especially ink painting, which makes good use of lines and pays little attention to information such as texture. Some style transfer methods have recently begun to apply traditional Chinese painting style (such as ink wash style) to photorealistic. Ink stylization of different types of real-world photos in a dataset using these style transfer methods has some limitations. When the input images are animal types that have not been seen in the training set, the generated results retain some semantic features of the data in the training set, resulting in distortion. Therefore, in this paper, we attempt to separate the feature representations for styles and contents and propose a style-woven attention network to achieve zero-shot ink wash painting style transfer. Our model learns to disentangle the data representations in an unsupervised fashion and capture the semantic correlations of content and style. In addition, an ink style loss is added to improve the learning ability of the style encoder. In order to verify the ability of ink wash stylization, we augmented the publicly available dataset $ChipPhi$. Extensive experiments based on a wide validation set prove that our method achieves state-of-the-art results.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128165004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

DMPCANet: A Low Dimensional Aggregation Network for Visual Place Recognition DMPCANet:一种用于视觉位置识别的低维聚合网络

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531427

Yinghao Wang, Haonan Chen, Jiong Wang, Yingying Zhu

Visual place recognition (VPR) aims to estimate the geographical location of a query image by finding its nearest reference images from a large geo-tagged database. Most of the existing methods adopt convolutional neural networks to extract feature maps from images. Nevertheless, such feature maps are high-dimensional tensors, and it is a challenge to effectively aggregate them into a compact vector representation for efficient retrieval. To tackle this challenge, we develop an end-to-end convolutional neural network architecture named DMPCANet. The network adopts the regional pooling module to generate feature tensors of the same size from images of different sizes. The core component of our network, the Differentiable Multilinear Principal Component Analysis (DMPCA) module, directly acts on tensor data and utilizes convolution operations to generate projection matrices for dimensionality reduction, thereby reducing the dimensionality to one sixteenth. This module can preserve crucial information while reducing data dimensions. Experiments on two widely used place recognition datasets demonstrate that our proposed DMPCANet can generate low-dimensional discriminative global descriptors and achieve the state-of-the-art results.

视觉位置识别(VPR)旨在通过从大型地理标记数据库中找到离查询图像最近的参考图像来估计查询图像的地理位置。现有的方法大多采用卷积神经网络从图像中提取特征映射。然而，这些特征映射是高维张量，如何有效地将它们聚合成一个紧凑的向量表示以进行高效检索是一个挑战。为了应对这一挑战，我们开发了一个端到端卷积神经网络架构，名为DMPCANet。该网络采用区域池化模块，从不同大小的图像中生成相同大小的特征张量。我们网络的核心组件，可微分多线性主成分分析(DMPCA)模块，直接作用于张量数据，并利用卷积运算生成投影矩阵进行降维，从而将维数降低到十六分之一。该模块可以在减少数据维度的同时保留关键信息。在两个广泛使用的位置识别数据集上的实验表明，我们提出的DMPCANet可以生成低维判别全局描述符，并取得了最先进的结果。

{"title":"DMPCANet: A Low Dimensional Aggregation Network for Visual Place Recognition","authors":"Yinghao Wang, Haonan Chen, Jiong Wang, Yingying Zhu","doi":"10.1145/3512527.3531427","DOIUrl":"https://doi.org/10.1145/3512527.3531427","url":null,"abstract":"Visual place recognition (VPR) aims to estimate the geographical location of a query image by finding its nearest reference images from a large geo-tagged database. Most of the existing methods adopt convolutional neural networks to extract feature maps from images. Nevertheless, such feature maps are high-dimensional tensors, and it is a challenge to effectively aggregate them into a compact vector representation for efficient retrieval. To tackle this challenge, we develop an end-to-end convolutional neural network architecture named DMPCANet. The network adopts the regional pooling module to generate feature tensors of the same size from images of different sizes. The core component of our network, the Differentiable Multilinear Principal Component Analysis (DMPCA) module, directly acts on tensor data and utilizes convolution operations to generate projection matrices for dimensionality reduction, thereby reducing the dimensionality to one sixteenth. This module can preserve crucial information while reducing data dimensions. Experiments on two widely used place recognition datasets demonstrate that our proposed DMPCANet can generate low-dimensional discriminative global descriptors and achieve the state-of-the-art results.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114174903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 0

TransPCC: Towards Deep Point Cloud Compression via Transformers TransPCC:通过变压器实现深度点云压缩

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531423

Zujie Liang, Fan Liang

High-efficient point cloud compression (PCC) techniques are necessary for various 3D practical applications, such as autonomous driving, holographic transmission, virtual reality, etc. The sparsity and disorder nature make it challenging to design frameworks for point cloud compression. In this paper, we present a new model, called TransPCC that adopts a fully Transformer auto-encoder architecture for deep Point Cloud Compression. By taking the input point cloud as a set in continuous space with learnable position embeddings, we employ the self-attention layers and necessary point-wise operations for point cloud compression. The self-attention based architecture enables our model to better learn point-wise dependency information for point cloud compression. Experimental results show that our method outperforms state-of-the-art methods on large-scale point cloud dataset.

高效的点云压缩(PCC)技术是自动驾驶、全息传输、虚拟现实等各种三维实际应用所必需的。点云的稀疏性和无序性给点云压缩框架的设计带来了挑战。在本文中，我们提出了一个新的模型，称为TransPCC，它采用了一个完全Transformer的自编码器架构来进行深度点云压缩。通过将输入点云作为具有可学习位置嵌入的连续空间中的集合，我们使用自关注层和必要的逐点操作来进行点云压缩。基于自关注的体系结构使我们的模型能够更好地学习点云压缩的逐点依赖信息。实验结果表明，该方法在大规模点云数据集上优于现有方法。

引用次数: 5

Cross-Modal Retrieval between Event-Dense Text and Image 事件密集文本和图像之间的跨模态检索

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531374

Zhongwei Xie, Lin Li, Luo Zhong, Jianquan Liu, Ling Liu

This paper presents a novel approach to the problem of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. It is known that modality alignment is crucial for retrieval performance. However, due to the lack of event sequence information in the image, it is challenging to perform the fine-grain alignment of the event-dense text with the image. Our proposed approach incorporates the event-oriented features to enhance the cross-modal alignment, and applies the event-dense text-image retrieval to the food domain for empirical validation. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval. Extensive experiments demonstrate that our proposed event-oriented modality alignment approach significantly outperforms the state-of-the-art approach with a 23.3% improvement on top-1 Recall for image-to-recipe retrieval on Recipe1M 10k test set.

本文提出了一种新的方法来解决事件密集文本和图像跨模态检索问题，其中文本包含许多事件的描述。众所周知，模态对齐对检索性能至关重要。然而，由于图像中缺乏事件序列信息，对事件密集文本与图像进行细粒度对齐是具有挑战性的。我们提出的方法结合了面向事件的特征来增强跨模态对齐，并将事件密集的文本图像检索应用于食品领域进行经验验证。具体而言，我们通过Transformer捕获每个事件的意义，并将其与识别出的关键事件元素结合起来，以增强学习文本嵌入对所有事件的识别能力。接下来，我们通过将文本和图像共同共享的事件标签与事件相关图像区域的视觉嵌入相结合来生成图像嵌入，该嵌入描述了所有事件的最终结果，并促进了基于事件的跨模态对齐。最后，我们通过迭代调节联合嵌入学习进行跨模态检索，将文本嵌入和图像嵌入与事件标签授权的损失优化相结合。大量的实验表明，我们提出的面向事件的模态对齐方法在Recipe1M 10k测试集上的图像到食谱检索的top-1召回率提高了23.3%，显著优于最先进的方法。

{"title":"Cross-Modal Retrieval between Event-Dense Text and Image","authors":"Zhongwei Xie, Lin Li, Luo Zhong, Jianquan Liu, Ling Liu","doi":"10.1145/3512527.3531374","DOIUrl":"https://doi.org/10.1145/3512527.3531374","url":null,"abstract":"This paper presents a novel approach to the problem of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. It is known that modality alignment is crucial for retrieval performance. However, due to the lack of event sequence information in the image, it is challenging to perform the fine-grain alignment of the event-dense text with the image. Our proposed approach incorporates the event-oriented features to enhance the cross-modal alignment, and applies the event-dense text-image retrieval to the food domain for empirical validation. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval. Extensive experiments demonstrate that our proposed event-oriented modality alignment approach significantly outperforms the state-of-the-art approach with a 23.3% improvement on top-1 Recall for image-to-recipe retrieval on Recipe1M 10k test set.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121117803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 4

HybridVocab: Towards Multi-Modal Machine Translation via Multi-Aspect Alignment HybridVocab:通过多角度对齐实现多模态机器翻译

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531386

Ru Peng, Yawen Zeng, J. Zhao

Multi-modal machine translation (MMT) aims to augment the linguistic machine translation frameworks by incorporating aligned vision information. As the core research challenge for MMT, how to fuse the image information and further align it with the bilingual data remains critical. Existing works have either focused on a methodological alignment in the space of bilingual text or emphasized the combination of the one-sided text and given image. In this work, we entertain the possibility of a triplet alignment, among the source and target text together with the image instance. In particular, we propose Multi-aspect AlignmenT (MAT) model that augments the MMT tasks to three sub-tasks --- namely cross-language translation alignment, cross-modal captioning alignment and multi-modal hybrid alignment tasks. Core to this model consists of a hybrid vocabulary which compiles the visually depictable entity (nouns) occurrence on both sides of the text as well as the detected object labels appearing in the images. Through this sub-task, we postulate that MAT manages to further align the modalities by casting three instances into a shared domain, as compared against previously proposed methods. Extensive experiments and analyses demonstrate the superiority of our approaches, which achieve several state-of-the-art results on two benchmark datasets of the MMT task.

多模态机器翻译(MMT)旨在通过整合对齐的视觉信息来增强语言机器翻译框架。如何将图像信息融合并进一步与双语数据对齐是MMT研究的核心挑战。现有的作品要么侧重于双语文本空间的方法论对齐，要么强调片面文本与给定图像的结合。在这项工作中，我们考虑了源文本和目标文本以及图像实例之间三元对齐的可能性。特别地，我们提出了Multi-aspect AlignmenT (MAT)模型，该模型将MMT任务扩展到三个子任务，即跨语言翻译对齐、跨模态字幕对齐和多模态混合对齐任务。该模型的核心是一个混合词汇表，它编译了文本两侧出现的视觉上可描述的实体(名词)以及出现在图像中的检测到的对象标签。通过此子任务，我们假设MAT通过将三个实例投射到共享域中来进一步对齐模式，而不是之前提出的方法。大量的实验和分析证明了我们的方法的优越性，在MMT任务的两个基准数据集上取得了几个最先进的结果。

{"title":"HybridVocab: Towards Multi-Modal Machine Translation via Multi-Aspect Alignment","authors":"Ru Peng, Yawen Zeng, J. Zhao","doi":"10.1145/3512527.3531386","DOIUrl":"https://doi.org/10.1145/3512527.3531386","url":null,"abstract":"Multi-modal machine translation (MMT) aims to augment the linguistic machine translation frameworks by incorporating aligned vision information. As the core research challenge for MMT, how to fuse the image information and further align it with the bilingual data remains critical. Existing works have either focused on a methodological alignment in the space of bilingual text or emphasized the combination of the one-sided text and given image. In this work, we entertain the possibility of a triplet alignment, among the source and target text together with the image instance. In particular, we propose Multi-aspect AlignmenT (MAT) model that augments the MMT tasks to three sub-tasks --- namely cross-language translation alignment, cross-modal captioning alignment and multi-modal hybrid alignment tasks. Core to this model consists of a hybrid vocabulary which compiles the visually depictable entity (nouns) occurrence on both sides of the text as well as the detected object labels appearing in the images. Through this sub-task, we postulate that MAT manages to further align the modalities by casting three instances into a shared domain, as compared against previously proposed methods. Extensive experiments and analyses demonstrate the superiority of our approaches, which achieve several state-of-the-art results on two benchmark datasets of the MMT task.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121332006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

Mobile Emotion Recognition via Multiple Physiological Signals using Convolution-augmented Transformer 基于卷积增强变压器的多生理信号移动情感识别

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531385

Kangning Yang, Benjamin Tag, Yue Gu, Chaofan Wang, Tilman Dingler, G. Wadley, Jorge Gonçalves

Recognising and monitoring emotional states play a crucial role in mental health and well-being management. Importantly, with the widespread adoption of smart mobile and wearable devices, it has become easier to collect long-term and granular emotion-related physiological data passively, continuously, and remotely. This creates new opportunities to help individuals manage their emotions and well-being in a less intrusive manner using off-the-shelf low-cost devices. Pervasive emotion recognition based on physiological signals is, however, still challenging due to the difficulty to efficiently extract high-order correlations between physiological signals and users' emotional states. In this paper, we propose a novel end-to-end emotion recognition system based on a convolution-augmented transformer architecture. Specifically, it can recognise users' emotions on the dimensions of arousal and valence by learning both the global and local fine-grained associations and dependencies within and across multimodal physiological data (including blood volume pulse, electrodermal activity, heart rate, and skin temperature). We extensively evaluated the performance of our model using the K-EmoCon dataset, which is acquired in naturalistic conversations using off-the-shelf devices and contains spontaneous emotion data. Our results demonstrate that our approach outperforms the baselines and achieves state-of-the-art or competitive performance. We also demonstrate the effectiveness and generalizability of our system on another affective dataset which used affect inducement and commercial physiological sensors.

识别和监控情绪状态在心理健康和幸福管理中起着至关重要的作用。重要的是，随着智能移动和可穿戴设备的广泛采用，被动、连续和远程地收集长期、细粒度的情绪相关生理数据变得更加容易。这创造了新的机会，帮助个人管理自己的情绪和健康，以一种较少干扰的方式使用现成的低成本设备。然而，由于难以有效地提取生理信号与用户情绪状态之间的高阶相关性，基于生理信号的普适情绪识别仍然具有挑战性。在本文中，我们提出了一种基于卷积增强变压器架构的端到端情感识别系统。具体来说，它可以通过学习多模态生理数据(包括血容量脉搏、皮电活动、心率和皮肤温度)内部和之间的全局和局部细粒度关联和依赖关系，在唤醒和价态维度上识别用户的情绪。我们使用K-EmoCon数据集广泛评估了模型的性能，该数据集是使用现成设备在自然对话中获得的，包含自发情绪数据。我们的结果表明，我们的方法优于基线，达到了最先进的或具有竞争力的表现。我们还在另一个使用影响诱导和商业生理传感器的情感数据集上展示了我们系统的有效性和可泛化性。

{"title":"Mobile Emotion Recognition via Multiple Physiological Signals using Convolution-augmented Transformer","authors":"Kangning Yang, Benjamin Tag, Yue Gu, Chaofan Wang, Tilman Dingler, G. Wadley, Jorge Gonçalves","doi":"10.1145/3512527.3531385","DOIUrl":"https://doi.org/10.1145/3512527.3531385","url":null,"abstract":"Recognising and monitoring emotional states play a crucial role in mental health and well-being management. Importantly, with the widespread adoption of smart mobile and wearable devices, it has become easier to collect long-term and granular emotion-related physiological data passively, continuously, and remotely. This creates new opportunities to help individuals manage their emotions and well-being in a less intrusive manner using off-the-shelf low-cost devices. Pervasive emotion recognition based on physiological signals is, however, still challenging due to the difficulty to efficiently extract high-order correlations between physiological signals and users' emotional states. In this paper, we propose a novel end-to-end emotion recognition system based on a convolution-augmented transformer architecture. Specifically, it can recognise users' emotions on the dimensions of arousal and valence by learning both the global and local fine-grained associations and dependencies within and across multimodal physiological data (including blood volume pulse, electrodermal activity, heart rate, and skin temperature). We extensively evaluated the performance of our model using the K-EmoCon dataset, which is acquired in naturalistic conversations using off-the-shelf devices and contains spontaneous emotion data. Our results demonstrate that our approach outperforms the baselines and achieves state-of-the-art or competitive performance. We also demonstrate the effectiveness and generalizability of our system on another affective dataset which used affect inducement and commercial physiological sensors.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121386336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 7

Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval 跨模态图像-文本检索的分层语义对应学习

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531358

Sheng Zeng, Changhong Liu, J. Zhou, Yong Chen, Aiwen Jiang, Hanxi Li

Cross-modal image-text retrieval is a fundamental task in information retrieval. The key to this task is to address both heterogeneity and cross-modal semantic correlation between data of different modalities. Fine-grained matching methods can nicely model local semantic correlations between image and text but face two challenges. First, images may contain redundant information while text sentences often contain words without semantic meaning. Such redundancy interferes with the local matching between textual words and image regions. Furthermore, the retrieval shall consider not only low-level semantic correspondence between image regions and textual words but also a higher semantic correlation between different intra-modal relationships. We propose a multi-layer graph convolutional network with object-level, object-relational-level, and higher-level learning sub-networks. Our method learns hierarchical semantic correspondences by both local and global alignment. We further introduce a self-attention mechanism after the word embedding to weaken insignificant words in the sentence and a cross-attention mechanism to guide the learning of image features. Extensive experiments on Flickr30K and MS-COCO datasets demonstrate the effectiveness and superiority of our proposed method.

跨模态图像-文本检索是信息检索中的一项基本任务。该任务的关键是处理不同模态数据之间的异质性和跨模态语义相关性。细粒度匹配方法可以很好地模拟图像和文本之间的局部语义关联，但面临两个挑战。首先，图像可能包含冗余信息，而文本句子通常包含没有语义的单词。这种冗余干扰了文本单词和图像区域之间的局部匹配。此外，检索不仅要考虑图像区域与文本单词之间的低层次语义对应关系，还要考虑不同模态内关系之间更高层次的语义相关性。我们提出了一个具有对象级、对象关系级和更高级别学习子网络的多层图卷积网络。我们的方法通过局部和全局对齐来学习层次语义对应。我们进一步引入词嵌入后的自注意机制来弱化句子中不重要的词，引入交叉注意机制来指导图像特征的学习。在Flickr30K和MS-COCO数据集上的大量实验证明了该方法的有效性和优越性。

{"title":"Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval","authors":"Sheng Zeng, Changhong Liu, J. Zhou, Yong Chen, Aiwen Jiang, Hanxi Li","doi":"10.1145/3512527.3531358","DOIUrl":"https://doi.org/10.1145/3512527.3531358","url":null,"abstract":"Cross-modal image-text retrieval is a fundamental task in information retrieval. The key to this task is to address both heterogeneity and cross-modal semantic correlation between data of different modalities. Fine-grained matching methods can nicely model local semantic correlations between image and text but face two challenges. First, images may contain redundant information while text sentences often contain words without semantic meaning. Such redundancy interferes with the local matching between textual words and image regions. Furthermore, the retrieval shall consider not only low-level semantic correspondence between image regions and textual words but also a higher semantic correlation between different intra-modal relationships. We propose a multi-layer graph convolutional network with object-level, object-relational-level, and higher-level learning sub-networks. Our method learns hierarchical semantic correspondences by both local and global alignment. We further introduce a self-attention mechanism after the word embedding to weaken insignificant words in the sentence and a cross-attention mechanism to guide the learning of image features. Extensive experiments on Flickr30K and MS-COCO datasets demonstrate the effectiveness and superiority of our proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121650967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 1

STAFNet: Swin Transformer Based Anchor-Free Network for Detection of Forward-looking Sonar Imagery 基于Swin变压器的无锚网络检测前视声纳图像

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531398

Xingyu Zhu, Yingshuo Liang, Jianlei Zhang, Zengqiang Chen

Forward-looking sonar (FLS) is widely applied in underwater operations, among which the search of underwater crash objects and victims is an incredibly challenging task. An efficient detection method based on deep learning can intelligently detect objects in FLS images, which makes it a reliable tool to replace manual recognition. To achieve this aim, we propose a novel Swin Transformer based anchor-free network (STAFNet), which contains a strong backbone Swin Transformer and a lite head with deformable convolution network (DCN). We employ a ROV equipped with a FLS to acquire dataset including victim, boat and plane model objects. A series of experiments are carried out on this dataset to train and verify the performance of STAFNet. Compared with other state-of-the-art methods, STAFNet significantly overcomes complex noise interference, and achieves the best balance between detection accuracy and inference speed.

前视声呐(FLS)在水下作业中有着广泛的应用，其中水下坠毁物体和遇难者的搜索是一项极具挑战性的任务。一种基于深度学习的高效检测方法可以智能地检测出FLS图像中的物体，是替代人工识别的可靠工具。为了实现这一目标，我们提出了一种新的基于Swin Transformer的无锚网络(STAFNet)，该网络包含一个强大的骨干Swin Transformer和一个具有可变形卷积网络(DCN)的life head。我们使用配备FLS的ROV来获取包括受害者，船只和飞机模型对象的数据集。在此数据集上进行了一系列实验，以训练和验证STAFNet的性能。与其他先进的方法相比，STAFNet显著克服了复杂的噪声干扰，在检测精度和推理速度之间达到了最佳平衡。

引用次数: 0

Camouflaged Poisoning Attack on Graph Neural Networks 图神经网络的伪装中毒攻击

Proceedings of the 2022 International Conference on Multimedia Retrieval

Pub Date : 2022-06-27 DOI: 10.1145/3512527.3531373

Chao Jiang, Yingzhe He, Richard Chapman, Hongyi Wu

Graph neural networks (GNNs) have enabled the automation of many web applications that entail node classification on graphs, such as scam detection in social media and event prediction in service networks. Nevertheless, recent studies revealed that the GNNs are vulnerable to adversarial attacks, where feeding GNNs with poisoned data at training time can lead them to yield catastrophically devastative test accuracy. This finding heats up the frontier of attacks and defenses against GNNs. However, the prior studies mainly posit that the adversaries can enjoy free access to manipulate the original graph, while obtaining such access could be too costly in practice. To fill this gap, we propose a novel attacking paradigm, named Generative Adversarial Fake Node Camouflaging (GAFNC), with its crux lying in crafting a set of fake nodes in a generative-adversarial regime. These nodes carry camouflaged malicious features and can poison the victim GNN by passing their malicious messages to the original graph via learned topological structures, such that they 1) maximize the devastation of classification accuracy (i.e., global attack) or 2) enforce the victim GNN to misclassify a targeted node set into prescribed classes (i.e., target attack). We benchmark our experiments on four real-world graph datasets, and the results substantiate the viability, effectiveness, and stealthiness of our proposed poisoning attack approach. Code is released in github.com/chao92/GAFNC.

图神经网络(gnn)已经实现了许多需要在图上进行节点分类的web应用程序的自动化，例如社交媒体中的骗局检测和服务网络中的事件预测。然而，最近的研究表明，gnn容易受到对抗性攻击，在训练时向gnn提供有毒数据可能导致它们产生灾难性的破坏性测试准确性。这一发现使针对gnn的攻击和防御的前沿升温。然而，先前的研究主要假设攻击者可以自由地访问原始图，而在实践中获得这种访问可能代价过高。为了填补这一空白，我们提出了一种新的攻击范式，称为生成对抗假节点伪装(GAFNC)，其关键在于在生成对抗机制中制作一组假节点。这些节点携带伪装的恶意特征，可以通过学习的拓扑结构将其恶意消息传递给原始图，从而毒害受害者GNN，这样它们1)最大限度地破坏分类准确性(即，全局攻击)或2)强制受害者GNN将目标节点集错误地分类为规定的类(即，目标攻击)。我们在四个真实世界的图形数据集上对我们的实验进行了基准测试，结果证实了我们提出的投毒攻击方法的可行性、有效性和隐蔽性。代码发布在github.com/chao92/GAFNC。

{"title":"Camouflaged Poisoning Attack on Graph Neural Networks","authors":"Chao Jiang, Yingzhe He, Richard Chapman, Hongyi Wu","doi":"10.1145/3512527.3531373","DOIUrl":"https://doi.org/10.1145/3512527.3531373","url":null,"abstract":"Graph neural networks (GNNs) have enabled the automation of many web applications that entail node classification on graphs, such as scam detection in social media and event prediction in service networks. Nevertheless, recent studies revealed that the GNNs are vulnerable to adversarial attacks, where feeding GNNs with poisoned data at training time can lead them to yield catastrophically devastative test accuracy. This finding heats up the frontier of attacks and defenses against GNNs. However, the prior studies mainly posit that the adversaries can enjoy free access to manipulate the original graph, while obtaining such access could be too costly in practice. To fill this gap, we propose a novel attacking paradigm, named Generative Adversarial Fake Node Camouflaging (GAFNC), with its crux lying in crafting a set of fake nodes in a generative-adversarial regime. These nodes carry camouflaged malicious features and can poison the victim GNN by passing their malicious messages to the original graph via learned topological structures, such that they 1) maximize the devastation of classification accuracy (i.e., global attack) or 2) enforce the victim GNN to misclassify a targeted node set into prescribed classes (i.e., target attack). We benchmark our experiments on four real-world graph datasets, and the results substantiate the viability, effectiveness, and stealthiness of our proposed poisoning attack approach. Code is released in github.com/chao92/GAFNC.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126397330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}

引用次数: 6