Jinpeng Chen, Yuan Cao, Fan Zhang, Pengfei Sun, Kaimin Wei
The next-item recommendation problem has received more and more attention from researchers in recent years. Ignoring the implicit item semantic information, existing algorithms focus more on the user-item binary relationship and suffer from high data sparsity. Inspired by the fact that user's decision-making process is often influenced by both intention and preference, this paper presents a SequentiAl inTentiOn-aware Recommender based on a user Interaction graph (Satori). In Satori, we first use a novel user interaction graph to construct relationships between users, items, and categories. Then, we leverage a graph attention network to extract auxiliary features on the graph and generate the three embeddings. Next, we adopt self-attention mechanism to model user intention and preference respectively which are later combined to form a hybrid user representation. Finally, the hybrid user representation and previously obtained item representation are both sent to the prediction modul to calculate the predicted item score. Testing on real-world datasets, the results prove that our approach outperforms state-of-the-art methods.
{"title":"Sequential Intention-aware Recommender based on User Interaction Graph","authors":"Jinpeng Chen, Yuan Cao, Fan Zhang, Pengfei Sun, Kaimin Wei","doi":"10.1145/3512527.3531390","DOIUrl":"https://doi.org/10.1145/3512527.3531390","url":null,"abstract":"The next-item recommendation problem has received more and more attention from researchers in recent years. Ignoring the implicit item semantic information, existing algorithms focus more on the user-item binary relationship and suffer from high data sparsity. Inspired by the fact that user's decision-making process is often influenced by both intention and preference, this paper presents a SequentiAl inTentiOn-aware Recommender based on a user Interaction graph (Satori). In Satori, we first use a novel user interaction graph to construct relationships between users, items, and categories. Then, we leverage a graph attention network to extract auxiliary features on the graph and generate the three embeddings. Next, we adopt self-attention mechanism to model user intention and preference respectively which are later combined to form a hybrid user representation. Finally, the hybrid user representation and previously obtained item representation are both sent to the prediction modul to calculate the predicted item score. Testing on real-world datasets, the results prove that our approach outperforms state-of-the-art methods.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"41 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131644133","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Traditional Chinese painting is a unique form of artistic expression. Compared with western art painting, it pays more attention to the verve in visual effect, especially ink painting, which makes good use of lines and pays little attention to information such as texture. Some style transfer methods have recently begun to apply traditional Chinese painting style (such as ink wash style) to photorealistic. Ink stylization of different types of real-world photos in a dataset using these style transfer methods has some limitations. When the input images are animal types that have not been seen in the training set, the generated results retain some semantic features of the data in the training set, resulting in distortion. Therefore, in this paper, we attempt to separate the feature representations for styles and contents and propose a style-woven attention network to achieve zero-shot ink wash painting style transfer. Our model learns to disentangle the data representations in an unsupervised fashion and capture the semantic correlations of content and style. In addition, an ink style loss is added to improve the learning ability of the style encoder. In order to verify the ability of ink wash stylization, we augmented the publicly available dataset $ChipPhi$. Extensive experiments based on a wide validation set prove that our method achieves state-of-the-art results.
{"title":"Style-woven Attention Network for Zero-shot Ink Wash Painting Style Transfer","authors":"Haochen Sun, L. Wu, Xiang Li, Xiangxu Meng","doi":"10.1145/3512527.3531391","DOIUrl":"https://doi.org/10.1145/3512527.3531391","url":null,"abstract":"Traditional Chinese painting is a unique form of artistic expression. Compared with western art painting, it pays more attention to the verve in visual effect, especially ink painting, which makes good use of lines and pays little attention to information such as texture. Some style transfer methods have recently begun to apply traditional Chinese painting style (such as ink wash style) to photorealistic. Ink stylization of different types of real-world photos in a dataset using these style transfer methods has some limitations. When the input images are animal types that have not been seen in the training set, the generated results retain some semantic features of the data in the training set, resulting in distortion. Therefore, in this paper, we attempt to separate the feature representations for styles and contents and propose a style-woven attention network to achieve zero-shot ink wash painting style transfer. Our model learns to disentangle the data representations in an unsupervised fashion and capture the semantic correlations of content and style. In addition, an ink style loss is added to improve the learning ability of the style encoder. In order to verify the ability of ink wash stylization, we augmented the publicly available dataset $ChipPhi$. Extensive experiments based on a wide validation set prove that our method achieves state-of-the-art results.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"124 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128165004","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Visual place recognition (VPR) aims to estimate the geographical location of a query image by finding its nearest reference images from a large geo-tagged database. Most of the existing methods adopt convolutional neural networks to extract feature maps from images. Nevertheless, such feature maps are high-dimensional tensors, and it is a challenge to effectively aggregate them into a compact vector representation for efficient retrieval. To tackle this challenge, we develop an end-to-end convolutional neural network architecture named DMPCANet. The network adopts the regional pooling module to generate feature tensors of the same size from images of different sizes. The core component of our network, the Differentiable Multilinear Principal Component Analysis (DMPCA) module, directly acts on tensor data and utilizes convolution operations to generate projection matrices for dimensionality reduction, thereby reducing the dimensionality to one sixteenth. This module can preserve crucial information while reducing data dimensions. Experiments on two widely used place recognition datasets demonstrate that our proposed DMPCANet can generate low-dimensional discriminative global descriptors and achieve the state-of-the-art results.
{"title":"DMPCANet: A Low Dimensional Aggregation Network for Visual Place Recognition","authors":"Yinghao Wang, Haonan Chen, Jiong Wang, Yingying Zhu","doi":"10.1145/3512527.3531427","DOIUrl":"https://doi.org/10.1145/3512527.3531427","url":null,"abstract":"Visual place recognition (VPR) aims to estimate the geographical location of a query image by finding its nearest reference images from a large geo-tagged database. Most of the existing methods adopt convolutional neural networks to extract feature maps from images. Nevertheless, such feature maps are high-dimensional tensors, and it is a challenge to effectively aggregate them into a compact vector representation for efficient retrieval. To tackle this challenge, we develop an end-to-end convolutional neural network architecture named DMPCANet. The network adopts the regional pooling module to generate feature tensors of the same size from images of different sizes. The core component of our network, the Differentiable Multilinear Principal Component Analysis (DMPCA) module, directly acts on tensor data and utilizes convolution operations to generate projection matrices for dimensionality reduction, thereby reducing the dimensionality to one sixteenth. This module can preserve crucial information while reducing data dimensions. Experiments on two widely used place recognition datasets demonstrate that our proposed DMPCANet can generate low-dimensional discriminative global descriptors and achieve the state-of-the-art results.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114174903","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
High-efficient point cloud compression (PCC) techniques are necessary for various 3D practical applications, such as autonomous driving, holographic transmission, virtual reality, etc. The sparsity and disorder nature make it challenging to design frameworks for point cloud compression. In this paper, we present a new model, called TransPCC that adopts a fully Transformer auto-encoder architecture for deep Point Cloud Compression. By taking the input point cloud as a set in continuous space with learnable position embeddings, we employ the self-attention layers and necessary point-wise operations for point cloud compression. The self-attention based architecture enables our model to better learn point-wise dependency information for point cloud compression. Experimental results show that our method outperforms state-of-the-art methods on large-scale point cloud dataset.
{"title":"TransPCC: Towards Deep Point Cloud Compression via Transformers","authors":"Zujie Liang, Fan Liang","doi":"10.1145/3512527.3531423","DOIUrl":"https://doi.org/10.1145/3512527.3531423","url":null,"abstract":"High-efficient point cloud compression (PCC) techniques are necessary for various 3D practical applications, such as autonomous driving, holographic transmission, virtual reality, etc. The sparsity and disorder nature make it challenging to design frameworks for point cloud compression. In this paper, we present a new model, called TransPCC that adopts a fully Transformer auto-encoder architecture for deep Point Cloud Compression. By taking the input point cloud as a set in continuous space with learnable position embeddings, we employ the self-attention layers and necessary point-wise operations for point cloud compression. The self-attention based architecture enables our model to better learn point-wise dependency information for point cloud compression. Experimental results show that our method outperforms state-of-the-art methods on large-scale point cloud dataset.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"134 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"123415071","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Zhongwei Xie, Lin Li, Luo Zhong, Jianquan Liu, Ling Liu
This paper presents a novel approach to the problem of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. It is known that modality alignment is crucial for retrieval performance. However, due to the lack of event sequence information in the image, it is challenging to perform the fine-grain alignment of the event-dense text with the image. Our proposed approach incorporates the event-oriented features to enhance the cross-modal alignment, and applies the event-dense text-image retrieval to the food domain for empirical validation. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval. Extensive experiments demonstrate that our proposed event-oriented modality alignment approach significantly outperforms the state-of-the-art approach with a 23.3% improvement on top-1 Recall for image-to-recipe retrieval on Recipe1M 10k test set.
{"title":"Cross-Modal Retrieval between Event-Dense Text and Image","authors":"Zhongwei Xie, Lin Li, Luo Zhong, Jianquan Liu, Ling Liu","doi":"10.1145/3512527.3531374","DOIUrl":"https://doi.org/10.1145/3512527.3531374","url":null,"abstract":"This paper presents a novel approach to the problem of event-dense text and image cross-modal retrieval where the text contains the descriptions of numerous events. It is known that modality alignment is crucial for retrieval performance. However, due to the lack of event sequence information in the image, it is challenging to perform the fine-grain alignment of the event-dense text with the image. Our proposed approach incorporates the event-oriented features to enhance the cross-modal alignment, and applies the event-dense text-image retrieval to the food domain for empirical validation. Specifically, we capture the significance of each event by Transformer, and combine it with the identified key event elements, to enhance the discriminative ability of the learned text embedding that summarizes all the events. Next, we produce the image embedding by combining the event tag jointly shared by the text and image with the visual embedding of the event-related image regions, which describes the eventual consequence of all the events and facilitates the event-based cross-modal alignment. Finally, we integrate text embedding and image embedding with the loss optimization empowered with the event tag by iteratively regulating the joint embedding learning for cross-modal retrieval. Extensive experiments demonstrate that our proposed event-oriented modality alignment approach significantly outperforms the state-of-the-art approach with a 23.3% improvement on top-1 Recall for image-to-recipe retrieval on Recipe1M 10k test set.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"8 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121117803","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Multi-modal machine translation (MMT) aims to augment the linguistic machine translation frameworks by incorporating aligned vision information. As the core research challenge for MMT, how to fuse the image information and further align it with the bilingual data remains critical. Existing works have either focused on a methodological alignment in the space of bilingual text or emphasized the combination of the one-sided text and given image. In this work, we entertain the possibility of a triplet alignment, among the source and target text together with the image instance. In particular, we propose Multi-aspect AlignmenT (MAT) model that augments the MMT tasks to three sub-tasks --- namely cross-language translation alignment, cross-modal captioning alignment and multi-modal hybrid alignment tasks. Core to this model consists of a hybrid vocabulary which compiles the visually depictable entity (nouns) occurrence on both sides of the text as well as the detected object labels appearing in the images. Through this sub-task, we postulate that MAT manages to further align the modalities by casting three instances into a shared domain, as compared against previously proposed methods. Extensive experiments and analyses demonstrate the superiority of our approaches, which achieve several state-of-the-art results on two benchmark datasets of the MMT task.
{"title":"HybridVocab: Towards Multi-Modal Machine Translation via Multi-Aspect Alignment","authors":"Ru Peng, Yawen Zeng, J. Zhao","doi":"10.1145/3512527.3531386","DOIUrl":"https://doi.org/10.1145/3512527.3531386","url":null,"abstract":"Multi-modal machine translation (MMT) aims to augment the linguistic machine translation frameworks by incorporating aligned vision information. As the core research challenge for MMT, how to fuse the image information and further align it with the bilingual data remains critical. Existing works have either focused on a methodological alignment in the space of bilingual text or emphasized the combination of the one-sided text and given image. In this work, we entertain the possibility of a triplet alignment, among the source and target text together with the image instance. In particular, we propose Multi-aspect AlignmenT (MAT) model that augments the MMT tasks to three sub-tasks --- namely cross-language translation alignment, cross-modal captioning alignment and multi-modal hybrid alignment tasks. Core to this model consists of a hybrid vocabulary which compiles the visually depictable entity (nouns) occurrence on both sides of the text as well as the detected object labels appearing in the images. Through this sub-task, we postulate that MAT manages to further align the modalities by casting three instances into a shared domain, as compared against previously proposed methods. Extensive experiments and analyses demonstrate the superiority of our approaches, which achieve several state-of-the-art results on two benchmark datasets of the MMT task.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"34 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121332006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Kangning Yang, Benjamin Tag, Yue Gu, Chaofan Wang, Tilman Dingler, G. Wadley, Jorge Gonçalves
Recognising and monitoring emotional states play a crucial role in mental health and well-being management. Importantly, with the widespread adoption of smart mobile and wearable devices, it has become easier to collect long-term and granular emotion-related physiological data passively, continuously, and remotely. This creates new opportunities to help individuals manage their emotions and well-being in a less intrusive manner using off-the-shelf low-cost devices. Pervasive emotion recognition based on physiological signals is, however, still challenging due to the difficulty to efficiently extract high-order correlations between physiological signals and users' emotional states. In this paper, we propose a novel end-to-end emotion recognition system based on a convolution-augmented transformer architecture. Specifically, it can recognise users' emotions on the dimensions of arousal and valence by learning both the global and local fine-grained associations and dependencies within and across multimodal physiological data (including blood volume pulse, electrodermal activity, heart rate, and skin temperature). We extensively evaluated the performance of our model using the K-EmoCon dataset, which is acquired in naturalistic conversations using off-the-shelf devices and contains spontaneous emotion data. Our results demonstrate that our approach outperforms the baselines and achieves state-of-the-art or competitive performance. We also demonstrate the effectiveness and generalizability of our system on another affective dataset which used affect inducement and commercial physiological sensors.
{"title":"Mobile Emotion Recognition via Multiple Physiological Signals using Convolution-augmented Transformer","authors":"Kangning Yang, Benjamin Tag, Yue Gu, Chaofan Wang, Tilman Dingler, G. Wadley, Jorge Gonçalves","doi":"10.1145/3512527.3531385","DOIUrl":"https://doi.org/10.1145/3512527.3531385","url":null,"abstract":"Recognising and monitoring emotional states play a crucial role in mental health and well-being management. Importantly, with the widespread adoption of smart mobile and wearable devices, it has become easier to collect long-term and granular emotion-related physiological data passively, continuously, and remotely. This creates new opportunities to help individuals manage their emotions and well-being in a less intrusive manner using off-the-shelf low-cost devices. Pervasive emotion recognition based on physiological signals is, however, still challenging due to the difficulty to efficiently extract high-order correlations between physiological signals and users' emotional states. In this paper, we propose a novel end-to-end emotion recognition system based on a convolution-augmented transformer architecture. Specifically, it can recognise users' emotions on the dimensions of arousal and valence by learning both the global and local fine-grained associations and dependencies within and across multimodal physiological data (including blood volume pulse, electrodermal activity, heart rate, and skin temperature). We extensively evaluated the performance of our model using the K-EmoCon dataset, which is acquired in naturalistic conversations using off-the-shelf devices and contains spontaneous emotion data. Our results demonstrate that our approach outperforms the baselines and achieves state-of-the-art or competitive performance. We also demonstrate the effectiveness and generalizability of our system on another affective dataset which used affect inducement and commercial physiological sensors.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"73 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121386336","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Sheng Zeng, Changhong Liu, J. Zhou, Yong Chen, Aiwen Jiang, Hanxi Li
Cross-modal image-text retrieval is a fundamental task in information retrieval. The key to this task is to address both heterogeneity and cross-modal semantic correlation between data of different modalities. Fine-grained matching methods can nicely model local semantic correlations between image and text but face two challenges. First, images may contain redundant information while text sentences often contain words without semantic meaning. Such redundancy interferes with the local matching between textual words and image regions. Furthermore, the retrieval shall consider not only low-level semantic correspondence between image regions and textual words but also a higher semantic correlation between different intra-modal relationships. We propose a multi-layer graph convolutional network with object-level, object-relational-level, and higher-level learning sub-networks. Our method learns hierarchical semantic correspondences by both local and global alignment. We further introduce a self-attention mechanism after the word embedding to weaken insignificant words in the sentence and a cross-attention mechanism to guide the learning of image features. Extensive experiments on Flickr30K and MS-COCO datasets demonstrate the effectiveness and superiority of our proposed method.
{"title":"Learning Hierarchical Semantic Correspondences for Cross-Modal Image-Text Retrieval","authors":"Sheng Zeng, Changhong Liu, J. Zhou, Yong Chen, Aiwen Jiang, Hanxi Li","doi":"10.1145/3512527.3531358","DOIUrl":"https://doi.org/10.1145/3512527.3531358","url":null,"abstract":"Cross-modal image-text retrieval is a fundamental task in information retrieval. The key to this task is to address both heterogeneity and cross-modal semantic correlation between data of different modalities. Fine-grained matching methods can nicely model local semantic correlations between image and text but face two challenges. First, images may contain redundant information while text sentences often contain words without semantic meaning. Such redundancy interferes with the local matching between textual words and image regions. Furthermore, the retrieval shall consider not only low-level semantic correspondence between image regions and textual words but also a higher semantic correlation between different intra-modal relationships. We propose a multi-layer graph convolutional network with object-level, object-relational-level, and higher-level learning sub-networks. Our method learns hierarchical semantic correspondences by both local and global alignment. We further introduce a self-attention mechanism after the word embedding to weaken insignificant words in the sentence and a cross-attention mechanism to guide the learning of image features. Extensive experiments on Flickr30K and MS-COCO datasets demonstrate the effectiveness and superiority of our proposed method.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"24 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"121650967","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Forward-looking sonar (FLS) is widely applied in underwater operations, among which the search of underwater crash objects and victims is an incredibly challenging task. An efficient detection method based on deep learning can intelligently detect objects in FLS images, which makes it a reliable tool to replace manual recognition. To achieve this aim, we propose a novel Swin Transformer based anchor-free network (STAFNet), which contains a strong backbone Swin Transformer and a lite head with deformable convolution network (DCN). We employ a ROV equipped with a FLS to acquire dataset including victim, boat and plane model objects. A series of experiments are carried out on this dataset to train and verify the performance of STAFNet. Compared with other state-of-the-art methods, STAFNet significantly overcomes complex noise interference, and achieves the best balance between detection accuracy and inference speed.
{"title":"STAFNet: Swin Transformer Based Anchor-Free Network for Detection of Forward-looking Sonar Imagery","authors":"Xingyu Zhu, Yingshuo Liang, Jianlei Zhang, Zengqiang Chen","doi":"10.1145/3512527.3531398","DOIUrl":"https://doi.org/10.1145/3512527.3531398","url":null,"abstract":"Forward-looking sonar (FLS) is widely applied in underwater operations, among which the search of underwater crash objects and victims is an incredibly challenging task. An efficient detection method based on deep learning can intelligently detect objects in FLS images, which makes it a reliable tool to replace manual recognition. To achieve this aim, we propose a novel Swin Transformer based anchor-free network (STAFNet), which contains a strong backbone Swin Transformer and a lite head with deformable convolution network (DCN). We employ a ROV equipped with a FLS to acquire dataset including victim, boat and plane model objects. A series of experiments are carried out on this dataset to train and verify the performance of STAFNet. Compared with other state-of-the-art methods, STAFNet significantly overcomes complex noise interference, and achieves the best balance between detection accuracy and inference speed.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"82 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115762766","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Chao Jiang, Yingzhe He, Richard Chapman, Hongyi Wu
Graph neural networks (GNNs) have enabled the automation of many web applications that entail node classification on graphs, such as scam detection in social media and event prediction in service networks. Nevertheless, recent studies revealed that the GNNs are vulnerable to adversarial attacks, where feeding GNNs with poisoned data at training time can lead them to yield catastrophically devastative test accuracy. This finding heats up the frontier of attacks and defenses against GNNs. However, the prior studies mainly posit that the adversaries can enjoy free access to manipulate the original graph, while obtaining such access could be too costly in practice. To fill this gap, we propose a novel attacking paradigm, named Generative Adversarial Fake Node Camouflaging (GAFNC), with its crux lying in crafting a set of fake nodes in a generative-adversarial regime. These nodes carry camouflaged malicious features and can poison the victim GNN by passing their malicious messages to the original graph via learned topological structures, such that they 1) maximize the devastation of classification accuracy (i.e., global attack) or 2) enforce the victim GNN to misclassify a targeted node set into prescribed classes (i.e., target attack). We benchmark our experiments on four real-world graph datasets, and the results substantiate the viability, effectiveness, and stealthiness of our proposed poisoning attack approach. Code is released in github.com/chao92/GAFNC.
{"title":"Camouflaged Poisoning Attack on Graph Neural Networks","authors":"Chao Jiang, Yingzhe He, Richard Chapman, Hongyi Wu","doi":"10.1145/3512527.3531373","DOIUrl":"https://doi.org/10.1145/3512527.3531373","url":null,"abstract":"Graph neural networks (GNNs) have enabled the automation of many web applications that entail node classification on graphs, such as scam detection in social media and event prediction in service networks. Nevertheless, recent studies revealed that the GNNs are vulnerable to adversarial attacks, where feeding GNNs with poisoned data at training time can lead them to yield catastrophically devastative test accuracy. This finding heats up the frontier of attacks and defenses against GNNs. However, the prior studies mainly posit that the adversaries can enjoy free access to manipulate the original graph, while obtaining such access could be too costly in practice. To fill this gap, we propose a novel attacking paradigm, named Generative Adversarial Fake Node Camouflaging (GAFNC), with its crux lying in crafting a set of fake nodes in a generative-adversarial regime. These nodes carry camouflaged malicious features and can poison the victim GNN by passing their malicious messages to the original graph via learned topological structures, such that they 1) maximize the devastation of classification accuracy (i.e., global attack) or 2) enforce the victim GNN to misclassify a targeted node set into prescribed classes (i.e., target attack). We benchmark our experiments on four real-world graph datasets, and the results substantiate the viability, effectiveness, and stealthiness of our proposed poisoning attack approach. Code is released in github.com/chao92/GAFNC.","PeriodicalId":179895,"journal":{"name":"Proceedings of the 2022 International Conference on Multimedia Retrieval","volume":"87 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2022-06-27","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"126397330","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}