Fengjie Xu, Chang-Hua Zhang, Zhongshu Chen, Zhekai Du, Lei Han, Lin Zuo
Underwater image enhancement is a challenging task due to the degradation of image quality in underwater complicated lighting conditions and scenes. In recent years, most methods improve the visual quality of underwater images by using deep Convolutional Neural Networks and Generative Adversarial Networks. However, the majority of existing methods do not consider that the attenuation degrees of R, G, B channels of the underwater image are different, leading to a sub-optimal performance. Based on this observation, we propose a Channel-wise Multi-scale Residual Dense Network called CMRD-Net, which learns the weights of different color channels instead of treating all the channels equally. More specifically, the Channel-wise Multi-scale Fusion Residual Attention Block (CMFRAB) is involved in the CMRD-Net to obtain a better ability of feature extraction and representation. Notably, we evaluate the effectiveness of our model by comparing it with recent state-of-the-art methods. Extensive experimental results show that our method can achieve a satisfactory performance on a popular public dataset.
{"title":"CMRD-Net: An Improved Method for Underwater Image Enhancement","authors":"Fengjie Xu, Chang-Hua Zhang, Zhongshu Chen, Zhekai Du, Lei Han, Lin Zuo","doi":"10.1145/3469877.3493590","DOIUrl":"https://doi.org/10.1145/3469877.3493590","url":null,"abstract":"Underwater image enhancement is a challenging task due to the degradation of image quality in underwater complicated lighting conditions and scenes. In recent years, most methods improve the visual quality of underwater images by using deep Convolutional Neural Networks and Generative Adversarial Networks. However, the majority of existing methods do not consider that the attenuation degrees of R, G, B channels of the underwater image are different, leading to a sub-optimal performance. Based on this observation, we propose a Channel-wise Multi-scale Residual Dense Network called CMRD-Net, which learns the weights of different color channels instead of treating all the channels equally. More specifically, the Channel-wise Multi-scale Fusion Residual Attention Block (CMFRAB) is involved in the CMRD-Net to obtain a better ability of feature extraction and representation. Notably, we evaluate the effectiveness of our model by comparing it with recent state-of-the-art methods. Extensive experimental results show that our method can achieve a satisfactory performance on a popular public dataset.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"77 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114583006","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Currently, most of the adversarial attacks focused on perturbation adding on 2D images. In this way, however, the adversarial attacks cannot easily be involved in a real-world AI system, since it is impossible for the AI system to open an interface to attackers. Therefore, it is more practical to add perturbation on real-world 3D objects’ surface, i.e., 3D adversarial attacks. The key challenges for 3D adversarial attacks are how to effectively deal with viewpoint changing and keep strong transferability across different state-of-the-art networks. In this paper, we mainly focus on improving the robustness and transferability of 3D adversarial examples generated by perturbing the surface textures of 3D objects. Towards this end, we propose an effective method, named Momentum Gradient-Filter Sign Method (M-GFSM), to generate 3D adversarial examples. Specially, the momentum is introduced into the procedure of 3D adversarial examples generation, which results in multiview robustness of 3D adversarial examples and high efficiency of attacking by updating the perturbation and stabilizing the update directions. In addition, filter operation is involved to improve the transferability of 3D adversarial examples by filtering gradient images selectively and completing the gradients of neglected pixels caused by downsampling in the rendering stage. Experimental results show the effectiveness and good transferability of the proposed method. Besides, we show that the 3D adversarial examples generated by our method still be robust under different illuminations.
{"title":"Towards Transferable 3D Adversarial Attack","authors":"Qiming Lu, Shikui Wei, Haoyu Chu, Yao Zhao","doi":"10.1145/3469877.3493596","DOIUrl":"https://doi.org/10.1145/3469877.3493596","url":null,"abstract":"Currently, most of the adversarial attacks focused on perturbation adding on 2D images. In this way, however, the adversarial attacks cannot easily be involved in a real-world AI system, since it is impossible for the AI system to open an interface to attackers. Therefore, it is more practical to add perturbation on real-world 3D objects’ surface, i.e., 3D adversarial attacks. The key challenges for 3D adversarial attacks are how to effectively deal with viewpoint changing and keep strong transferability across different state-of-the-art networks. In this paper, we mainly focus on improving the robustness and transferability of 3D adversarial examples generated by perturbing the surface textures of 3D objects. Towards this end, we propose an effective method, named Momentum Gradient-Filter Sign Method (M-GFSM), to generate 3D adversarial examples. Specially, the momentum is introduced into the procedure of 3D adversarial examples generation, which results in multiview robustness of 3D adversarial examples and high efficiency of attacking by updating the perturbation and stabilizing the update directions. In addition, filter operation is involved to improve the transferability of 3D adversarial examples by filtering gradient images selectively and completing the gradients of neglected pixels caused by downsampling in the rendering stage. Experimental results show the effectiveness and good transferability of the proposed method. Besides, we show that the 3D adversarial examples generated by our method still be robust under different illuminations.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"117037993","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Beibei Zhang, Fan Yu, Yaqun Fang, Tongwei Ren, Gangshan Wu
The Deep Video Understanding Challenge (DVU) is a task that focuses on comprehending long duration videos which involve many entities. Its main goal is to build relationship and interaction knowledge graph between entities to answer relevant questions. In this paper, we improved the joint learning method which we previously proposed in many aspects, including few shot learning, optical flow feature, entity recognition, and video description matching. We verified the effectiveness of these measures through experiments.
深度视频理解挑战(Deep Video Understanding Challenge, DVU)是一项专注于理解涉及多个实体的长时间视频的任务。其主要目标是建立实体之间的关系和交互知识图谱,以回答相关问题。本文对之前提出的联合学习方法进行了改进,包括少镜头学习、光流特征、实体识别、视频描述匹配等方面。我们通过实验验证了这些措施的有效性。
{"title":"Hybrid Improvements in Multimodal Analysis for Deep Video Understanding","authors":"Beibei Zhang, Fan Yu, Yaqun Fang, Tongwei Ren, Gangshan Wu","doi":"10.1145/3469877.3493599","DOIUrl":"https://doi.org/10.1145/3469877.3493599","url":null,"abstract":"The Deep Video Understanding Challenge (DVU) is a task that focuses on comprehending long duration videos which involve many entities. Its main goal is to build relationship and interaction knowledge graph between entities to answer relevant questions. In this paper, we improved the joint learning method which we previously proposed in many aspects, including few shot learning, optical flow feature, entity recognition, and video description matching. We verified the effectiveness of these measures through experiments.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"22 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"124966874","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In this paper, we explore the tokenized representation of musical scores using the Transformer model to automatically generate musical scores. Thus far, sequence models have yielded fruitful results with note-level (MIDI-equivalent) symbolic representations of music. Although the note-level representations can comprise sufficient information to reproduce music aurally, they cannot contain adequate information to represent music visually in terms of notation. Musical scores contain various musical symbols (e.g., clef, key signature, and notes) and attributes (e.g., stem direction, beam, and tie) that enable us to visually comprehend musical content. However, automated estimation of these elements has yet to be comprehensively addressed. In this paper, we first design score token representation corresponding to the various musical elements. We then train the Transformer model to transcribe note-level representation into appropriate music notation. Evaluations of popular piano scores show that the proposed method significantly outperforms existing methods on all 12 musical aspects that were investigated. We also explore an effective notation-level token representation to work with the model and determine that our proposed representation produces the steadiest results.
{"title":"Score Transformer: Generating Musical Score from Note-level Representation","authors":"Masahiro Suzuki","doi":"10.1145/3469877.3490612","DOIUrl":"https://doi.org/10.1145/3469877.3490612","url":null,"abstract":"In this paper, we explore the tokenized representation of musical scores using the Transformer model to automatically generate musical scores. Thus far, sequence models have yielded fruitful results with note-level (MIDI-equivalent) symbolic representations of music. Although the note-level representations can comprise sufficient information to reproduce music aurally, they cannot contain adequate information to represent music visually in terms of notation. Musical scores contain various musical symbols (e.g., clef, key signature, and notes) and attributes (e.g., stem direction, beam, and tie) that enable us to visually comprehend musical content. However, automated estimation of these elements has yet to be comprehensively addressed. In this paper, we first design score token representation corresponding to the various musical elements. We then train the Transformer model to transcribe note-level representation into appropriate music notation. Evaluations of popular piano scores show that the proposed method significantly outperforms existing methods on all 12 musical aspects that were investigated. We also explore an effective notation-level token representation to work with the model and determine that our proposed representation produces the steadiest results.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"114563756","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Social events are very common activities, where people can interact with each other. During an event, the organizer often hires photographers to take images, which provide rich information about the participants’ behaviour. In this work, we propose a method to discover the social graphs among event participants from the event images for social network analytics. By studying over 94 events with 32,330 event images, it is proven that the social graphs can be effectively extracted solely from event images. It is found that the discovered social graphs follow similar properties of online social graphs; for instance, the degree distribution obeys power law distribution. The usefulness of the proposed method for social graph discovery from event images is demonstrated through two applications: important participants detection and community detection. To the best of our knowledge, it is the first work to show the feasibility of discovering social graphs by utilizing event images only. As a result, social network analytics such as recommendations become possible, even without access to the online social graph.
{"title":"Discovering Social Connections using Event Images","authors":"Ming Cheung, Weiwei Sun, Jiantao Zhou","doi":"10.1145/3469877.3493699","DOIUrl":"https://doi.org/10.1145/3469877.3493699","url":null,"abstract":"Social events are very common activities, where people can interact with each other. During an event, the organizer often hires photographers to take images, which provide rich information about the participants’ behaviour. In this work, we propose a method to discover the social graphs among event participants from the event images for social network analytics. By studying over 94 events with 32,330 event images, it is proven that the social graphs can be effectively extracted solely from event images. It is found that the discovered social graphs follow similar properties of online social graphs; for instance, the degree distribution obeys power law distribution. The usefulness of the proposed method for social graph discovery from event images is demonstrated through two applications: important participants detection and community detection. To the best of our knowledge, it is the first work to show the feasibility of discovering social graphs by utilizing event images only. As a result, social network analytics such as recommendations become possible, even without access to the online social graph.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"81 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133102358","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jiapeng Tang, Yi Fang, Yu Dong, Rong Xie, Xiao Gu, Guangtao Zhai, Li Song
Emerging interests have been brought to blind quality assessment for images/videos captured in the wild, known as in-the-wild I/VQA. Prior deep learning based approaches have achieved considerable progress in I/VQA, but are intrinsically troubled with two issues. Firstly, most existing methods fine-tune the image-classification-oriented pre-trained models for the absence of large-scale I/VQA datasets. However, the task misalignment between I/VQA and image classification leads to degraded generalization performance. Secondly, existing VQA methods directly conduct temporal pooling on the predicted frame-wise scores, resulting in ambiguous inter-frame relation modeling. In this work, we propose a two-stage architecture to separately predict image and video quality in the wild. In the first stage, we resort to supervised contrastive learning to derive quality-aware representations that facilitate the prediction of image quality. Specifically, we propose a novel quality-aware contrastive loss to pull together samples of similar quality and push away quality-different ones in embedding space. In the second stage, we develop a Relation-Guided Temporal Attention (RTA) module for video quality prediction, which captures global inter-frame dependencies in embedding space to learn frame-wise attention weights for frame quality aggregation. Extensive experiments demonstrate that our approach performs favorably against state-of-the-art methods on both authentically distorted image benchmarks and video benchmarks.
{"title":"Blindly Predict Image and Video Quality in the Wild","authors":"Jiapeng Tang, Yi Fang, Yu Dong, Rong Xie, Xiao Gu, Guangtao Zhai, Li Song","doi":"10.1145/3469877.3490588","DOIUrl":"https://doi.org/10.1145/3469877.3490588","url":null,"abstract":"Emerging interests have been brought to blind quality assessment for images/videos captured in the wild, known as in-the-wild I/VQA. Prior deep learning based approaches have achieved considerable progress in I/VQA, but are intrinsically troubled with two issues. Firstly, most existing methods fine-tune the image-classification-oriented pre-trained models for the absence of large-scale I/VQA datasets. However, the task misalignment between I/VQA and image classification leads to degraded generalization performance. Secondly, existing VQA methods directly conduct temporal pooling on the predicted frame-wise scores, resulting in ambiguous inter-frame relation modeling. In this work, we propose a two-stage architecture to separately predict image and video quality in the wild. In the first stage, we resort to supervised contrastive learning to derive quality-aware representations that facilitate the prediction of image quality. Specifically, we propose a novel quality-aware contrastive loss to pull together samples of similar quality and push away quality-different ones in embedding space. In the second stage, we develop a Relation-Guided Temporal Attention (RTA) module for video quality prediction, which captures global inter-frame dependencies in embedding space to learn frame-wise attention weights for frame quality aggregation. Extensive experiments demonstrate that our approach performs favorably against state-of-the-art methods on both authentically distorted image benchmarks and video benchmarks.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"141 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133850026","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Jun Zhang, X. Zhong, Jingling Yuan, Shilei Zhao, Rongbo Zhang, Duxiu Feng, Luo Zhong
In real traffic scenarios, the changes of vehicle resolution that the camera captures tend to be relatively obvious considering the distances to the vehicle, different directions, and height of the camera. When the resolution difference exists between the probe and the gallery vehicle, the resolution mismatch will occur, which will seriously influence the performance of the vehicle re-identification (Re-ID). This problem is also known as multi-resolution vehicle Re-ID. An effective strategy is equivalent to utilize image super-resolution to handle the resolution gap. However, existing methods conduct super-resolution on global images instead of local representation of each image, leading to much more noisy information generated from the background and illumination variations. In our work, a local-enhanced multi-resolution representation learning (LMRL) is therefore proposed to address these problems by combining the training of local-enhanced super-resolution (LSR) module and local-guided contrastive learning (LCL) module. Specifically, we use a parsing network to parse a vehicle into four different parts to extract local-enhanced vehicle representation. And then, the LSR module, which consists of two auto-encoders that share parameters, transforms low-resolution images into high-resolution in both global and local branches. LCL module can learn discriminative vehicle representation by contrasting local representation between the high-resolution reconstructed image and the ground truth. We evaluate our approach on two public datasets that contain vehicle images at a wide range of resolutions, in which our approach shows significant superiority to the existing solution.
{"title":"Local-enhanced Multi-resolution Representation Learning for Vehicle Re-identification","authors":"Jun Zhang, X. Zhong, Jingling Yuan, Shilei Zhao, Rongbo Zhang, Duxiu Feng, Luo Zhong","doi":"10.1145/3469877.3497690","DOIUrl":"https://doi.org/10.1145/3469877.3497690","url":null,"abstract":"In real traffic scenarios, the changes of vehicle resolution that the camera captures tend to be relatively obvious considering the distances to the vehicle, different directions, and height of the camera. When the resolution difference exists between the probe and the gallery vehicle, the resolution mismatch will occur, which will seriously influence the performance of the vehicle re-identification (Re-ID). This problem is also known as multi-resolution vehicle Re-ID. An effective strategy is equivalent to utilize image super-resolution to handle the resolution gap. However, existing methods conduct super-resolution on global images instead of local representation of each image, leading to much more noisy information generated from the background and illumination variations. In our work, a local-enhanced multi-resolution representation learning (LMRL) is therefore proposed to address these problems by combining the training of local-enhanced super-resolution (LSR) module and local-guided contrastive learning (LCL) module. Specifically, we use a parsing network to parse a vehicle into four different parts to extract local-enhanced vehicle representation. And then, the LSR module, which consists of two auto-encoders that share parameters, transforms low-resolution images into high-resolution in both global and local branches. LCL module can learn discriminative vehicle representation by contrasting local representation between the high-resolution reconstructed image and the ground truth. We evaluate our approach on two public datasets that contain vehicle images at a wide range of resolutions, in which our approach shows significant superiority to the existing solution.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"160 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"115995625","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Dongliang Shao, Yunhui Shi, Jin Wang, N. Ling, Baocai Yin
Removing undesirable reflections from a single image captured through a glass surface is of broad application to various image processing and computer vision tasks, but it is an ill-posed and challenging problem. Existing traditional single image reflection removal(SIRR) methods are often less efficient to remove reflection due to the limited description ability of handcrafted priors. State-of-the-art learning based methods often cause instability problems because they are designed as unexplainable black boxes. In this paper, we present an explainable approach for SIRR named model-guided unfolding network(MoG-SIRR), which is unfolded from our proposed reflection removal model with non-local autoregressive prior and dereflection prior. In order to complement the transmission layer and the reflection layer in a single image, we construct a deep learning framework with two streams by integrating reflection removal and non-local regularization into trainable modules. Extensive experiments on public benchmark datasets demonstrate that our method achieves superior performance for single image reflection removal.
{"title":"A Model-Guided Unfolding Network for Single Image Reflection Removal","authors":"Dongliang Shao, Yunhui Shi, Jin Wang, N. Ling, Baocai Yin","doi":"10.1145/3469877.3490607","DOIUrl":"https://doi.org/10.1145/3469877.3490607","url":null,"abstract":"Removing undesirable reflections from a single image captured through a glass surface is of broad application to various image processing and computer vision tasks, but it is an ill-posed and challenging problem. Existing traditional single image reflection removal(SIRR) methods are often less efficient to remove reflection due to the limited description ability of handcrafted priors. State-of-the-art learning based methods often cause instability problems because they are designed as unexplainable black boxes. In this paper, we present an explainable approach for SIRR named model-guided unfolding network(MoG-SIRR), which is unfolded from our proposed reflection removal model with non-local autoregressive prior and dereflection prior. In order to complement the transmission layer and the reflection layer in a single image, we construct a deep learning framework with two streams by integrating reflection removal and non-local regularization into trainable modules. Extensive experiments on public benchmark datasets demonstrate that our method achieves superior performance for single image reflection removal.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"4657 3 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129369162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Alberto Baldrati, M. Bertini, Tiberio Uricchio, A. del Bimbo
Building on the recent advances in multimodal zero-shot representation learning, in this paper we explore the use of features obtained from the recent CLIP model to perform conditioned image retrieval. Starting from a reference image and an additive textual description of what the user wants with respect to the reference image, we learn a Combiner network that is able to understand the image content, integrate the textual description and provide combined feature used to perform the conditioned image retrieval. Starting from the bare CLIP features and a simple baseline, we show that a carefully crafted Combiner network, based on such multimodal features, is extremely effective and outperforms more complex state of the art approaches on the popular FashionIQ dataset.
{"title":"Conditioned Image Retrieval for Fashion using Contrastive Learning and CLIP-based Features","authors":"Alberto Baldrati, M. Bertini, Tiberio Uricchio, A. del Bimbo","doi":"10.1145/3469877.3493593","DOIUrl":"https://doi.org/10.1145/3469877.3493593","url":null,"abstract":"Building on the recent advances in multimodal zero-shot representation learning, in this paper we explore the use of features obtained from the recent CLIP model to perform conditioned image retrieval. Starting from a reference image and an additive textual description of what the user wants with respect to the reference image, we learn a Combiner network that is able to understand the image content, integrate the textual description and provide combined feature used to perform the conditioned image retrieval. Starting from the bare CLIP features and a simple baseline, we show that a carefully crafted Combiner network, based on such multimodal features, is extremely effective and outperforms more complex state of the art approaches on the popular FashionIQ dataset.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"18 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130982013","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Yang Shi, Xiushan Nie, Quan Zhou, Li Zou, Yilong Yin
Recent studies have verified that learning compact hash codes can facilitate big data retrieval processing. In particular, learning the deep hash function can greatly improve the retrieval performance. However, the existing deep supervised hashing algorithm treats all the samples in the same way, which leads to insufficient learning of difficult samples. Therefore, we cannot obtain the accurate learning of the similarity relation, making it difficult to achieve satisfactory performance. In light of this, this work proposes a deep supervised hashing model, called deep adaptive attention triple hashing (DAATH), which weights the similarity prediction scores of positive and negative samples in the form of triples, thus giving different degrees of attention to different samples. Compared with the traditional triple loss, it places a greater emphasis on the difficult triple, dramatically reducing the redundant calculation. Extensive experiments have been conducted to show that DAAH consistently outperforms the state-of-the-arts, confirmed its the effectiveness.
{"title":"Deep Adaptive Attention Triple Hashing","authors":"Yang Shi, Xiushan Nie, Quan Zhou, Li Zou, Yilong Yin","doi":"10.1145/3469877.3495646","DOIUrl":"https://doi.org/10.1145/3469877.3495646","url":null,"abstract":"Recent studies have verified that learning compact hash codes can facilitate big data retrieval processing. In particular, learning the deep hash function can greatly improve the retrieval performance. However, the existing deep supervised hashing algorithm treats all the samples in the same way, which leads to insufficient learning of difficult samples. Therefore, we cannot obtain the accurate learning of the similarity relation, making it difficult to achieve satisfactory performance. In light of this, this work proposes a deep supervised hashing model, called deep adaptive attention triple hashing (DAATH), which weights the similarity prediction scores of positive and negative samples in the form of triples, thus giving different degrees of attention to different samples. Compared with the traditional triple loss, it places a greater emphasis on the difficult triple, dramatically reducing the redundant calculation. Extensive experiments have been conducted to show that DAAH consistently outperforms the state-of-the-arts, confirmed its the effectiveness.","PeriodicalId":210974,"journal":{"name":"ACM Multimedia Asia","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-12-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130413777","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}