Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506625
Tomoya Kaichi, Yuko Ozasa
Despite the high recognition accuracy of recent deep neural networks, they can be easily deceived by spoofing. Spoofs (e.g., a printed photograph) visually resemble the actual objects quite closely. Thus, we propose a method for spoof detection with a hyperspectral image (HSI) that can effectively detect differences in surface materials. In contrast to existing anti-spoofing approaches, the proposed method learns the feature representation for spoof detection without spoof supervision. The informative pixels on an HSI are embedded onto the feature space, and we identify the spoof from their distribution. As this is the first attempt at unsupervised spoof detection with an HSI, a new dataset that includes spoofs, named Hyperspectral Spoof Dataset (HSSD), has been developed. The experimental results indicate that the proposed method performs significantly better than the baselines. The source code and the dataset are available on Github1.
{"title":"A Hyperspectral Approach For Unsupervised Spoof Detection With Intra-Sample Distribution","authors":"Tomoya Kaichi, Yuko Ozasa","doi":"10.1109/ICIP42928.2021.9506625","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506625","url":null,"abstract":"Despite the high recognition accuracy of recent deep neural networks, they can be easily deceived by spoofing. Spoofs (e.g., a printed photograph) visually resemble the actual objects quite closely. Thus, we propose a method for spoof detection with a hyperspectral image (HSI) that can effectively detect differences in surface materials. In contrast to existing anti-spoofing approaches, the proposed method learns the feature representation for spoof detection without spoof supervision. The informative pixels on an HSI are embedded onto the feature space, and we identify the spoof from their distribution. As this is the first attempt at unsupervised spoof detection with an HSI, a new dataset that includes spoofs, named Hyperspectral Spoof Dataset (HSSD), has been developed. The experimental results indicate that the proposed method performs significantly better than the baselines. The source code and the dataset are available on Github1.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"54 4 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129988139","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506802
Renu Sharma, A. Ross
We investigate the problem of morph attacks in the context of iris biometrics. A morph attack entails the generation of an image that embodies two different identities. This is accomplished by combining, i.e., morphing, two biometric samples pertaining to two different identities. While such an attack is being increasingly studied in the context of face recognition, it has not been widely analyzed in iris recognition. In this work, we perform iris morphing at the image-level and generate morphed iris images using two available datasets (IITD and WVU multi-modal). We demonstrate the vulnerability of three different iris recognition methods to morph attacks with a success rate of over 90% at a false match rate of 0.01%. We also analyze the textural similarity required between the component images to create a successful morphed image. Finally, we provide preliminary results on the detection of morphed iris images.
{"title":"Image-Level Iris Morph Attack","authors":"Renu Sharma, A. Ross","doi":"10.1109/ICIP42928.2021.9506802","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506802","url":null,"abstract":"We investigate the problem of morph attacks in the context of iris biometrics. A morph attack entails the generation of an image that embodies two different identities. This is accomplished by combining, i.e., morphing, two biometric samples pertaining to two different identities. While such an attack is being increasingly studied in the context of face recognition, it has not been widely analyzed in iris recognition. In this work, we perform iris morphing at the image-level and generate morphed iris images using two available datasets (IITD and WVU multi-modal). We demonstrate the vulnerability of three different iris recognition methods to morph attacks with a success rate of over 90% at a false match rate of 0.01%. We also analyze the textural similarity required between the component images to create a successful morphed image. Finally, we provide preliminary results on the detection of morphed iris images.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"42 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133894621","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506272
Gen Li, Yun Cao, Xianfeng Zhao
In this paper, we introduce a new approach to detect synthetic portrait images and videos. Motivated by the observation that the symmetry of synthetic facial area would be easily broken, this approach aims to reveal the tampering trace by features learned from symmetrical facial regions. To do so, a two-stream learning framework is designed which uses a hard sharing Deep Residual Networks as the backbone network. The feature extractor maps the pair of symmetrical face patches to an angular distance indicating the difference of symmetry features. Extensive experiments are carried out to test the effectiveness in detecting synthetic portrait images and videos, and corresponding results show that our approach is effective even on heterogeneous data and re-compression data that were not used to train the detection model.
{"title":"Exploiting Facial Symmetry to Expose Deepfakes","authors":"Gen Li, Yun Cao, Xianfeng Zhao","doi":"10.1109/ICIP42928.2021.9506272","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506272","url":null,"abstract":"In this paper, we introduce a new approach to detect synthetic portrait images and videos. Motivated by the observation that the symmetry of synthetic facial area would be easily broken, this approach aims to reveal the tampering trace by features learned from symmetrical facial regions. To do so, a two-stream learning framework is designed which uses a hard sharing Deep Residual Networks as the backbone network. The feature extractor maps the pair of symmetrical face patches to an angular distance indicating the difference of symmetry features. Extensive experiments are carried out to test the effectiveness in detecting synthetic portrait images and videos, and corresponding results show that our approach is effective even on heterogeneous data and re-compression data that were not used to train the detection model.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"20 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131797188","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506363
Fan Jiang, Xin Jin, Tingting Zhong
Plenoptic 2.0 videos that record time-varying light fields by focused plenoptic cameras are promising to immersive visual applications because of capturing dense sampled light fields with high spatial resolution in the rendered sub-apertures. In this paper, an intra prediction method is proposed for compressing multi-focus plenoptic 2.0 videos efficiently. Based on the imaging principle analysis of multi-focus plenoptic cameras, zooming relationships among the microimages are discovered and exploited by the proposed method. Positions of the prediction candidates and the zooming factors are derived, after which block zooming and tailoring are proposed to generate novel prediction candidates for weighted prediction. Experimental results demonstrated the superior performance of the proposed method relative to HEVC and state-of-the-art methods.
{"title":"Zoomable Intra Prediction for Multi-Focus Plenoptic 2.0 Video Coding","authors":"Fan Jiang, Xin Jin, Tingting Zhong","doi":"10.1109/ICIP42928.2021.9506363","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506363","url":null,"abstract":"Plenoptic 2.0 videos that record time-varying light fields by focused plenoptic cameras are promising to immersive visual applications because of capturing dense sampled light fields with high spatial resolution in the rendered sub-apertures. In this paper, an intra prediction method is proposed for compressing multi-focus plenoptic 2.0 videos efficiently. Based on the imaging principle analysis of multi-focus plenoptic cameras, zooming relationships among the microimages are discovered and exploited by the proposed method. Positions of the prediction candidates and the zooming factors are derived, after which block zooming and tailoring are proposed to generate novel prediction candidates for weighted prediction. Experimental results demonstrated the superior performance of the proposed method relative to HEVC and state-of-the-art methods.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"24 5 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130764535","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506435
Yifeng Fan, Zhizhen Zhao
Cryo-electron microscopy (EM) single particle reconstruction is a general technique for 3D structure determination of macromolecules. However, because the images are taken at low electron dose, it is extremely hard to visualize the individual particle with low contrast and high noise level. In this paper, we propose a novel framework for cryo-EM single particle image denoising, which incorporates the recently developed multi-frequency vector diffusion maps [1] for improving the identification and alignment of images with similar viewing directions. In addition, we propose a novel filtering scheme combining graph signal processing and truncated Fourier-Bessel expansion of the projection images. Through both simulated and publicly available real data, we demonstrate that our proposed method is efficient and robust to noise compared with the state-of-the-art cryo-EM 2D class averaging algorithms.
{"title":"Cryo-Electron Microscopy Image Denoising Using Multi-Frequency Vector Diffusion Maps","authors":"Yifeng Fan, Zhizhen Zhao","doi":"10.1109/ICIP42928.2021.9506435","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506435","url":null,"abstract":"Cryo-electron microscopy (EM) single particle reconstruction is a general technique for 3D structure determination of macromolecules. However, because the images are taken at low electron dose, it is extremely hard to visualize the individual particle with low contrast and high noise level. In this paper, we propose a novel framework for cryo-EM single particle image denoising, which incorporates the recently developed multi-frequency vector diffusion maps [1] for improving the identification and alignment of images with similar viewing directions. In addition, we propose a novel filtering scheme combining graph signal processing and truncated Fourier-Bessel expansion of the projection images. Through both simulated and publicly available real data, we demonstrate that our proposed method is efficient and robust to noise compared with the state-of-the-art cryo-EM 2D class averaging algorithms.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"66 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133590024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506412
Li Ma, Sumei Li
Recent years have witnessed great advances in stereo image super-resolution (SR). However, the existing methods only consider the horizontal parallax when capturing the stereo correspondence, which is insufficient because the vertical parallax inevitably exists in stereo image pairs. To address this problem, we propose an enhanced back projection stereo SR network (EBPSSRnet) to make full use of the complementary information in stereo images for more accurate SR results. Specifically, we propose a relaxed parallax attention module (rePAM) to handle different stereo images with vertical and horizontal parallax. Then, an enhanced back projection block (EBPB) is developed to extract discriminative features for capturing the stereo correspondence and consolidate the best representation for reconstruction. Extensive experiments show that the proposed method achieves state-of-the-art performance on the Flickr1024, Middlebury, KITTI2012 and KITTI2015 datasets.
{"title":"Enhanced Back Projection Network Based Stereo Image Super-Resolution Considering Parallax Attention","authors":"Li Ma, Sumei Li","doi":"10.1109/ICIP42928.2021.9506412","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506412","url":null,"abstract":"Recent years have witnessed great advances in stereo image super-resolution (SR). However, the existing methods only consider the horizontal parallax when capturing the stereo correspondence, which is insufficient because the vertical parallax inevitably exists in stereo image pairs. To address this problem, we propose an enhanced back projection stereo SR network (EBPSSRnet) to make full use of the complementary information in stereo images for more accurate SR results. Specifically, we propose a relaxed parallax attention module (rePAM) to handle different stereo images with vertical and horizontal parallax. Then, an enhanced back projection block (EBPB) is developed to extract discriminative features for capturing the stereo correspondence and consolidate the best representation for reconstruction. Extensive experiments show that the proposed method achieves state-of-the-art performance on the Flickr1024, Middlebury, KITTI2012 and KITTI2015 datasets.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"106 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133461277","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506257
Takasuke Nagai, Shoichiro Takeda, Masaaki Matsumura, S. Shimizu, Susumu Yamamoto
We propose an action quality assessment (AQA) method that can specifically assess target action quality with ignoring scene context, which is a feature unrelated to the target action. Existing AQA methods have tried to extract spatiotemporal features related to the target action by applying 3D convolution to the video. However, since their models are not explicitly designed to extract the features of the target action, they mis-extract scene context and thus cannot assess the target action quality correctly. To overcome this problem, we impose two losses to an existing AQA model: scene adversarial loss and our newly proposed human-masked regression loss. The scene adversarial loss encourages the model to ignore scene context by adversarial training. Our human-masked regression loss does so by making the correlation between score outputs by an AQA model and human referees undefinable when the target action is not visible. These two losses lead the model to specifically assess the target action quality with ignoring scene context. We evaluated our method on a diving dataset commonly used for AQA and found that it outperformed current state-of-the-art methods. This result shows that our method is effective in ignoring scene context while assessing the target action quality.
{"title":"Action Quality Assessment With Ignoring Scene Context","authors":"Takasuke Nagai, Shoichiro Takeda, Masaaki Matsumura, S. Shimizu, Susumu Yamamoto","doi":"10.1109/ICIP42928.2021.9506257","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506257","url":null,"abstract":"We propose an action quality assessment (AQA) method that can specifically assess target action quality with ignoring scene context, which is a feature unrelated to the target action. Existing AQA methods have tried to extract spatiotemporal features related to the target action by applying 3D convolution to the video. However, since their models are not explicitly designed to extract the features of the target action, they mis-extract scene context and thus cannot assess the target action quality correctly. To overcome this problem, we impose two losses to an existing AQA model: scene adversarial loss and our newly proposed human-masked regression loss. The scene adversarial loss encourages the model to ignore scene context by adversarial training. Our human-masked regression loss does so by making the correlation between score outputs by an AQA model and human referees undefinable when the target action is not visible. These two losses lead the model to specifically assess the target action quality with ignoring scene context. We evaluated our method on a diving dataset commonly used for AQA and found that it outperformed current state-of-the-art methods. This result shows that our method is effective in ignoring scene context while assessing the target action quality.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"90 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133647024","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506707
Yongli Chang, Sumei Li, Anqi Liu
To simulate the characteristics of perceiving things from binocular vision, a dual-pathway convolutional neural network (CNN) for quality assessment of screen content images (SCIs) is proposed. Considering the different sensitivity of retinal photoreceptor cells to RGB colors and the human visual attention mechanism, we employ a convolutional block attention module (CBAM) to weight the RGB channels and their spatial position on each channel. And 3D convolution considering inter-frame information is used to extract the correlation features between RGB channels. Moreover, because of the important role of optic chiasm in binocular vision, we design its simulation strategy in the proposed network. Furthermore, since the characteristics of multi-scale and multi-level are indispensable to perception of any objects in human visual system (HVS), a new multi-scale and multi-level feature fusion (MSMLFF) module is built to obtain perceptual features of different scales and levels. Experimental results show that the proposed method is superior to several mainstream SCIs metrics on publicly accessible databases.
{"title":"Quality Assessment of Screen Content Images Based on Convolutional Neural Network with Dual Pathways","authors":"Yongli Chang, Sumei Li, Anqi Liu","doi":"10.1109/ICIP42928.2021.9506707","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506707","url":null,"abstract":"To simulate the characteristics of perceiving things from binocular vision, a dual-pathway convolutional neural network (CNN) for quality assessment of screen content images (SCIs) is proposed. Considering the different sensitivity of retinal photoreceptor cells to RGB colors and the human visual attention mechanism, we employ a convolutional block attention module (CBAM) to weight the RGB channels and their spatial position on each channel. And 3D convolution considering inter-frame information is used to extract the correlation features between RGB channels. Moreover, because of the important role of optic chiasm in binocular vision, we design its simulation strategy in the proposed network. Furthermore, since the characteristics of multi-scale and multi-level are indispensable to perception of any objects in human visual system (HVS), a new multi-scale and multi-level feature fusion (MSMLFF) module is built to obtain perceptual features of different scales and levels. Experimental results show that the proposed method is superior to several mainstream SCIs metrics on publicly accessible databases.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"31 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132756992","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506340
Luisa Watkins, Sheila W. Seidel, Minxu Peng, Akshay Agarwal, Christopher C. Yu, V. Goyal
Variations in the intensity of the incident beam can cause significant inaccuracies in microscopes that use focused beams of electrons or ions. Existing mitigation methods depend on the artifacts having characteristic spatial structures explained by the raster scan pattern and temporal correlation of the beam current variations. We show that recently introduced time-resolved measurement methods create robustness to beam current variations that improve significantly upon existing methods while not depending on separability of artifact structure from underlying image content. These advantages are illustrated through Monte Carlo simulations representative of both helium ion microscopy (higher secondary electron yield) and scanning electron microscopy (lower secondary electron yield). Notably, this demonstrates that when the beam current variation is appreciable, time-resolved measurements provide a novel benefit in particle beam microscopy that extends to low secondary electron yields.
{"title":"Robustness of Time-Resolved Measurement to Unknown and Variable Beam Current in Particle Beam Microscopy","authors":"Luisa Watkins, Sheila W. Seidel, Minxu Peng, Akshay Agarwal, Christopher C. Yu, V. Goyal","doi":"10.1109/ICIP42928.2021.9506340","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506340","url":null,"abstract":"Variations in the intensity of the incident beam can cause significant inaccuracies in microscopes that use focused beams of electrons or ions. Existing mitigation methods depend on the artifacts having characteristic spatial structures explained by the raster scan pattern and temporal correlation of the beam current variations. We show that recently introduced time-resolved measurement methods create robustness to beam current variations that improve significantly upon existing methods while not depending on separability of artifact structure from underlying image content. These advantages are illustrated through Monte Carlo simulations representative of both helium ion microscopy (higher secondary electron yield) and scanning electron microscopy (lower secondary electron yield). Notably, this demonstrates that when the beam current variation is appreciable, time-resolved measurements provide a novel benefit in particle beam microscopy that extends to low secondary electron yields.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"133176733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506182
Bela Chakraborty, Peng Wang, Lei Wang
Zero-shot cross-modal retrieval (ZS-CMR) performs the task of cross-modal retrieval where the classes of test categories have a different scope than the training categories. It borrows the intuition from zero-shot learning which targets to transfer the knowledge inferred during the training phase for seen classes to the testing phase for unseen classes. It mimics the real-world scenario where new object categories are continuously populating the multi-media data corpus. Unlike existing ZS-CMR approaches which use generative adversarial networks (GANs) to generate more data, we propose Inter-Modality Fusion based Attention (IMFA) and a framework ZS_INN_FUSE (Zero-Shot cross-modal retrieval using INNer product with image-text FUSEd). It exploits the rich semantics of textual data as guidance to infer additional knowledge during the training phase. This is achieved by generating attention weights through the fusion of image and text modalities to focus on the important regions in an image. We carefully create a zero-shot split based on the large-scale MS-COCO and Flickr30k datasets to perform experiments. The results show that our method achieves improvement over the ZS-CMR baseline and self-attention mechanism, demonstrating the effectiveness of inter-modality fusion in a zero-shot scenario.
{"title":"Inter-Modality Fusion Based Attention for Zero-Shot Cross-Modal Retrieval","authors":"Bela Chakraborty, Peng Wang, Lei Wang","doi":"10.1109/ICIP42928.2021.9506182","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506182","url":null,"abstract":"Zero-shot cross-modal retrieval (ZS-CMR) performs the task of cross-modal retrieval where the classes of test categories have a different scope than the training categories. It borrows the intuition from zero-shot learning which targets to transfer the knowledge inferred during the training phase for seen classes to the testing phase for unseen classes. It mimics the real-world scenario where new object categories are continuously populating the multi-media data corpus. Unlike existing ZS-CMR approaches which use generative adversarial networks (GANs) to generate more data, we propose Inter-Modality Fusion based Attention (IMFA) and a framework ZS_INN_FUSE (Zero-Shot cross-modal retrieval using INNer product with image-text FUSEd). It exploits the rich semantics of textual data as guidance to infer additional knowledge during the training phase. This is achieved by generating attention weights through the fusion of image and text modalities to focus on the important regions in an image. We carefully create a zero-shot split based on the large-scale MS-COCO and Flickr30k datasets to perform experiments. The results show that our method achieves improvement over the ZS-CMR baseline and self-attention mechanism, demonstrating the effectiveness of inter-modality fusion in a zero-shot scenario.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"17 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"131274902","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}