Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506454
Ling Huang, Deruo Cheng, Xulei Yang, Tong Lin, Yiqiong Shi, Kaiyi Yang, B. Gwee, B. Wen
While microscopy enables material scientists to view and analyze microstructures, the imaging results often include defects and anomalies with varied shapes and locations. The presence of such anomalies significantly degrades the quality of microscopy images and the subsequent analytical tasks. Comparing to classic feature-based methods, recent advancements in deep learning provide a more efficient, accurate, and scalable approach to detect and remove anomalies in microscopy images. However, most of the deep inpainting and anomaly detection schemes require a certain level of supervision, i.e., either annotation of the anomalies, or a corpus of purely normal data, which are limited in practice for supervision-starving microscopy applications. In this work, we propose a self-supervised deep learning scheme for joint anomaly detection and inpainting of microscopy images. The proposed anomaly detection model can be trained over a mixture of normal and abnormal microscopy images without any labeling. Instead of a two-stage scheme, our multi-task model can simultaneously detect abnormal regions and remove the defects via jointly training. To benchmark such microscopy application under the real-world setup, we propose a novel dataset of real microscopic images of integrated circuits, dubbed MIIC. The proposed dataset contains tens of thousands of normal microscopic images, while we labeled hundreds of them containing various imaging and manufacturing anomalies and defects for testing. Experiments show that the proposed model outperforms various popular or state-of-the-art competing methods for both microscopy image anomaly detection and inpainting.
{"title":"Joint Anomaly Detection and Inpainting for Microscopy Images Via Deep Self-Supervised Learning","authors":"Ling Huang, Deruo Cheng, Xulei Yang, Tong Lin, Yiqiong Shi, Kaiyi Yang, B. Gwee, B. Wen","doi":"10.1109/ICIP42928.2021.9506454","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506454","url":null,"abstract":"While microscopy enables material scientists to view and analyze microstructures, the imaging results often include defects and anomalies with varied shapes and locations. The presence of such anomalies significantly degrades the quality of microscopy images and the subsequent analytical tasks. Comparing to classic feature-based methods, recent advancements in deep learning provide a more efficient, accurate, and scalable approach to detect and remove anomalies in microscopy images. However, most of the deep inpainting and anomaly detection schemes require a certain level of supervision, i.e., either annotation of the anomalies, or a corpus of purely normal data, which are limited in practice for supervision-starving microscopy applications. In this work, we propose a self-supervised deep learning scheme for joint anomaly detection and inpainting of microscopy images. The proposed anomaly detection model can be trained over a mixture of normal and abnormal microscopy images without any labeling. Instead of a two-stage scheme, our multi-task model can simultaneously detect abnormal regions and remove the defects via jointly training. To benchmark such microscopy application under the real-world setup, we propose a novel dataset of real microscopic images of integrated circuits, dubbed MIIC. The proposed dataset contains tens of thousands of normal microscopic images, while we labeled hundreds of them containing various imaging and manufacturing anomalies and defects for testing. Experiments show that the proposed model outperforms various popular or state-of-the-art competing methods for both microscopy image anomaly detection and inpainting.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"83 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128614574","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506398
S. Kumari, S. Raman
3D point cloud completion problem deals with completing the shape from partial points. The problem finds its application in many vision-related applications. Here, structure plays an important role. Most of the existing approaches either do not consider structural information or consider structure at the decoder only. For maintaining the structure, it is also necessary to maintain the position of the available 3D points. However, most of the approaches lack the aspect of maintaining the available structural position. In this paper, we propose to employ stacked auto-encoder in conjunction a with shared Multi-Layer Perceptron (MLP). MLP converts each 3D point into a feature vector and the stacked auto-encoder helps in maintaining the available structural position of the input points. Further, it explores the redundancy present in the feature vector. It aids to incorporate coarse to fine scale information that further helps in better shape representation. The embedded feature is finally decoded by a structural preserving decoder. Both the encoding and the decoding operations of our method take care of preserving the structure of the available shape information. The experimental results demonstrate the structure preserving capability of our network as compared to the state-of-the-art methods.
{"title":"3d Point Cloud Completion Using Stacked Auto-Encoder For Structure Preservation","authors":"S. Kumari, S. Raman","doi":"10.1109/ICIP42928.2021.9506398","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506398","url":null,"abstract":"3D point cloud completion problem deals with completing the shape from partial points. The problem finds its application in many vision-related applications. Here, structure plays an important role. Most of the existing approaches either do not consider structural information or consider structure at the decoder only. For maintaining the structure, it is also necessary to maintain the position of the available 3D points. However, most of the approaches lack the aspect of maintaining the available structural position. In this paper, we propose to employ stacked auto-encoder in conjunction a with shared Multi-Layer Perceptron (MLP). MLP converts each 3D point into a feature vector and the stacked auto-encoder helps in maintaining the available structural position of the input points. Further, it explores the redundancy present in the feature vector. It aids to incorporate coarse to fine scale information that further helps in better shape representation. The embedded feature is finally decoded by a structural preserving decoder. Both the encoding and the decoding operations of our method take care of preserving the structure of the available shape information. The experimental results demonstrate the structure preserving capability of our network as compared to the state-of-the-art methods.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"6 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"128740884","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506215
Wei Guo, Ziyu Zhu, F. Xia, Jiarui Sun, Yong Zhao
Nowadays, convolutional neural networks based on deep learning have greatly improved the performance of stereo matching. To obtain higher disparity estimation accuracy in ill-posed regions, this paper proposes a hierarchical and multi-level model based on a novel cost aggregation module (HMLNet). This effective cost aggregation consists of two main modules: one is the multi-level cost aggregation which incorporates global context information by fusing information in different levels, and the other called the hourglass+ module utilizes sufficiently volumes in the same level to regularize cost volumes better. Also, we take advantage of disparity refinement with residual learning to boost robustness to challenging situations. We conducted comprehensive experiments on Sceneflow, KITTI 2012, and KITTI 2015 datasets. The competitive results prove that our approach outperforms many other stereo matching algorithms.
{"title":"Hierarchical and Multi-Level Cost Aggregation For Stereo Matching","authors":"Wei Guo, Ziyu Zhu, F. Xia, Jiarui Sun, Yong Zhao","doi":"10.1109/ICIP42928.2021.9506215","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506215","url":null,"abstract":"Nowadays, convolutional neural networks based on deep learning have greatly improved the performance of stereo matching. To obtain higher disparity estimation accuracy in ill-posed regions, this paper proposes a hierarchical and multi-level model based on a novel cost aggregation module (HMLNet). This effective cost aggregation consists of two main modules: one is the multi-level cost aggregation which incorporates global context information by fusing information in different levels, and the other called the hourglass+ module utilizes sufficiently volumes in the same level to regularize cost volumes better. Also, we take advantage of disparity refinement with residual learning to boost robustness to challenging situations. We conducted comprehensive experiments on Sceneflow, KITTI 2012, and KITTI 2015 datasets. The competitive results prove that our approach outperforms many other stereo matching algorithms.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"29 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"129353319","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506511
Yu Zheng, Senem Velipasalar
Although deep neural networks (DNNs) have achieved top performances in different computer vision tasks, such as object detection, image segmentation and person re-identification (ReID), they can easily be deceived by adversarial examples, which are carefully crafted images with perturbations that are imperceptible to human eyes. Such adversarial examples can significantly degrade the performance of existing DNNs. There are also targeted attacks misleading classifiers into making specific decisions based on attackers’ intentions. In this paper, we propose a new method to effectively detect adversarial examples presented to a person ReID network. The proposed method utilizes parts-based feature squeezing to detect the adversarial examples. We apply two types of squeezing to segmented body parts to better detect adversarial examples. We perform extensive experiments over three major datasets with different attacks, and compare the detection performance of the proposed body part-based approach with a ReID method that is not parts-based. Experimental results show that the proposed method can effectively detect the adversarial examples, and has the potential to avoid significant decreases in person ReID performance caused by adversarial examples.
{"title":"Part-Based Feature Squeezing To Detect Adversarial Examples in Person Re-Identification Networks","authors":"Yu Zheng, Senem Velipasalar","doi":"10.1109/ICIP42928.2021.9506511","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506511","url":null,"abstract":"Although deep neural networks (DNNs) have achieved top performances in different computer vision tasks, such as object detection, image segmentation and person re-identification (ReID), they can easily be deceived by adversarial examples, which are carefully crafted images with perturbations that are imperceptible to human eyes. Such adversarial examples can significantly degrade the performance of existing DNNs. There are also targeted attacks misleading classifiers into making specific decisions based on attackers’ intentions. In this paper, we propose a new method to effectively detect adversarial examples presented to a person ReID network. The proposed method utilizes parts-based feature squeezing to detect the adversarial examples. We apply two types of squeezing to segmented body parts to better detect adversarial examples. We perform extensive experiments over three major datasets with different attacks, and compare the detection performance of the proposed body part-based approach with a ReID method that is not parts-based. Experimental results show that the proposed method can effectively detect the adversarial examples, and has the potential to avoid significant decreases in person ReID performance caused by adversarial examples.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"93 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127137160","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506192
Abderrezzaq Sendjasni, M. Larabi, F. A. Cheikh
The use of convolutional neural networks (CNN) for image quality assessment (IQA) becomes many researcher’s focus. Various pre-trained models are fine-tuned and used for this task. In this paper, we conduct a benchmark study of seven state-of-the-art pre-trained models for IQA of omnidirectional images. To this end, we first train these models using an omnidirectional database and compare their performance with the pre-trained versions. Then, we compare the use of viewports versus equirectangular (ERP) images as inputs to the models. Finally, for the viewports-based models, we explore the impact of the input number of viewports on the models’ performance. Experimental results demonstrated the performance gain of the re-trained CNNs compared to their pre-trained versions. Also, the viewports-based approach outperformed the ERP-based one independently of the number of selected views.
{"title":"Convolutional Neural Networks for Omnidirectional Image Quality Assessment: Pre-Trained or Re-Trained?","authors":"Abderrezzaq Sendjasni, M. Larabi, F. A. Cheikh","doi":"10.1109/ICIP42928.2021.9506192","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506192","url":null,"abstract":"The use of convolutional neural networks (CNN) for image quality assessment (IQA) becomes many researcher’s focus. Various pre-trained models are fine-tuned and used for this task. In this paper, we conduct a benchmark study of seven state-of-the-art pre-trained models for IQA of omnidirectional images. To this end, we first train these models using an omnidirectional database and compare their performance with the pre-trained versions. Then, we compare the use of viewports versus equirectangular (ERP) images as inputs to the models. Finally, for the viewports-based models, we explore the impact of the input number of viewports on the models’ performance. Experimental results demonstrated the performance gain of the re-trained CNNs compared to their pre-trained versions. Also, the viewports-based approach outperformed the ERP-based one independently of the number of selected views.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"71 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127375168","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506049
Giulia Slavic, Abrham Shiferaw Alemaw, L. Marcenaro, C. Regazzoni
This paper proposes a method for performing future-frame prediction and anomaly detection on video data in a multi-modal framework based on Dynamic Bayesian Networks (DBNs). In particular, odometry data and video data from a moving vehicle are fused. A Markov Jump Particle Filter (MJPF) is learned on odometry data, and its features are used to aid the learning of a Kalman Variational Autoencoder (KVAE) on video data. Consequently, anomaly detection can be performed on video data using the learned model. We evaluate the proposed method using multi-modal data from a vehicle performing different tasks in a closed environment.
{"title":"Learning Of Linear Video Prediction Models In A Multi-Modal Framework For Anomaly Detection","authors":"Giulia Slavic, Abrham Shiferaw Alemaw, L. Marcenaro, C. Regazzoni","doi":"10.1109/ICIP42928.2021.9506049","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506049","url":null,"abstract":"This paper proposes a method for performing future-frame prediction and anomaly detection on video data in a multi-modal framework based on Dynamic Bayesian Networks (DBNs). In particular, odometry data and video data from a moving vehicle are fused. A Markov Jump Particle Filter (MJPF) is learned on odometry data, and its features are used to aid the learning of a Kalman Variational Autoencoder (KVAE) on video data. Consequently, anomaly detection can be performed on video data using the learned model. We evaluate the proposed method using multi-modal data from a vehicle performing different tasks in a closed environment.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"27 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"127543676","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506699
Syed Talha Abid Ali, Abdul Hakeem Usama, I. R. Khan, M. Khan, Asif Siddiq
Automatic License Plate Recognition (ALPR) for years has remained a persistent topic of research due to numerous practicable applications, especially in the Intelligent Transportation system (ITS). Many currently available solutions are still not robust in various real-world circumstances and often impose constraints like fixed backgrounds and constant distance and camera angles. This paper presents an efficient multi-language repudiate ALPR system based on machine learning. Convolutional Neural Network (CNN) is trained and fine-tuned for the recognition stage to become more dynamic, plaint to diversification of backgrounds. For license plate (LP) detection, a newly released YOLOv5 object detecting framework is used. Data augmentation techniques such as gray scale and rotatation are also used to generate an augmented dataset for the training purpose. This proposed methodology achieved a recognition rate of 92.2%, producing better results than commercially available systems, PlateRecognizer (67%) and OpenALPR (77%). Our experiments validated that the proposed methodology can meet the pressing requirement of real-time analysis in Intelligent Transportation System (ITS).
{"title":"Mobile Registration Number Plate Recognition Using Artificial Intelligence","authors":"Syed Talha Abid Ali, Abdul Hakeem Usama, I. R. Khan, M. Khan, Asif Siddiq","doi":"10.1109/ICIP42928.2021.9506699","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506699","url":null,"abstract":"Automatic License Plate Recognition (ALPR) for years has remained a persistent topic of research due to numerous practicable applications, especially in the Intelligent Transportation system (ITS). Many currently available solutions are still not robust in various real-world circumstances and often impose constraints like fixed backgrounds and constant distance and camera angles. This paper presents an efficient multi-language repudiate ALPR system based on machine learning. Convolutional Neural Network (CNN) is trained and fine-tuned for the recognition stage to become more dynamic, plaint to diversification of backgrounds. For license plate (LP) detection, a newly released YOLOv5 object detecting framework is used. Data augmentation techniques such as gray scale and rotatation are also used to generate an augmented dataset for the training purpose. This proposed methodology achieved a recognition rate of 92.2%, producing better results than commercially available systems, PlateRecognizer (67%) and OpenALPR (77%). Our experiments validated that the proposed methodology can meet the pressing requirement of real-time analysis in Intelligent Transportation System (ITS).","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"64 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130068316","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506019
S. Chowdhury, Subhrajyoti Dasgupta, Sudip Das, U. Bhattacharya
Performing sound source separation and visual object segmentation jointly in naturally occurring videos is a notoriously difficult task, especially in the absence of annotated data. In this study, we leverage the concurrency between audio and visual modalities in an attempt to solve the joint audio-visual segmentation problem in a self-supervised manner. Human beings interact with the physical world through a few sensory systems such as vision, auditory, movement, etc. The usefulness of the interplay of such systems lies in the concept of degeneracy [1]. It tells us that the cross-modal signals can educate each other without the presence of an external supervisor. In this work, we efficiently exploit this fact that learning from one modality inherently helps to find patterns in others by introducing a novel audio-visual fusion technique. Also, to the best of our knowledge, we are the first to address the partially occluded sound source segmentation task. Our study shows that the proposed model significantly outperforms existing state-of-the-art methods in both visual and audio source separation tasks.
{"title":"Listen To The Pixels","authors":"S. Chowdhury, Subhrajyoti Dasgupta, Sudip Das, U. Bhattacharya","doi":"10.1109/ICIP42928.2021.9506019","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506019","url":null,"abstract":"Performing sound source separation and visual object segmentation jointly in naturally occurring videos is a notoriously difficult task, especially in the absence of annotated data. In this study, we leverage the concurrency between audio and visual modalities in an attempt to solve the joint audio-visual segmentation problem in a self-supervised manner. Human beings interact with the physical world through a few sensory systems such as vision, auditory, movement, etc. The usefulness of the interplay of such systems lies in the concept of degeneracy [1]. It tells us that the cross-modal signals can educate each other without the presence of an external supervisor. In this work, we efficiently exploit this fact that learning from one modality inherently helps to find patterns in others by introducing a novel audio-visual fusion technique. Also, to the best of our knowledge, we are the first to address the partially occluded sound source segmentation task. Our study shows that the proposed model significantly outperforms existing state-of-the-art methods in both visual and audio source separation tasks.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"25 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130667596","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506387
Jinghui Chu, Jiawei Feng, Peiguang Jing, Wei Lu
One-shot object detection aims to detect all candidate instances in a target image whose label class is unavailable in training, and only one labeled query image is given in testing. Nevertheless, insufficient utilization of the only known sample is one significant reason causing the performance degradation of current one-shot object detection models. To tackle the problem, we develop joint co-attention and co-reconstruction (CoAR) representation learning for one-shot object detection. The main contributions are described as follows. First, we propose a high-order feature fusion operation to exploit the deep co-attention of each target-query pair, which aims to enhance the correlation of the same class. Second, we use a low-rank structure to reconstruct the target-query feature in channel level, which aims to remove the irrelevant noise and enhance the latent similarity between the region proposals in target image and the query image. Experiments on both PASCAL VOC and MS COCO datasets demonstrate that our method outperforms previous state-of-the-art algorithms.
{"title":"Joint Co-Attention And Co-Reconstruction Representation Learning For One-Shot Object Detection","authors":"Jinghui Chu, Jiawei Feng, Peiguang Jing, Wei Lu","doi":"10.1109/ICIP42928.2021.9506387","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506387","url":null,"abstract":"One-shot object detection aims to detect all candidate instances in a target image whose label class is unavailable in training, and only one labeled query image is given in testing. Nevertheless, insufficient utilization of the only known sample is one significant reason causing the performance degradation of current one-shot object detection models. To tackle the problem, we develop joint co-attention and co-reconstruction (CoAR) representation learning for one-shot object detection. The main contributions are described as follows. First, we propose a high-order feature fusion operation to exploit the deep co-attention of each target-query pair, which aims to enhance the correlation of the same class. Second, we use a low-rank structure to reconstruct the target-query feature in channel level, which aims to remove the irrelevant noise and enhance the latent similarity between the region proposals in target image and the query image. Experiments on both PASCAL VOC and MS COCO datasets demonstrate that our method outperforms previous state-of-the-art algorithms.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"1 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"130728066","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2021-09-19DOI: 10.1109/ICIP42928.2021.9506385
Yong-Yaw Yeo, John See, Lai-Kuan Wong, Hui-Ngo Goh
The recent surge in deep learning methods across multiple modalities has resulted in an increased interest in image captioning. Most advances in image captioning are still focused on the generation of factual-centric captions, which mainly describe the contents of an image. However, generating captions to provide a meaningful and opinionated critique of photographs is less studied. This paper presents a framework for leveraging aesthetic features encoded from an image aesthetic scorer, to synthesize human-like textual critique via a sequence decoder. Experiments on a large-scale dataset show that the proposed method is capable of producing promising results on relevant metrics relating to semantic diversity and synonymity, with qualitative observations demonstrating likewise. We also suggest the use of Word Mover’s Distance as a semantically intuitive and informative metric for this task.
{"title":"Generating Aesthetic Based Critique For Photographs","authors":"Yong-Yaw Yeo, John See, Lai-Kuan Wong, Hui-Ngo Goh","doi":"10.1109/ICIP42928.2021.9506385","DOIUrl":"https://doi.org/10.1109/ICIP42928.2021.9506385","url":null,"abstract":"The recent surge in deep learning methods across multiple modalities has resulted in an increased interest in image captioning. Most advances in image captioning are still focused on the generation of factual-centric captions, which mainly describe the contents of an image. However, generating captions to provide a meaningful and opinionated critique of photographs is less studied. This paper presents a framework for leveraging aesthetic features encoded from an image aesthetic scorer, to synthesize human-like textual critique via a sequence decoder. Experiments on a large-scale dataset show that the proposed method is capable of producing promising results on relevant metrics relating to semantic diversity and synonymity, with qualitative observations demonstrating likewise. We also suggest the use of Word Mover’s Distance as a semantically intuitive and informative metric for this task.","PeriodicalId":314429,"journal":{"name":"2021 IEEE International Conference on Image Processing (ICIP)","volume":"144 1","pages":"0"},"PeriodicalIF":0.0,"publicationDate":"2021-09-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"132083270","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}