Pub Date : 2024-07-02DOI: 10.1109/TIP.2024.3411452
Shuo Wang;Jinda Lu;Haiyang Xu;Yanbin Hao;Xiangnan He
Few-shot learning (FSL) aims at recognizing a novel object under limited training samples. A robust feature extractor (backbone) can significantly improve the recognition performance of the FSL model. However, training an effective backbone is a challenging issue since 1) designing and validating structures of backbones are time-consuming and expensive processes, and 2) a backbone trained on the known (base) categories is more inclined to focus on the textures of the objects it learns, which is hard to describe the novel samples. To solve these problems, we propose a feature mixture operation on the pre-trained (fixed) features: 1) We replace a part of the values of the feature map from a novel category with the content of other feature maps to increase the generalizability and diversity of training samples, which avoids retraining a complex backbone with high computational costs. 2) We use the similarities between the features to constrain the mixture operation, which helps the classifier focus on the representations of the novel object where these representations are hidden in the features from the pre-trained backbone with biased training. Experimental studies on five benchmark datasets in both inductive and transductive settings demonstrate the effectiveness of our feature mixture (FM). Specifically, compared with the baseline on the Mini-ImageNet dataset, it achieves 3.8% and 4.2% accuracy improvements for 1 and 5 training samples, respectively. Additionally, the proposed mixture operation can be used to improve other existing FSL methods based on backbone training.
{"title":"Feature Mixture on Pre-Trained Model for Few-Shot Learning","authors":"Shuo Wang;Jinda Lu;Haiyang Xu;Yanbin Hao;Xiangnan He","doi":"10.1109/TIP.2024.3411452","DOIUrl":"10.1109/TIP.2024.3411452","url":null,"abstract":"Few-shot learning (FSL) aims at recognizing a novel object under limited training samples. A robust feature extractor (backbone) can significantly improve the recognition performance of the FSL model. However, training an effective backbone is a challenging issue since 1) designing and validating structures of backbones are time-consuming and expensive processes, and 2) a backbone trained on the known (base) categories is more inclined to focus on the textures of the objects it learns, which is hard to describe the novel samples. To solve these problems, we propose a feature mixture operation on the pre-trained (fixed) features: 1) We replace a part of the values of the feature map from a novel category with the content of other feature maps to increase the generalizability and diversity of training samples, which avoids retraining a complex backbone with high computational costs. 2) We use the similarities between the features to constrain the mixture operation, which helps the classifier focus on the representations of the novel object where these representations are hidden in the features from the pre-trained backbone with biased training. Experimental studies on five benchmark datasets in both inductive and transductive settings demonstrate the effectiveness of our feature mixture (FM). Specifically, compared with the baseline on the Mini-ImageNet dataset, it achieves 3.8% and 4.2% accuracy improvements for 1 and 5 training samples, respectively. Additionally, the proposed mixture operation can be used to improve other existing FSL methods based on backbone training.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141494632","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1109/TIP.2024.3411448
Jie Nie;Xin Wang;Runze Hou;Guohao Li;Hong Chen;Wenwu Zhu
Video question answering (VideoQA) requires the ability of comprehensively understanding visual contents in videos. Existing VideoQA models mainly focus on scenarios involving a single event with simple object interactions and leave event-centric scenarios involving multiple events with dynamically complex object interactions largely unexplored. These conventional VideoQA models are usually based on features extracted from the global visual signals, making it difficult to capture the object-level and event-level semantics. Although there exists a recent work utilizing a static spatio-temporal graph to explicitly model object interactions in videos, it ignores the dynamic impact of questions for graph construction and fails to exploit the implicit event-level semantic clues in questions. To overcome these limitations, we propose a Self-supervised Dynamic Graph Reasoning (SDGraphR) model for video question answering (VideoQA). Our SDGraphR model learns a question-guided spatio-temporal graph that dynamically encodes intra-frame spatial correlations and inter-frame correspondences between objects in the videos. Furthermore, the proposed SDGraphR model discovers event-level cues from questions to conduct self-supervised learning with an auxiliary event recognition task, which in turn helps to improve its VideoQA performances without using any extra annotations. We carry out extensive experiments to validate the substantial improvements of our proposed SDGraphR model over existing baselines.
{"title":"Dynamic Spatio-Temporal Graph Reasoning for VideoQA With Self-Supervised Event Recognition","authors":"Jie Nie;Xin Wang;Runze Hou;Guohao Li;Hong Chen;Wenwu Zhu","doi":"10.1109/TIP.2024.3411448","DOIUrl":"10.1109/TIP.2024.3411448","url":null,"abstract":"Video question answering (VideoQA) requires the ability of comprehensively understanding visual contents in videos. Existing VideoQA models mainly focus on scenarios involving a single event with simple object interactions and leave event-centric scenarios involving multiple events with dynamically complex object interactions largely unexplored. These conventional VideoQA models are usually based on features extracted from the global visual signals, making it difficult to capture the object-level and event-level semantics. Although there exists a recent work utilizing a static spatio-temporal graph to explicitly model object interactions in videos, it ignores the dynamic impact of questions for graph construction and fails to exploit the implicit event-level semantic clues in questions. To overcome these limitations, we propose a Self-supervised Dynamic Graph Reasoning (SDGraphR) model for video question answering (VideoQA). Our SDGraphR model learns a question-guided spatio-temporal graph that dynamically encodes intra-frame spatial correlations and inter-frame correspondences between objects in the videos. Furthermore, the proposed SDGraphR model discovers event-level cues from questions to conduct self-supervised learning with an auxiliary event recognition task, which in turn helps to improve its VideoQA performances without using any extra annotations. We carry out extensive experiments to validate the substantial improvements of our proposed SDGraphR model over existing baselines.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141494587","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-02DOI: 10.1109/TIP.2024.3419414
Xiaobo Shen;Wei Wu;Xiaxin Wang;Yuhui Zheng
Conventional image set methods typically learn from small to medium-sized image set datasets. However, when applied to large-scale image set applications such as classification and retrieval, they face two primary challenges: 1) effectively modeling complex image sets; and 2) efficiently performing tasks. To address the above issues, we propose a novel Multiple Riemannian Kernel Hashing (MRKH) method that leverages the powerful capabilities of Riemannian manifold and Hashing on effective and efficient image set representation. MRKH considers multiple heterogeneous Riemannian manifolds to represent each image set. It introduces a multiple kernel learning framework designed to effectively combine statistics from multiple manifolds, and constructs kernels by selecting a small set of anchor points, enabling efficient scalability for large-scale applications. In addition, MRKH further exploits inter- and intra-modal semantic structure to enhance discrimination. Instead of employing continuous feature to represent each image set, MRKH suggests learning hash code for each image set, thereby achieving efficient computation and storage. We present an iterative algorithm with theoretical convergence guarantee to optimize MRKH, and the computational complexity is linear with the size of dataset. Extensive experiments on five image set benchmark datasets including three large-scale ones demonstrate the proposed method outperforms state-of-the-arts in accuracy and efficiency particularly in large-scale image set classification and retrieval.
{"title":"Multiple Riemannian Kernel Hashing for Large-Scale Image Set Classification and Retrieval","authors":"Xiaobo Shen;Wei Wu;Xiaxin Wang;Yuhui Zheng","doi":"10.1109/TIP.2024.3419414","DOIUrl":"10.1109/TIP.2024.3419414","url":null,"abstract":"Conventional image set methods typically learn from small to medium-sized image set datasets. However, when applied to large-scale image set applications such as classification and retrieval, they face two primary challenges: 1) effectively modeling complex image sets; and 2) efficiently performing tasks. To address the above issues, we propose a novel Multiple Riemannian Kernel Hashing (MRKH) method that leverages the powerful capabilities of Riemannian manifold and Hashing on effective and efficient image set representation. MRKH considers multiple heterogeneous Riemannian manifolds to represent each image set. It introduces a multiple kernel learning framework designed to effectively combine statistics from multiple manifolds, and constructs kernels by selecting a small set of anchor points, enabling efficient scalability for large-scale applications. In addition, MRKH further exploits inter- and intra-modal semantic structure to enhance discrimination. Instead of employing continuous feature to represent each image set, MRKH suggests learning hash code for each image set, thereby achieving efficient computation and storage. We present an iterative algorithm with theoretical convergence guarantee to optimize MRKH, and the computational complexity is linear with the size of dataset. Extensive experiments on five image set benchmark datasets including three large-scale ones demonstrate the proposed method outperforms state-of-the-arts in accuracy and efficiency particularly in large-scale image set classification and retrieval.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141494633","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-01DOI: 10.1109/TIP.2024.3418670
Jinglei Shi;Yihong Xu;Christine Guillemot
Light fields capture 3D scene information by recording light rays emitted from a scene at various orientations. They offer a more immersive perception, compared with classic 2D images, but at the cost of huge data volumes. In this paper, we design a compact neural network representation for the light field compression task. In the same vein as the deep image prior, the neural network takes randomly initialized noise as input and is trained in a supervised manner in order to best reconstruct the target light field Sub-Aperture Images (SAIs). The network is composed of two types of complementary kernels: descriptive kernels (descriptors) that store scene description information learned during training, and modulatory kernels (modulators) that control the rendering of different SAIs from the queried perspectives. To further enhance compactness of the network meanwhile retain high quality of the decoded light field, we propose modulator allocation and apply kernel tensor decomposition techniques, followed by non-uniform quantization and lossless entropy coding. Extensive experiments demonstrate that our method outperforms other state-of-the-art (SOTA) methods by a significant margin in the light field compression task. Moreover, after adapting descriptors, the modulators learned from one light field can be transferred to new light fields for rendering dense views, showing the potential of the solution for view synthesis.
{"title":"Learning Kernel-Modulated Neural Representation for Efficient Light Field Compression","authors":"Jinglei Shi;Yihong Xu;Christine Guillemot","doi":"10.1109/TIP.2024.3418670","DOIUrl":"10.1109/TIP.2024.3418670","url":null,"abstract":"Light fields capture 3D scene information by recording light rays emitted from a scene at various orientations. They offer a more immersive perception, compared with classic 2D images, but at the cost of huge data volumes. In this paper, we design a compact neural network representation for the light field compression task. In the same vein as the deep image prior, the neural network takes randomly initialized noise as input and is trained in a supervised manner in order to best reconstruct the target light field Sub-Aperture Images (SAIs). The network is composed of two types of complementary kernels: descriptive kernels (descriptors) that store scene description information learned during training, and modulatory kernels (modulators) that control the rendering of different SAIs from the queried perspectives. To further enhance compactness of the network meanwhile retain high quality of the decoded light field, we propose modulator allocation and apply kernel tensor decomposition techniques, followed by non-uniform quantization and lossless entropy coding. Extensive experiments demonstrate that our method outperforms other state-of-the-art (SOTA) methods by a significant margin in the light field compression task. Moreover, after adapting descriptors, the modulators learned from one light field can be transferred to new light fields for rendering dense views, showing the potential of the solution for view synthesis.","PeriodicalId":94032,"journal":{"name":"IEEE transactions on image processing : a publication of the IEEE Signal Processing Society","volume":null,"pages":null},"PeriodicalIF":0.0,"publicationDate":"2024-07-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"141478181","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":0,"RegionCategory":"","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-07-01DOI: 10.1109/TIP.2024.3418581
Mengcheng Lan;Min Meng;Jun Yu;Jigang Wu
Domain adaptation has shown appealing performance by leveraging knowledge from a source domain with rich annotations. However, for a specific target task, it is cumbersome to collect related and high-quality source domains. In real-world scenarios, large-scale datasets corrupted with noisy labels are easy to collect, stimulating a great demand for automatic recognition in a generalized setting, i.e., weakly-supervised partial domain adaptation (WS-PDA), which transfers a classifier from a large source domain with noises in labels to a small unlabeled target domain. As such, the key issues of WS-PDA are: 1) how to sufficiently discover the knowledge from the noisy labeled source domain and the unlabeled target domain, and 2) how to successfully adapt the knowledge across domains. In this paper, we propose a simple yet effective domain adaptation approach, termed as self-paced transfer classifier learning (SP-TCL), to address the above issues, which could be regarded as a well-performing baseline for several generalized domain adaptation tasks. The proposed model is established upon the self-paced learning scheme, seeking a preferable classifier for the target domain. Specifically, SP-TCL learns to discover faithful knowledge via a carefully designed prudent loss function and simultaneously adapts the learned knowledge to the target domain by iteratively excluding source examples from training under the self-paced fashion. Extensive evaluations on several benchmark datasets demonstrate that SP-TCL significantly outperforms state-of-the-art approaches on several generalized domain adaptation tasks. Code is available at https://github.com/mc-lan/SP-TCL