Traditional image captioning methods only have a local perspective at the dataset level, allowing them to explore dispersed information within individual images. However, the lack of a global perspective prevents them from capturing common characteristics among similar images. To address the limitation, this paper introduces a novel Triple-stream Commonsense Circulating Transformer Network (TCCTN). It incorporates contextual stream into the encoder, combining enhanced channel stream and spatial stream for comprehensive feature learning. The proposed commonsense-aware contextual attention (CCA) module queries commonsense contextual features from the dataset, obtaining global contextual association information by projecting grid features into the contextual space. The pure semantic channel attention (PSCA) module leverages compressed spatial domain for channel pooling, focusing on attention weights of pure channel features to capture inherent semantic features. The region spatial attention (RSA) module enhances spatial concepts in semantic learning by incorporating region position information. Furthermore, leveraging the complementary differences among the three features, TCCTN introduces the mixture of experts strategy to enhance the unique discriminative ability of features and promote their integration in textual feature learning. Extensive experiments on the MS-COCO dataset demonstrate the effectiveness of contextual commonsense stream and the superior performance of TCCTN.
Few-shot learning has achieved great success in many fields, thanks to its requirement of limited number of labeled data. However, most of the state-of-the-art techniques of few-shot learning employ transfer learning, which still requires massive labeled data to train a meta-learning system. To simulate the human learning mechanism, a deep model of few-shot learning is proposed to learn from one, or a few examples. First of all in this paper, we analyze and note that the problem with representative semi-supervised few-shot learning methods is getting stuck in local optimization and the negligence of intra-class compactness problem. To address these issue, we propose a novel semi-supervised few-shot learning method with Convex Kullback–Leibler, hereafter referred to as CKL, in which KL divergence is employed to achieve global optimum solution by optimizing a strictly convex functions to perform clustering; whereas sample selection strategy is employed to achieve intra-class compactness. In training, the CKL is optimized iteratively via deep learning and expectation–maximization algorithm. Intensive experiments have been conducted on three popular benchmark data sets, take miniImagenet data set for example, our proposed CKL achieved 76.83% and 85.78% under 5-way 1-shot and 5-way 5-shot, the experimental results show that this method significantly improves the classification ability of few-shot learning tasks and obtains the start-of-the-art performance.
Depth completion aims at reconstructing a dense depth from sparse depth input, frequently using color images as guidance. The sparse depth map lacks sufficient contexts for reconstructing focal contexts such as the shape of objects. The RGB images contain redundant contexts including details useless for reconstruction, which reduces the efficiency of focal context extraction. The unaligned contextual information from these two modalities poses a challenge to focal context extraction and further fusion, as well as the accuracy of depth completion. To optimize the utilization of multimodal contextual information, we explore a novel framework: Context Aligned Fusion Network (CAFNet). CAFNet comprises two stages: the context-aligned stage and the full-scale stage. In the context-aligned stage, CAFNet downsamples input RGB-D pairs to the scale, at which multimodal contextual information is adequately aligned for feature extraction in two encoders and fusion in CF modules. In the full-scale stage, feature maps with fused multimodal context from the previous stage are upsampled to the original scale and subsequentially fused with full-scale depth features by the GF module utilizing a dynamic masked fusion strategy. Ultimately, accurate dense depth maps are reconstructed, leveraging the GF module’s resultant features. Experiments conducted on indoor and outdoor benchmark datasets show that the CAFNet produces results comparable to state-of-the-art methods while effectively reducing computational costs.
Infrared and visible image fusion is an extensively investigated problem in infrared image processing, aiming to extract useful information from source images. However, the automatic fusion of these images presents a significant challenge due to the large domain difference and ambiguous boundaries. In this article, we propose a novel image fusion approach based on hybrid boundary-aware attention, termed HBANet, which models global dependencies across the image and leverages boundary-wise prior knowledge to supplement local details. Specifically, we design a novel mixed boundary-aware attention module that is capable of leveraging spatial information to the fullest extent and integrating long dependencies across different domains. To preserve the integrity of texture and structural information, we introduced a sophisticated loss function that comprises structure, intensity, and variation losses. Our method has been demonstrated to outperform state-of-the-art methods in terms of both visual and quantitative metrics, in our experiments on public datasets. Furthermore, our approach also exhibits great generalization capability, achieving satisfactory results in CT and MRI image fusion tasks.
Language-vision integration has become an increasingly popular research direction within the computer vision field. In recent years, there has been a growing recognition of the importance of incorporating linguistic information into visual tasks, particularly in domains such as action anticipation. This integration allows anticipation models to leverage textual descriptions to gain deeper contextual understanding, leading to more accurate predictions. In this work, we focus on pedestrian action anticipation, where the objective is the early prediction of pedestrians’ future actions in urban environments. Our method relies on a multi-modal transformer model that encodes past observations and produces predictions at different anticipation times, employing a learned mask technique to filter out redundancy in the observed frames. Instead of relying solely on visual cues extracted from images or videos, we explore the impact of integrating textual information in enriching the input modalities of our pedestrian action anticipation model. We investigate various techniques for generating descriptive captions corresponding to input images, aiming to enhance the anticipation performance. Evaluation results on available public benchmarks demonstrate the effectiveness of our method in improving the prediction performance at different anticipation times compared to previous works. Additionally, incorporating the language modality in our anticipation model proved significant improvement, reaching a 29.5% increase in the F1 score at 1-second anticipation and a 16.66% increase at 4-second anticipation. These results underscore the potential of language-vision integration in advancing pedestrian action anticipation in complex urban environments.
Aiming at the problem of insufficient use of human–object interaction (HOI) information and spatial location information in images, we propose a human–object interaction detection network based on graph structure and improved cascade pyramid. This network is composed of three branches, namely, graph branch, human–object branch and human pose branch. In graph branch, we propose a Graph-based Interactive Feature Generation Algorithm (GIFGA) to address the inadequate utilization of interaction information. GIFGA constructs an initial dense graph model by taking humans and objects as nodes and their interaction relationships as edges. Then, by traversing each node, the graph model is updated to generate the final interaction features. In human pose branch, we propose an Improved Cascade Pyramid Network (ICPN) to tackle the underutilization of spatial location information. ICPN extracts human pose features and maps both the object bounding boxes and extracted human pose maps onto the global feature map to capture the most discriminative interaction-related region features within the global context. Finally, the features from the three branches are fed into a Multi-Layer Perceptron (MLP) for fusion and then classified for recognition. Experimental results demonstrate that our network achieves mAP of 54.93% and 28.69% on the V-COCO and HICO-DET datasets, respectively.
In recent years, multi-modal fusion methods have shown excellent performance in the field of 3D object detection, which select the voxel centers and globally fuse with image features across the scene. However, these approaches exist two issues. First, The distribution of voxel density is highly heterogeneous due to the discrete volumes. Additionally, there are significant differences in the features between images and point clouds. Global fusion does not take into account the correspondence between these two modalities, which leads to the insufficient fusion. In this paper, we propose a new multi-modal fusion method named Voxel-Image Dynamic Fusion (VIDF). Specifically, VIDF-Net is composed of the Voxel Centroid Mapping module (VCM) and the Deformable Attention Fusion module (DAF). The Voxel Centroid Mapping module is used to calculate the centroid of voxel features and map them onto the image plane, which can locate the position of voxel features more effectively. We then use the Deformable Attention Fusion module to dynamically calculates the offset of each voxel centroid from the image position and combine these two modalities. Furthermore, we propose Region Proposal Network with Channel-Spatial Aggregate to combine channel and spatial attention maps for improved multi-scale feature interaction. We conduct extensive experiments on the KITTI dataset to demonstrate the outstanding performance of proposed VIDF network. In particular, significant improvements have been observed in the Hard categories of Cars and Pedestrians, which shows the significant effectiveness of our approach in dealing with complex scenarios.
Accurate extraction of regions of interest (ROI) with variable shapes and scales is one of the primary challenges in medical image segmentation. Current U-based networks mostly aggregate multi-stage encoding outputs as an improved multi-scale skip connection. Although this design has been proven to provide scale diversity and contextual integrity, there remain several intuitive limits: (i) the encoding outputs are resampled to the same size simply, which destruct the fine-grained information. The advantages of utilization of multiple scales are insufficient. (ii) Certain redundant information proportional to the feature dimension size is introduced and causes multi-stage interference. And (iii) the precision of information delivery relies on the up-sampling and down-sampling layers, but guidance on maintaining consistency in feature locations and trends between them is lacking.
To improve these situations, this paper proposed a U-based CNN network named HAD-Net, by assembling a new hyper-scale shifted aggregating module (HSAM) paradigm and progressive reusing attention (PRA) for skip connections, as well as employing a novel pair of dual-branch parameter-free sampling layers, i.e. max-diagonal pooling (MDP) and max-diagonal un-pooling (MDUP). That is, the aggregating scheme additionally combines five subregions with certain offsets in the shallower stage. Since the lower scale-down ratios of subregions enrich scales and fine-grain context. Then, the attention scheme contains a partial-to-global channel attention (PGCA) and a multi-scale reusing spatial attention (MRSA), it builds reusing connections internally and adjusts the focus on more useful dimensions. Finally, MDP and MDUP are explored in pairs to improve texture delivery and feature consistency, enhancing information retention and avoiding positional confusion.
Compared to state-of-the-art networks, HAD-Net has achieved comparable and even better performances with Dice of 90.13%, 81.51%, and 75.43% for each class on BraTS20, 89.59% Dice and 98.56% AUC on Kvasir-SEG, as well as 82.17% Dice and 98.05% AUC on DRIVE.
The scheme of HSAM+PRA+MDP+MDUP has been proven to be a remarkable improvement and leaves room for further research.