Pub Date : 2024-10-31DOI: 10.1016/j.patcog.2024.111125
<div><div>When exploring the broad application prospects of large-scale Gaussian process regression (GPR), three core challenges significantly constrain its full effectiveness: firstly, the <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>3</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> time complexity of computing the inverse covariance matrix of <span><math><mi>n</mi></math></span> training points becomes an insurmountable performance bottleneck when processing large-scale datasets; Secondly, although traditional local approximation methods are widely used, they are often limited by the inconsistency of prediction results; The third issue is that many aggregation strategies lack discrimination when evaluating the importance of experts (i.e. local models), resulting in a loss of overall prediction accuracy. In response to the above challenges, this article innovatively proposes a comprehensive method that integrates third-degree stochastic fully symmetric interpolatory rules (TDSFSI), local approximation, and Tsallis mutual information (TDSFSIRLA), aiming to fundamentally break through existing limitations. Specifically, TDSFSIRLA first introduces an efficient third-degree stochastic fully symmetric interpolatory rules, which achieves accurate approximation of Gaussian kernel functions by generating adaptive dimensional feature maps. This innovation not only significantly reduces the number of required orthogonal nodes and effectively lowers computational costs, but also maintains extremely high approximation accuracy, providing a solid theoretical foundation for processing large-scale datasets. Furthermore, in order to overcome the inconsistency of local approximation methods, this paper adopts the Generalized Robust Bayesian Committee Machine (GRBCM) as the aggregation framework for local experts. GRBCM ensures the harmonious unity of the prediction results of each local model through its inherent consistency and robustness, significantly improving the stability and reliability of the overall prediction. More importantly, in response to the issue of uneven distribution of expert weights, this article creatively introduces Tsallis mutual information as a metric for weight allocation. Tsallis mutual information, with its sensitive ability to capture information complexity, assigns weights to different local experts that match their contribution, effectively solving the problem of prediction bias caused by uneven weight distribution and further improving prediction accuracy. In the experimental verification phase, this article conducted comprehensive testing on multiple synthetic datasets and seven representative real datasets. The results show that the TDSFSIRLA method not only achieves significant reduction in time complexity, but also demonstrates excellent performance in prediction accuracy, fully verifying its significant advantages and broad application prospects in the field of large-scale Gaussi
{"title":"Jointly stochastic fully symmetric interpolatory rules and local approximation for scalable Gaussian process regression","authors":"","doi":"10.1016/j.patcog.2024.111125","DOIUrl":"10.1016/j.patcog.2024.111125","url":null,"abstract":"<div><div>When exploring the broad application prospects of large-scale Gaussian process regression (GPR), three core challenges significantly constrain its full effectiveness: firstly, the <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>n</mi></mrow><mrow><mn>3</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> time complexity of computing the inverse covariance matrix of <span><math><mi>n</mi></math></span> training points becomes an insurmountable performance bottleneck when processing large-scale datasets; Secondly, although traditional local approximation methods are widely used, they are often limited by the inconsistency of prediction results; The third issue is that many aggregation strategies lack discrimination when evaluating the importance of experts (i.e. local models), resulting in a loss of overall prediction accuracy. In response to the above challenges, this article innovatively proposes a comprehensive method that integrates third-degree stochastic fully symmetric interpolatory rules (TDSFSI), local approximation, and Tsallis mutual information (TDSFSIRLA), aiming to fundamentally break through existing limitations. Specifically, TDSFSIRLA first introduces an efficient third-degree stochastic fully symmetric interpolatory rules, which achieves accurate approximation of Gaussian kernel functions by generating adaptive dimensional feature maps. This innovation not only significantly reduces the number of required orthogonal nodes and effectively lowers computational costs, but also maintains extremely high approximation accuracy, providing a solid theoretical foundation for processing large-scale datasets. Furthermore, in order to overcome the inconsistency of local approximation methods, this paper adopts the Generalized Robust Bayesian Committee Machine (GRBCM) as the aggregation framework for local experts. GRBCM ensures the harmonious unity of the prediction results of each local model through its inherent consistency and robustness, significantly improving the stability and reliability of the overall prediction. More importantly, in response to the issue of uneven distribution of expert weights, this article creatively introduces Tsallis mutual information as a metric for weight allocation. Tsallis mutual information, with its sensitive ability to capture information complexity, assigns weights to different local experts that match their contribution, effectively solving the problem of prediction bias caused by uneven weight distribution and further improving prediction accuracy. In the experimental verification phase, this article conducted comprehensive testing on multiple synthetic datasets and seven representative real datasets. The results show that the TDSFSIRLA method not only achieves significant reduction in time complexity, but also demonstrates excellent performance in prediction accuracy, fully verifying its significant advantages and broad application prospects in the field of large-scale Gaussi","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594028","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-31DOI: 10.1016/j.patcog.2024.111103
Noisy boxes as queries for sparse object detection has become a hot topic of research in recent years. Sparse R-CNN achieves one-to-one prediction from noisy boxes to object boxes, while DiffusionDet transforms the prediction process of Sparse R-CNN into multiple diffusion processes. Especially, algorithms such as Sparse R-CNN and its improved versions all rely on FPN to extract features for ROI Aligning. But the target only matching one feature map in FPN, which is inefficient and resource-consuming. otherwise, these methods like sparse object detection crop regions from noisy boxes for prediction, resulting in boxes failing to capture global features. In this work, we rethink the detection paradigm of sparse object detection and propose two improvements and produce a new object detector, called Prior Sparse R-CNN. Firstly, we replace the original FPN neck with a neck that only outputs one feature map to improve efficiency. Then, we design aggregated encoder after neck to solve the object scale problem through dilated residual blocks and feature aggregation. Another improvement is that we introduce prior knowledge for noisy boxes to enhance their understanding of global representations. Region Generation network (RGN) is designed by us to generate global object information and fuse it with the features of noisy boxes as prior knowledge. Prior Sparse R-CNN reaches the state-of-the-art 47.0 AP on COCO 2017 validation set, surpassing DiffusionDet by 1.5 AP with ResNet-50 backbone. Additionally, our training epoch requires only 3/5 of the time.
{"title":"Apply prior feature integration to sparse object detectors","authors":"","doi":"10.1016/j.patcog.2024.111103","DOIUrl":"10.1016/j.patcog.2024.111103","url":null,"abstract":"<div><div>Noisy boxes as queries for sparse object detection has become a hot topic of research in recent years. Sparse R-CNN achieves one-to-one prediction from noisy boxes to object boxes, while DiffusionDet transforms the prediction process of Sparse R-CNN into multiple diffusion processes. Especially, algorithms such as Sparse R-CNN and its improved versions all rely on FPN to extract features for ROI Aligning. But the target only matching one feature map in FPN, which is inefficient and resource-consuming. otherwise, these methods like sparse object detection crop regions from noisy boxes for prediction, resulting in boxes failing to capture global features. In this work, we rethink the detection paradigm of sparse object detection and propose two improvements and produce a new object detector, called Prior Sparse R-CNN. Firstly, we replace the original FPN neck with a neck that only outputs one feature map to improve efficiency. Then, we design aggregated encoder after neck to solve the object scale problem through dilated residual blocks and feature aggregation. Another improvement is that we introduce prior knowledge for noisy boxes to enhance their understanding of global representations. Region Generation network (RGN) is designed by us to generate global object information and fuse it with the features of noisy boxes as prior knowledge. Prior Sparse R-CNN reaches the state-of-the-art 47.0 AP on COCO 2017 validation set, surpassing DiffusionDet by 1.5 AP with ResNet-50 backbone. Additionally, our training epoch requires only 3/5 of the time.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142593962","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-31DOI: 10.1016/j.patcog.2024.111106
The current successful paradigm for skeleton-based action recognition is the combination of Graph Convolutional Networks (GCNs) modeling spatial correlations, and Temporal Convolution Networks (TCNs), extracting motion features. Such GCN-TCN-based approaches usually rely on local graph convolution operations, which limits their ability to capture complicated correlations among distant joints, as well as represent long-range dependencies. Although the self-attention originated from Transformers shows great potential in correlation modeling of global joints, the Transformer-based methods are usually computationally expensive and ignore the physical connectivity structure of the human skeleton. To address these issues, we propose a novel Local-Global Self-Attention Enhanced Graph Convolutional Network (LG-SGNet) to simultaneously learn both local and global representations in the spatial–temporal dimension. Our approach consists of three components: The Local-Global Graph Convolutional Network (LG-GCN) module extracts local and global spatial feature representations by parallel channel-specific global and local spatial modeling. The Local-Global Temporal Convolutional Network (LG-TCN) module performs a joint-wise global temporal modeling using multi-head self-attention in parallel with local temporal modeling. This constitutes a new multi-branch temporal convolution structure that effectively captures both long-range dependencies and subtle temporal structures. Finally, the Dynamic Frame Weighting Module (DFWM) adjusts the weights of skeleton action sequence frames, allowing the model to adaptively focus on the features of representative frames for more efficient action recognition. Extensive experiments demonstrate that our LG-SGNet performs very competitively compared to the state-of-the-art methods. Our project website is available at https://github.com/DingYyue/LG-SGNet.
{"title":"Local and global self-attention enhanced graph convolutional network for skeleton-based action recognition","authors":"","doi":"10.1016/j.patcog.2024.111106","DOIUrl":"10.1016/j.patcog.2024.111106","url":null,"abstract":"<div><div>The current successful paradigm for skeleton-based action recognition is the combination of Graph Convolutional Networks (GCNs) modeling spatial correlations, and Temporal Convolution Networks (TCNs), extracting motion features. Such GCN-TCN-based approaches usually rely on local graph convolution operations, which limits their ability to capture complicated correlations among distant joints, as well as represent long-range dependencies. Although the self-attention originated from Transformers shows great potential in correlation modeling of global joints, the Transformer-based methods are usually computationally expensive and ignore the physical connectivity structure of the human skeleton. To address these issues, we propose a novel Local-Global Self-Attention Enhanced Graph Convolutional Network (LG-SGNet) to simultaneously learn both local and global representations in the spatial–temporal dimension. Our approach consists of three components: The Local-Global Graph Convolutional Network (LG-GCN) module extracts local and global spatial feature representations by parallel channel-specific global and local spatial modeling. The Local-Global Temporal Convolutional Network (LG-TCN) module performs a joint-wise global temporal modeling using multi-head self-attention in parallel with local temporal modeling. This constitutes a new multi-branch temporal convolution structure that effectively captures both long-range dependencies and subtle temporal structures. Finally, the Dynamic Frame Weighting Module (DFWM) adjusts the weights of skeleton action sequence frames, allowing the model to adaptively focus on the features of representative frames for more efficient action recognition. Extensive experiments demonstrate that our LG-SGNet performs very competitively compared to the state-of-the-art methods. Our project website is available at <span><span>https://github.com/DingYyue/LG-SGNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-31","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142593961","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-30DOI: 10.1016/j.patcog.2024.111095
Knowledge distillation (KD) is a popular approach for deep model acceleration. Based on the knowledge distilled, we categorize KD methods as label-related and structure-related. The former distills the very abstract (high-level) knowledge, e.g., logits; and the latter uses the spatial (low- or medium-level feature) knowledge. However, existing KD methods are usually not explainable, i.e., we do not know what knowledge is transferred during distillation. In this work, we propose a new KD method, Explainability-based Knowledge Distillation (Exp-KD). Specifically, we propose to use class activation map (CAM) as the explainable knowledge which can effectively capture both label- and structure-related information during the distillation. We conduct extensive experiments, including image classification tasks on CIFAR-10, CIFAR-100 and ImageNet datasets, and explainability tests on ImageNet and ImageNet-Segmentation. The results show the great effectiveness and explainability of Exp-KD compared with the state-of-the-art. Code is available at https://github.com/Blenderama/Exp-KD.
{"title":"Explainability-based knowledge distillation","authors":"","doi":"10.1016/j.patcog.2024.111095","DOIUrl":"10.1016/j.patcog.2024.111095","url":null,"abstract":"<div><div>Knowledge distillation (KD) is a popular approach for deep model acceleration. Based on the knowledge distilled, we categorize KD methods as label-related and structure-related. The former distills the very abstract (high-level) knowledge, e.g., logits; and the latter uses the spatial (low- or medium-level feature) knowledge. However, existing KD methods are usually not explainable, i.e., we do not know what knowledge is transferred during distillation. In this work, we propose a new KD method, Explainability-based Knowledge Distillation (Exp-KD). Specifically, we propose to use class activation map (CAM) as the explainable knowledge which can effectively capture both label- and structure-related information during the distillation. We conduct extensive experiments, including image classification tasks on CIFAR-10, CIFAR-100 and ImageNet datasets, and explainability tests on ImageNet and ImageNet-Segmentation. The results show the great effectiveness and explainability of Exp-KD compared with the state-of-the-art. Code is available at <span><span>https://github.com/Blenderama/Exp-KD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142593957","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-30DOI: 10.1016/j.patcog.2024.111123
Optical Coherence Tomography Angiography (OCTA) plays a crucial role in the early detection and continuous monitoring of ocular diseases, which relies on accurate multi-tissue segmentation of retinal images. Existing OCTA segmentation methods typically focus on single-task designs that do not fully utilize the information of volume data in these images. To bridge this gap, our study introduces H2C-Net, a novel network architecture engineered for simultaneous and precise segmentation of various retinal structures, including capillaries, arteries, veins, and the fovea avascular zone (FAZ). At its core, H2C-Net consists of a plug-and-play Height-Channel Module (H2C) and an Enhanced U-shaped Network (GPC-Net). The H2C module cleverly converts the height information of the OCTA volume data into channel information through the Squeeze operation, realizes the lossless dimensionality reduction from 3D to 2D, and provides the "Soft layering" information by unidirectional pooling. Meanwhile, in order to guide the network to focus on channels for training, U-Net is enhanced with group normalization, channel attention mechanism, and Parametric Rectified Linear Unit (PReLU), which reduces the dependence on batch size and enhances the network's ability to extract salient features. Extensive experiments on two subsets of the publicly available OCTA-500 dataset have shown that H2C-Net outperforms existing state-of-the-art methods. It achieves average Intersection over Union (IoU) scores of 82.84 % and 88.48 %, marking improvements of 0.81 % and 1.59 %, respectively. Similarly, the average Dice scores are elevated to 90.40 % and 93.76 %, exceeding previous benchmarks by 0.42 % and 0.94 %. The proposed H2C-Net exhibits excellent performance in OCTA image segmentation, providing an efficient and accurate multi-task segmentation solution in ophthalmic diagnostics. The code is publicly available at: https://github.com/IAAI-SIT/H2C-Net.
{"title":"Multi-task OCTA image segmentation with innovative dimension compression","authors":"","doi":"10.1016/j.patcog.2024.111123","DOIUrl":"10.1016/j.patcog.2024.111123","url":null,"abstract":"<div><div>Optical Coherence Tomography Angiography (OCTA) plays a crucial role in the early detection and continuous monitoring of ocular diseases, which relies on accurate multi-tissue segmentation of retinal images. Existing OCTA segmentation methods typically focus on single-task designs that do not fully utilize the information of volume data in these images. To bridge this gap, our study introduces H2C-Net, a novel network architecture engineered for simultaneous and precise segmentation of various retinal structures, including capillaries, arteries, veins, and the fovea avascular zone (FAZ). At its core, H2C-Net consists of a plug-and-play Height-Channel Module (H2C) and an Enhanced U-shaped Network (GPC-Net). The H2C module cleverly converts the height information of the OCTA volume data into channel information through the Squeeze operation, realizes the lossless dimensionality reduction from 3D to 2D, and provides the \"Soft layering\" information by unidirectional pooling. Meanwhile, in order to guide the network to focus on channels for training, U-Net is enhanced with group normalization, channel attention mechanism, and Parametric Rectified Linear Unit (PReLU), which reduces the dependence on batch size and enhances the network's ability to extract salient features. Extensive experiments on two subsets of the publicly available OCTA-500 dataset have shown that H2C-Net outperforms existing state-of-the-art methods. It achieves average Intersection over Union (IoU) scores of 82.84 % and 88.48 %, marking improvements of 0.81 % and 1.59 %, respectively. Similarly, the average Dice scores are elevated to 90.40 % and 93.76 %, exceeding previous benchmarks by 0.42 % and 0.94 %. The proposed H2C-Net exhibits excellent performance in OCTA image segmentation, providing an efficient and accurate multi-task segmentation solution in ophthalmic diagnostics. The code is publicly available at: <span><span>https://github.com/IAAI-SIT/H2C-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-30","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594027","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-29DOI: 10.1016/j.patcog.2024.111096
Image-text retrieval serves as a bridge connecting vision and language. Mainstream modal cross matching methods can effectively perform cross-modal interactions with high theoretical performance. However, there is a deficiency in efficiency. Modal independent matching methods exhibit superior efficiency but lack in performance. Therefore, achieving a balance between matching efficiency and performance becomes a challenge in the field of image-text retrieval. In this paper, we propose a new Cross-modal Independent Matching Network (CIMN) for image-text retrieval. Specifically, we first use the proposed Feature Relationship Reasoning (FRR) to infer neighborhood and potential relations of modal features. Then, we introduce Graph Pooling (GP) based on graph convolutional networks to perform modal global semantic aggregation. Finally, we introduce the Gravitation Loss (GL) by incorporating sample mass into the learning process. This loss can correct the matching relationship between and within each modality, avoiding the problem of equal treatment of all samples in the traditional triplet loss. Extensive experiments on Flickr30K and MSCOCO datasets demonstrate the superiority of the proposed method. It achieves a good balance between matching efficiency and performance, surpasses other similar independent matching methods in performance, and can obtain retrieval accuracy comparable to some mainstream cross matching methods with an order of magnitude lower inference time.
{"title":"Cross-modal independent matching network for image-text retrieval","authors":"","doi":"10.1016/j.patcog.2024.111096","DOIUrl":"10.1016/j.patcog.2024.111096","url":null,"abstract":"<div><div>Image-text retrieval serves as a bridge connecting vision and language. Mainstream modal cross matching methods can effectively perform cross-modal interactions with high theoretical performance. However, there is a deficiency in efficiency. Modal independent matching methods exhibit superior efficiency but lack in performance. Therefore, achieving a balance between matching efficiency and performance becomes a challenge in the field of image-text retrieval. In this paper, we propose a new Cross-modal Independent Matching Network (CIMN) for image-text retrieval. Specifically, we first use the proposed Feature Relationship Reasoning (FRR) to infer neighborhood and potential relations of modal features. Then, we introduce Graph Pooling (GP) based on graph convolutional networks to perform modal global semantic aggregation. Finally, we introduce the Gravitation Loss (GL) by incorporating sample mass into the learning process. This loss can correct the matching relationship between and within each modality, avoiding the problem of equal treatment of all samples in the traditional triplet loss. Extensive experiments on Flickr30K and MSCOCO datasets demonstrate the superiority of the proposed method. It achieves a good balance between matching efficiency and performance, surpasses other similar independent matching methods in performance, and can obtain retrieval accuracy comparable to some mainstream cross matching methods with an order of magnitude lower inference time.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-29","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594029","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.patcog.2024.111138
Video Captioning (VC) is a challenging task of automatically generating natural language sentences for describing video contents. As a video often contains multiple objects, it is comprehensively crucial to identify multiple objects and model relationships between them. Previous models usually adopt Graph Convolutional Networks (GCN) to infer relational information via object nodes, but there exist uncertainty and over-smoothing issues of relational reasoning. To tackle these issues, we propose a Knowledge Graph based Video Captioning Network (KG-VCN) by fully exploring object relation interaction, hidden state and attention enhancement. In encoding stages, we present a Graph and Convolution Hybrid Encoder (GCHE), which uses an object detector to find visual objects with bounding boxes for Knowledge Graph (KG) and Convolutional Neural Network (CNN). To model intrinsic relations between detected objects, we propose a knowledge graph based Object Relation Graph Interaction (ORGI) module. In ORGI, we design triplets (head, relation, tail) to efficiently mine object relations, and create a global node to enable adequate information flow among all graph nodes for avoiding possibly missed relations. To produce accurate and rich captions, we propose a hidden State and Attention Enhanced Decoder (SAED) by integrating hidden states and dynamically updated attention features. Our SAED accepts both relational and visual features, adopts Long Short-Term Memory (LSTM) to produce hidden states, and dynamically update attention features. Unlike existing methods, we concatenate state and attention features to predict next word sequentially. To demonstrate the effectiveness of our model, we conduct experiments on three well-known datasets (MSVD, MSR-VTT, VaTeX), and our model achieves impressive results significantly outperforming existing state-of-the-art models.
{"title":"Fully exploring object relation interaction and hidden state attention for video captioning","authors":"","doi":"10.1016/j.patcog.2024.111138","DOIUrl":"10.1016/j.patcog.2024.111138","url":null,"abstract":"<div><div>Video Captioning (VC) is a challenging task of automatically generating natural language sentences for describing video contents. As a video often contains multiple objects, it is comprehensively crucial to identify multiple objects and model relationships between them. Previous models usually adopt Graph Convolutional Networks (GCN) to infer relational information via object nodes, but there exist uncertainty and over-smoothing issues of relational reasoning. To tackle these issues, we propose a Knowledge Graph based Video Captioning Network (KG-VCN) by fully exploring object relation interaction, hidden state and attention enhancement. In encoding stages, we present a Graph and Convolution Hybrid Encoder (GCHE), which uses an object detector to find visual objects with bounding boxes for Knowledge Graph (KG) and Convolutional Neural Network (CNN). To model intrinsic relations between detected objects, we propose a knowledge graph based Object Relation Graph Interaction (ORGI) module. In ORGI, we design triplets (<em>head, relation, tail</em>) to efficiently mine object relations, and create a global node to enable adequate information flow among all graph nodes for avoiding possibly missed relations. To produce accurate and rich captions, we propose a hidden State and Attention Enhanced Decoder (SAED) by integrating hidden states and dynamically updated attention features. Our SAED accepts both relational and visual features, adopts Long Short-Term Memory (LSTM) to produce hidden states, and dynamically update attention features. Unlike existing methods, we concatenate state and attention features to predict next word sequentially. To demonstrate the effectiveness of our model, we conduct experiments on three well-known datasets (MSVD, MSR-VTT, VaTeX), and our model achieves impressive results significantly outperforming existing state-of-the-art models.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572367","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.patcog.2024.111119
Smoke has large variances of visual appearances that are very adverse to visual segmentation. Furthermore, its semi-transparency often produces highly complicated mixtures of smoke and backgrounds. These factors lead to great difficulties in labelling and segmenting smoke regions. To improve accuracy of smoke segmentation, we propose a Newton Interpolation Network (NINet) for visual smoke semantic segmentation. Unlike simply concatenating or point-wisely adding multi-scale encoded feature maps for information fusion or re-usage, we design a Newton Interpolation Module (NIM) to extract structured information by analyzing the feature values in the same position but from encoded feature maps with different scales. Interpolated features by our NIM contain long-range dependency and semantic structures across different levels, but traditional fusion of multi-scale feature maps cannot model intrinsic structures embedded in these maps. To obtain multi-scale structured information, we repeatedly use the proposed NIM at different levels of the decoding stages. In addition, we use more encoded feature maps to construct a higher order Newton interpolation polynomial for extracting higher order information. Extensive experiments validate that our method significantly outperforms existing state-of-the-art algorithms on virtual and real smoke datasets, and ablation experiments also validate the effectiveness of our NIMs.
{"title":"A newton interpolation network for smoke semantic segmentation","authors":"","doi":"10.1016/j.patcog.2024.111119","DOIUrl":"10.1016/j.patcog.2024.111119","url":null,"abstract":"<div><div>Smoke has large variances of visual appearances that are very adverse to visual segmentation. Furthermore, its semi-transparency often produces highly complicated mixtures of smoke and backgrounds. These factors lead to great difficulties in labelling and segmenting smoke regions. To improve accuracy of smoke segmentation, we propose a Newton Interpolation Network (NINet) for visual smoke semantic segmentation. Unlike simply concatenating or point-wisely adding multi-scale encoded feature maps for information fusion or re-usage, we design a Newton Interpolation Module (NIM) to extract structured information by analyzing the feature values in the same position but from encoded feature maps with different scales. Interpolated features by our NIM contain long-range dependency and semantic structures across different levels, but traditional fusion of multi-scale feature maps cannot model intrinsic structures embedded in these maps. To obtain multi-scale structured information, we repeatedly use the proposed NIM at different levels of the decoding stages. In addition, we use more encoded feature maps to construct a higher order Newton interpolation polynomial for extracting higher order information. Extensive experiments validate that our method significantly outperforms existing state-of-the-art algorithms on virtual and real smoke datasets, and ablation experiments also validate the effectiveness of our NIMs.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142572368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.patcog.2024.111089
Few-shot classification (FSC) is a challenging problem, which aims to identify novel classes with limited samples. Most existing methods employ vanilla transfer learning or episodic meta-training to learn a feature extractor, and then measure the similarity between the query image and the few support examples of novel classes. However, these approaches merely learn feature representations from individual images, overlooking the exploration of the interrelationships among images. This neglect can hinder the attainment of more discriminative feature representations, thus limiting the potential improvement of few-shot classification performance. To address this issue, we propose a Sample Relationship Exploration (SRE) module comprising the Sample-level Attention (SA), Explicit Guidance (EG) and Channel-wise Adaptive Fusion (CAF) components, to learn discriminative category-related features. Specifically, we first employ the SA component to explore the similarity relationships among samples and obtain aggregated features of similar samples. Furthermore, to enhance the robustness of these features, we introduce the EG component to explicitly guide the learning of sample relationships by providing an ideal affinity map among samples. Finally, the CAF component is adopted to perform weighted fusion of the original features and the aggregated features, yielding category-related embeddings. The proposed method is a plug-and-play module which can be embedded into both transfer learning and meta-learning based few-shot classification frameworks. Extensive experiments on benchmark datasets show that the proposed module can effectively improve the performance over baseline models, and also perform competitively against the state-of-the-art algorithms. The source code is available at https://github.com/Chenguoz/SRE.
快速分类(FSC)是一个具有挑战性的问题,其目的是利用有限的样本识别新类别。大多数现有方法都采用香草迁移学习或偶发元训练来学习特征提取器,然后测量查询图像与新类别的少数支持示例之间的相似性。然而,这些方法只是从单个图像中学习特征表征,忽略了对图像间相互关系的探索。这种忽视可能会阻碍获得更具区分性的特征表征,从而限制了少量图像分类性能的潜在提高。为解决这一问题,我们提出了一个样本关系探索(SRE)模块,该模块由样本级关注(SA)、显式引导(EG)和信道自适应融合(CAF)组件组成,用于学习与类别相关的判别特征。具体来说,我们首先利用 SA 组件探索样本之间的相似性关系,并获得相似样本的聚合特征。此外,为了增强这些特征的鲁棒性,我们引入了 EG 组件,通过提供样本间的理想亲和图来明确指导样本关系的学习。最后,我们采用 CAF 组件对原始特征和聚合特征进行加权融合,得到与类别相关的嵌入。所提出的方法是一个即插即用的模块,可以嵌入到基于迁移学习和元学习的少量分类框架中。在基准数据集上进行的大量实验表明,与基线模型相比,所提出的模块能有效提高性能,与最先进的算法相比也具有竞争力。源代码见 https://github.com/Chenguoz/SRE。
{"title":"Exploring sample relationship for few-shot classification","authors":"","doi":"10.1016/j.patcog.2024.111089","DOIUrl":"10.1016/j.patcog.2024.111089","url":null,"abstract":"<div><div>Few-shot classification (FSC) is a challenging problem, which aims to identify novel classes with limited samples. Most existing methods employ vanilla transfer learning or episodic meta-training to learn a feature extractor, and then measure the similarity between the query image and the few support examples of novel classes. However, these approaches merely learn feature representations from individual images, overlooking the exploration of the interrelationships among images. This neglect can hinder the attainment of more discriminative feature representations, thus limiting the potential improvement of few-shot classification performance. To address this issue, we propose a Sample Relationship Exploration (SRE) module comprising the Sample-level Attention (SA), Explicit Guidance (EG) and Channel-wise Adaptive Fusion (CAF) components, to learn discriminative category-related features. Specifically, we first employ the SA component to explore the similarity relationships among samples and obtain aggregated features of similar samples. Furthermore, to enhance the robustness of these features, we introduce the EG component to explicitly guide the learning of sample relationships by providing an ideal affinity map among samples. Finally, the CAF component is adopted to perform weighted fusion of the original features and the aggregated features, yielding category-related embeddings. The proposed method is a plug-and-play module which can be embedded into both transfer learning and meta-learning based few-shot classification frameworks. Extensive experiments on benchmark datasets show that the proposed module can effectively improve the performance over baseline models, and also perform competitively against the state-of-the-art algorithms. The source code is available at <span><span>https://github.com/Chenguoz/SRE</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142594025","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.patcog.2024.111091
Nearshore Multiple Object Tracking (NMOT) aims to locate and associate nearshore objects. Current approaches utilize Automatic Identification Systems (AIS) and radar to accomplish this task. However, video signals can describe the visual appearance of nearshore objects without prior information such as identity, location, or motion. In addition, sea clutter will not affect the capture of living objects by visual sensors. Recognizing this, we analyzed three key long-term challenges of the vision-based NMOT and proposed a tracking pipeline that relies solely on motion information. Maritime objects are highly susceptible to being obscured or submerged by waves, resulting in fragmented tracklets. We first introduced guiding modulation to address the long-term occlusion and interaction of maritime objects. Subsequently, we modeled confidence, altitude, and angular momentum to mitigate the effects of motion blur, ringing, and overshoot artifacts to observations in unstable imaging environments. Additionally, we designed a motion fusion mechanism that combines long-term macro tracklets with short-term fine-grained tracklets. This correction mechanism helps reduce the estimation variance of the Kalman Filter (KF) to alleviate the substantial nonlinear motion of maritime objects. We call this pipeline SeaTrack, which remains simple, online, and real-time, demonstrating excellent performance and scalability in benchmark evaluations.
{"title":"SeaTrack: Rethinking Observation-Centric SORT for Robust Nearshore Multiple Object Tracking","authors":"","doi":"10.1016/j.patcog.2024.111091","DOIUrl":"10.1016/j.patcog.2024.111091","url":null,"abstract":"<div><div>Nearshore Multiple Object Tracking (NMOT) aims to locate and associate nearshore objects. Current approaches utilize Automatic Identification Systems (AIS) and radar to accomplish this task. However, video signals can describe the visual appearance of nearshore objects without prior information such as identity, location, or motion. In addition, sea clutter will not affect the capture of living objects by visual sensors. Recognizing this, we analyzed three key long-term challenges of the vision-based NMOT and proposed a tracking pipeline that relies solely on motion information. Maritime objects are highly susceptible to being obscured or submerged by waves, resulting in fragmented tracklets. We first introduced guiding modulation to address the long-term occlusion and interaction of maritime objects. Subsequently, we modeled confidence, altitude, and angular momentum to mitigate the effects of motion blur, ringing, and overshoot artifacts to observations in unstable imaging environments. Additionally, we designed a motion fusion mechanism that combines long-term macro tracklets with short-term fine-grained tracklets. This correction mechanism helps reduce the estimation variance of the Kalman Filter (KF) to alleviate the substantial nonlinear motion of maritime objects. We call this pipeline SeaTrack, which remains simple, online, and real-time, demonstrating excellent performance and scalability in benchmark evaluations.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":null,"pages":null},"PeriodicalIF":7.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142561362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}