Pub Date : 2025-04-22DOI: 10.1016/j.patrec.2025.04.012
Samuel Repka , Bořek Reich , Fedor Zolotarev , Tuomas Eerola , Pavel Zemčík
We propose a novel Graph Neural Network-based method for segmentation based on data fusion of multimodal Scanning Electron Microscope (SEM) images. In most cases, Backscattered Electron (BSE) images obtained using SEM do not contain sufficient information for mineral segmentation. Therefore, imaging is often complemented with point-wise Energy-Dispersive X-ray Spectroscopy (EDS) spectral measurements that provide highly accurate information about the chemical composition but that are time-consuming to acquire. This motivates the use of sparse spectral data in conjunction with BSE images for mineral segmentation. The unstructured nature of the spectral data makes most traditional image fusion techniques unsuitable for BSE-EDS fusion. We propose using graph neural networks to fuse the two modalities and segment the mineral phases simultaneously. Our results demonstrate that providing EDS data for as few as 1% of BSE pixels produces accurate segmentation, enabling rapid analysis of mineral samples. The proposed data fusion pipeline is versatile and can be adapted to other domains that involve image data and point-wise measurements.
{"title":"Mineral segmentation using electron microscope images and spectral sampling through multimodal graph neural networks","authors":"Samuel Repka , Bořek Reich , Fedor Zolotarev , Tuomas Eerola , Pavel Zemčík","doi":"10.1016/j.patrec.2025.04.012","DOIUrl":"10.1016/j.patrec.2025.04.012","url":null,"abstract":"<div><div>We propose a novel Graph Neural Network-based method for segmentation based on data fusion of multimodal Scanning Electron Microscope (SEM) images. In most cases, Backscattered Electron (BSE) images obtained using SEM do not contain sufficient information for mineral segmentation. Therefore, imaging is often complemented with point-wise Energy-Dispersive X-ray Spectroscopy (EDS) spectral measurements that provide highly accurate information about the chemical composition but that are time-consuming to acquire. This motivates the use of sparse spectral data in conjunction with BSE images for mineral segmentation. The unstructured nature of the spectral data makes most traditional image fusion techniques unsuitable for BSE-EDS fusion. We propose using graph neural networks to fuse the two modalities and segment the mineral phases simultaneously. Our results demonstrate that providing EDS data for as few as 1% of BSE pixels produces accurate segmentation, enabling rapid analysis of mineral samples. The proposed data fusion pipeline is versatile and can be adapted to other domains that involve image data and point-wise measurements.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"193 ","pages":"Pages 79-85"},"PeriodicalIF":3.9,"publicationDate":"2025-04-22","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143869163","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-21DOI: 10.1016/j.patrec.2025.04.011
Jugurta Montalvão, Gabriel Bastos, Rodrigo Sousa, Ataíde Gualberto
Embeddings are adjusted to allow points representing states and observations in Markov models, where conditional probabilities are approximately encoded as the exponential of (negative) distances, jointly scaled by a density factor. It is shown that the goodness of this approximation can be managed, mainly if the embedding dimension is chosen in function of entropies associated to the corresponding Markov model. Therefore, for sparse (low entropy) models, their representation as state embeddings can save memory and allow fully geometric versions of probabilistic algorithms, as the Viterbi, taken as an example in this work. Besides, evidences are also gathered in favor of potentially useful properties that emerge from the geometric representation of Markov models, such as analogies, superstates (aggregation) and semantic fields.
{"title":"On the representation of sparse stochastic matrices with state embedding","authors":"Jugurta Montalvão, Gabriel Bastos, Rodrigo Sousa, Ataíde Gualberto","doi":"10.1016/j.patrec.2025.04.011","DOIUrl":"10.1016/j.patrec.2025.04.011","url":null,"abstract":"<div><div>Embeddings are adjusted to allow points representing states and observations in Markov models, where conditional probabilities are approximately encoded as the exponential of (negative) distances, jointly scaled by a density factor. It is shown that the goodness of this approximation can be managed, mainly if the embedding dimension is chosen in function of entropies associated to the corresponding Markov model. Therefore, for sparse (low entropy) models, their representation as state embeddings can save memory and allow fully geometric versions of probabilistic algorithms, as the Viterbi, taken as an example in this work. Besides, evidences are also gathered in favor of potentially useful properties that emerge from the geometric representation of Markov models, such as analogies, superstates (aggregation) and semantic fields.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"193 ","pages":"Pages 71-78"},"PeriodicalIF":3.9,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143869162","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-21DOI: 10.1016/j.patrec.2025.04.004
Xinru Meng , Han Sun , Jiamei Liu , Ningzhong Liu , Huiyu Zhou
Source-free domain adaptation (SFDA), which involves adapting models without access to source data, is both demanding and challenging. Existing SFDA techniques typically rely on pseudo-labels generated from confidence levels, leading to negative transfer due to significant noise. To tackle this problem, Energy-Based Pseudo-Label Refining (EBPR) is proposed for SFDA. Pseudo-labels are created for all sample clusters according to their energy scores. Global and class energy thresholds are computed to selectively filter pseudo-labels. Furthermore, a contrastive learning strategy is introduced to filter difficult samples, aligning them with their augmented versions to learn more discriminative features. Our method is validated on the Office-31, Office-Home, and VisDA-C datasets, consistently finding that our model outperformed state-of-the-art methods.
{"title":"Energy-based pseudo-label refining for source-free domain adaptation","authors":"Xinru Meng , Han Sun , Jiamei Liu , Ningzhong Liu , Huiyu Zhou","doi":"10.1016/j.patrec.2025.04.004","DOIUrl":"10.1016/j.patrec.2025.04.004","url":null,"abstract":"<div><div>Source-free domain adaptation (SFDA), which involves adapting models without access to source data, is both demanding and challenging. Existing SFDA techniques typically rely on pseudo-labels generated from confidence levels, leading to negative transfer due to significant noise. To tackle this problem, Energy-Based Pseudo-Label Refining (EBPR) is proposed for SFDA. Pseudo-labels are created for all sample clusters according to their energy scores. Global and class energy thresholds are computed to selectively filter pseudo-labels. Furthermore, a contrastive learning strategy is introduced to filter difficult samples, aligning them with their augmented versions to learn more discriminative features. Our method is validated on the Office-31, Office-Home, and VisDA-C datasets, consistently finding that our model outperformed state-of-the-art methods.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"193 ","pages":"Pages 50-55"},"PeriodicalIF":3.9,"publicationDate":"2025-04-21","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143863614","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
The proliferation of social media platforms has resulted in an exponential increase in user-generated content, facilitating the rapid and widespread dissemination of information. However, this ease of sharing content has also paved the way for the spread of false or misleading information, commonly known as fake news, which can have harmful effects on society. Existing studies in the literature rely on content in source posts, social interaction networks, and external evidence to verify the authenticity of the posts. However, studies in the literature fails to detect fake news in the following case. (i) Sparsity and limited words in social media posts heavily affect the performance of content-based methods. (ii) Social interaction-based methods require a huge social interaction network for a given source post, which is easily unavailable for every social media post. (iii) Social media discussions sometimes precede or surpass mainstream media reporting and information from external sources such as Knowledge Base and Wikipedia. Consequently, in such circumstances, getting external information that will help verify the authenticity of social media posts is not readily available. To address the above-mentioned limitations, this study proposes Hashtag Context-aware Fake News Detection (HCFND). Our proposed model, HCFND, leverages information posted under the hashtags mentioned in the source post and relevant posts extracted from named entities mentioned in the source post as external sources of information from the community with interest in similar topics. The extraction of external information from posts under relevant hashtags and profiles mentioned in source tweets enables the HCFND to cross-reference the content of the source post with data from communities sharing similar interests, thereby facilitating the verification of the authenticity of social media posts. We evaluate the performances of the proposed model on three publicly available benchmark datasets. The results indicate that our proposed model outperforms existing state-of-the-art methods in the literature.
{"title":"Fake News Detection using Hashtag Context","authors":"Sujit Kumar, Shifali Agrahari, Priyank Soni, Aayush Sachdeva, Sanasam Ranbir Singh","doi":"10.1016/j.patrec.2025.04.008","DOIUrl":"10.1016/j.patrec.2025.04.008","url":null,"abstract":"<div><div>The proliferation of social media platforms has resulted in an exponential increase in user-generated content, facilitating the rapid and widespread dissemination of information. However, this ease of sharing content has also paved the way for the spread of false or misleading information, commonly known as fake news, which can have harmful effects on society. Existing studies in the literature rely on content in source posts, social interaction networks, and external evidence to verify the authenticity of the posts. However, studies in the literature fails to detect fake news in the following case. (i) Sparsity and limited words in social media posts heavily affect the performance of content-based methods. (ii) Social interaction-based methods require a huge social interaction network for a given source post, which is easily unavailable for every social media post. (iii) Social media discussions sometimes precede or surpass mainstream media reporting and information from external sources such as Knowledge Base and Wikipedia. Consequently, in such circumstances, getting external information that will help verify the authenticity of social media posts is not readily available. To address the above-mentioned limitations, this study proposes <em>Hashtag Context-aware Fake News Detection</em> (HCFND). Our proposed model, HCFND, leverages information posted under the hashtags mentioned in the source post and relevant posts extracted from named entities mentioned in the source post as external sources of information from the community with interest in similar topics. The extraction of external information from posts under relevant hashtags and profiles mentioned in source tweets enables the HCFND to cross-reference the content of the source post with data from communities sharing similar interests, thereby facilitating the verification of the authenticity of social media posts. We evaluate the performances of the proposed model on three publicly available benchmark datasets. The results indicate that our proposed model outperforms existing state-of-the-art methods in the literature.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"193 ","pages":"Pages 43-49"},"PeriodicalIF":3.9,"publicationDate":"2025-04-20","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143858948","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
In video captioning, accurately identifying and summarizing key concepts while ignoring irrelevant details remains a significant challenge. Mainstream approaches often suffer from the inclusion of semantically irrelevant features, leading to inaccuracies and hallucinations in the generated captions. This study aims to develop a novel framework, Dynamic and Static Concept-driven video captioning model(DiSCo), to enhance the accuracy and coherence of video captions by effectively leveraging pre-trained models and addressing the issue of semantic irrelevance. DiSCo builds upon the conventional encoder–decoder architecture by incorporating a Semantic Feature Extractor (SFE) and a Static-Dynamic Concept Detector (S-DCD). The SFE filters out semantically irrelevant features extracted by the visual model, while the S-DCD identifies critical concepts to guide the large language model (LLM) in generating captions. Both the visual model and the LLM are pre-trained and their parameters are frozen; only the SFE and S-DCD are trained to optimize the feature extraction and concept detection processes. Comprehensive experiments conducted on the MSVD and MSR-VTT datasets show that DiSCo significantly outperforms existing methods, achieving notable improvements in the quality and relevance of the generated captions. The proposed DiSCo framework demonstrates a robust solution for enhancing the accuracy and coherence of video captions by effectively integrating semantic feature extraction and concept-driven guidance.
{"title":"From visual features to key concepts: A Dynamic and Static Concept-driven approach for video captioning","authors":"Xin Ren, Yufeng Han, Bing Wei, Xue-song Tang, Kuangrong Hao","doi":"10.1016/j.patrec.2025.04.007","DOIUrl":"10.1016/j.patrec.2025.04.007","url":null,"abstract":"<div><div>In video captioning, accurately identifying and summarizing key concepts while ignoring irrelevant details remains a significant challenge. Mainstream approaches often suffer from the inclusion of semantically irrelevant features, leading to inaccuracies and hallucinations in the generated captions. This study aims to develop a novel framework, <strong>D</strong>ynam<strong>i</strong>c and <strong>S</strong>tatic <strong>Co</strong>ncept-driven video captioning model(DiSCo), to enhance the accuracy and coherence of video captions by effectively leveraging pre-trained models and addressing the issue of semantic irrelevance. DiSCo builds upon the conventional encoder–decoder architecture by incorporating a Semantic Feature Extractor (SFE) and a Static-Dynamic Concept Detector (S-DCD). The SFE filters out semantically irrelevant features extracted by the visual model, while the S-DCD identifies critical concepts to guide the large language model (LLM) in generating captions. Both the visual model and the LLM are pre-trained and their parameters are frozen; only the SFE and S-DCD are trained to optimize the feature extraction and concept detection processes. Comprehensive experiments conducted on the MSVD and MSR-VTT datasets show that DiSCo significantly outperforms existing methods, achieving notable improvements in the quality and relevance of the generated captions. The proposed DiSCo framework demonstrates a robust solution for enhancing the accuracy and coherence of video captions by effectively integrating semantic feature extraction and concept-driven guidance.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"193 ","pages":"Pages 64-70"},"PeriodicalIF":3.9,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143863618","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-19DOI: 10.1016/j.patrec.2025.04.005
Ming Zhao, Jing Zhang
Scene Graph Generation aims to construct a structured representation of entities and their relationships in an image. Traditional methods use object detection for entity localization but struggle with relationship modeling in complex scenes. Most approaches also face challenges in predicate classification due to inter-class similarity and intra-class variability. Additionally, when multiple entities are present in an image, the contextual information between them are crucial. To address these challenges, this paper proposes a Panoptic Segmentation-based Semantic Embedding Matching Network, which optimizes the entire process from entity localization to entity-pair and predicate prediction. Specifically, we use a panoptic segmentation module to locate all entities (including the foreground and background), providing comprehensive support for predicate prediction in complex scenes. Simultaneously, a semantic embedding module is introduced to fuse the visual and semantic features of entities and predicates respectively, constructing a similarity-based matching mechanism. Furthermore, we incorporate a graph attention network before the semantic embedding of entities, effectively capturing contextual information among multiple entities and dynamically adjusting the semantic embedding module. Experiments on the PSG dataset validate the proposed method’s effectiveness. The results show that our model outperforms existing methods in relationship detection and generation in complex scenes.
{"title":"Panoptic segmentation-based semantic embedding matching model for scene graph generation","authors":"Ming Zhao, Jing Zhang","doi":"10.1016/j.patrec.2025.04.005","DOIUrl":"10.1016/j.patrec.2025.04.005","url":null,"abstract":"<div><div>Scene Graph Generation aims to construct a structured representation of entities and their relationships in an image. Traditional methods use object detection for entity localization but struggle with relationship modeling in complex scenes. Most approaches also face challenges in predicate classification due to inter-class similarity and intra-class variability. Additionally, when multiple entities are present in an image, the contextual information between them are crucial. To address these challenges, this paper proposes a Panoptic Segmentation-based Semantic Embedding Matching Network, which optimizes the entire process from entity localization to entity-pair and predicate prediction. Specifically, we use a panoptic segmentation module to locate all entities (including the foreground and background), providing comprehensive support for predicate prediction in complex scenes. Simultaneously, a semantic embedding module is introduced to fuse the visual and semantic features of entities and predicates respectively, constructing a similarity-based matching mechanism. Furthermore, we incorporate a graph attention network before the semantic embedding of entities, effectively capturing contextual information among multiple entities and dynamically adjusting the semantic embedding module. Experiments on the PSG dataset validate the proposed method’s effectiveness. The results show that our model outperforms existing methods in relationship detection and generation in complex scenes.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"193 ","pages":"Pages 56-63"},"PeriodicalIF":3.9,"publicationDate":"2025-04-19","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143863617","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-17DOI: 10.1016/j.patrec.2025.04.006
Jince Wang , Jian Peng , Feihu Huang , Sirui Liao , Pengxiang Zhan , Peiyu Yi
Graph representation learning is an important and fundamental research concentration in complex networks. Graph neural networks design excellent filters and perform positively in downstream tasks. From first principles, the fundamental goal of graph representation learning is to obtain neighbor information to decrease the uncertainty of target nodes. Based on the partial information decomposition (PID), this paper finds that the existing node aggregation strategy does not obtain sufficient information gain from neighbors. Furthermore, the graph contains a huge number of nodes, making mutual information decomposition challenging. Thus, this paper defines Partial Information Decomposition on Graph (PIDG) as a coarse-grained PID, designs a gate to learn the representations for information gains from neighbor nodes, and builds Information Enhancement (IE) module, which enhances nodes’ representation capabilities by combining various forms of information from neighboring nodes. This work achieves information enhancement about the nodes in a graph and is verified on authentic datasets.
{"title":"Information enhancement graph representation learning","authors":"Jince Wang , Jian Peng , Feihu Huang , Sirui Liao , Pengxiang Zhan , Peiyu Yi","doi":"10.1016/j.patrec.2025.04.006","DOIUrl":"10.1016/j.patrec.2025.04.006","url":null,"abstract":"<div><div>Graph representation learning is an important and fundamental research concentration in complex networks. Graph neural networks design excellent filters and perform positively in downstream tasks. From first principles, the fundamental goal of graph representation learning is to obtain neighbor information to decrease the uncertainty of target nodes. Based on the partial information decomposition (PID), this paper finds that the existing node aggregation strategy does not obtain sufficient information gain from neighbors. Furthermore, the graph contains a huge number of nodes, making mutual information decomposition challenging. Thus, this paper defines Partial Information Decomposition on Graph (PIDG) as a coarse-grained PID, designs a gate to learn the representations for information gains from neighbor nodes, and builds Information Enhancement (IE) module, which enhances nodes’ representation capabilities by combining various forms of information from neighboring nodes. This work achieves information enhancement about the nodes in a graph and is verified on authentic datasets.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"193 ","pages":"Pages 36-42"},"PeriodicalIF":3.9,"publicationDate":"2025-04-17","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143851981","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-15DOI: 10.1016/j.patrec.2025.04.003
Jiaqi Li , Shuhuan Wen , Di Lu , Linxiang Li , Hong Zhang
For the problem of missing depth values of transparent objects in depth-channel captured by RGB-D camera, a voxel-based deep learning depth-completion algorithm for transparent objects is proposed. We mapped the image to the 3D voxel space, calculated the effective point cloud according to the input depth map, and obtained the occupied voxels by the boundary test method. Combined with the camera ray direction, the occupied voxels are filtered for the voxels that intersect the camera ray. Using the image features contained in the RGB image and the valid points in the intersecting voxels calculated from the point cloud image, the multi-layer perception is applied to predict the missing channel of the object, and under the constraint of surface normal consistency, the depth value is optimized. The proposed algorithm achieves improvements of 12.55%, 0.6%, and 1.63% over ClearGrasp in the metrics , , and , respectively.
{"title":"Voxel and deep learning based depth complementation for transparent objects","authors":"Jiaqi Li , Shuhuan Wen , Di Lu , Linxiang Li , Hong Zhang","doi":"10.1016/j.patrec.2025.04.003","DOIUrl":"10.1016/j.patrec.2025.04.003","url":null,"abstract":"<div><div>For the problem of missing depth values of transparent objects in depth-channel captured by RGB-D camera, a voxel-based deep learning depth-completion algorithm for transparent objects is proposed. We mapped the image to the 3D voxel space, calculated the effective point cloud according to the input depth map, and obtained the occupied voxels by the boundary test method. Combined with the camera ray direction, the occupied voxels are filtered for the voxels that intersect the camera ray. Using the image features contained in the RGB image and the valid points in the intersecting voxels calculated from the point cloud image, the multi-layer perception is applied to predict the missing channel of the object, and under the constraint of surface normal consistency, the depth value is optimized. The proposed algorithm achieves improvements of 12.55%, 0.6%, and 1.63% over ClearGrasp in the metrics <span><math><msub><mrow><mi>δ</mi></mrow><mrow><mn>1</mn><mo>.</mo><mn>05</mn></mrow></msub></math></span>, <span><math><msub><mrow><mi>δ</mi></mrow><mrow><mn>1</mn><mo>.</mo><mn>10</mn></mrow></msub></math></span>, and <span><math><msub><mrow><mi>δ</mi></mrow><mrow><mn>1</mn><mo>.</mo><mn>25</mn></mrow></msub></math></span>, respectively.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"193 ","pages":"Pages 14-20"},"PeriodicalIF":3.9,"publicationDate":"2025-04-15","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143843762","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-12DOI: 10.1016/j.patrec.2025.03.030
Yijing Guo , Fuhang Li , Yi Qiu , Pengyu Xu , Kunhua Li
Pedestrian detection is the primary task of automated driving and intelligent video surveillance systems. Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision (VLPD) greatly improves the detection accuracy of single-stage pedestrian detectors. Meanwhile, to maintain reasoning speed, VLPD adopts ResNet-50 as its backbone network, which undoubtedly poses a significant limitation for single-stage detectors that require direct category prediction and bounding box regression on feature maps. To tap into the potential of CNNs in representation capability, we propose a novel simplified architectural unit, the Channel and Spatial Global Pooling Attention Module (GPA), which integrates activation channels and spatial weights attention maps through parallel computation to achieve adaptive feature refinement of backbone output feature maps. Furthermore, we optimize the module structure of the VLPD self-supervised prototype semantic contrast method, significantly enhancing the detector’s ability to discriminate and detect pedestrians in complex urban street environments. With only a 0.2FPS decrease in reasoning speed, the miss rates on the Heavy Occlusion subsets and Reasonable subsets of the Citypersons dataset are reduced by 2.41% and 0.72%, respectively, achieving state-of-the-art (SOTA) performance for single-stage detectors on this dataset. On the Heavy Occlusion subset and the All subset of the Caltech dataset, the performance decreased by 2.90% and 0.80%, respectively. Without using additional data, this method can rival the detection accuracy of two-stage detectors.
{"title":"Pedestrian detection based on vision-language semantics with global adaptive adjustment","authors":"Yijing Guo , Fuhang Li , Yi Qiu , Pengyu Xu , Kunhua Li","doi":"10.1016/j.patrec.2025.03.030","DOIUrl":"10.1016/j.patrec.2025.03.030","url":null,"abstract":"<div><div>Pedestrian detection is the primary task of automated driving and intelligent video surveillance systems. Context-Aware Pedestrian Detection via Vision-Language Semantic Self-Supervision (VLPD) greatly improves the detection accuracy of single-stage pedestrian detectors. Meanwhile, to maintain reasoning speed, VLPD adopts ResNet-50 as its backbone network, which undoubtedly poses a significant limitation for single-stage detectors that require direct category prediction and bounding box regression on feature maps. To tap into the potential of CNNs in representation capability, we propose a novel simplified architectural unit, the Channel and Spatial <strong>G</strong>lobal <strong>P</strong>ooling <strong>A</strong>ttention Module (GPA), which integrates activation channels and spatial weights attention maps through parallel computation to achieve adaptive feature refinement of backbone output feature maps. Furthermore, we optimize the module structure of the VLPD self-supervised prototype semantic contrast method, significantly enhancing the detector’s ability to discriminate and detect pedestrians in complex urban street environments. With only a 0.2FPS decrease in reasoning speed, the miss rates on the Heavy Occlusion subsets and Reasonable subsets of the Citypersons dataset are reduced by 2.41% and 0.72%, respectively, achieving state-of-the-art (SOTA) performance for single-stage detectors on this dataset. On the Heavy Occlusion subset and the All subset of the Caltech dataset, the performance decreased by 2.90% and 0.80%, respectively. Without using additional data, this method can rival the detection accuracy of two-stage detectors.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"193 ","pages":"Pages 8-13"},"PeriodicalIF":3.9,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143835059","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-04-12DOI: 10.1016/j.patrec.2025.04.002
Shuofeng Sun , Yaohan Yang , Haibin Yan
In this paper, we propose a new Frequency Feature Decoupling and Fusion Network (FDFN) method for robust kinship verification. Our approach begins with a multi-scale fusion module designed to acquire features with enhanced discriminative power, which are then decoupled into high-frequency and low-frequency components. High-frequency features focus on the local details of the face, while low-frequency features emphasize the overall structural information. Furthermore, we introduce a hybrid spatial attention module to refine the high-frequency features, allowing the model to concentrate on more important facial regions. At the same time, the hybrid channel attention module is employed to optimize the low-frequency features, enabling the model to pay attention to the more significant feature channels within the overall structure. Finally, a fusion module then combines the refined high and low-frequency features to produce the final image representation. Our method effectively resolves the conflict between local details and global structure, optimizing each aspect separately to obtain more discriminative facial features. Experimental results on the FIW and Kinface datasets demonstrate that our approach achieves superior performance compared to baseline methods, establishing a robust foundation for kinship verification tasks and advancing the state of fine-grained image analysis in computer vision.
{"title":"Kinship verification via Frequency Feature Decoupling and Fusion","authors":"Shuofeng Sun , Yaohan Yang , Haibin Yan","doi":"10.1016/j.patrec.2025.04.002","DOIUrl":"10.1016/j.patrec.2025.04.002","url":null,"abstract":"<div><div>In this paper, we propose a new Frequency Feature Decoupling and Fusion Network (FDFN) method for robust kinship verification. Our approach begins with a multi-scale fusion module designed to acquire features with enhanced discriminative power, which are then decoupled into high-frequency and low-frequency components. High-frequency features focus on the local details of the face, while low-frequency features emphasize the overall structural information. Furthermore, we introduce a hybrid spatial attention module to refine the high-frequency features, allowing the model to concentrate on more important facial regions. At the same time, the hybrid channel attention module is employed to optimize the low-frequency features, enabling the model to pay attention to the more significant feature channels within the overall structure. Finally, a fusion module then combines the refined high and low-frequency features to produce the final image representation. Our method effectively resolves the conflict between local details and global structure, optimizing each aspect separately to obtain more discriminative facial features. Experimental results on the FIW and Kinface datasets demonstrate that our approach achieves superior performance compared to baseline methods, establishing a robust foundation for kinship verification tasks and advancing the state of fine-grained image analysis in computer vision.</div></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"193 ","pages":"Pages 1-7"},"PeriodicalIF":3.9,"publicationDate":"2025-04-12","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143835189","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}