Pub Date : 2026-01-07DOI: 10.1016/j.patcog.2026.113056
Lin Li , Ru Lei , Kun Zhang , Lingchen Sun , Rustam Stolkin
Hazy images caused by atmospheric scattering significantly degrade the visibility and performance of computer vision systems, especially in long-range applications. Existing synthetic haze datasets are usually limited to short visibility ranges and fail to adequately model wavelength-dependent scattering effects, leading to suboptimal evaluation of dehazing algorithms. In this study, we propose a physically motivated synthesis method that combines the atmospheric scattering model with channel-specific extinction coefficients for the RGB channels and depth information ranging from 0 to 10 km. This approach enables the construction of the Wide Visibility Synthetic Haze (WVSH) dataset, which spans visibility distances from 50 m to 2 km. Based on WVSH, we design WVDehazeNet, a convolutional neural network that effectively leverages multi-scale spatial features and wavelength-dependent haze priors. Extensive experiments on both WVSH and real-world hazy images demonstrate that WVDehazeNet achieves competitive or superior performance compared with eight state-of-the-art methods in both quantitative and qualitative evaluations. The WVSH dataset and WVDehazeNet provide valuable benchmarks and references for long-range image dehazing research, helping to advance the field.
{"title":"Neural network-based framework for wide visibility dehazing with synthetic benchmarks","authors":"Lin Li , Ru Lei , Kun Zhang , Lingchen Sun , Rustam Stolkin","doi":"10.1016/j.patcog.2026.113056","DOIUrl":"10.1016/j.patcog.2026.113056","url":null,"abstract":"<div><div>Hazy images caused by atmospheric scattering significantly degrade the visibility and performance of computer vision systems, especially in long-range applications. Existing synthetic haze datasets are usually limited to short visibility ranges and fail to adequately model wavelength-dependent scattering effects, leading to suboptimal evaluation of dehazing algorithms. In this study, we propose a physically motivated synthesis method that combines the atmospheric scattering model with channel-specific extinction coefficients for the RGB channels and depth information ranging from 0 to 10 km. This approach enables the construction of the Wide Visibility Synthetic Haze (WVSH) dataset, which spans visibility distances from 50 m to 2 km. Based on WVSH, we design WVDehazeNet, a convolutional neural network that effectively leverages multi-scale spatial features and wavelength-dependent haze priors. Extensive experiments on both WVSH and real-world hazy images demonstrate that WVDehazeNet achieves competitive or superior performance compared with eight state-of-the-art methods in both quantitative and qualitative evaluations. The WVSH dataset and WVDehazeNet provide valuable benchmarks and references for long-range image dehazing research, helping to advance the field.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113056"},"PeriodicalIF":7.6,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145980520","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1016/j.patcog.2025.113010
Yibo Xu , Dawei Zhou , Decheng Liu , Nannan Wang , Xinbo Gao
Deep learning models have been found to be vulnerable to adversarial noise. Adversarial training is a major defense strategy to mitigate the interference caused by adversarial noise. However, the correlations between different categories on deep features in the model have not been fully considered in adversarial training. Our multi-perspective investigations indicate that adversarial noise can disrupt this correlation, resulting in undesirable close inter-class feature distances and far intra-class feature distances, thus degrading accuracy. To solve this problem, in this work, we propose a Feature Similarity-based Targeted adversarial training (FST), which guides the model to learn an appropriate feature distribution among categories under the adversarial environment for making rational decisions. Specifically, we first design a Feature Obfuscation Attack to obfuscate the natural state of feature similarity among categories, and then it is leveraged to generate specific adversarial training examples. Next, we construct target feature similarity matrices as supervision information to prompt the model to learn clean deep features for adversarial data and thereby achieve accurate classification. The target matrix is initialized based on the features learned from natural examples by a naturally pre-trained model. To further enhance the feature similarity between examples with the same category, we directly assign the highest similarity value to the region with the same category in the target matrix. Experimental results on popular datasets show the superior performance of our method, and ablation studies are conducted to demonstrate the effectiveness of designed modules.
{"title":"FST: Improving adversarial robustness via feature similarity-based targeted adversarial training","authors":"Yibo Xu , Dawei Zhou , Decheng Liu , Nannan Wang , Xinbo Gao","doi":"10.1016/j.patcog.2025.113010","DOIUrl":"10.1016/j.patcog.2025.113010","url":null,"abstract":"<div><div>Deep learning models have been found to be vulnerable to adversarial noise. Adversarial training is a major defense strategy to mitigate the interference caused by adversarial noise. However, the correlations between different categories on deep features in the model have not been fully considered in adversarial training. Our multi-perspective investigations indicate that adversarial noise can disrupt this correlation, resulting in undesirable close inter-class feature distances and far intra-class feature distances, thus degrading accuracy. To solve this problem, in this work, we propose a Feature Similarity-based Targeted adversarial training (FST), which guides the model to learn an appropriate feature distribution among categories under the adversarial environment for making rational decisions. Specifically, we first design a <em>Feature Obfuscation Attack</em> to obfuscate the natural state of feature similarity among categories, and then it is leveraged to generate specific adversarial training examples. Next, we construct <em>target feature similarity matrices</em> as supervision information to prompt the model to learn clean deep features for adversarial data and thereby achieve accurate classification. The target matrix is initialized based on the features learned from natural examples by a naturally pre-trained model. To further enhance the feature similarity between examples with the same category, we directly assign the highest similarity value to the region with the same category in the target matrix. Experimental results on popular datasets show the superior performance of our method, and ablation studies are conducted to demonstrate the effectiveness of designed modules.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113010"},"PeriodicalIF":7.6,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145980518","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1016/j.patcog.2025.112992
Yang Fang , Yujie Hu , Bailian Xie , Yujie Wang , Zongyi Xu , Weisheng Li , Xinbo Gao
Transformer tracking methods have become the mainstream tracking paradigm due to the excellent ability to capture global context and long-range dependencies. Among them, plain Transformer tracking directly divides the image into 16 × 16 patches to shorten the token length to reduce computational complexity, which is computationally efficient, but performance is limited due to single-scale feature learning and relational modeling. On the contrary, the hierarchical Transformer tracking enables one to hierarchically learn both low-level details and high-level semantics, which shows stronger tracking performance, but always introducing complicated and asymmetric attention operations. To this end, this paper proposes a simple yet powerful hierarchical Transformer tracking framework, HiViTrack, which enjoys both the efficiency of plain models and the strong representation capabilities of hierarchical models. Specifically, HiViTrack consists mainly of the following modules: a two-stage shallow spatial details retention (SSDR) module that can efficiently capture shallow spatial details to facilitate accurate target localization. Next, a two-stage deep semantic mutual integration (DSMI) module is designated to simultaneously modulate and integrate high-level semantics to enhance discrimination ability and model robustness. Then, the proposed target-prompt update (TPU) mechanism first applies template scoring attention to rank the historical templates, followed by target-prompt attention to generate a target-aware token, before feeding the enriched features into prediction head. Experimental results on six datasets demonstrate that the proposed HiViTrack achieves state-of-the-art (SOTA) performance while maintaining real-time efficiency, establishing a strong baseline for the hierarchical Transformer tracking. Code will be available at https://github.com/huyj2001-ship-it/HiViTrack.
{"title":"HiViTrack: Hierarchical vision transformer with efficient target-prompt update for visual object tracking","authors":"Yang Fang , Yujie Hu , Bailian Xie , Yujie Wang , Zongyi Xu , Weisheng Li , Xinbo Gao","doi":"10.1016/j.patcog.2025.112992","DOIUrl":"10.1016/j.patcog.2025.112992","url":null,"abstract":"<div><div>Transformer tracking methods have become the mainstream tracking paradigm due to the excellent ability to capture global context and long-range dependencies. Among them, plain Transformer tracking directly divides the image into 16 × 16 patches to shorten the token length to reduce computational complexity, which is computationally efficient, but performance is limited due to single-scale feature learning and relational modeling. On the contrary, the hierarchical Transformer tracking enables one to hierarchically learn both low-level details and high-level semantics, which shows stronger tracking performance, but always introducing complicated and asymmetric attention operations. To this end, this paper proposes a simple yet powerful hierarchical Transformer tracking framework, HiViTrack, which enjoys both the efficiency of plain models and the strong representation capabilities of hierarchical models. Specifically, HiViTrack consists mainly of the following modules: a two-stage shallow spatial details retention (SSDR) module that can efficiently capture shallow spatial details to facilitate accurate target localization. Next, a two-stage deep semantic mutual integration (DSMI) module is designated to simultaneously modulate and integrate high-level semantics to enhance discrimination ability and model robustness. Then, the proposed target-prompt update (TPU) mechanism first applies template scoring attention to rank the historical templates, followed by target-prompt attention to generate a target-aware token, before feeding the enriched features into prediction head. Experimental results on six datasets demonstrate that the proposed HiViTrack achieves state-of-the-art (SOTA) performance while maintaining real-time efficiency, establishing a strong baseline for the hierarchical Transformer tracking. Code will be available at <span><span>https://github.com/huyj2001-ship-it/HiViTrack</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 112992"},"PeriodicalIF":7.6,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145980390","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-07DOI: 10.1016/j.patcog.2026.113054
Jin-Ping Zou , Huan Ren , Hongjia Chen , Xu-Yun Xu , Xiang Wang
Color image inpainting plays an important role in computer vision, which aims to reconstruct missing regions from the available information. Existing quaternion-based deep inpainting methods often struggle to restore both global structure and natural textures, especially when only a single corrupted image is available for training. To address these challenges, we propose BQAE-TV, a novel model that integrates a quaternion fully connected network to capture global features while incorporating total variation regularization to optimize quaternion matrix completion, producing structurally coherent and visually natural images. Furthermore, a Bayesian inference mechanism is employed to regularize the deep image prior and mitigate overfitting. Experiments demonstrate that BQAE-TV outperforms both traditional and state-of-the-art methods in terms of visual quality and quantitative metrics, validating its effectiveness and robustness.
{"title":"A Bayesian deep prior-based quaternion matrix completion for color image inpainting","authors":"Jin-Ping Zou , Huan Ren , Hongjia Chen , Xu-Yun Xu , Xiang Wang","doi":"10.1016/j.patcog.2026.113054","DOIUrl":"10.1016/j.patcog.2026.113054","url":null,"abstract":"<div><div>Color image inpainting plays an important role in computer vision, which aims to reconstruct missing regions from the available information. Existing quaternion-based deep inpainting methods often struggle to restore both global structure and natural textures, especially when only a single corrupted image is available for training. To address these challenges, we propose BQAE-TV, a novel model that integrates a quaternion fully connected network to capture global features while incorporating total variation regularization to optimize quaternion matrix completion, producing structurally coherent and visually natural images. Furthermore, a Bayesian inference mechanism is employed to regularize the deep image prior and mitigate overfitting. Experiments demonstrate that BQAE-TV outperforms both traditional and state-of-the-art methods in terms of visual quality and quantitative metrics, validating its effectiveness and robustness.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113054"},"PeriodicalIF":7.6,"publicationDate":"2026-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145980651","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-06DOI: 10.1016/j.patcog.2026.113059
Haiping Ma , Jiyuan Huang , Chenxu Shen , Jin Liu , Qingming Liu
Drug addiction (DA) is a chronic and relapsing brain disorder with limited effective treatments. The combined use of repetitive transcranial magnetic stimulation and electroencephalography (rTMS-EEG) presents a highly promising approach for DA treatment. This paper proposes an effective 3D temporal-spatial convolutional long short-term memory (LSTM) network for DA assessment using rTMS-EEG signals. First, the multi-channel EEG time series after rTMS treatment are converted into multiple topomaps with non-uniform sample times, to enhance spatial features of rTMS-EEG signals. Then these topomaps are sequentially fed into a convolutional module to extract spatial features of brain activity under DA conditions. Next, considering the temporal correlation of rTMS-EEG signals, an LSTM module is introduced to adaptively capture significant sequential time information. Further, a contrastive loss function is defined to reinforce the temporal-spatial features, thereby enhancing DA assessment. Finally, to evaluate the performance of the proposed network, the first rTMS-EEG dataset for DA treatment is constructed. The results of extensive experiments indicate that the and rhythms are likely to be major brain physiological markers of DA disorder, and the rTMS is a safe and effective treatment for DA. Meanwhile, the proposed network achieves the assessing accuracies of 85% and 83% for sham/pre-DA subjects and pre/post-DA subjects respectively, outperforming several existing approaches.
{"title":"3D temporal-spatial convolutional LSTM network for assessing drug addiction treatment","authors":"Haiping Ma , Jiyuan Huang , Chenxu Shen , Jin Liu , Qingming Liu","doi":"10.1016/j.patcog.2026.113059","DOIUrl":"10.1016/j.patcog.2026.113059","url":null,"abstract":"<div><div>Drug addiction (DA) is a chronic and relapsing brain disorder with limited effective treatments. The combined use of repetitive transcranial magnetic stimulation and electroencephalography (rTMS-EEG) presents a highly promising approach for DA treatment. This paper proposes an effective 3D temporal-spatial convolutional long short-term memory (LSTM) network for DA assessment using rTMS-EEG signals. First, the multi-channel EEG time series after rTMS treatment are converted into multiple topomaps with non-uniform sample times, to enhance spatial features of rTMS-EEG signals. Then these topomaps are sequentially fed into a convolutional module to extract spatial features of brain activity under DA conditions. Next, considering the temporal correlation of rTMS-EEG signals, an LSTM module is introduced to adaptively capture significant sequential time information. Further, a contrastive loss function is defined to reinforce the temporal-spatial features, thereby enhancing DA assessment. Finally, to evaluate the performance of the proposed network, the first rTMS-EEG dataset for DA treatment is constructed. The results of extensive experiments indicate that the <span><math><mi>α</mi></math></span> and <span><math><mi>β</mi></math></span> rhythms are likely to be major brain physiological markers of DA disorder, and the rTMS is a safe and effective treatment for DA. Meanwhile, the proposed network achieves the assessing accuracies of 85% and 83% for sham/pre-DA subjects and pre/post-DA subjects respectively, outperforming several existing approaches.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113059"},"PeriodicalIF":7.6,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145915151","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-06DOI: 10.1016/j.patcog.2026.113051
Zhilin Zhu , Jianhua Dai
High-dimensional heterogeneous data often contain redundant and irrelevant features, hindering pattern recognition and data mining. Feature selection enhances data quality and model generalization capabilities by eliminating redundant features. Although information entropy is effective for symbolic data, heterogeneous datasets with both symbolic and numerical features pose new challenges. The neighborhood rough set (NRS) model provides a solution, but existing NRS-based methods suffer from non-monotonicity in entropy and mutual information measures, and insufficient redundancy handling. To address these problems, we propose a macro-neighborhood entropy framework with monotonic measures and a Pairwise Joint Symmetric Uncertainty (PJSU) method that jointly evaluates decision relevance and feature redundancy. Experiments conducted on 15 benchmark datasets using the Naive Bayes (NB) and CART classifiers demonstrate that PJSU achieves the best performance, with accuracies of 84.61% on NB and 83.00% on CART. Results represent improvements of 14.38% and 4.89%, respectively, compared with the original datasets. Meanwhile, the average dimensionality was effectively reduced from 5390.8 to 5.67 and 6.27 for the two classifiers, respectively. These results demonstrate the effectiveness of the proposed method in heterogeneous feature selection.
{"title":"Pairwise joint symmetric uncertainty based on macro-neighborhood entropy for heterogeneous feature selection","authors":"Zhilin Zhu , Jianhua Dai","doi":"10.1016/j.patcog.2026.113051","DOIUrl":"10.1016/j.patcog.2026.113051","url":null,"abstract":"<div><div>High-dimensional heterogeneous data often contain redundant and irrelevant features, hindering pattern recognition and data mining. Feature selection enhances data quality and model generalization capabilities by eliminating redundant features. Although information entropy is effective for symbolic data, heterogeneous datasets with both symbolic and numerical features pose new challenges. The neighborhood rough set (NRS) model provides a solution, but existing NRS-based methods suffer from non-monotonicity in entropy and mutual information measures, and insufficient redundancy handling. To address these problems, we propose a macro-neighborhood entropy framework with monotonic measures and a Pairwise Joint Symmetric Uncertainty (PJSU) method that jointly evaluates decision relevance and feature redundancy. Experiments conducted on 15 benchmark datasets using the Naive Bayes (NB) and CART classifiers demonstrate that PJSU achieves the best performance, with accuracies of 84.61% on NB and 83.00% on CART. Results represent improvements of 14.38% and 4.89%, respectively, compared with the original datasets. Meanwhile, the average dimensionality was effectively reduced from 5390.8 to 5.67 and 6.27 for the two classifiers, respectively. These results demonstrate the effectiveness of the proposed method in heterogeneous feature selection.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113051"},"PeriodicalIF":7.6,"publicationDate":"2026-01-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145915221","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-05DOI: 10.1016/j.patcog.2026.113050
Songming Yang , Jing Wen , Bin Fang
Fine-grained visual classification (FGVC) is a challenging task due to subtle inter-class differences and significant intra-class variations. Most existing approaches struggle to simultaneously capture multi-level discriminative features and effectively integrate complementary visual information. To address these challenges, we propose Fine-Grained Mixture of Expert (FG-MoE), a novel heterogeneous mixture-of-experts model for fine-grained visual classification. Our approach introduces a specialized multi-scale pyramid module that aggregates multi-scale information and enhances feature representation through spatial and channel attention mechanisms. Inspired by neuroscientific insights into visual processing mechanisms of the human brain, FG-MoE employs five specialized experts that focus on different visual cues: global structures, regional semantics, local details, textures, and part-level interactions. A spatial-aware gating mechanism dynamically selects appropriate expert combinations for each input image. We further design a novel multi-stage training strategy and employ balance constraints along with diversity and orthogonality regularization to ensure balanced learning and promote diverse expert specialization. The final classification leverages fused features from all selected experts. Extensive experiments on three widely used FGVC datasets demonstrate that FG-MoE achieves substantial performance improvements over backbone models and establishes state-of-the-art results across all these benchmarks, validating the effectiveness and robustness of our approach.
{"title":"FG-MoE: Heterogeneous mixture of experts model for fine-grained visual classification","authors":"Songming Yang , Jing Wen , Bin Fang","doi":"10.1016/j.patcog.2026.113050","DOIUrl":"10.1016/j.patcog.2026.113050","url":null,"abstract":"<div><div>Fine-grained visual classification (FGVC) is a challenging task due to subtle inter-class differences and significant intra-class variations. Most existing approaches struggle to simultaneously capture multi-level discriminative features and effectively integrate complementary visual information. To address these challenges, we propose Fine-Grained Mixture of Expert (FG-MoE), a novel heterogeneous mixture-of-experts model for fine-grained visual classification. Our approach introduces a specialized multi-scale pyramid module that aggregates multi-scale information and enhances feature representation through spatial and channel attention mechanisms. Inspired by neuroscientific insights into visual processing mechanisms of the human brain, FG-MoE employs five specialized experts that focus on different visual cues: global structures, regional semantics, local details, textures, and part-level interactions. A spatial-aware gating mechanism dynamically selects appropriate expert combinations for each input image. We further design a novel multi-stage training strategy and employ balance constraints along with diversity and orthogonality regularization to ensure balanced learning and promote diverse expert specialization. The final classification leverages fused features from all selected experts. Extensive experiments on three widely used FGVC datasets demonstrate that FG-MoE achieves substantial performance improvements over backbone models and establishes state-of-the-art results across all these benchmarks, validating the effectiveness and robustness of our approach.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113050"},"PeriodicalIF":7.6,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145980417","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-05DOI: 10.1016/j.patcog.2025.112923
Yifan Yang , Zeshuai Deng , Dong Liu , Zixiong Huang , Kai Zhou , Hailin Luo , Qing Du , Mingkui Tan
Text-driven 3D human generation significantly reduces manual labor for professionals and enables non-professionals to create 3D assets, facilitating applications across various fields, such as digital games, advertising, and films. Conventional methods usually follow the paradigm of optimizing 3D representations such as neural radiance field and 3D Gaussian Splatting by Score distillation Sampling (SDS) using a diffusion model. However, existing methods struggle to generate delicate and 3D consistent human body parts, primarily due to the ignorance of imposing stable topology control and precise local view control. Our key idea is to focus on the critical components of the human body parts to impose precise control while optimizing the 3D model. Following this, we propose FocalGaussian. Specifically, to generate delicate body parts, we propose a focal depth loss that recovers delicate human body parts by aligning the depth of local body parts in the 3D human model and SMPL-X at local and global scales. Moreover, to achieve 3D consistent local body parts, we propose a focal view-dependent SDS that emphasizes key body-part features and provides finer control over local geometry. Extensive experiments demonstrate the superiority of our FocalGaussian across a variety of prompts. Critically, our generated 3D humans accurately capture complex features of human body parts, particularly the hands. For more results please check our project page at Project page.
{"title":"FocalGaussian: Improving text-driven 3D human generation with body part focus","authors":"Yifan Yang , Zeshuai Deng , Dong Liu , Zixiong Huang , Kai Zhou , Hailin Luo , Qing Du , Mingkui Tan","doi":"10.1016/j.patcog.2025.112923","DOIUrl":"10.1016/j.patcog.2025.112923","url":null,"abstract":"<div><div>Text-driven 3D human generation significantly reduces manual labor for professionals and enables non-professionals to create 3D assets, facilitating applications across various fields, such as digital games, advertising, and films. Conventional methods usually follow the paradigm of optimizing 3D representations such as neural radiance field and 3D Gaussian Splatting by Score distillation Sampling (SDS) using a diffusion model. However, existing methods struggle to generate delicate and 3D consistent human body parts, primarily due to the ignorance of imposing stable topology control and precise local view control. Our key idea is to focus on the critical components of the human body parts to impose precise control while optimizing the 3D model. Following this, we propose FocalGaussian. Specifically, to generate delicate body parts, we propose a focal depth loss that recovers delicate human body parts by aligning the depth of local body parts in the 3D human model and SMPL-X at local and global scales. Moreover, to achieve 3D consistent local body parts, we propose a focal view-dependent SDS that emphasizes key body-part features and provides finer control over local geometry. Extensive experiments demonstrate the superiority of our FocalGaussian across a variety of prompts. Critically, our generated 3D humans accurately capture complex features of human body parts, particularly the hands. For more results please check our project page at <span><span>Project page</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 112923"},"PeriodicalIF":7.6,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145980396","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-05DOI: 10.1016/j.patcog.2026.113043
Jayanthan K S, Domnic S
This work introduces a novel architecture that integrates a multi-scale Visual Transformer (ViT) encoder with a graph attention network decoder to model contextual relationships in visual scenes. Our approach achieves real-time, parameter-efficient object counting through an innovative Knowledge Distillation framework that integrates density estimation maps with regression-based counting mechanisms. The distillation process optimizes performance through a three-component loss function: encoder loss, decoder loss, and our proposed Dual-Domain Density-Regression Loss (DD-R Loss). This novel loss formulation simultaneously supervises both spatial density distribution and direct count regression, providing complementary learning signals for robust object quantification. A key contribution is our scale-aware token embedding technique and cross-attention fusion across varying receptive fields within the ViT architecture, enabling precise counting in cluttered visual environments. Experiments are conducted on four crowd-counting datasets, two vehicle counting datasets. Our detailed experimental evaluation shows that the proposed method delivers outcomes comparable to SOTA methods in terms of counting accuracy and density estimate precision. The detailed comparisons presented in our results and discussion sections highlight the significant strengths and advantages of our methodology within the challenging domain of visual object counting. Our framework bridges the gap between the representational power of transformer-based models and graph network architectures. The efficiency of our approach enables real-time performance comparable to other CNN based approaches. This combination delivers a comprehensive solution for object counting tasks that performs effectively even in resource-constrained environments.
{"title":"HybridCount : Multi-Scale transformer with knowledge distillation for object counting","authors":"Jayanthan K S, Domnic S","doi":"10.1016/j.patcog.2026.113043","DOIUrl":"10.1016/j.patcog.2026.113043","url":null,"abstract":"<div><div>This work introduces a novel architecture that integrates a multi-scale Visual Transformer (ViT) encoder with a graph attention network decoder to model contextual relationships in visual scenes. Our approach achieves real-time, parameter-efficient object counting through an innovative Knowledge Distillation framework that integrates density estimation maps with regression-based counting mechanisms. The distillation process optimizes performance through a three-component loss function: encoder loss, decoder loss, and our proposed Dual-Domain Density-Regression Loss (DD-R Loss). This novel loss formulation simultaneously supervises both spatial density distribution and direct count regression, providing complementary learning signals for robust object quantification. A key contribution is our scale-aware token embedding technique and cross-attention fusion across varying receptive fields within the ViT architecture, enabling precise counting in cluttered visual environments. Experiments are conducted on four crowd-counting datasets, two vehicle counting datasets. Our detailed experimental evaluation shows that the proposed method delivers outcomes comparable to SOTA methods in terms of counting accuracy and density estimate precision. The detailed comparisons presented in our results and discussion sections highlight the significant strengths and advantages of our methodology within the challenging domain of visual object counting. Our framework bridges the gap between the representational power of transformer-based models and graph network architectures. The efficiency of our approach enables real-time performance comparable to other CNN based approaches. This combination delivers a comprehensive solution for object counting tasks that performs effectively even in resource-constrained environments.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113043"},"PeriodicalIF":7.6,"publicationDate":"2026-01-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145980398","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-01-04DOI: 10.1016/j.patcog.2026.113052
Zhaochuan Zeng , Shilei Wang , Yidong Song , Zhenhua Wang , Jifeng Ning
The utilization of temporal sequences is crucial for tracking in complex scenarios, particularly when addressing challenges such as occlusion and deformation. However, existing methods are often constrained by limitations such as the use of unrefined raw images or computationally expensive temporal fusion modules, both of which restrict the scale of temporal sequences that can be utilized. This study proposes a novel appearance compression strategy and a temporal feature fusion module, which together significantly enhance the tracker’s ability to utilize long-term temporal sequences. Based on these designs, we propose a tracker that can leverage a Long-term Temporal Sequence that contains historical context across 300 frames, which we name LTSTrack. First, we present a simple yet effective appearance compression strategy to extract target appearance features from each frame and compress them into compact summary tokens, which constitute a long-term temporal sequence. Then, the Mamba block is introduced to efficiently fuse the long-term temporal sequence, generating a fusion token containing the historical representation of the target. Finally, this fusion token is used to enhance the search-region features, thereby achieving more accurate tracking. Extensive experiments demonstrate that the proposed method achieves significant performance improvements across the GOT-10K, TrackingNet, TNL2K, LaSOT, UAV123 and LaSOText datasets. Notably, it achieves remarkable scores of 75.1% AO on GOT-10K and 84.6% AUC on TrackingNet, substantially outperforming previous state-of-the-art methods.
{"title":"LTSTrack: Visual tracking with long-term temporal sequence","authors":"Zhaochuan Zeng , Shilei Wang , Yidong Song , Zhenhua Wang , Jifeng Ning","doi":"10.1016/j.patcog.2026.113052","DOIUrl":"10.1016/j.patcog.2026.113052","url":null,"abstract":"<div><div>The utilization of temporal sequences is crucial for tracking in complex scenarios, particularly when addressing challenges such as occlusion and deformation. However, existing methods are often constrained by limitations such as the use of unrefined raw images or computationally expensive temporal fusion modules, both of which restrict the scale of temporal sequences that can be utilized. This study proposes a novel appearance compression strategy and a temporal feature fusion module, which together significantly enhance the tracker’s ability to utilize long-term temporal sequences. Based on these designs, we propose a tracker that can leverage a <strong>L</strong>ong-term <strong>T</strong>emporal <strong>S</strong>equence that contains historical context across 300 frames, which we name LTSTrack. First, we present a simple yet effective appearance compression strategy to extract target appearance features from each frame and compress them into compact summary tokens, which constitute a long-term temporal sequence. Then, the Mamba block is introduced to efficiently fuse the long-term temporal sequence, generating a fusion token containing the historical representation of the target. Finally, this fusion token is used to enhance the search-region features, thereby achieving more accurate tracking. Extensive experiments demonstrate that the proposed method achieves significant performance improvements across the GOT-10K, TrackingNet, TNL2K, LaSOT, UAV123 and LaSOT<sub><em>ext</em></sub> datasets. Notably, it achieves remarkable scores of 75.1% AO on GOT-10K and 84.6% AUC on TrackingNet, substantially outperforming previous state-of-the-art methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"175 ","pages":"Article 113052"},"PeriodicalIF":7.6,"publicationDate":"2026-01-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"145980391","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}