Pub Date : 2025-01-13DOI: 10.1016/j.patcog.2024.111310
You Wu , Peipei Li , Yizhang Zou
As the dimensionality of multi-label data continues to increase, feature selection has become increasingly prevalent in multi-label learning, serving as an efficient and interpretable means of dimensionality reduction. However, existing multi-label feature selection algorithms often assume data to be noise-free, which cannot hold in real-world applications where feature and label noise are frequently encountered. Therefore, we propose a novel partial multi-label feature selection algorithm, which aims to effectively select an optimal subset of features in the environment plagued by feature noise and partial multi-label. Specifically, we first propose a robust label enhancement model to diminish noise interference and enrich the semantic information of labels. Subsequently, a sparse reconstruction is utilized to learn the instance relevance information and then applied to the smoothness assumption to obtain more accurate label distributions. Additionally, we employ the -norm to eliminate irrelevant features and constrain the model complexity. Finally, the above processing is optimized end-to-end within a unified objective function. Experimental results demonstrate that our algorithm outperforms several state-of-the-art feature selection methods across 15 datasets.
{"title":"Partial multi-label feature selection with feature noise","authors":"You Wu , Peipei Li , Yizhang Zou","doi":"10.1016/j.patcog.2024.111310","DOIUrl":"10.1016/j.patcog.2024.111310","url":null,"abstract":"<div><div>As the dimensionality of multi-label data continues to increase, feature selection has become increasingly prevalent in multi-label learning, serving as an efficient and interpretable means of dimensionality reduction. However, existing multi-label feature selection algorithms often assume data to be noise-free, which cannot hold in real-world applications where feature and label noise are frequently encountered. Therefore, we propose a novel partial multi-label feature selection algorithm, which aims to effectively select an optimal subset of features in the environment plagued by feature noise and partial multi-label. Specifically, we first propose a robust label enhancement model to diminish noise interference and enrich the semantic information of labels. Subsequently, a sparse reconstruction is utilized to learn the instance relevance information and then applied to the smoothness assumption to obtain more accurate label distributions. Additionally, we employ the <span><math><msub><mrow><mi>ℓ</mi></mrow><mrow><mn>2</mn><mo>,</mo><mn>1</mn></mrow></msub></math></span>-norm to eliminate irrelevant features and constrain the model complexity. Finally, the above processing is optimized end-to-end within a unified objective function. Experimental results demonstrate that our algorithm outperforms several state-of-the-art feature selection methods across 15 datasets.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111310"},"PeriodicalIF":7.5,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143150551","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-13DOI: 10.1016/j.patcog.2025.111350
Sheng Wang , Zihao Zhao , Zixu Zhuang , Xi Ouyang , Lichi Zhang , Zheren Li , Chong Ma , Tianming Liu , Dinggang Shen , Qian Wang
Recent advancements in self-supervised contrastive learning have shown significant benefits from utilizing a Siamese network architecture, which focuses on reducing the distances between similar (positive) pairs of data. These methods often employ random data augmentations on input images, with the expectation that these augmented views of the same image will be recognized as similar and thus, positively paired. However, this approach of random augmentation may not fully consider the semantics of the image, potentially leading to a reduction in the quality of the augmented images for contrastive learning. This challenge is particularly pronounced in the domain of medical imaging, where disease-related anomalies can be subtle and easily corrupted. In this study, we initially show that for commonly used X-ray images, traditional augmentation techniques employed in contrastive pre-training can negatively impact the performance of subsequent diagnostic or classification tasks. To address this, we introduce a novel augmentation method, i.e., FocusContrast, to learn from radiologists’ gaze during diagnosis and generate contrastive views with guidance from radiologists’ visual attention. Specifically, we track the eye movements of radiologists to understand their visual attention while diagnosing X-ray images. This understanding allows the saliency prediction model to predict where a radiologist might focus when presented with a new image, guiding the attention-aware augmentation that maintains crucial details related to diseases. As a plug-and-play and module, FocusContrast can enhance the performance of contrastive learning frameworks like SimCLR, MoCo, and BYOL. Our results show consistent improvements on datasets of knee X-rays and digital mammography, demonstrating the effectiveness of incorporating radiological expertise into the augmentation process for contrastive learning in medical imaging.
{"title":"Learning better contrastive view from radiologist’s gaze","authors":"Sheng Wang , Zihao Zhao , Zixu Zhuang , Xi Ouyang , Lichi Zhang , Zheren Li , Chong Ma , Tianming Liu , Dinggang Shen , Qian Wang","doi":"10.1016/j.patcog.2025.111350","DOIUrl":"10.1016/j.patcog.2025.111350","url":null,"abstract":"<div><div>Recent advancements in self-supervised contrastive learning have shown significant benefits from utilizing a Siamese network architecture, which focuses on reducing the distances between similar (positive) pairs of data. These methods often employ random data augmentations on input images, with the expectation that these augmented views of the same image will be recognized as similar and thus, positively paired. However, this approach of random augmentation may not fully consider the semantics of the image, potentially leading to a reduction in the quality of the augmented images for contrastive learning. This challenge is particularly pronounced in the domain of medical imaging, where disease-related anomalies can be subtle and easily corrupted. In this study, we initially show that for commonly used X-ray images, traditional augmentation techniques employed in contrastive pre-training can negatively impact the performance of subsequent diagnostic or classification tasks. To address this, we introduce a novel augmentation method, i.e., FocusContrast, to learn from radiologists’ gaze during diagnosis and generate contrastive views with guidance from radiologists’ visual attention. Specifically, we track the eye movements of radiologists to understand their visual attention while diagnosing X-ray images. This understanding allows the saliency prediction model to predict where a radiologist might focus when presented with a new image, guiding the attention-aware augmentation that maintains crucial details related to diseases. As a plug-and-play and module, FocusContrast can enhance the performance of contrastive learning frameworks like SimCLR, MoCo, and BYOL. Our results show consistent improvements on datasets of knee X-rays and digital mammography, demonstrating the effectiveness of incorporating radiological expertise into the augmentation process for contrastive learning in medical imaging.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111350"},"PeriodicalIF":7.5,"publicationDate":"2025-01-13","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149788","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-10DOI: 10.1016/j.patcog.2025.111346
Hong-Yu Guo , Chuang Wang , Fei Yin , Xiao-Hui Li , Cheng-Lin Liu
Vision–language pre-training models have shown promise in improving various downstream tasks. However, handwritten mathematical expression recognition (HMER), as a typical structured learning problem, can hardly benefit from existing pre-training methods due to the presence of multiple symbols and complicated structural relationships, as well as the scarcity of paired data. To overcome these problems, we propose a Vision-Language Pre-training paradigm for Graph-based HMER (VLPG), utilizing unpaired mathematical expression images and LaTeX labels. Our HMER model is built upon a graph parsing method with superior explainability, which is enhanced by the proposed graph-structure aware transformer decoder. Based on this framework, the symbol localization pretext task and language modeling task are employed for vision–language pre-training. First, we make use of unlabeled mathematical symbol images to pre-train the visual feature extractor through the localization pretext task, improving the symbol localization and discrimination ability. Second, the structure understanding module is pre-trained using LaTeX corpora through language modeling task, which promotes the model’s context comprehension ability. The pre-trained model is fine-tuned and aligned on the downstream HMER task using benchmark datasets. Experiments on public datasets demonstrate that the pre-training paradigm significantly improves the mathematical expression recognition performance. Our VLPG achieves state-of-the-art performance on standard CROHME datasets and comparable performance on the HME100K dataset, highlighting the effectiveness and superiority of the proposed model. We released our codes at https://github.com/guohy17/VLPG.
{"title":"Vision–language pre-training for graph-based handwritten mathematical expression recognition","authors":"Hong-Yu Guo , Chuang Wang , Fei Yin , Xiao-Hui Li , Cheng-Lin Liu","doi":"10.1016/j.patcog.2025.111346","DOIUrl":"10.1016/j.patcog.2025.111346","url":null,"abstract":"<div><div>Vision–language pre-training models have shown promise in improving various downstream tasks. However, handwritten mathematical expression recognition (HMER), as a typical structured learning problem, can hardly benefit from existing pre-training methods due to the presence of multiple symbols and complicated structural relationships, as well as the scarcity of paired data. To overcome these problems, we propose a <strong>V</strong>ision-<strong>L</strong>anguage <strong>P</strong>re-training paradigm for <strong>G</strong>raph-based HMER (VLPG), utilizing unpaired mathematical expression images and LaTeX labels. Our HMER model is built upon a graph parsing method with superior explainability, which is enhanced by the proposed graph-structure aware transformer decoder. Based on this framework, the symbol localization pretext task and language modeling task are employed for vision–language pre-training. First, we make use of unlabeled mathematical symbol images to pre-train the visual feature extractor through the localization pretext task, improving the symbol localization and discrimination ability. Second, the structure understanding module is pre-trained using LaTeX corpora through language modeling task, which promotes the model’s context comprehension ability. The pre-trained model is fine-tuned and aligned on the downstream HMER task using benchmark datasets. Experiments on public datasets demonstrate that the pre-training paradigm significantly improves the mathematical expression recognition performance. Our VLPG achieves state-of-the-art performance on standard CROHME datasets and comparable performance on the HME100K dataset, highlighting the effectiveness and superiority of the proposed model. We released our codes at <span><span>https://github.com/guohy17/VLPG</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111346"},"PeriodicalIF":7.5,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149484","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-10DOI: 10.1016/j.patcog.2025.111349
Ms. R. Mallika Alias Pandeeswari , Dr. G. Rajakumar
Person re-identification is the system that aims to attain the re-identity of a particular person captured by different surveillance cameras. However, it is still a challenging problem in the surveillance system. The more considerable variation of light conditions, body poses, angles illumination, and occlusion makes it difficult for the system to re-identify the persons. Recently, the study has been significantly improved by the use of deep intelligence frameworks. However, it faces some limitations, such as insufficient features and poor accuracy. Therefore, a novel Horned Lizard Googlenet Forecasting System (HLGFS) is developed in this research to better result in person re-identification. The novelty of the research lies in integrating Horned Lizard optimization with GoogleNet for fine-tuned and efficient forecasting to re-identify the person. Initially, the surveillance images were preprocessed to filter the low-level noise features. Further, the relevant features were extracted based on the Horned Lizard optimization function. Subsequently, by analyzing the extracted features, the re-identity of the person is identified and received by matching and ranking. Moreover, the similarity percentage of the query and identified images was measured through structure similarity. The process of the designed model is tested using the CUHK03, Market1501, and DukeMTMC re-id dataset in the Python platform. Finally, the forecasting efficiency of the approach is validated and related to existing techniques. The accuracy of HLGFS is 97.8 %, and the mAP is 97.6 % for the CUHK03 dataset, with 97.68 % accuracy, and 98.87 % mAP for the Market1501 dataset and for the DukeMTMC re-id dataset, the model achieved 96.65 % accuracy and 96.65 % mAP.
{"title":"Deep intelligent technique for person Re-identification system in surveillance images","authors":"Ms. R. Mallika Alias Pandeeswari , Dr. G. Rajakumar","doi":"10.1016/j.patcog.2025.111349","DOIUrl":"10.1016/j.patcog.2025.111349","url":null,"abstract":"<div><div>Person re-identification is the system that aims to attain the re-identity of a particular person captured by different surveillance cameras. However, it is still a challenging problem in the surveillance system. The more considerable variation of light conditions, body poses, angles illumination, and occlusion makes it difficult for the system to re-identify the persons. Recently, the study has been significantly improved by the use of deep intelligence frameworks. However, it faces some limitations, such as insufficient features and poor accuracy. Therefore, a novel Horned Lizard Googlenet Forecasting System (HLGFS) is developed in this research to better result in person re-identification. The novelty of the research lies in integrating Horned Lizard optimization with GoogleNet for fine-tuned and efficient forecasting to re-identify the person. Initially, the surveillance images were preprocessed to filter the low-level noise features. Further, the relevant features were extracted based on the Horned Lizard optimization function. Subsequently, by analyzing the extracted features, the re-identity of the person is identified and received by matching and ranking. Moreover, the similarity percentage of the query and identified images was measured through structure similarity. The process of the designed model is tested using the CUHK03, Market1501, and DukeMTMC re-id dataset in the Python platform. Finally, the forecasting efficiency of the approach is validated and related to existing techniques. The accuracy of HLGFS is 97.8 %, and the mAP is 97.6 % for the CUHK03 dataset, with 97.68 % accuracy, and 98.87 % mAP for the Market1501 dataset and for the DukeMTMC re-id dataset, the model achieved 96.65 % accuracy and 96.65 % mAP.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111349"},"PeriodicalIF":7.5,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149491","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-10DOI: 10.1016/j.patcog.2024.111302
Junyong Zhao , Liang Sun , Zhi Sun , Yanling Fu , Wei Shao , Xin Zhou , Haipeng Si , Daoqiang Zhang
Accurate segmentation of tubular structures in the human body is crucial for disease diagnosis and preoperative planning in clinical practice. However, achieving precision in segmenting tubular structures in medical images proves challenging due to their highly intricate, furcated, and slender nature. This complexity also challenges obtaining a substantial amount of labeled data necessary for training deep learning models. To address these challenges, we propose ESVC-Net, a novel Edge-enhanced Semi-supervised Vertical Convolutional neural network designed to produce accurate tubular structure segmentation. Unlike traditional convolution approaches at a single scale, we propose a cross-scale vertical convolution module, enabling the learning of abundant multi-scale features for furcated and slender structures in the encoder. To enhance discriminability around the boundary, we introduce an edge spatially adaptive enhancement module. This module integrates edge features learned from the auxiliary edge detection task into the segmentation process. Furthermore, we employ a semi-supervised learning method, leveraging a significant amount of unlabeled data to enhance segmentation performance. We validate the effectiveness of ESVC-Net on two types of tubular structures: the lumbosacral plexus using MR images and the airway using CT images. Experimental results show that the superiority of ESVC-Net over state-of-the-art methods.
{"title":"Edge-enhanced semi-supervised vertical convolutional neural network for tubular structure segmentation: Application to medical images","authors":"Junyong Zhao , Liang Sun , Zhi Sun , Yanling Fu , Wei Shao , Xin Zhou , Haipeng Si , Daoqiang Zhang","doi":"10.1016/j.patcog.2024.111302","DOIUrl":"10.1016/j.patcog.2024.111302","url":null,"abstract":"<div><div>Accurate segmentation of tubular structures in the human body is crucial for disease diagnosis and preoperative planning in clinical practice. However, achieving precision in segmenting tubular structures in medical images proves challenging due to their highly intricate, furcated, and slender nature. This complexity also challenges obtaining a substantial amount of labeled data necessary for training deep learning models. To address these challenges, we propose ESVC-Net, a novel Edge-enhanced Semi-supervised Vertical Convolutional neural network designed to produce accurate tubular structure segmentation. Unlike traditional convolution approaches at a single scale, we propose a cross-scale vertical convolution module, enabling the learning of abundant multi-scale features for furcated and slender structures in the encoder. To enhance discriminability around the boundary, we introduce an edge spatially adaptive enhancement module. This module integrates edge features learned from the auxiliary edge detection task into the segmentation process. Furthermore, we employ a semi-supervised learning method, leveraging a significant amount of unlabeled data to enhance segmentation performance. We validate the effectiveness of ESVC-Net on two types of tubular structures: the lumbosacral plexus using MR images and the airway using CT images. Experimental results show that the superiority of ESVC-Net over state-of-the-art methods.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111302"},"PeriodicalIF":7.5,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149864","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-10DOI: 10.1016/j.patcog.2025.111351
Huang Zhang , Long Yu , Guoqi Wang , Shengwei Tian , Zaiyang Yu , Weijun Li , Xin Ning
A Point cloud is an important representation of three-dimensional (3D) objects, playing an important role in computer vision. However, the inherent sparseness and disorder of point clouds do not provide a stable representation comparable to 2D image pixels. Graph convolutional neural network (GCNN) can generate local neighborhood descriptions of 3D modalities by constructing a graph but it is difficult to capture relationships between distant points. This study proposes a hierarchical encoder based on graph offset convolution, which aggregates the long short distance relationships within local neighborhoods to extend the graph semantic information contained in the adjacency matrix. Furthermore, to address the difficulty of aligning point clouds with image features while avoiding the limitations of text annotations, we introduce a joint point-view pre-training strategy. This strategy learns a unified representation of the two modalities, improving the network’s comprehension of the limited 3D data. Finally, a cross-modal alignment is used to map point and view information to the same feature space, thereby constraining the training states of the two modalities. The proposed method is validated on both standard and zero-shot classification tasks, showing excellent performance. The proposed 3D backbone network achieves 93.6% overall accuracy on the ModelNet40 dataset, with our pre-training strategy that improves the performance of the model by 1.3%.
{"title":"Cross-modal knowledge transfer for 3D point clouds via graph offset prediction","authors":"Huang Zhang , Long Yu , Guoqi Wang , Shengwei Tian , Zaiyang Yu , Weijun Li , Xin Ning","doi":"10.1016/j.patcog.2025.111351","DOIUrl":"10.1016/j.patcog.2025.111351","url":null,"abstract":"<div><div>A Point cloud is an important representation of three-dimensional (3D) objects, playing an important role in computer vision. However, the inherent sparseness and disorder of point clouds do not provide a stable representation comparable to 2D image pixels. Graph convolutional neural network (GCNN) can generate local neighborhood descriptions of 3D modalities by constructing a graph but it is difficult to capture relationships between distant points. This study proposes a hierarchical encoder based on graph offset convolution, which aggregates the long short distance relationships within local neighborhoods to extend the graph semantic information contained in the adjacency matrix. Furthermore, to address the difficulty of aligning point clouds with image features while avoiding the limitations of text annotations, we introduce a joint point-view pre-training strategy. This strategy learns a unified representation of the two modalities, improving the network’s comprehension of the limited 3D data. Finally, a cross-modal alignment is used to map point and view information to the same feature space, thereby constraining the training states of the two modalities. The proposed method is validated on both standard and zero-shot classification tasks, showing excellent performance. The proposed 3D backbone network achieves 93.6% overall accuracy on the ModelNet40 dataset, with our pre-training strategy that improves the performance of the model by 1.3%.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111351"},"PeriodicalIF":7.5,"publicationDate":"2025-01-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149866","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Quantum Clustering (QC) is widely regarded as a powerful method in unsupervised learning problems. This method forms a potential function using a wave function as a superposition of Gaussian probability functions centered at data points. Clusters are then identified by locating the minima of the potential function. However, QC is highly sensitive to the kernel bandwidth parameter in the Schrödinger equation, which controls the shape of the Gaussian kernel, and affects the potential function's minima. This paper proposes an Apollonius Circle-based Quantum Clustering (ACQC) method using Lennard-Jones Potential (LJP), entitled ACQC-LJP, to address this limitation. ACQC-LJP introduces a novel approach to clustering by leveraging LJP to screen dense points and constructing Apollonius circle-based neighborhood groups, enabling the extraction of adaptive kernel bandwidths to effectively resolve the kernel bandwidth issue. Experimental results on real-world and synthetic datasets demonstrate that ACQC-LJP improves cluster detection accuracy by 50% compared to the original QC and by 10% compared to the ACQC method. Furthermore, the computational cost is reduced by more than 90% through localized calculations. ACQC-LJP outperforms state-of-the-art methods on diverse datasets, including those with small sample sizes, high feature variability, and imbalanced class distributions. These findings highlight the method's robustness and effectiveness across various challenging scenarios, marking it as a significant advancement in unsupervised learning. All the implementation source codes of ACQC-LJP are available at https://github.com/NAbdolmaleki/ACQC-LJP.
{"title":"ACQC-LJP: Apollonius circle-based quantum clustering using Lennard-Jones potential","authors":"Nasim Abdolmaleki , Leyli Mohammad Khanli , Mahdi Hashemzadeh , Shahin Pourbahrami","doi":"10.1016/j.patcog.2025.111342","DOIUrl":"10.1016/j.patcog.2025.111342","url":null,"abstract":"<div><div>Quantum Clustering (QC) is widely regarded as a powerful method in unsupervised learning problems. This method forms a potential function using a wave function as a superposition of Gaussian probability functions centered at data points. Clusters are then identified by locating the minima of the potential function. However, QC is highly sensitive to the kernel bandwidth parameter in the Schrödinger equation, which controls the shape of the Gaussian kernel, and affects the potential function's minima. This paper proposes an Apollonius Circle-based Quantum Clustering (ACQC) method using Lennard-Jones Potential (LJP), entitled ACQC-LJP, to address this limitation. ACQC-LJP introduces a novel approach to clustering by leveraging LJP to screen dense points and constructing Apollonius circle-based neighborhood groups, enabling the extraction of adaptive kernel bandwidths to effectively resolve the kernel bandwidth issue. Experimental results on real-world and synthetic datasets demonstrate that ACQC-LJP improves cluster detection accuracy by 50% compared to the original QC and by 10% compared to the ACQC method. Furthermore, the computational cost is reduced by more than 90% through localized calculations. ACQC-LJP outperforms state-of-the-art methods on diverse datasets, including those with small sample sizes, high feature variability, and imbalanced class distributions. These findings highlight the method's robustness and effectiveness across various challenging scenarios, marking it as a significant advancement in unsupervised learning. All the implementation source codes of ACQC-LJP are available at <span><span>https://github.com/NAbdolmaleki/ACQC-LJP</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"161 ","pages":"Article 111342"},"PeriodicalIF":7.5,"publicationDate":"2025-01-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143146456","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-07DOI: 10.1016/j.patcog.2025.111343
Jianbo Liu , Ying Wang , Shiming Xiang , Chunhong Pan
Previous methods for skeleton-based gesture recognition mostly arrange the skeleton sequence into a pseudo image or spatial–temporal graph and apply a deep Convolutional Neural Network (CNN) or Graph Convolutional Network (GCN) for feature extraction. Although achieving superior results, the computing efficiency still remains a serious issue. In this paper, we concentrate on designing an extremely lightweight model for skeleton-based gesture recognition using pure self-attention module. With dynamic attention weights, self-attention module is able to aggregate the features of the most informative joints using a shallow network. Considering the hierarchical structure of hand joints and inspired by the idea of divide-and-conquer, we propose an efficient hierarchical self-attention network (HAN) for skeleton-based gesture recognition. The hierarchical design can further reduce computation cost and allow the network to explicitly extract finger-level spatial temporal features, which further improves the performance of the model. Specifically, the joint self-attention module is used to capture spatial features of fingers, the finger self-attention module is designed to aggregate features of the whole hand. In terms of temporal features, the temporal self-attention module is utilized to capture the temporal dynamics of the fingers and the entire hand. Finally, these features are fused by the fusion self-attention module for gesture classification. Experiments show that our method achieves competitive results on three gesture recognition datasets with much lower computational complexity.
{"title":"HAN: An efficient hierarchical self-attention network for skeleton-based gesture recognition","authors":"Jianbo Liu , Ying Wang , Shiming Xiang , Chunhong Pan","doi":"10.1016/j.patcog.2025.111343","DOIUrl":"10.1016/j.patcog.2025.111343","url":null,"abstract":"<div><div>Previous methods for skeleton-based gesture recognition mostly arrange the skeleton sequence into a pseudo image or spatial–temporal graph and apply a deep Convolutional Neural Network (CNN) or Graph Convolutional Network (GCN) for feature extraction. Although achieving superior results, the computing efficiency still remains a serious issue. In this paper, we concentrate on designing an extremely lightweight model for skeleton-based gesture recognition using pure self-attention module. With dynamic attention weights, self-attention module is able to aggregate the features of the most informative joints using a shallow network. Considering the hierarchical structure of hand joints and inspired by the idea of divide-and-conquer, we propose an efficient hierarchical self-attention network (HAN) for skeleton-based gesture recognition. The hierarchical design can further reduce computation cost and allow the network to explicitly extract finger-level spatial temporal features, which further improves the performance of the model. Specifically, the joint self-attention module is used to capture spatial features of fingers, the finger self-attention module is designed to aggregate features of the whole hand. In terms of temporal features, the temporal self-attention module is utilized to capture the temporal dynamics of the fingers and the entire hand. Finally, these features are fused by the fusion self-attention module for gesture classification. Experiments show that our method achieves competitive results on three gesture recognition datasets with much lower computational complexity.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111343"},"PeriodicalIF":7.5,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143149488","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Optimizing overall classification accuracy in neural networks does not always yield the best top- accuracy, a critical metric in many real-world applications. This discrepancy is particularly evident in scenarios where multiple classes exhibit high similarity and overlap in the embedding space, leading to class ambiguity during retrieval. Addressing this challenge, the paper proposes a novel method to enhance top- matching performance by leveraging class relationships in the embedding space. The proposed approach first employs a clustering algorithm to group similar classes into superclusters, capturing their inherent similarity. Next, the compactness of these superclusters is optimized while preserving the discriminative properties of individual classes. This dual optimization improves the separability of classes within superclusters and enhances retrieval accuracy in ambiguous scenarios. Experimental results on diverse datasets, including STL-10, CIFAR-10, CIFAR-100, Stanford Online Products, CARS196, and SCface, demonstrate significant improvements in top- accuracy, validating the effectiveness and generalizability of the proposed method.
{"title":"On learning discriminative embeddings for optimized top-k matching","authors":"Soumyadeep Ghosh , Mayank Vatsa , Richa Singh , Nalini Ratha","doi":"10.1016/j.patcog.2025.111341","DOIUrl":"10.1016/j.patcog.2025.111341","url":null,"abstract":"<div><div>Optimizing overall classification accuracy in neural networks does not always yield the best top-<span><math><mi>k</mi></math></span> accuracy, a critical metric in many real-world applications. This discrepancy is particularly evident in scenarios where multiple classes exhibit high similarity and overlap in the embedding space, leading to class ambiguity during retrieval. Addressing this challenge, the paper proposes a novel method to enhance top-<span><math><mi>k</mi></math></span> matching performance by leveraging class relationships in the embedding space. The proposed approach first employs a clustering algorithm to group similar classes into superclusters, capturing their inherent similarity. Next, the compactness of these superclusters is optimized while preserving the discriminative properties of individual classes. This dual optimization improves the separability of classes within superclusters and enhances retrieval accuracy in ambiguous scenarios. Experimental results on diverse datasets, including STL-10, CIFAR-10, CIFAR-100, Stanford Online Products, CARS196, and SCface, demonstrate significant improvements in top-<span><math><mi>k</mi></math></span> accuracy, validating the effectiveness and generalizability of the proposed method.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111341"},"PeriodicalIF":7.5,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143297831","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2025-01-07DOI: 10.1016/j.patcog.2025.111344
Xiang Gao , Yuqi Zhang
Recent style transfer problems are still largely dominated by Generative Adversarial Network (GAN) from the perspective of cross-domain image-to-image (I2I) translation, where the pivotal issue is to learn and transfer target-domain style patterns onto source-domain content images. This paper handles the problem of translating real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though a wide range of I2I models tackle this problem, a notable challenge is that the content details of the source image could be easily erased or corrupted due to the transfer of ink-wash style elements. To remedy this issue, we propose to incorporate saliency detection into the unpaired I2I framework to regularize image content, where the detected saliency map is utilized from two aspects: (i) we propose saliency IOU (SIOU) loss to explicitly regularize object content structure by enforcing saliency consistency before and after image stylization; (ii) we propose saliency adaptive normalization (SANorm) which implicitly enhances object structure integrity of the generated paintings by dynamically injecting image saliency information into the generator to guide stylization process. Besides, we also propose saliency attended discriminator which harnesses image saliency information to focus generative adversarial attention onto the drawn objects, contributing to generating more vivid and delicate brush strokes and ink-wash textures. Extensive qualitative and quantitative experiments demonstrate superiority of our approach over related advanced image stylization methods in both GAN and diffusion model paradigms.
{"title":"SRAGAN: Saliency regularized and attended generative adversarial network for Chinese ink-wash painting style transfer","authors":"Xiang Gao , Yuqi Zhang","doi":"10.1016/j.patcog.2025.111344","DOIUrl":"10.1016/j.patcog.2025.111344","url":null,"abstract":"<div><div>Recent style transfer problems are still largely dominated by Generative Adversarial Network (GAN) from the perspective of cross-domain image-to-image (I2I) translation, where the pivotal issue is to learn and transfer target-domain style patterns onto source-domain content images. This paper handles the problem of translating real pictures into traditional Chinese ink-wash paintings, i.e., Chinese ink-wash painting style transfer. Though a wide range of I2I models tackle this problem, a notable challenge is that the content details of the source image could be easily erased or corrupted due to the transfer of ink-wash style elements. To remedy this issue, we propose to incorporate saliency detection into the unpaired I2I framework to regularize image content, where the detected saliency map is utilized from two aspects: (i) we propose saliency IOU (SIOU) loss to explicitly regularize object content structure by enforcing saliency consistency before and after image stylization; (ii) we propose saliency adaptive normalization (SANorm) which implicitly enhances object structure integrity of the generated paintings by dynamically injecting image saliency information into the generator to guide stylization process. Besides, we also propose saliency attended discriminator which harnesses image saliency information to focus generative adversarial attention onto the drawn objects, contributing to generating more vivid and delicate brush strokes and ink-wash textures. Extensive qualitative and quantitative experiments demonstrate superiority of our approach over related advanced image stylization methods in both GAN and diffusion model paradigms.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"162 ","pages":"Article 111344"},"PeriodicalIF":7.5,"publicationDate":"2025-01-07","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"143150565","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}