Pub Date : 2024-09-11DOI: 10.1016/j.patrec.2024.09.006
Min Zhang , Shupeng Liu , Taihao Li , Huai Chen , Xiaoyin Xu
Image deblurring is a challenging inverse problem, especially when there is additive noise to the observation. To solve such an inverse problem in an iterative manner, it is important to control the step size for achieving a stable and robust performance. We designed a method that controls the progress of iterative process in solving the inverse problem without the need for a user-specified step size. The method searches for an optimal step size under the assumption that the signal and noise are two independent stochastic processes. Experiments show that the method can achieve good performance in the presence of noise and imperfect knowledge about the blurring kernel. Tests also show that, for different blurring kernels and noise levels, the difference between two consecutive estimates given by the new method tends to remain more stable and stay in a smaller range, as compared to those given by some existing techniques. This stable feature makes the new method more robust in the sense that it is easier to select a stopping threshold for the new method to use in different scenarios.
{"title":"Use estimated signal and noise to adjust step size for image restoration","authors":"Min Zhang , Shupeng Liu , Taihao Li , Huai Chen , Xiaoyin Xu","doi":"10.1016/j.patrec.2024.09.006","DOIUrl":"10.1016/j.patrec.2024.09.006","url":null,"abstract":"<div><p>Image deblurring is a challenging inverse problem, especially when there is additive noise to the observation. To solve such an inverse problem in an iterative manner, it is important to control the step size for achieving a stable and robust performance. We designed a method that controls the progress of iterative process in solving the inverse problem without the need for a user-specified step size. The method searches for an optimal step size under the assumption that the signal and noise are two independent stochastic processes. Experiments show that the method can achieve good performance in the presence of noise and imperfect knowledge about the blurring kernel. Tests also show that, for different blurring kernels and noise levels, the difference between two consecutive estimates given by the new method tends to remain more stable and stay in a smaller range, as compared to those given by some existing techniques. This stable feature makes the new method more robust in the sense that it is easier to select a stopping threshold for the new method to use in different scenarios.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 57-63"},"PeriodicalIF":3.9,"publicationDate":"2024-09-11","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142171797","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1016/j.patrec.2024.09.001
Haiju Fan , Xiaona Qin , Shuang Chen , Hubert P. H. Shum , Ming Li
To improve storage and transmission, images are generally compressed. Vector quantization (VQ) is a popular compression method as it has a high compression ratio that suppresses other compression techniques. Despite this, existing adversarial attack methods on image classification are mostly performed in the pixel domain with few exceptions in the compressed domain, making them less applicable in real-world scenarios. In this paper, we propose a novel one-index attack method in the VQ domain to generate adversarial images by a differential evolution algorithm, successfully resulting in image misclassification in victim models. The one-index attack method modifies a single index in the compressed data stream so that the decompressed image is misclassified. It only needs to modify a single VQ index to realize an attack, which limits the number of perturbed indexes. The proposed method belongs to a semi-black-box attack, which is more in line with the actual attack scenario. We apply our method to attack three popular image classification models, i.e., Resnet, NIN, and VGG16. On average, 55.9 % and 77.4 % of the images in CIFAR-10 and Fashion MNIST, respectively, are successfully attacked, with a high level of misclassification confidence and a low level of image perturbation.
{"title":"One-index vector quantization based adversarial attack on image classification","authors":"Haiju Fan , Xiaona Qin , Shuang Chen , Hubert P. H. Shum , Ming Li","doi":"10.1016/j.patrec.2024.09.001","DOIUrl":"10.1016/j.patrec.2024.09.001","url":null,"abstract":"<div><p>To improve storage and transmission, images are generally compressed. Vector quantization (VQ) is a popular compression method as it has a high compression ratio that suppresses other compression techniques. Despite this, existing adversarial attack methods on image classification are mostly performed in the pixel domain with few exceptions in the compressed domain, making them less applicable in real-world scenarios. In this paper, we propose a novel one-index attack method in the VQ domain to generate adversarial images by a differential evolution algorithm, successfully resulting in image misclassification in victim models. The one-index attack method modifies a single index in the compressed data stream so that the decompressed image is misclassified. It only needs to modify a single VQ index to realize an attack, which limits the number of perturbed indexes. The proposed method belongs to a semi-black-box attack, which is more in line with the actual attack scenario. We apply our method to attack three popular image classification models, i.e., Resnet, NIN, and VGG16. On average, 55.9 % and 77.4 % of the images in CIFAR-10 and Fashion MNIST, respectively, are successfully attacked, with a high level of misclassification confidence and a low level of image perturbation.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 47-56"},"PeriodicalIF":3.9,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167865524002575/pdfft?md5=96833f101476805d73c37d5dd7083f2c&pid=1-s2.0-S0167865524002575-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142171796","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-06DOI: 10.1016/j.patrec.2024.08.024
Adjorn van Engelenhoven, Nicola Strisciuglio, Estefanía Talavera
The Transformer architecture has shown to be a powerful tool for a wide range of tasks. It is based on the self-attention mechanism, which is an inherently computationally expensive operation with quadratic computational complexity: memory usage and compute time increase quadratically with the length of the input sequences, thus limiting the application of Transformers. In this work, we propose a novel Clustering self-Attention mechanism using Surrogate Tokens (CAST), to optimize the attention computation and achieve efficient transformers. CAST utilizes learnable surrogate tokens to construct a cluster affinity matrix, used to cluster the input sequence and generate novel cluster summaries. The self-attention from within each cluster is then combined with the cluster summaries of other clusters, enabling information flow across the entire input sequence. CAST improves efficiency by reducing the complexity from to where is the sequence length, and is constant according to the number of clusters and samples per cluster. We show that CAST performs better than or comparable to the baseline Transformers on long-range sequence modeling tasks, while also achieving higher results on time and memory efficiency than other efficient transformers.
{"title":"CAST: Clustering self-Attention using Surrogate Tokens for efficient transformers","authors":"Adjorn van Engelenhoven, Nicola Strisciuglio, Estefanía Talavera","doi":"10.1016/j.patrec.2024.08.024","DOIUrl":"10.1016/j.patrec.2024.08.024","url":null,"abstract":"<div><p>The Transformer architecture has shown to be a powerful tool for a wide range of tasks. It is based on the self-attention mechanism, which is an inherently computationally expensive operation with quadratic computational complexity: memory usage and compute time increase quadratically with the length of the input sequences, thus limiting the application of Transformers. In this work, we propose a novel Clustering self-Attention mechanism using Surrogate Tokens (CAST), to optimize the attention computation and achieve efficient transformers. CAST utilizes learnable surrogate tokens to construct a cluster affinity matrix, used to cluster the input sequence and generate novel cluster summaries. The self-attention from within each cluster is then combined with the cluster summaries of other clusters, enabling information flow across the entire input sequence. CAST improves efficiency by reducing the complexity from <span><math><mrow><mi>O</mi><mrow><mo>(</mo><msup><mrow><mi>N</mi></mrow><mrow><mn>2</mn></mrow></msup><mo>)</mo></mrow></mrow></math></span> to <span><math><mrow><mi>O</mi><mrow><mo>(</mo><mi>α</mi><mi>N</mi><mo>)</mo></mrow></mrow></math></span> where <span><math><mi>N</mi></math></span> is the sequence length, and <span><math><mi>α</mi></math></span> is constant according to the number of clusters and samples per cluster. We show that CAST performs better than or comparable to the baseline Transformers on long-range sequence modeling tasks, while also achieving higher results on time and memory efficiency than other efficient transformers.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 30-36"},"PeriodicalIF":3.9,"publicationDate":"2024-09-06","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167865524002563/pdfft?md5=41d75a76c8436c27473bdc1f0c0144be&pid=1-s2.0-S0167865524002563-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142164130","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-05DOI: 10.1016/j.patrec.2024.09.004
Irene Amerini, Victor Sanchez, Luca Maiano
{"title":"Editorial for pattern recognition letters special issue on Advances in Disinformation Detection and Media Forensics","authors":"Irene Amerini, Victor Sanchez, Luca Maiano","doi":"10.1016/j.patrec.2024.09.004","DOIUrl":"10.1016/j.patrec.2024.09.004","url":null,"abstract":"","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 21-22"},"PeriodicalIF":3.9,"publicationDate":"2024-09-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142164129","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Thermal infrared (TIR) images are visually blurred and low in information content. Some TIR trackers focus on enhancing the semantic information of TIR features, neglecting the equally important detailed information for TIR tracking. After target localization, detailed information can assist the tracker in generating accurate prediction boxes. In addition, simple element-wise addition is not a way to fully utilize and fuse multiple response maps. To address these issues, this study proposes a multipath and feature-enhanced Siamese tracker (SiamMAF) for TIR tracking. We design a feature-enhanced module (FEM) based on complementarity, which can highlight the key semantic information of the target and preserve the detailed information of objects. Furthermore, we introduce a response fusion module (RFM) that can adaptively fuse multiple response maps. Extensive experimental results on two challenging benchmarks show that SiamMAF outperforms many existing state-of-the-art TIR trackers and runs at a steady 31FPS.
热红外(TIR)图像视觉模糊,信息含量低。一些热红外跟踪器只注重增强热红外特征的语义信息,而忽略了对热红外跟踪同样重要的详细信息。目标定位后,详细信息可帮助跟踪器生成准确的预测框。此外,简单的元素相加并不能充分利用和融合多个响应图。为了解决这些问题,本研究提出了一种用于 TIR 跟踪的多路径和特征增强型连体跟踪器(SiamMAF)。我们设计了一个基于互补性的特征增强模块(FEM),它能突出目标的关键语义信息,并保留物体的详细信息。此外,我们还引入了响应融合模块(RFM),它可以自适应地融合多个响应图。在两个具有挑战性的基准上进行的大量实验结果表明,SiamMAF 的性能优于许多现有的一流 TIR 跟踪器,并且能以 31FPS 的速度稳定运行。
{"title":"SiamMAF: A multipath and feature-enhanced thermal infrared tracker","authors":"Weisheng Li, Yuhao Fang, Lanbing Lv, Shunping Chen","doi":"10.1016/j.patrec.2024.09.003","DOIUrl":"10.1016/j.patrec.2024.09.003","url":null,"abstract":"<div><p>Thermal infrared (TIR) images are visually blurred and low in information content. Some TIR trackers focus on enhancing the semantic information of TIR features, neglecting the equally important detailed information for TIR tracking. After target localization, detailed information can assist the tracker in generating accurate prediction boxes. In addition, simple element-wise addition is not a way to fully utilize and fuse multiple response maps. To address these issues, this study proposes a multipath and feature-enhanced Siamese tracker (SiamMAF) for TIR tracking. We design a feature-enhanced module (FEM) based on complementarity, which can highlight the key semantic information of the target and preserve the detailed information of objects. Furthermore, we introduce a response fusion module (RFM) that can adaptively fuse multiple response maps. Extensive experimental results on two challenging benchmarks show that SiamMAF outperforms many existing state-of-the-art TIR trackers and runs at a steady 31FPS.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 37-46"},"PeriodicalIF":3.9,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142169501","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-03DOI: 10.1016/j.patrec.2024.09.002
Iason Ioannis Panagos , Giorgos Sfikas , Christophoros Nikou
Recent progress in visual speech recognition systems due to advances in deep learning and large-scale public datasets has led to impressive performance compared to human professionals. The potential applications of these systems in real-life scenarios are numerous and can greatly benefit the lives of many individuals. However, most of these systems are not designed with practicality in mind, requiring large-size models and powerful hardware, factors which limit their applicability in resource-constrained environments and other real-world tasks. In addition, few works focus on developing lightweight systems that can be deployed in such conditions. Considering these issues, we propose compact networks that take advantage of hypercomplex layers that utilize a sum of Kronecker products to reduce overall parameter demands and model sizes. We train and evaluate our proposed models on the largest public dataset for single word speech recognition for English. Our experiments show that high compression rates are achievable with a minimal accuracy drop, indicating the method’s potential for practical applications in lower-resource environments. Code and models are available at https://github.com/jpanagos/vsr_phm.
{"title":"Visual speech recognition using compact hypercomplex neural networks","authors":"Iason Ioannis Panagos , Giorgos Sfikas , Christophoros Nikou","doi":"10.1016/j.patrec.2024.09.002","DOIUrl":"10.1016/j.patrec.2024.09.002","url":null,"abstract":"<div><p>Recent progress in visual speech recognition systems due to advances in deep learning and large-scale public datasets has led to impressive performance compared to human professionals. The potential applications of these systems in real-life scenarios are numerous and can greatly benefit the lives of many individuals. However, most of these systems are not designed with practicality in mind, requiring large-size models and powerful hardware, factors which limit their applicability in resource-constrained environments and other real-world tasks. In addition, few works focus on developing lightweight systems that can be deployed in such conditions. Considering these issues, we propose compact networks that take advantage of hypercomplex layers that utilize a sum of Kronecker products to reduce overall parameter demands and model sizes. We train and evaluate our proposed models on the largest public dataset for single word speech recognition for English. Our experiments show that high compression rates are achievable with a minimal accuracy drop, indicating the method’s potential for practical applications in lower-resource environments. Code and models are available at <span><span>https://github.com/jpanagos/vsr_phm</span><svg><path></path></svg></span>.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 1-7"},"PeriodicalIF":3.9,"publicationDate":"2024-09-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142164128","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-02DOI: 10.1016/j.patrec.2024.08.023
Rucha Deshpande , Mark A. Anastasio , Frank J. Brooks
Deep generative models (DGMs) have the potential to revolutionize diagnostic imaging. Generative adversarial networks (GANs) are one kind of DGM which are widely employed. The overarching problem with deploying any sort of DGM in mission-critical applications is a lack of adequate and/or automatic means of assessing the domain-specific quality of generated images. In this work, we demonstrate several objective and human-interpretable tests of images output by two popular DGMs. These tests serve two goals: (i) ruling out DGMs for downstream, domain-specific applications, and (ii) quantifying hallucinations in the expected spatial context in DGM-generated images. The designed datasets are made public and the proposed tests could also serve as benchmarks and aid the prototyping of emerging DGMs. Although these tests are demonstrated on GANs, they can be employed as a benchmark for evaluating any DGM. Specifically, we designed several stochastic context models (SCMs) of distinct image features that can be recovered after generation by a trained DGM. Together, these SCMs encode features as per-image constraints in prevalence, position, intensity, and/or texture. Several of these features are high-order, algorithmic pixel-arrangement rules which are not readily expressed in covariance matrices. We designed and validated statistical classifiers to detect specific effects of the known arrangement rules. We then tested the rates at which two different DGMs correctly reproduced the feature context under a variety of training scenarios, and degrees of feature-class similarity. We found that ensembles of generated images can appear largely accurate visually, and show high accuracy in ensemble measures, while not exhibiting the known spatial arrangements. The main conclusion is that SCMs can be engineered, and serve as benchmarks, to quantify numerous per image errors, i.e., hallucinations, that may not be captured in ensemble statistics but plausibly can affect subsequent use of the DGM-generated images.
{"title":"A method for evaluating deep generative models of images for hallucinations in high-order spatial context","authors":"Rucha Deshpande , Mark A. Anastasio , Frank J. Brooks","doi":"10.1016/j.patrec.2024.08.023","DOIUrl":"10.1016/j.patrec.2024.08.023","url":null,"abstract":"<div><p>Deep generative models (DGMs) have the potential to revolutionize diagnostic imaging. Generative adversarial networks (GANs) are one kind of DGM which are widely employed. The overarching problem with deploying any sort of DGM in mission-critical applications is a lack of adequate and/or automatic means of assessing the domain-specific quality of generated images. In this work, we demonstrate several objective and human-interpretable tests of images output by two popular DGMs. These tests serve two goals: (i) ruling out DGMs for downstream, domain-specific applications, and (ii) quantifying hallucinations in the expected spatial context in DGM-generated images. The designed datasets are made public and the proposed tests could also serve as benchmarks and aid the prototyping of emerging DGMs. Although these tests are demonstrated on GANs, they can be employed as a benchmark for evaluating any DGM. Specifically, we designed several stochastic context models (SCMs) of distinct image features that can be recovered after generation by a trained DGM. Together, these SCMs encode features as per-image constraints in prevalence, position, intensity, and/or texture. Several of these features are high-order, algorithmic pixel-arrangement rules which are not readily expressed in covariance matrices. We designed and validated statistical classifiers to detect specific effects of the known arrangement rules. We then tested the rates at which two different DGMs correctly reproduced the feature context under a variety of training scenarios, and degrees of feature-class similarity. We found that ensembles of generated images can appear largely accurate visually, and show high accuracy in ensemble measures, while not exhibiting the known spatial arrangements. The main conclusion is that SCMs can be engineered, and serve as benchmarks, to quantify numerous <em>per image</em> errors, <em>i.e.</em>, hallucinations, that may not be captured in ensemble statistics but plausibly can affect subsequent use of the DGM-generated images.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"186 ","pages":"Pages 23-29"},"PeriodicalIF":3.9,"publicationDate":"2024-09-02","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"https://www.sciencedirect.com/science/article/pii/S0167865524002551/pdfft?md5=5df7937160b427d56d6a3c847ac5fdfc&pid=1-s2.0-S0167865524002551-main.pdf","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142164131","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-01DOI: 10.1016/j.patrec.2024.08.005
Akram Bennour , Tolga Ensari , Mohammed Al-Shabi
{"title":"Introduction to the special section “Advances trends of pattern recognition for intelligent systems applications” (SS:ISPR23)","authors":"Akram Bennour , Tolga Ensari , Mohammed Al-Shabi","doi":"10.1016/j.patrec.2024.08.005","DOIUrl":"10.1016/j.patrec.2024.08.005","url":null,"abstract":"","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Page 271"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142185733","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-01DOI: 10.1016/j.patrec.2024.08.009
Falai Wei, Xiaofang Hu
Currently, research on human pose estimation tasks primarily focuses on heatmap-based and regression-based methods. However, the increasing complexity of heatmap models and the low accuracy of regression methods are becoming significant barriers to the advancement of the field. In recent years, researchers have begun exploring new methods to transfer knowledge from heatmap models to regression models. Recognizing the limitations of existing approaches, our study introduces a novel distillation model that is both lightweight and precise. In the feature extraction phase, we design the Channel-Attention-Unit (CAU), which integrates group convolution with an attention mechanism to effectively reduce redundancy while maintaining model accuracy with a decreased parameter count. During distillation, we develop the attention loss function, , which enhances the model’s capacity to locate key points quickly and accurately, emulating the effect of additional transformer layers and boosting precision without the need for increased parameters or network depth. Specifically, on the CrowdPose test dataset, our model achieves 71.7% mAP with 4.3M parameters, 2.2 GFLOPs, and 51.3 FPS. Experimental results demonstrates the model’s strong capabilities in both accuracy and efficiency, making it a viable option for real-time posture estimation tasks in real-world environments.
{"title":"A lightweight attention-driven distillation model for human pose estimation","authors":"Falai Wei, Xiaofang Hu","doi":"10.1016/j.patrec.2024.08.009","DOIUrl":"10.1016/j.patrec.2024.08.009","url":null,"abstract":"<div><p>Currently, research on human pose estimation tasks primarily focuses on heatmap-based and regression-based methods. However, the increasing complexity of heatmap models and the low accuracy of regression methods are becoming significant barriers to the advancement of the field. In recent years, researchers have begun exploring new methods to transfer knowledge from heatmap models to regression models. Recognizing the limitations of existing approaches, our study introduces a novel distillation model that is both lightweight and precise. In the feature extraction phase, we design the Channel-Attention-Unit (CAU), which integrates group convolution with an attention mechanism to effectively reduce redundancy while maintaining model accuracy with a decreased parameter count. During distillation, we develop the attention loss function, <span><math><msub><mrow><mi>L</mi></mrow><mrow><mi>A</mi></mrow></msub></math></span>, which enhances the model’s capacity to locate key points quickly and accurately, emulating the effect of additional transformer layers and boosting precision without the need for increased parameters or network depth. Specifically, on the CrowdPose test dataset, our model achieves 71.7% mAP with 4.3M parameters, 2.2 GFLOPs, and 51.3 FPS. Experimental results demonstrates the model’s strong capabilities in both accuracy and efficiency, making it a viable option for real-time posture estimation tasks in real-world environments.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 247-253"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097514","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-09-01DOI: 10.1016/j.patrec.2024.08.020
Jiuhang Wang , Hongying Tang , Shanshan Luo , Liqi Yang , Shusheng Liu , Aoping Hong , Baoqing Li
Multi-label image classification (MLIC), a fundamental task assigning multiple labels to each image, has been seen notable progress in recent years. Considering simultaneous appearances of objects in the physical world, modeling object correlations is crucial for enhancing classification accuracy. This involves accounting for spatial image feature correlation and label semantic correlation. However, existing methods struggle to establish these correlations due to complex spatial location and label semantic relationships. On the other hand, regarding the fusion of image feature relevance and label semantic relevance, existing methods typically learn a semantic representation in the final CNN layer to combine spatial and label semantic correlations. However, different CNN layers capture features at diverse scales and possess distinct discriminative abilities. To address these issues, in this paper we introduce the Semantic Guidance-Based Fusion Network (SGFN) for MLIC. To model spatial image feature correlation, we leverage the advanced TResNet architecture as the backbone network and employ the Feature Aggregation Module for capturing global spatial correlation. For label semantic correlation, we establish both local and global semantic correlation. We further enrich model features by learning semantic representations across multiple convolutional layers. Our method outperforms current state-of-the-art techniques on PASCAL VOC (2007, 2012) and MS-COCO datasets.
{"title":"A semantic guidance-based fusion network for multi-label image classification","authors":"Jiuhang Wang , Hongying Tang , Shanshan Luo , Liqi Yang , Shusheng Liu , Aoping Hong , Baoqing Li","doi":"10.1016/j.patrec.2024.08.020","DOIUrl":"10.1016/j.patrec.2024.08.020","url":null,"abstract":"<div><p>Multi-label image classification (MLIC), a fundamental task assigning multiple labels to each image, has been seen notable progress in recent years. Considering simultaneous appearances of objects in the physical world, modeling object correlations is crucial for enhancing classification accuracy. This involves accounting for spatial image feature correlation and label semantic correlation. However, existing methods struggle to establish these correlations due to complex spatial location and label semantic relationships. On the other hand, regarding the fusion of image feature relevance and label semantic relevance, existing methods typically learn a semantic representation in the final CNN layer to combine spatial and label semantic correlations. However, different CNN layers capture features at diverse scales and possess distinct discriminative abilities. To address these issues, in this paper we introduce the Semantic Guidance-Based Fusion Network (SGFN) for MLIC. To model spatial image feature correlation, we leverage the advanced TResNet architecture as the backbone network and employ the Feature Aggregation Module for capturing global spatial correlation. For label semantic correlation, we establish both local and global semantic correlation. We further enrich model features by learning semantic representations across multiple convolutional layers. Our method outperforms current state-of-the-art techniques on PASCAL VOC (2007, 2012) and MS-COCO datasets.</p></div>","PeriodicalId":54638,"journal":{"name":"Pattern Recognition Letters","volume":"185 ","pages":"Pages 254-261"},"PeriodicalIF":3.9,"publicationDate":"2024-09-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142097515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":3,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}