With extensive face images being shared on social media, there has been a notable escalation in privacy concerns. In this paper, we propose AdvCloak, an innovative framework for privacy protection using generative models. AdvCloak is designed to automatically customize class-wise adversarial masks that can maintain superior image-level naturalness while providing enhanced feature-level generalization ability. Specifically, AdvCloak sequentially optimizes the generative adversarial networks by employing a two-stage training strategy. This strategy initially focuses on adapting the masks to the unique individual faces and then enhances their feature-level generalization ability to diverse facial variations of individuals. To fully utilize the limited training data, we combine AdvCloak with several general geometric modeling methods, to better describe the feature subspace of source identities. Extensive quantitative and qualitative evaluations on both common and celebrity datasets demonstrate that AdvCloak outperforms existing state-of-the-art methods in terms of efficiency and effectiveness. The code is available at https://github.com/liuxuannan/AdvCloak.
{"title":"AdvCloak: Customized adversarial cloak for privacy protection","authors":"Xuannan Liu, Yaoyao Zhong, Xing Cui, Yuhang Zhang, Peipei Li, Weihong Deng","doi":"10.1016/j.patcog.2024.111050","DOIUrl":"10.1016/j.patcog.2024.111050","url":null,"abstract":"<div><div>With extensive face images being shared on social media, there has been a notable escalation in privacy concerns. In this paper, we propose AdvCloak, an innovative framework for privacy protection using generative models. AdvCloak is designed to automatically customize class-wise adversarial masks that can maintain superior image-level naturalness while providing enhanced feature-level generalization ability. Specifically, AdvCloak sequentially optimizes the generative adversarial networks by employing a two-stage training strategy. This strategy initially focuses on adapting the masks to the unique individual faces and then enhances their feature-level generalization ability to diverse facial variations of individuals. To fully utilize the limited training data, we combine AdvCloak with several general geometric modeling methods, to better describe the feature subspace of source identities. Extensive quantitative and qualitative evaluations on both common and celebrity datasets demonstrate that AdvCloak outperforms existing state-of-the-art methods in terms of efficiency and effectiveness. The code is available at <span><span>https://github.com/liuxuannan/AdvCloak</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"158 ","pages":"Article 111050"},"PeriodicalIF":7.5,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142537294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-10DOI: 10.1016/j.patcog.2024.111075
Mengyang Sun , Wei Suo , Ji Wang , Peng Wang , Yanning Zhang
Human–object interactions (HOI) detection aims at capturing human–object pairs in images and predicting their actions. It is an essential step for many visual reasoning tasks, such as VQA, image retrieval and surveillance event detection. The challenge of this task is to tackle the compositional learning problem, especially in a few-shot setting. A straightforward approach is designing a group of dedicated models for each specific pair. However, the maintenance of these independent models is unrealistic due to combinatorial explosion. To address the above problems, we propose a new Conditional Hyper-Adapter (CHA) method based on meta-learning. Different from previous works, our approach regards each verb, object as an independent sub-task. Meanwhile, we design two kinds of Hyper-Adapter structures to guide the model to learn “how to address the HOI detection”. By combining the different conditions and hypernetwork, the CHA can adaptively generate partial parameters and improve the representation and generalization ability of the model. Finally, our proposed method can be viewed as a plug-and-play module to boost existing HOI detection models on the widely used HOI benchmarks.
{"title":"CHA: Conditional Hyper-Adapter method for detecting human–object interaction","authors":"Mengyang Sun , Wei Suo , Ji Wang , Peng Wang , Yanning Zhang","doi":"10.1016/j.patcog.2024.111075","DOIUrl":"10.1016/j.patcog.2024.111075","url":null,"abstract":"<div><div>Human–object interactions (HOI) detection aims at capturing human–object pairs in images and predicting their actions. It is an essential step for many visual reasoning tasks, such as VQA, image retrieval and surveillance event detection. The challenge of this task is to tackle the compositional learning problem, especially in a few-shot setting. A straightforward approach is designing a group of dedicated models for each specific pair. However, the maintenance of these independent models is unrealistic due to combinatorial explosion. To address the above problems, we propose a new Conditional Hyper-Adapter (CHA) method based on meta-learning. Different from previous works, our approach regards each <span><math><mo><</mo></math></span>verb, object<span><math><mo>></mo></math></span> as an independent sub-task. Meanwhile, we design two kinds of Hyper-Adapter structures to guide the model to learn “how to address the HOI detection”. By combining the different conditions and hypernetwork, the CHA can adaptively generate partial parameters and improve the representation and generalization ability of the model. Finally, our proposed method can be viewed as a plug-and-play module to boost existing HOI detection models on the widely used HOI benchmarks.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"159 ","pages":"Article 111075"},"PeriodicalIF":7.5,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142530431","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-10DOI: 10.1016/j.patcog.2024.111080
Dong Li , Jiandong Jin , Yuhao Zhang , Yanlin Zhong , Yaoyang Wu , Lan Chen , Xiao Wang , Bin Luo
Pattern recognition through the fusion of RGB frames and Event streams has emerged as a novel research area in recent years. Current methods typically employ backbone networks to individually extract the features of RGB frames and event streams, and subsequently fuse these features for pattern recognition. However, we posit that these methods may suffer from two key issues: (1). They attempt to directly learn a mapping from the input vision modality to the semantic labels. This approach often leads to sub-optimal results due to the disparity between the input and semantic labels; (2). They utilize small-scale backbone networks for the extraction of RGB and Event input features, thus these models fail to harness the recent performance advancements of large-scale visual-language models. In this study, we introduce a novel pattern recognition framework that consolidates the semantic labels, RGB frames, and event streams, leveraging pre-trained large-scale vision–language models. Specifically, given the input RGB frames, event streams, and all the predefined semantic labels, we employ a pre-trained large-scale vision model (CLIP vision encoder) to extract the RGB and event features. To handle the semantic labels, we initially convert them into language descriptions through prompt engineering and polish using ChatGPT, and then obtain the semantic features using the pre-trained large-scale language model (CLIP text encoder). Subsequently, we integrate the RGB/Event features and semantic features using multimodal Transformer networks. The resulting frame and event tokens are further amplified using self-attention layers. Concurrently, we propose to enhance the interactions between text tokens and RGB/Event tokens via cross-attention. Finally, we consolidate all three modalities using self-attention and feed-forward layers for recognition. Comprehensive experiments on the HARDVS and PokerEvent datasets fully substantiate the efficacy of our proposed SAFE model. The source code has been released at https://github.com/Event-AHU/SAFE_LargeVLM.
{"title":"Semantic-aware frame-event fusion based pattern recognition via large vision–language models","authors":"Dong Li , Jiandong Jin , Yuhao Zhang , Yanlin Zhong , Yaoyang Wu , Lan Chen , Xiao Wang , Bin Luo","doi":"10.1016/j.patcog.2024.111080","DOIUrl":"10.1016/j.patcog.2024.111080","url":null,"abstract":"<div><div>Pattern recognition through the fusion of RGB frames and Event streams has emerged as a novel research area in recent years. Current methods typically employ backbone networks to individually extract the features of RGB frames and event streams, and subsequently fuse these features for pattern recognition. However, we posit that these methods may suffer from two key issues: (1). They attempt to directly learn a mapping from the input vision modality to the semantic labels. This approach often leads to sub-optimal results due to the disparity between the input and semantic labels; (2). They utilize small-scale backbone networks for the extraction of RGB and Event input features, thus these models fail to harness the recent performance advancements of large-scale visual-language models. In this study, we introduce a novel pattern recognition framework that consolidates the semantic labels, RGB frames, and event streams, leveraging pre-trained large-scale vision–language models. Specifically, given the input RGB frames, event streams, and all the predefined semantic labels, we employ a pre-trained large-scale vision model (CLIP vision encoder) to extract the RGB and event features. To handle the semantic labels, we initially convert them into language descriptions through prompt engineering and polish using ChatGPT, and then obtain the semantic features using the pre-trained large-scale language model (CLIP text encoder). Subsequently, we integrate the RGB/Event features and semantic features using multimodal Transformer networks. The resulting frame and event tokens are further amplified using self-attention layers. Concurrently, we propose to enhance the interactions between text tokens and RGB/Event tokens via cross-attention. Finally, we consolidate all three modalities using self-attention and feed-forward layers for recognition. Comprehensive experiments on the HARDVS and PokerEvent datasets fully substantiate the efficacy of our proposed SAFE model. The source code has been released at <span><span>https://github.com/Event-AHU/SAFE_LargeVLM</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"158 ","pages":"Article 111080"},"PeriodicalIF":7.5,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142537251","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-10DOI: 10.1016/j.patcog.2024.111073
Yifei Qian , Liangfei Zhang , Zhongliang Guo , Xiaopeng Hong , Ognjen Arandjelović , Carl R. Donovan
To alleviate the burden of labeling data to train crowd counting models, we propose a prototype-based learning approach for semi-supervised crowd counting with an embeded understanding of perspective. Our key idea is that image patches with the same density of people are likely to exhibit coherent appearance changes under similar perspective distortion, but differ significantly under varying distortions. Motivated by this observation, we construct multiple prototypes for each density level to capture variations in perspective. For labeled data, the prototype-based learning assists the regression task by regularizing the feature space and modeling the relationships within and across different density levels. For unlabeled data, the learnt perspective-embedded prototypes enhance differentiation between samples of the same density levels, allowing for a more nuanced assessment of the predictions. By incorporating regression results, we categorize unlabeled samples as reliable or unreliable, applying tailored consistency learning strategies to enhance model accuracy and generalization. Since the perspective information is often unavailable, we propose a novel pseudo-label assigner based on perspective self-organization which requires no additional annotations and assigns image regions to distinct spatial density groups, which mainly reflect the differences in average density among regions. Extensive experiments on four crowd counting benchmarks demonstrate the effectiveness of our approach.
{"title":"Perspective-assisted prototype-based learning for semi-supervised crowd counting","authors":"Yifei Qian , Liangfei Zhang , Zhongliang Guo , Xiaopeng Hong , Ognjen Arandjelović , Carl R. Donovan","doi":"10.1016/j.patcog.2024.111073","DOIUrl":"10.1016/j.patcog.2024.111073","url":null,"abstract":"<div><div>To alleviate the burden of labeling data to train crowd counting models, we propose a prototype-based learning approach for semi-supervised crowd counting with an embeded understanding of perspective. Our key idea is that image patches with the same density of people are likely to exhibit coherent appearance changes under similar perspective distortion, but differ significantly under varying distortions. Motivated by this observation, we construct multiple prototypes for each density level to capture variations in perspective. For labeled data, the prototype-based learning assists the regression task by regularizing the feature space and modeling the relationships within and across different density levels. For unlabeled data, the learnt perspective-embedded prototypes enhance differentiation between samples of the same density levels, allowing for a more nuanced assessment of the predictions. By incorporating regression results, we categorize unlabeled samples as reliable or unreliable, applying tailored consistency learning strategies to enhance model accuracy and generalization. Since the perspective information is often unavailable, we propose a novel pseudo-label assigner based on perspective self-organization which requires no additional annotations and assigns image regions to distinct spatial density groups, which mainly reflect the differences in average density among regions. Extensive experiments on four crowd counting benchmarks demonstrate the effectiveness of our approach.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"158 ","pages":"Article 111073"},"PeriodicalIF":7.5,"publicationDate":"2024-10-10","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142537273","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1016/j.patcog.2024.111068
Hatem Ibrahem, Ahmed Salem, Hyun-Soo Kang
ConvMixer is an extremely simple model that could perform better than the state-of-the-art convolutional-based and vision transformer-based methods thanks to mixing the input image patches using a standard convolution. The global mixing process of the patches is only valid for the classification tasks, but it cannot be used for dense prediction tasks as the spatial information of the image is lost in the mixing process. We propose a more efficient technique for image patching, known as pixel shuffling, as it can preserve spatial information. We downsample the input image using the pixel shuffle downsampling in the same form of image patches so that the ConvMixer can be extended for the dense prediction tasks. This paper proves that pixel shuffle downsampling is more efficient than the standard image patching as it outperforms the original ConvMixer architecture in the CIFAR10 and ImageNet-1k classification tasks. We also suggest spatially-aware ConvMixer architectures based on efficient pixel shuffle downsampling and upsampling operations for semantic segmentation and monocular depth estimation. We performed extensive experiments to test the proposed architectures on several datasets; Pascal VOC2012, Cityscapes, and ADE20k for semantic segmentation, NYU-depthV2, and Cityscapes for depth estimation. We show that SA-ConvMixer is efficient enough to get relatively high accuracy at many tasks in a few training epochs (150400). The proposed SA-ConvMixer could achieve an ImageNet-1K Top-1 classification accuracy of 87.02%, mean intersection over union (mIOU) of 87.1% in the PASCAL VOC2012 semantic segmentation task, and absolute relative error of 0.096 in the NYU depthv2 depth estimation task. The implementation code of the proposed method is available at: https://github.com/HatemHosam/SA-ConvMixer/.
{"title":"Pixel shuffling is all you need: spatially aware convmixer for dense prediction tasks","authors":"Hatem Ibrahem, Ahmed Salem, Hyun-Soo Kang","doi":"10.1016/j.patcog.2024.111068","DOIUrl":"10.1016/j.patcog.2024.111068","url":null,"abstract":"<div><div>ConvMixer is an extremely simple model that could perform better than the state-of-the-art convolutional-based and vision transformer-based methods thanks to mixing the input image patches using a standard convolution. The global mixing process of the patches is only valid for the classification tasks, but it cannot be used for dense prediction tasks as the spatial information of the image is lost in the mixing process. We propose a more efficient technique for image patching, known as pixel shuffling, as it can preserve spatial information. We downsample the input image using the pixel shuffle downsampling in the same form of image patches so that the ConvMixer can be extended for the dense prediction tasks. This paper proves that pixel shuffle downsampling is more efficient than the standard image patching as it outperforms the original ConvMixer architecture in the CIFAR10 and ImageNet-1k classification tasks. We also suggest spatially-aware ConvMixer architectures based on efficient pixel shuffle downsampling and upsampling operations for semantic segmentation and monocular depth estimation. We performed extensive experiments to test the proposed architectures on several datasets; Pascal VOC2012, Cityscapes, and ADE20k for semantic segmentation, NYU-depthV2, and Cityscapes for depth estimation. We show that SA-ConvMixer is efficient enough to get relatively high accuracy at many tasks in a few training epochs (150<span><math><mo>∼</mo></math></span>400). The proposed SA-ConvMixer could achieve an ImageNet-1K Top-1 classification accuracy of 87.02%, mean intersection over union (mIOU) of 87.1% in the PASCAL VOC2012 semantic segmentation task, and absolute relative error of 0.096 in the NYU depthv2 depth estimation task. The implementation code of the proposed method is available at: <span><span>https://github.com/HatemHosam/SA-ConvMixer/</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"158 ","pages":"Article 111068"},"PeriodicalIF":7.5,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142537271","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1016/j.patcog.2024.111066
Yanjiao Zhu , Qilin Li , Wanquan Liu , Chuancun Yin
Spectral clustering-based methods have gained significant popularity in subspace clustering due to their ability to capture the underlying data structure effectively. Standard spectral clustering focuses on only pairwise relationships between data points, neglecting interactions among high-order neighboring points. Integrating the diffusion process can address this limitation by leveraging a Markov random walk. However, ensuring that diffusion methods capture sufficient information while maintaining stability against noise remains challenging. In this paper, we propose the Diffusion Process with Structural Changes (DPSC) method, a novel affinity learning framework that enhances the robustness of the diffusion process. Our approach broadens the scope of nearest neighbors and leverages the dropout idea to generate random transition matrices. Furthermore, inspired by the structural changes model, we use two transition matrices to optimize the iteration rule. The resulting affinity matrix undergoes self-supervised learning and is subsequently integrated back into the diffusion process for refinement. Notably, the convergence of the proposed DPSC is theoretically proven. Extensive experiments on benchmark datasets demonstrate that the proposed method outperforms existing subspace clustering methods. The code of our proposed DPSC is available at https://github.com/zhudafa/DPSC.
{"title":"Diffusion process with structural changes for subspace clustering","authors":"Yanjiao Zhu , Qilin Li , Wanquan Liu , Chuancun Yin","doi":"10.1016/j.patcog.2024.111066","DOIUrl":"10.1016/j.patcog.2024.111066","url":null,"abstract":"<div><div>Spectral clustering-based methods have gained significant popularity in subspace clustering due to their ability to capture the underlying data structure effectively. Standard spectral clustering focuses on only pairwise relationships between data points, neglecting interactions among high-order neighboring points. Integrating the diffusion process can address this limitation by leveraging a Markov random walk. However, ensuring that diffusion methods capture sufficient information while maintaining stability against noise remains challenging. In this paper, we propose the Diffusion Process with Structural Changes (DPSC) method, a novel affinity learning framework that enhances the robustness of the diffusion process. Our approach broadens the scope of nearest neighbors and leverages the dropout idea to generate random transition matrices. Furthermore, inspired by the structural changes model, we use two transition matrices to optimize the iteration rule. The resulting affinity matrix undergoes self-supervised learning and is subsequently integrated back into the diffusion process for refinement. Notably, the convergence of the proposed DPSC is theoretically proven. Extensive experiments on benchmark datasets demonstrate that the proposed method outperforms existing subspace clustering methods. The code of our proposed DPSC is available at <span><span>https://github.com/zhudafa/DPSC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"158 ","pages":"Article 111066"},"PeriodicalIF":7.5,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142425150","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1016/j.patcog.2024.111067
Junwei Wu , Mingjie Sun , Haotian Xu , Chenru Jiang , Wuwei Ma , Quan Zhang
This paper focuses on Weakly Supervised 3D Point Cloud Semantic Segmentation (WS3DSS), which involves annotating only a few points while leaving a large number of points unlabeled in the training sample. Existing methods roughly force point-to-point predictions across different augmented versions of inputs close to each other. While this paper introduces a carefully-designed approach for learning class agnostic and specific consistency, based on the teacher–student framework. The proposed class-agnostic consistency learning, to bring the features of student and teacher models closer together, enhances the model robustness by replacing the traditional point-to-point prediction consistency with the group-to-group consistency based on the perturbed local neighboring points’ features. Furthermore, to facilitate learning under class-wise supervisions, we propose a class-specific consistency learning method, pulling the feature of the unlabeled point towards its corresponding class-specific memory bank feature. Such a class of the unlabeled point is determined as the one with the highest probability predicted by the classifier. Extensive experimental results demonstrate that our proposed method surpasses the SOTA method SQN (Huet al., 2022) by 2.5% and 8.3% on S3DIS dataset, and 4.4% and 13.9% on ScanNetV2 dataset, on the 0.1% and 0.01% settings, respectively. Code is available at https://github.com/jasonwjw/CASC.
{"title":"Class agnostic and specific consistency learning for weakly-supervised point cloud semantic segmentation","authors":"Junwei Wu , Mingjie Sun , Haotian Xu , Chenru Jiang , Wuwei Ma , Quan Zhang","doi":"10.1016/j.patcog.2024.111067","DOIUrl":"10.1016/j.patcog.2024.111067","url":null,"abstract":"<div><div>This paper focuses on Weakly Supervised 3D Point Cloud Semantic Segmentation (WS3DSS), which involves annotating only a few points while leaving a large number of points unlabeled in the training sample. Existing methods roughly force point-to-point predictions across different augmented versions of inputs close to each other. While this paper introduces a carefully-designed approach for learning class agnostic and specific consistency, based on the teacher–student framework. The proposed class-agnostic consistency learning, to bring the features of student and teacher models closer together, enhances the model robustness by replacing the traditional point-to-point prediction consistency with the group-to-group consistency based on the perturbed local neighboring points’ features. Furthermore, to facilitate learning under class-wise supervisions, we propose a class-specific consistency learning method, pulling the feature of the unlabeled point towards its corresponding class-specific memory bank feature. Such a class of the unlabeled point is determined as the one with the highest probability predicted by the classifier. Extensive experimental results demonstrate that our proposed method surpasses the SOTA method SQN (Huet al., 2022) by 2.5% and 8.3% on S3DIS dataset, and 4.4% and 13.9% on ScanNetV2 dataset, on the 0.1% and 0.01% settings, respectively. Code is available at <span><span>https://github.com/jasonwjw/CASC</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"158 ","pages":"Article 111067"},"PeriodicalIF":7.5,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142425153","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-09DOI: 10.1016/j.patcog.2024.111063
Mei Yang , Tian-Lin Chen , Wei-Zhi Wu , Wen-Xi Zeng , Jing-Yu Zhang , Fan Min
Multi-instance learning (MIL) is a potent framework for solving weakly supervised problems, with bags containing multiple instances. Various embedding methods convert each bag into a vector in the new feature space based on a representative bag or instance, aiming to extract useful information from the bag. However, since the distribution of instances is related to labels, these methods rely solely on the overall perspective embedding without considering the different distribution characteristics, which will conflate the varied distributions of instances and thus lead to poor classification performance. In this paper, we propose the dual-perspective multi-instance embedding learning with adaptive density distribution mining (DPMIL) algorithm with three new techniques. First, the mutual instance selection technique consists of adaptive density distribution mining and discriminative evaluation. The distribution characteristics of negative instances and heterogeneous instance dissimilarity are effectively exploited to obtain instances with strong representativeness. Second, the embedding technique mines two crucial information of the bag simultaneously. Bags are converted into sequence invariant vectors according to the dual-perspective such that the distinguishability is maintained. Finally, the ensemble technique trains a batch of classifiers. The final model is obtained by weighted voting with the contribution of the dual-perspective embedding information. The experimental results demonstrate that the DPMIL algorithm has higher average accuracy than other compared algorithms, especially on web datasets.
{"title":"Dual-perspective multi-instance embedding learning with adaptive density distribution mining","authors":"Mei Yang , Tian-Lin Chen , Wei-Zhi Wu , Wen-Xi Zeng , Jing-Yu Zhang , Fan Min","doi":"10.1016/j.patcog.2024.111063","DOIUrl":"10.1016/j.patcog.2024.111063","url":null,"abstract":"<div><div>Multi-instance learning (MIL) is a potent framework for solving weakly supervised problems, with bags containing multiple instances. Various embedding methods convert each bag into a vector in the new feature space based on a representative bag or instance, aiming to extract useful information from the bag. However, since the distribution of instances is related to labels, these methods rely solely on the overall perspective embedding without considering the different distribution characteristics, which will conflate the varied distributions of instances and thus lead to poor classification performance. In this paper, we propose the dual-perspective multi-instance embedding learning with adaptive density distribution mining (DPMIL) algorithm with three new techniques. First, the mutual instance selection technique consists of adaptive density distribution mining and discriminative evaluation. The distribution characteristics of negative instances and heterogeneous instance dissimilarity are effectively exploited to obtain instances with strong representativeness. Second, the embedding technique mines two crucial information of the bag simultaneously. Bags are converted into sequence invariant vectors according to the dual-perspective such that the distinguishability is maintained. Finally, the ensemble technique trains a batch of classifiers. The final model is obtained by weighted voting with the contribution of the dual-perspective embedding information. The experimental results demonstrate that the DPMIL algorithm has higher average accuracy than other compared algorithms, especially on web datasets.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"158 ","pages":"Article 111063"},"PeriodicalIF":7.5,"publicationDate":"2024-10-09","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142432515","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-05DOI: 10.1016/j.patcog.2024.111070
Zhiwei Yao , Shaobing Gao , Wenjuan Li
The current spiking neural network (SNN) relies on spike-timing-dependent plasticity (STDP) primarily for shape learning in object recognition tasks, overlooking the equally critical aspect of color information. To address this gap, our study introduces an unsupervised variant of STDP that incorporates principles from color-opponency mechanisms (COM) and classical receptive fields (CRF) found in the biological visual system, facilitating the integration of color information during parameter updates within the SNN architecture. Our approach initially preprocesses images into two distinct feature maps: one for shape and another for color. Then, signals derived from COM and intensity concurrently drive the STDP process, thereby updating parameters associated with both color and shape feature maps. Furthermore, we propose a channel-wise attention mechanism to enhance differentiation among objects sharing similar shapes or colors. Specifically, this mechanism utilizes convolution to generate an output spike-wave, identifying a winner based on earliest spike timing and maximal potential. The winning kernel computes attention, which is then applied via convolution to each input image feature map, generating post-feature maps. A STDP-like normalization rule compares firing times between pre- and post-feature maps, dynamically adjusting channel weights to optimize object recognition during the training phase.
We assessed the proposed algorithm using SNN with both single-layer and multi-layer architectures across three datasets. Experimental findings highlight its efficacy and superiority in complex object recognition tasks compared to state-of-the-art (SOTA) algorithms. Notably, our approach achieved a significant 20% performance improvement over the SOTA on the Caltech-101 dataset. Moreover, the algorithm is well-suited for hardware implementation and energy efficiency, leveraging a winner-selection mechanism based on the earliest spike time.
目前的尖峰神经网络(SNN)主要依靠尖峰计时可塑性(STDP)进行物体识别任务中的形状学习,而忽略了同样重要的颜色信息。为了弥补这一缺陷,我们的研究引入了一种无监督的 STDP 变体,该变体结合了生物视觉系统中的色彩反应机制(COM)和经典感受野(CRF)原理,有助于在 SNN 架构内的参数更新过程中整合色彩信息。我们的方法首先将图像预处理成两个不同的特征图:一个是形状特征图,另一个是颜色特征图。然后,来自 COM 和强度的信号同时驱动 STDP 流程,从而更新与颜色和形状特征图相关的参数。此外,我们还提出了一种通道关注机制,以加强对具有相似形状或颜色的物体的区分。具体来说,该机制利用卷积生成输出尖峰波,并根据最早的尖峰时间和最大电位确定获胜者。获胜内核计算注意力,然后通过卷积应用于每个输入图像特征图,生成后特征图。在训练阶段,类似 STDP 的归一化规则会比较前特征图和后特征图之间的点燃时间,动态调整通道权重以优化目标识别。实验结果表明,与最先进的(SOTA)算法相比,该算法在复杂的物体识别任务中更有效、更优越。值得注意的是,在 Caltech-101 数据集上,我们的方法比 SOTA 算法显著提高了 20% 的性能。此外,该算法利用基于最早尖峰时间的优胜者选择机制,非常适合硬件实现和提高能效。
{"title":"SNN using color-opponent and attention mechanisms for object recognition","authors":"Zhiwei Yao , Shaobing Gao , Wenjuan Li","doi":"10.1016/j.patcog.2024.111070","DOIUrl":"10.1016/j.patcog.2024.111070","url":null,"abstract":"<div><div>The current spiking neural network (SNN) relies on spike-timing-dependent plasticity (STDP) primarily for shape learning in object recognition tasks, overlooking the equally critical aspect of color information. To address this gap, our study introduces an unsupervised variant of STDP that incorporates principles from color-opponency mechanisms (COM) and classical receptive fields (CRF) found in the biological visual system, facilitating the integration of color information during parameter updates within the SNN architecture. Our approach initially preprocesses images into two distinct feature maps: one for shape and another for color. Then, signals derived from COM and intensity concurrently drive the STDP process, thereby updating parameters associated with both color and shape feature maps. Furthermore, we propose a channel-wise attention mechanism to enhance differentiation among objects sharing similar shapes or colors. Specifically, this mechanism utilizes convolution to generate an output spike-wave, identifying a winner based on earliest spike timing and maximal potential. The winning kernel computes attention, which is then applied via convolution to each input image feature map, generating post-feature maps. A STDP-like normalization rule compares firing times between pre- and post-feature maps, dynamically adjusting channel weights to optimize object recognition during the training phase.</div><div>We assessed the proposed algorithm using SNN with both single-layer and multi-layer architectures across three datasets. Experimental findings highlight its efficacy and superiority in complex object recognition tasks compared to state-of-the-art (SOTA) algorithms. Notably, our approach achieved a significant 20% performance improvement over the SOTA on the Caltech-101 dataset. Moreover, the algorithm is well-suited for hardware implementation and energy efficiency, leveraging a winner-selection mechanism based on the earliest spike time.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"158 ","pages":"Article 111070"},"PeriodicalIF":7.5,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142537272","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Arbitrary bit-width network quantization has received significant attention due to its high adaptability to various bit-width requirements during runtime. However, in this paper, we investigate existing methods and observe a significant accumulation of quantization errors caused by switching weight and activations bit-widths, leading to limited performance. To address this issue, we propose MBQuant, a novel method that utilizes a multi-branch topology for arbitrary bit-width quantization. MBQuant duplicates the network body into multiple independent branches, where the weights of each branch are quantized to a fixed 2-bit and the activations remain in the input bit-width. For completing the computation of a desired bit-width, MBQuant selects multiple branches, ensuring that the computational costs match those of the desired bit-width, to carry out forward propagation. By fixing the weight bit-width, MBQuant substantially reduces quantization errors caused by switching weight bit-widths. Additionally, we observe that the first branch suffers from quantization errors caused by all bit-widths, leading to performance degradation. Thus, we introduce an amortization branch selection strategy that amortizes the errors. Specifically, the first branch is selected only for certain bit-widths, rather than universally, thereby the errors are distributed among the branches more evenly. Finally, we adopt an in-place distillation strategy that uses the largest bit-width to guide the other bit-widths to further enhance MBQuant’s performance. Extensive experiments demonstrate that MBQuant achieves significant performance gains compared to existing arbitrary bit-width quantization methods. Code is made publicly available at https://github.com/zysxmu/MBQuant.
{"title":"MBQuant: A novel multi-branch topology method for arbitrary bit-width network quantization","authors":"Yunshan Zhong , Yuyao Zhou , Fei Chao , Rongrong Ji","doi":"10.1016/j.patcog.2024.111061","DOIUrl":"10.1016/j.patcog.2024.111061","url":null,"abstract":"<div><div>Arbitrary bit-width network quantization has received significant attention due to its high adaptability to various bit-width requirements during runtime. However, in this paper, we investigate existing methods and observe a significant accumulation of quantization errors caused by switching weight and activations bit-widths, leading to limited performance. To address this issue, we propose MBQuant, a novel method that utilizes a multi-branch topology for arbitrary bit-width quantization. MBQuant duplicates the network body into multiple independent branches, where the weights of each branch are quantized to a fixed 2-bit and the activations remain in the input bit-width. For completing the computation of a desired bit-width, MBQuant selects multiple branches, ensuring that the computational costs match those of the desired bit-width, to carry out forward propagation. By fixing the weight bit-width, MBQuant substantially reduces quantization errors caused by switching weight bit-widths. Additionally, we observe that the first branch suffers from quantization errors caused by all bit-widths, leading to performance degradation. Thus, we introduce an amortization branch selection strategy that amortizes the errors. Specifically, the first branch is selected only for certain bit-widths, rather than universally, thereby the errors are distributed among the branches more evenly. Finally, we adopt an in-place distillation strategy that uses the largest bit-width to guide the other bit-widths to further enhance MBQuant’s performance. Extensive experiments demonstrate that MBQuant achieves significant performance gains compared to existing arbitrary bit-width quantization methods. Code is made publicly available at <span><span>https://github.com/zysxmu/MBQuant</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"158 ","pages":"Article 111061"},"PeriodicalIF":7.5,"publicationDate":"2024-10-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142537295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}