Pub Date : 2024-10-28DOI: 10.1016/j.neucom.2024.128763
Cunhan Guo , Heyan Huang
Camouflaged Object Detection (COD) is a critical aspect of computer vision aimed at identifying concealed objects, with applications spanning military, industrial, medical and monitoring domains. To address the problem of poor detail segmentation effect, we introduce a novel method for camouflaged object detection, named CoFiNet. Our approach primarily focuses on multi-scale feature fusion and extraction, with special attention to the model’s segmentation effectiveness for detailed features, enhancing its ability to effectively detect camouflaged objects. CoFiNet adopts a coarse-to-fine strategy. A multi-scale feature integration module is laveraged to enhance the model’s capability of fusing context feature. A multi-activation selective kernel module is leveraged to grant the model the ability to autonomously alter its receptive field, enabling it to selectively choose an appropriate receptive field for camouflaged objects of different sizes. During mask generation, we employ the dual-mask strategy for image segmentation, separating the reconstruction of coarse and fine masks, which significantly enhances the model’s learning capacity for details. Comprehensive experiments were conducted on four different datasets, demonstrating that CoFiNet achieves state-of-the-art performance across all datasets. The experiment results of CoFiNet underscore its effectiveness in camouflaged object detection and highlight its potential in various practical application scenarios.
{"title":"CoFiNet: Unveiling camouflaged objects with multi-scale finesse","authors":"Cunhan Guo , Heyan Huang","doi":"10.1016/j.neucom.2024.128763","DOIUrl":"10.1016/j.neucom.2024.128763","url":null,"abstract":"<div><div>Camouflaged Object Detection (COD) is a critical aspect of computer vision aimed at identifying concealed objects, with applications spanning military, industrial, medical and monitoring domains. To address the problem of poor detail segmentation effect, we introduce a novel method for camouflaged object detection, named CoFiNet. Our approach primarily focuses on multi-scale feature fusion and extraction, with special attention to the model’s segmentation effectiveness for detailed features, enhancing its ability to effectively detect camouflaged objects. CoFiNet adopts a coarse-to-fine strategy. A multi-scale feature integration module is laveraged to enhance the model’s capability of fusing context feature. A multi-activation selective kernel module is leveraged to grant the model the ability to autonomously alter its receptive field, enabling it to selectively choose an appropriate receptive field for camouflaged objects of different sizes. During mask generation, we employ the dual-mask strategy for image segmentation, separating the reconstruction of coarse and fine masks, which significantly enhances the model’s learning capacity for details. Comprehensive experiments were conducted on four different datasets, demonstrating that CoFiNet achieves state-of-the-art performance across all datasets. The experiment results of CoFiNet underscore its effectiveness in camouflaged object detection and highlight its potential in various practical application scenarios.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128763"},"PeriodicalIF":5.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142573295","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.neucom.2024.128792
Haitao Liu , Xianwei Xin , Jihua Song , Weiming Peng
The multimodal named entity recognition task on social media involves recognizing named entities with textual and visual information, which is of great significance for information processing. Nevertheless, many existing models still face the following challenges. First, in the process of cross-modal interaction, the attention mechanism sometimes focuses on trivial parts in the images that are not relevant to entities, which not only neglects valuable information but also inevitably introduces visual noise. Second, the gate mechanism is widely used for filtering out visual information to reduce the influence of noise on text understanding. However, the gate mechanism neglects capturing fine-grained semantic relevance between modalities, which easily affects the filtration process. To address these issues, we propose a cross-modal integration framework based on the surprisingly popular algorithm, aiming at enhancing the integration of effective visual guidance and reducing the interference of irrelevant visual noise. Specifically, we design a dual-branch interaction module that includes the attention mechanism and the surprisingly popular algorithm, allowing the model to focus on valuable but overlooked parts in the images. Furthermore, we compute the matching degree between modalities at the multi-granularity level, using the Choquet integral to establish a more reasonable basis for filtering out visual noise. We have conducted extensive experiments on public datasets, and the experimental results demonstrate the advantages of our model.
{"title":"CRISP: A cross-modal integration framework based on the surprisingly popular algorithm for multimodal named entity recognition","authors":"Haitao Liu , Xianwei Xin , Jihua Song , Weiming Peng","doi":"10.1016/j.neucom.2024.128792","DOIUrl":"10.1016/j.neucom.2024.128792","url":null,"abstract":"<div><div>The multimodal named entity recognition task on social media involves recognizing named entities with textual and visual information, which is of great significance for information processing. Nevertheless, many existing models still face the following challenges. First, in the process of cross-modal interaction, the attention mechanism sometimes focuses on trivial parts in the images that are not relevant to entities, which not only neglects valuable information but also inevitably introduces visual noise. Second, the gate mechanism is widely used for filtering out visual information to reduce the influence of noise on text understanding. However, the gate mechanism neglects capturing fine-grained semantic relevance between modalities, which easily affects the filtration process. To address these issues, we propose a cross-modal integration framework based on the surprisingly popular algorithm, aiming at enhancing the integration of effective visual guidance and reducing the interference of irrelevant visual noise. Specifically, we design a dual-branch interaction module that includes the attention mechanism and the surprisingly popular algorithm, allowing the model to focus on valuable but overlooked parts in the images. Furthermore, we compute the matching degree between modalities at the multi-granularity level, using the Choquet integral to establish a more reasonable basis for filtering out visual noise. We have conducted extensive experiments on public datasets, and the experimental results demonstrate the advantages of our model.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128792"},"PeriodicalIF":5.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142578800","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.neucom.2024.128784
Tianyou Chen , Hui Ruan , Shaojie Wang , Jin Xiao , Xiaoguang Hu
Camouflaged objects are typically assimilated into their backgrounds and exhibit fuzzy boundaries. The complex environmental conditions and the high intrinsic similarity between camouflaged targets and their surroundings pose significant challenges in accurately locating and segmenting these objects in their entirety. While existing methods have demonstrated remarkable performance in various real-world scenarios, they still face limitations when confronted with difficult cases, such as small targets, thin structures, and indistinct boundaries. Drawing inspiration from human visual perception when observing images containing camouflaged objects, we propose a three-stage model that enables coarse-to-fine segmentation in a single iteration. Specifically, our model employs three decoders to sequentially process subsampled features, cropped features, and high-resolution original features. This proposed approach not only reduces computational overhead but also mitigates interference caused by background noise. Furthermore, considering the significance of multi-scale information, we have designed a multi-scale feature enhancement module that enlarges the receptive field while preserving detailed structural cues. Additionally, a boundary enhancement module has been developed to enhance performance by leveraging boundary information. Subsequently, a mask-guided fusion module is proposed to generate fine-grained results by integrating coarse prediction maps with high-resolution feature maps. Our network shows superior performance without introducing unnecessary complexities. Upon acceptance of the paper, the source code will be made publicly available at https://github.com/clelouch/TSNet.
{"title":"A three-stage model for camouflaged object detection","authors":"Tianyou Chen , Hui Ruan , Shaojie Wang , Jin Xiao , Xiaoguang Hu","doi":"10.1016/j.neucom.2024.128784","DOIUrl":"10.1016/j.neucom.2024.128784","url":null,"abstract":"<div><div>Camouflaged objects are typically assimilated into their backgrounds and exhibit fuzzy boundaries. The complex environmental conditions and the high intrinsic similarity between camouflaged targets and their surroundings pose significant challenges in accurately locating and segmenting these objects in their entirety. While existing methods have demonstrated remarkable performance in various real-world scenarios, they still face limitations when confronted with difficult cases, such as small targets, thin structures, and indistinct boundaries. Drawing inspiration from human visual perception when observing images containing camouflaged objects, we propose a three-stage model that enables coarse-to-fine segmentation in a single iteration. Specifically, our model employs three decoders to sequentially process subsampled features, cropped features, and high-resolution original features. This proposed approach not only reduces computational overhead but also mitigates interference caused by background noise. Furthermore, considering the significance of multi-scale information, we have designed a multi-scale feature enhancement module that enlarges the receptive field while preserving detailed structural cues. Additionally, a boundary enhancement module has been developed to enhance performance by leveraging boundary information. Subsequently, a mask-guided fusion module is proposed to generate fine-grained results by integrating coarse prediction maps with high-resolution feature maps. Our network shows superior performance without introducing unnecessary complexities. Upon acceptance of the paper, the source code will be made publicly available at <span><span>https://github.com/clelouch/TSNet</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128784"},"PeriodicalIF":5.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142573294","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.neucom.2024.128777
Zhirui Wang, Liu Yang, Yahong Han
Unsupervised source-free domain adaptation methods aim to transfer knowledge acquired from labeled source domain to an unlabeled target domain, where the source data are not accessible during target domain adaptation and it is prohibited to minimize domain gap by pairwise calculation of the samples from the source and target domains. Previous approaches assign pseudo label to target data using pre-trained source model to progressively train the target model in a self-learning manner. However, incorrect pseudo label may adversely affect prediction in the target domain. Furthermore, they overlook the generalization ability of the source model, which primarily affects the initial prediction of the target model. Therefore, we propose an effective framework based on adversarial training to train the target model for source-free domain adaptation. Specifically, adversarial training is an effective technique to enhance the robustness of deep neural networks. By generating anti-adversarial examples and adversarial examples, the pseudo label of target data can be corrected further by adversarial training and a more optimal performance in both accuracy and robustness is achieved. Moreover, owing to the inherent domain distribution difference between source and target domains, mislabeled target samples exist inevitably. So a target sample filtering scheme is proposed to refine pseudo label to further improve the prediction capability on the target domain. Experiments conducted on benchmark tasks demonstrate that the proposed method outperforms existing approaches.
{"title":"Robust source-free domain adaptation with anti-adversarial samples training","authors":"Zhirui Wang, Liu Yang, Yahong Han","doi":"10.1016/j.neucom.2024.128777","DOIUrl":"10.1016/j.neucom.2024.128777","url":null,"abstract":"<div><div>Unsupervised source-free domain adaptation methods aim to transfer knowledge acquired from labeled source domain to an unlabeled target domain, where the source data are not accessible during target domain adaptation and it is prohibited to minimize domain gap by pairwise calculation of the samples from the source and target domains. Previous approaches assign pseudo label to target data using pre-trained source model to progressively train the target model in a self-learning manner. However, incorrect pseudo label may adversely affect prediction in the target domain. Furthermore, they overlook the generalization ability of the source model, which primarily affects the initial prediction of the target model. Therefore, we propose an effective framework based on adversarial training to train the target model for source-free domain adaptation. Specifically, adversarial training is an effective technique to enhance the robustness of deep neural networks. By generating anti-adversarial examples and adversarial examples, the pseudo label of target data can be corrected further by adversarial training and a more optimal performance in both accuracy and robustness is achieved. Moreover, owing to the inherent domain distribution difference between source and target domains, mislabeled target samples exist inevitably. So a target sample filtering scheme is proposed to refine pseudo label to further improve the prediction capability on the target domain. Experiments conducted on benchmark tasks demonstrate that the proposed method outperforms existing approaches.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128777"},"PeriodicalIF":5.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142586924","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.neucom.2024.128755
Mohammad Reza Zarei, Majid Komeili
Few-shot learning (FSL) presents a challenging learning problem in which only a few samples are available for each class. Decision interpretation is more important in few-shot classification due to a greater chance of error compared to traditional classification. However, the majority of the previous FSL methods are black-box models. In this paper, we propose an inherently interpretable model for FSL based on human-friendly attributes. Previously, human-friendly attributes have been utilized to train models with the potential for human interaction and interpretability. However, such approaches are not directly extendible to the few-shot classification scenario. Moreover, we propose an online attribute selection mechanism to effectively filter out irrelevant attributes in each episode. The attribute selection mechanism improves accuracy and helps with interpretability by reducing the number of attributes that participate in each episode. We further propose a mechanism that automatically detects the episodes where the pool of available human-friendly attributes is insufficient, and subsequently augments it by engaging some learned unknown attributes. We demonstrate that the proposed method achieves results on par with black-box few-shot learning models on four widely used datasets. We also empirically evaluate the level of decision alignment between different models and human understanding and show that our model outperforms the comparison methods based on this criterion.
{"title":"Interpretable few-shot learning with online attribute selection","authors":"Mohammad Reza Zarei, Majid Komeili","doi":"10.1016/j.neucom.2024.128755","DOIUrl":"10.1016/j.neucom.2024.128755","url":null,"abstract":"<div><div>Few-shot learning (FSL) presents a challenging learning problem in which only a few samples are available for each class. Decision interpretation is more important in few-shot classification due to a greater chance of error compared to traditional classification. However, the majority of the previous FSL methods are black-box models. In this paper, we propose an inherently interpretable model for FSL based on human-friendly attributes. Previously, human-friendly attributes have been utilized to train models with the potential for human interaction and interpretability. However, such approaches are not directly extendible to the few-shot classification scenario. Moreover, we propose an online attribute selection mechanism to effectively filter out irrelevant attributes in each episode. The attribute selection mechanism improves accuracy and helps with interpretability by reducing the number of attributes that participate in each episode. We further propose a mechanism that automatically detects the episodes where the pool of available human-friendly attributes is insufficient, and subsequently augments it by engaging some learned unknown attributes. We demonstrate that the proposed method achieves results on par with black-box few-shot learning models on four widely used datasets. We also empirically evaluate the level of decision alignment between different models and human understanding and show that our model outperforms the comparison methods based on this criterion.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128755"},"PeriodicalIF":5.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142592862","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.neucom.2024.128773
Zehua Hao, Fang Liu, Licheng Jiao, Yaoyang Du, Shuo Li, Hao Wang, Pengfang Li, Xu Liu, Puhua Chen
In the current landscape of Compositional Zero-Shot Learning (CZSL) methods that leverage CLIP, the predominant approach is based on prompt learning paradigms. These methods encounter significant computational complexity when dealing with a large number of categories. Additionally, when confronted with new classification tasks, there is a necessity to learn the prompts again, which can be both time-consuming and resource-intensive. To address these challenges, We present a new methodology, named the Mixture of Pretrained Expert (MoPE), for enhancing Compositional Zero-shot Learning through Logit-Level Fusion with Multi Expert Fusion Module. The MoPE skillfully blends the benefits of extensive pre-trained models like CLIP, Bert, GPT-3 and Word2Vec for effectively tackling Compositional Zero-shot Learning. Firstly, we extract the text label space for each language model individually, then map the visual feature vectors to their respective text spaces. This maintains the integrity and structure of the original text space. During this process, the pre-trained expert parameters are kept frozen. The mapping of visual features to the corresponding text spaces is subject to learning and could be considered as multiple learnable visual experts. In the model fusion phase, we propose a new fusion strategy that features a gating mechanism that adjusts the contributions of various models dynamically. This enables our approach to adapt more effectively to a range of tasks and data sets. The method’s robustness is demonstrated by the fact that the language model is not tailored to specific downstream task datasets or losses. This preserves the larger model’s topology and expands the potential for application. Preliminary experiments conducted on the UT-Zappos, AO-Clever, and C-GQA datasets indicate that MoPE performs competitively when compared to existing techniques.
{"title":"Preserving text space integrity for robust compositional zero-shot learning via mixture of pretrained experts","authors":"Zehua Hao, Fang Liu, Licheng Jiao, Yaoyang Du, Shuo Li, Hao Wang, Pengfang Li, Xu Liu, Puhua Chen","doi":"10.1016/j.neucom.2024.128773","DOIUrl":"10.1016/j.neucom.2024.128773","url":null,"abstract":"<div><div>In the current landscape of Compositional Zero-Shot Learning (CZSL) methods that leverage CLIP, the predominant approach is based on prompt learning paradigms. These methods encounter significant computational complexity when dealing with a large number of categories. Additionally, when confronted with new classification tasks, there is a necessity to learn the prompts again, which can be both time-consuming and resource-intensive. To address these challenges, We present a new methodology, named the <strong>M</strong>ixture of <strong>P</strong>retrained <strong>E</strong>xpert (MoPE), for enhancing Compositional Zero-shot Learning through Logit-Level Fusion with Multi Expert Fusion Module. The MoPE skillfully blends the benefits of extensive pre-trained models like CLIP, Bert, GPT-3 and Word2Vec for effectively tackling Compositional Zero-shot Learning. Firstly, we extract the text label space for each language model individually, then map the visual feature vectors to their respective text spaces. This maintains the integrity and structure of the original text space. During this process, the pre-trained expert parameters are kept frozen. The mapping of visual features to the corresponding text spaces is subject to learning and could be considered as multiple learnable visual experts. In the model fusion phase, we propose a new fusion strategy that features a gating mechanism that adjusts the contributions of various models dynamically. This enables our approach to adapt more effectively to a range of tasks and data sets. The method’s robustness is demonstrated by the fact that the language model is not tailored to specific downstream task datasets or losses. This preserves the larger model’s topology and expands the potential for application. Preliminary experiments conducted on the UT-Zappos, AO-Clever, and C-GQA datasets indicate that MoPE performs competitively when compared to existing techniques.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128773"},"PeriodicalIF":5.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142660661","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Few-shot anomaly detection for video surveillance is challenging due to the diverse nature of target domains. Existing methodologies treat it as a one-class classification problem, training on a reduced sample of nominal scenes. The focus is on either reconstructive or predictive frame methodologies to learn a manifold against which outliers can be detected during inference. We posit that the quality of image reconstruction or future frame prediction is inherently important in identifying anomalous pixels in video frames. In this paper, we enhance the image synthesis and mode coverage for video anomaly detection (VAD) by integrating a Denoising Diffusion model with a future frame prediction model. Our novel VAD pipeline includes a Generative Adversarial Network combined with denoising diffusion to learn the underlying non-anomalous data distribution and generate in one-step high fidelity future-frame samples. We further regularize the image reconstruction with perceptual quality metrics such as Multi-scale Structural Similarity Index Measure and Peak Signal-to-Noise Ratio, ensuring high-quality output under few episodic training iterations. Extensive experiments demonstrate that our method outperforms state-of-the-art techniques across multiple benchmarks, validating that high-quality image synthesis in frame prediction leads to robust anomaly detection in videos.
由于目标领域的多样性,视频监控的少镜头异常检测具有挑战性。现有的方法将其视为单类分类问题,在减少的标称场景样本上进行训练。重点在于重建或预测帧方法,以学习一个流形,在推理过程中可根据该流形检测异常值。我们认为,图像重建或未来帧预测的质量对于识别视频帧中的异常像素至关重要。在本文中,我们通过整合去噪扩散模型和未来帧预测模型,提高了视频异常检测(VAD)的图像合成和模式覆盖率。我们新颖的 VAD 管道包括一个生成对抗网络(Generative Adversarial Network),该网络与去噪扩散相结合,可学习底层非异常数据分布,并一步生成高保真的未来帧样本。我们还利用多尺度结构相似性指数测量和峰值信噪比等感知质量指标对图像重建进行了进一步的规范化处理,确保在少量偶发训练迭代的情况下实现高质量的输出。广泛的实验证明,我们的方法在多个基准测试中的表现优于最先进的技术,从而验证了在帧预测中进行高质量图像合成可实现稳健的视频异常检测。
{"title":"Adversarial diffusion for few-shot scene adaptive video anomaly detection","authors":"Yumna Zahid , Christine Zarges , Bernie Tiddeman , Jungong Han","doi":"10.1016/j.neucom.2024.128796","DOIUrl":"10.1016/j.neucom.2024.128796","url":null,"abstract":"<div><div>Few-shot anomaly detection for video surveillance is challenging due to the diverse nature of target domains. Existing methodologies treat it as a one-class classification problem, training on a reduced sample of nominal scenes. The focus is on either reconstructive or predictive frame methodologies to learn a manifold against which outliers can be detected during inference. We posit that the quality of image reconstruction or future frame prediction is inherently important in identifying anomalous pixels in video frames. In this paper, we enhance the image synthesis and mode coverage for video anomaly detection (VAD) by integrating a <em>Denoising Diffusion</em> model with a future frame prediction model. Our novel VAD pipeline includes a <em>Generative Adversarial Network</em> combined with denoising diffusion to learn the underlying non-anomalous data distribution and generate in one-step high fidelity future-frame samples. We further regularize the image reconstruction with perceptual quality metrics such as <em>Multi-scale Structural Similarity Index Measure</em> and <em>Peak Signal-to-Noise Ratio</em>, ensuring high-quality output under few episodic training iterations. Extensive experiments demonstrate that our method outperforms state-of-the-art techniques across multiple benchmarks, validating that high-quality image synthesis in frame prediction leads to robust anomaly detection in videos.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128796"},"PeriodicalIF":5.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142586925","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.neucom.2024.128788
Weide Liu , Jieming Lou , Xingxing Wang , Wei Zhou , Jun Cheng , Xulei Yang
Open vocabulary segmentation is a challenging task that aims to segment out the thousands of unseen categories. Directly applying CLIP to open-vocabulary semantic segmentation is challenging due to the granularity gap between its image-level contrastive learning and the pixel-level recognition required for segmentation. To address these challenges, we propose a unified pipeline that leverages physical structure regularization to enhance the generalizability and robustness of open vocabulary segmentation. By incorporating physical structure information, which is independent of the training data, we aim to reduce bias and improve the model’s performance on unseen classes. We utilize low-level structures such as edges and keypoints as regularization terms, as they are easier to obtain and strongly correlated with segmentation boundary information. These structures are used as pseudo-ground truth to supervise the model. Furthermore, inspired by the effectiveness of comparative learning in human cognition, we introduce the weighted patched alignment loss. This loss function contrasts similar and dissimilar samples to acquire low-dimensional representations that capture the distinctions between different object classes. By incorporating physical knowledge and leveraging weighted patched alignment loss, we aim to improve the model’s generalizability, robustness, and capability to recognize diverse object classes. The experiments on the COCO Stuff, Pascal VOC, Pascal Context-59, Pascal Context-459, ADE20K-150, and ADE20K-847 datasets demonstrate that our proposed method consistently improves baselines and achieves new state-of-the-art in the open vocabulary segmentation task.
{"title":"Physically-guided open vocabulary segmentation with weighted patched alignment loss","authors":"Weide Liu , Jieming Lou , Xingxing Wang , Wei Zhou , Jun Cheng , Xulei Yang","doi":"10.1016/j.neucom.2024.128788","DOIUrl":"10.1016/j.neucom.2024.128788","url":null,"abstract":"<div><div>Open vocabulary segmentation is a challenging task that aims to segment out the thousands of unseen categories. Directly applying CLIP to open-vocabulary semantic segmentation is challenging due to the granularity gap between its image-level contrastive learning and the pixel-level recognition required for segmentation. To address these challenges, we propose a unified pipeline that leverages physical structure regularization to enhance the generalizability and robustness of open vocabulary segmentation. By incorporating physical structure information, which is independent of the training data, we aim to reduce bias and improve the model’s performance on unseen classes. We utilize low-level structures such as edges and keypoints as regularization terms, as they are easier to obtain and strongly correlated with segmentation boundary information. These structures are used as pseudo-ground truth to supervise the model. Furthermore, inspired by the effectiveness of comparative learning in human cognition, we introduce the weighted patched alignment loss. This loss function contrasts similar and dissimilar samples to acquire low-dimensional representations that capture the distinctions between different object classes. By incorporating physical knowledge and leveraging weighted patched alignment loss, we aim to improve the model’s generalizability, robustness, and capability to recognize diverse object classes. The experiments on the COCO Stuff, Pascal VOC, Pascal Context-59, Pascal Context-459, ADE20K-150, and ADE20K-847 datasets demonstrate that our proposed method consistently improves baselines and achieves new state-of-the-art in the open vocabulary segmentation task.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128788"},"PeriodicalIF":5.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142573291","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.neucom.2024.128772
Ziting Wen , Oscar Pizarro , Stefan Williams
Training deep models with limited annotations poses a significant challenge when applied to diverse practical domains. Employing semi-supervised learning alongside the self-supervised model offers the potential to enhance label efficiency. However, this approach faces a bottleneck in reducing the need for labels. We observed that the semi-supervised model disrupts valuable information from self-supervised learning when only limited labels are available. To address this issue, this paper proposes a simple yet effective framework, active self-semi-supervised learning (AS3L). AS3L bootstraps semi-supervised models with prior pseudo-labels (PPL). These PPLs are obtained by label propagation over self-supervised features. Based on the observations the accuracy of PPL is not only affected by the quality of features but also by the selection of the labeled samples. We develop active learning and label propagation strategies to obtain accurate PPL. Consequently, our framework can significantly improve the performance of models in the case of limited annotations while demonstrating fast convergence. On the image classification tasks across four datasets, our method outperforms the baseline by an average of 5.4%. Additionally, it achieves the same accuracy as the baseline method in about 1/3 of the training time.
{"title":"Active self-semi-supervised learning for few labeled samples","authors":"Ziting Wen , Oscar Pizarro , Stefan Williams","doi":"10.1016/j.neucom.2024.128772","DOIUrl":"10.1016/j.neucom.2024.128772","url":null,"abstract":"<div><div>Training deep models with limited annotations poses a significant challenge when applied to diverse practical domains. Employing semi-supervised learning alongside the self-supervised model offers the potential to enhance label efficiency. However, this approach faces a bottleneck in reducing the need for labels. We observed that the semi-supervised model disrupts valuable information from self-supervised learning when only limited labels are available. To address this issue, this paper proposes a simple yet effective framework, active self-semi-supervised learning (AS3L). AS3L bootstraps semi-supervised models with prior pseudo-labels (PPL). These PPLs are obtained by label propagation over self-supervised features. Based on the observations the accuracy of PPL is not only affected by the quality of features but also by the selection of the labeled samples. We develop active learning and label propagation strategies to obtain accurate PPL. Consequently, our framework can significantly improve the performance of models in the case of limited annotations while demonstrating fast convergence. On the image classification tasks across four datasets, our method outperforms the baseline by an average of 5.4%. Additionally, it achieves the same accuracy as the baseline method in about 1/3 of the training time.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128772"},"PeriodicalIF":5.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142573218","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"OA","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2024-10-28DOI: 10.1016/j.neucom.2024.128757
Tung Nguyen , Tung Pham , Linh Ngo Van, Ha-Bang Ban, Khoat Than
Topic models have become ubiquitous tools for analyzing streaming data. However, existing streaming topic models suffer from several limitations when applied to real-world data streams. This includes the inability to accommodate evolving vocabularies and control topic quality throughout the streaming process. In this paper, we propose a novel streaming topic modeling approach that dynamically adapts to the changing nature of data streams. Our method leverages Byte-Pair Encoding embedding (BPEmb) to resolve the out-of-vocabulary problem that arises with new words in the stream. Additionally, we introduce a topic change variable that provides fine-grained control over topics’ parameter updates and present a preservation approach to retain high-coherence topics at each time step, helping preserve semantic quality. To further enhance model adaptability, our method allows dynamical adjustment of topic space size as needed. To the best of our knowledge, we are the first to address the expansion of vocabulary and maintain topic quality during the streaming process. Extensive experiments show the superior effectiveness of our method.
{"title":"Out-of-vocabulary handling and topic quality control strategies in streaming topic models","authors":"Tung Nguyen , Tung Pham , Linh Ngo Van, Ha-Bang Ban, Khoat Than","doi":"10.1016/j.neucom.2024.128757","DOIUrl":"10.1016/j.neucom.2024.128757","url":null,"abstract":"<div><div>Topic models have become ubiquitous tools for analyzing streaming data. However, existing streaming topic models suffer from several limitations when applied to real-world data streams. This includes the inability to accommodate evolving vocabularies and control topic quality throughout the streaming process. In this paper, we propose a novel streaming topic modeling approach that dynamically adapts to the changing nature of data streams. Our method leverages Byte-Pair Encoding embedding (BPEmb) to resolve the out-of-vocabulary problem that arises with new words in the stream. Additionally, we introduce a topic change variable that provides fine-grained control over topics’ parameter updates and present a preservation approach to retain high-coherence topics at each time step, helping preserve semantic quality. To further enhance model adaptability, our method allows dynamical adjustment of topic space size as needed. To the best of our knowledge, we are the first to address the expansion of vocabulary and maintain topic quality during the streaming process. Extensive experiments show the superior effectiveness of our method.</div></div>","PeriodicalId":19268,"journal":{"name":"Neurocomputing","volume":"614 ","pages":"Article 128757"},"PeriodicalIF":5.5,"publicationDate":"2024-10-28","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"142573249","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":2,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}