首页 > 最新文献

Pattern Recognition最新文献

英文 中文
RFAConv: Receptive-field attention convolution for improving convolutional neural networks
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-05 DOI: 10.1016/j.patcog.2026.113208
Xin Zhang , Chen Liu , Tingting Song , Degang Yang , Yichen Ye , Ke Li , Yingze Song
In the realm of deep learning, spatial attention mechanisms have emerged as a vital method for enhancing the performance of convolutional neural networks. However, these mechanisms possess inherent limitations that cannot be overlooked. This work delves into the mechanism of spatial attention and reveals a new insight. It is that the mechanism essentially addresses the issue of convolutional parameter sharing. By addressing this issue, the convolutional kernel can efficiently extract features by employing varying weights at distinct locations. However, current spatial attention mechanisms focus on reweighting spatial features through attention, which is insufficient to address the fundamental challenge of parameter sharing in convolutions involving larger kernels. In response to this challenge, we introduce a novel attention mechanism known as Receptive-Field Attention (RFA). Compared to existing spatial attention methods, RFA not only concentrates on the receptive-field spatial features but also offers effective attention weights for large convolutional kernels. Building upon the RFA concept, a Receptive-Field Attention Convolution (RFAConv) is proposed to supplant the conventional standard convolution. Notably, it offers nearly negligible increment of computational overhead and parameters, while significantly improving network performance. Furthermore, this work reveals that current spatial attention mechanisms require enhanced prioritization of receptive-field spatial features to optimize network performance. To validate the advantages of the proposed methods, we conduct many experiments across several authoritative datasets, including ImageNet, Places365, COCO, VOC, and Roboflow. The results demonstrate that the proposed methods bring about significant advancements in tasks, such as image classification, object detection, and semantic segmentation, surpassing convolutional operations constructed using current spatial attention mechanisms. Presently, the code and pre-trained models for the associated tasks have been made publicly available at https://github.com/Liuchen1997/RFAConv.
在深度学习领域,空间注意机制已经成为增强卷积神经网络性能的重要方法。然而,这些机制具有不可忽视的内在局限性。这个作品深入探讨了空间注意力的机制,揭示了一个新的见解。该机制本质上解决了卷积参数共享的问题。通过解决这个问题,卷积核可以通过在不同的位置使用不同的权值来有效地提取特征。然而,目前的空间注意机制侧重于通过注意来重新加权空间特征,这不足以解决涉及更大核卷积的参数共享的根本挑战。为了应对这一挑战,我们引入了一种新的注意机制,即接受场注意(RFA)。与现有的空间注意方法相比,RFA不仅专注于接受场的空间特征,而且为大卷积核提供了有效的注意权重。在RFA概念的基础上,提出了一种接受场注意卷积(RFAConv)来取代传统的标准卷积。值得注意的是,它提供了几乎可以忽略不计的计算开销和参数增量,同时显著提高了网络性能。此外,这项工作表明,当前的空间注意机制需要增强接受场空间特征的优先级,以优化网络性能。为了验证所提出方法的优势,我们在几个权威数据集上进行了许多实验,包括ImageNet、Places365、COCO、VOC和Roboflow。结果表明,所提出的方法在图像分类、目标检测和语义分割等任务上取得了显著的进步,超过了使用当前空间注意机制构建的卷积操作。目前,相关任务的代码和预训练模型已在https://github.com/Liuchen1997/RFAConv上公开提供。
{"title":"RFAConv: Receptive-field attention convolution for improving convolutional neural networks","authors":"Xin Zhang ,&nbsp;Chen Liu ,&nbsp;Tingting Song ,&nbsp;Degang Yang ,&nbsp;Yichen Ye ,&nbsp;Ke Li ,&nbsp;Yingze Song","doi":"10.1016/j.patcog.2026.113208","DOIUrl":"10.1016/j.patcog.2026.113208","url":null,"abstract":"<div><div>In the realm of deep learning, spatial attention mechanisms have emerged as a vital method for enhancing the performance of convolutional neural networks. However, these mechanisms possess inherent limitations that cannot be overlooked. This work delves into the mechanism of spatial attention and reveals a new insight. It is that the mechanism essentially addresses the issue of convolutional parameter sharing. By addressing this issue, the convolutional kernel can efficiently extract features by employing varying weights at distinct locations. However, current spatial attention mechanisms focus on reweighting spatial features through attention, which is insufficient to address the fundamental challenge of parameter sharing in convolutions involving larger kernels. In response to this challenge, we introduce a novel attention mechanism known as Receptive-Field Attention (RFA). Compared to existing spatial attention methods, RFA not only concentrates on the receptive-field spatial features but also offers effective attention weights for large convolutional kernels. Building upon the RFA concept, a Receptive-Field Attention Convolution (RFAConv) is proposed to supplant the conventional standard convolution. Notably, it offers nearly negligible increment of computational overhead and parameters, while significantly improving network performance. Furthermore, this work reveals that current spatial attention mechanisms require enhanced prioritization of receptive-field spatial features to optimize network performance. To validate the advantages of the proposed methods, we conduct many experiments across several authoritative datasets, including ImageNet, Places365, COCO, VOC, and Roboflow. The results demonstrate that the proposed methods bring about significant advancements in tasks, such as image classification, object detection, and semantic segmentation, surpassing convolutional operations constructed using current spatial attention mechanisms. Presently, the code and pre-trained models for the associated tasks have been made publicly available at <span><span>https://github.com/Liuchen1997/RFAConv</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113208"},"PeriodicalIF":7.6,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174362","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MTSCL-Net: Multi-level temporal spatial contrastive learning for robust breast tumor segmentation in DCE-MRI
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-05 DOI: 10.1016/j.patcog.2026.113225
Jiezhou He , Qi Wen , Zhiming Luo , Xue Zhao , Songzhi Su , Guojun Zhang , Shaozi Li
Accurate and automated tumor segmentation in dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is vital for breast cancer diagnosis and treatment. However, the substantial heterogeneity of breast cancer, along with the varied tumor sizes, shapes, and appearances, coupled with artifacts in DCE-MRI data, pose significant challenges for accurate segmentation. Furthermore, current approaches encounter difficulties in cross-dataset scenarios with different image acquiring protocols and devices, due to ineffective representation of essential tumor features. To address these challenges, we propose a Multi-level Temporal-Spatial Contrastive Learning Network (MTSCL-Net) for breast tumor segmentation in DCE-MRI. Our method introduces a novel multi-level temporal-spatial contrastive loss to enhance feature representation from multiple layers and across temporal sequences. Additionally, we design a feature-sharing encoding structure with tumor-invariant feature perception, reducing parameters while maintaining consistent spatial feature extraction. A temporal fusion module integrates sequence features, further reducing parameter count and complexity. Extensive experiments on two public datasets demonstrate the superiority of our approach over recent state-of-the-art methods. To explore generalization across different centers, we trained our method on a public dataset (DUKE) and tested it on another public dataset (YUN, collected from seven centers) and two private datasets. The results verified the robustness and effectiveness of our approach in addressing both within-domain and cross-domain challenges.
动态对比增强磁共振成像(DCE-MRI)中准确和自动的肿瘤分割对乳腺癌的诊断和治疗至关重要。然而,乳腺癌的巨大异质性,以及肿瘤大小、形状和外观的变化,再加上DCE-MRI数据中的伪影,对准确分割构成了重大挑战。此外,目前的方法在具有不同图像获取协议和设备的跨数据集场景中遇到困难,因为无法有效表征肿瘤的基本特征。为了解决这些挑战,我们提出了一个用于DCE-MRI乳腺肿瘤分割的多层次时空对比学习网络(MTSCL-Net)。我们的方法引入了一种新的多层次时空对比损失,以增强多层和跨时间序列的特征表示。此外,我们设计了具有肿瘤不变特征感知的特征共享编码结构,在减少参数的同时保持空间特征提取的一致性。时间融合模块集成了序列特征,进一步减少了参数数量和复杂度。在两个公共数据集上进行的大量实验表明,我们的方法优于最近最先进的方法。为了探索不同中心之间的泛化,我们在一个公共数据集(DUKE)上训练了我们的方法,并在另一个公共数据集(YUN,收集自七个中心)和两个私有数据集上进行了测试。结果验证了我们的方法在解决域内和跨域挑战方面的鲁棒性和有效性。
{"title":"MTSCL-Net: Multi-level temporal spatial contrastive learning for robust breast tumor segmentation in DCE-MRI","authors":"Jiezhou He ,&nbsp;Qi Wen ,&nbsp;Zhiming Luo ,&nbsp;Xue Zhao ,&nbsp;Songzhi Su ,&nbsp;Guojun Zhang ,&nbsp;Shaozi Li","doi":"10.1016/j.patcog.2026.113225","DOIUrl":"10.1016/j.patcog.2026.113225","url":null,"abstract":"<div><div>Accurate and automated tumor segmentation in dynamic contrast-enhanced magnetic resonance imaging (DCE-MRI) is vital for breast cancer diagnosis and treatment. However, the substantial heterogeneity of breast cancer, along with the varied tumor sizes, shapes, and appearances, coupled with artifacts in DCE-MRI data, pose significant challenges for accurate segmentation. Furthermore, current approaches encounter difficulties in cross-dataset scenarios with different image acquiring protocols and devices, due to ineffective representation of essential tumor features. To address these challenges, we propose a Multi-level Temporal-Spatial Contrastive Learning Network (MTSCL-Net) for breast tumor segmentation in DCE-MRI. Our method introduces a novel multi-level temporal-spatial contrastive loss to enhance feature representation from multiple layers and across temporal sequences. Additionally, we design a feature-sharing encoding structure with tumor-invariant feature perception, reducing parameters while maintaining consistent spatial feature extraction. A temporal fusion module integrates sequence features, further reducing parameter count and complexity. Extensive experiments on two public datasets demonstrate the superiority of our approach over recent state-of-the-art methods. To explore generalization across different centers, we trained our method on a public dataset (DUKE) and tested it on another public dataset (YUN, collected from seven centers) and two private datasets. The results verified the robustness and effectiveness of our approach in addressing both within-domain and cross-domain challenges.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113225"},"PeriodicalIF":7.6,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174870","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Understanding multimodal sentiment with deep modality interaction learning
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-05 DOI: 10.1016/j.patcog.2026.113236
Jie Mu , Jing Zhang , Jian Xu , Wenqi Liu , Zhizheng Sun , Wei Wang
Multimodal sentiment analysis (MSA), which detects the sentimental polarities from multimodal data, is a crucial task in data mining and pattern recognition. Most MSA methods apply attention mechanisms to achieve better performance. However, the attention-based methods have two limitations: (1) their encoding modules based on the attention mechanism fail to fully consider the semantic-related information shared by image and text, which prevents the models from discovering sentiment features; (2) the attention-based methods either consider intra-modal interaction or inter-modal interaction but cannot consider intra- and inter-modal interaction simultaneously, which makes the models unable to fuse different modal features well. To overcome these limitations, this paper proposes a deep modality interaction network (DMINet) for understanding multimodal sentiment. First, we raise a cross-modal information interaction strategy to preserve the semantic-related information by maximizing the mutual information between image and text. Second, we design an image-text interactive graph module to simultaneously consider the intra- and inter-modal interaction by constructing a cross-modal graph. In addition, to address the problem that mutual information is difficult to calculate, we derive a cross-modal sub-boundary to compute the mutual information. Experimental results on 4 publicly available multimodal datasets demonstrate that DMINet outperforms 18 existing methods in multimodal sentiment analysis, achieving up to a 19-percentage-point improvement over several baseline models.
多模态情感分析(MSA)从多模态数据中检测情感的极性,是数据挖掘和模式识别中的一项重要任务。大多数MSA方法应用注意机制来实现更好的性能。然而,基于注意的方法存在两个局限性:(1)基于注意机制的编码模块没有充分考虑图像和文本共享的语义相关信息,导致模型无法发现情感特征;(2)基于注意力的方法只考虑了模态内或模态间的相互作用,而不能同时考虑模态内和模态间的相互作用,导致模型不能很好地融合不同的模态特征。为了克服这些限制,本文提出了一种深度情态交互网络(DMINet)来理解多情态情感。首先,我们提出了一种跨模态信息交互策略,通过最大化图像和文本之间的互信息来保留语义相关信息。其次,我们设计了一个图像-文本交互图形模块,通过构建一个跨模态图来同时考虑模态内和模态间的交互。此外,为了解决互信息难以计算的问题,我们推导了一个跨模态子边界来计算互信息。在4个公开可用的多模态数据集上的实验结果表明,DMINet在多模态情感分析中优于18种现有方法,比几个基线模型提高了19个百分点。
{"title":"Understanding multimodal sentiment with deep modality interaction learning","authors":"Jie Mu ,&nbsp;Jing Zhang ,&nbsp;Jian Xu ,&nbsp;Wenqi Liu ,&nbsp;Zhizheng Sun ,&nbsp;Wei Wang","doi":"10.1016/j.patcog.2026.113236","DOIUrl":"10.1016/j.patcog.2026.113236","url":null,"abstract":"<div><div>Multimodal sentiment analysis (MSA), which detects the sentimental polarities from multimodal data, is a crucial task in data mining and pattern recognition. Most MSA methods apply attention mechanisms to achieve better performance. However, the attention-based methods have two limitations: (1) their encoding modules based on the attention mechanism fail to fully consider the semantic-related information shared by image and text, which prevents the models from discovering sentiment features; (2) the attention-based methods either consider intra-modal interaction or inter-modal interaction but cannot consider intra- and inter-modal interaction simultaneously, which makes the models unable to fuse different modal features well. To overcome these limitations, this paper proposes a deep modality interaction network (DMINet) for understanding multimodal sentiment. First, we raise a cross-modal information interaction strategy to preserve the semantic-related information by maximizing the mutual information between image and text. Second, we design an image-text interactive graph module to simultaneously consider the intra- and inter-modal interaction by constructing a cross-modal graph. In addition, to address the problem that mutual information is difficult to calculate, we derive a cross-modal sub-boundary to compute the mutual information. Experimental results on 4 publicly available multimodal datasets demonstrate that DMINet outperforms 18 existing methods in multimodal sentiment analysis, achieving up to a 19-percentage-point improvement over several baseline models.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113236"},"PeriodicalIF":7.6,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174363","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Noise-aware cross attention for image manipulation localization
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-05 DOI: 10.1016/j.patcog.2026.113164
Hongshi Zhang , Tonghua Su , Zhou Liu , Fuxiang Yang , Donglin Di , Yang Song , Lei Fan
Modern image manipulation techniques have achieved visual realism that often deceives the human eye and semantic-based detectors. However, manipulation operations typically disturb the intrinsic statistical properties of images. Unlike high-level semantic content, which remains visually consistent, such disturbances manifest as anomalies in noise characteristics, including inconsistencies in sensor pattern noise, distinct high-frequency residuals, and unnatural frequency-domain artifacts introduced by resampling or synthesis. These subtle forensic cues provide more reliable evidence for manipulation localization but are often suppressed by standard RGB-domain feature extractors. Existing IML methods often rely on a single noise feature extraction strategy or treat all tampering techniques uniformly, leading to two major limitations, incomplete noise characterization and insufficient tampering-type awareness. We propose a Noise-aware Contrastive localization Network (NC-Net), which introduces two key modules. Firstly, a Gated Noise Extractor that captures mixed noise-domain patterns using a gated network combining features derived from BayarConv and Discrete Wavelet Transform (DWT) operations. This extractor is further enhanced by a dual-granularity contrastive learning strategy, which models distributional discrepancies both within images (between manipulated and authentic regions) and across images (among different manipulation types). Secondly, a Multi-Scale Fusion Module that adaptively integrates noise-domain and RGB-domain semantic features via a cross-domain attention mechanism and a top-down feature pyramid. A lightweight decoder then produces the final localization map with high precision. NC-Net enables end-to-end joint optimization of the noise extraction and RGB branches, achieving state-of-the-art performance with competitive computational overhead. Extensive experiments demonstrate its superiority over existing methods. Source code is available at https://github.com/HIT-liar/NC-Net.
现代图像处理技术已经实现了经常欺骗人眼和基于语义的检测器的视觉真实感。然而,操作操作通常会干扰图像的固有统计特性。与在视觉上保持一致的高级语义内容不同,这些干扰表现为噪声特征的异常,包括传感器模式噪声的不一致、明显的高频残差以及重采样或合成引入的非自然频域伪影。这些微妙的取证线索为操作定位提供了更可靠的证据,但通常被标准的rgb域特征提取器所抑制。现有的IML方法往往依赖于单一的噪声特征提取策略或统一对待所有篡改技术,导致两大局限性:不完整的噪声表征和篡改类型感知不足。本文提出了一种噪声感知对比定位网络(NC-Net),其中引入了两个关键模块。首先,门控噪声提取器使用门控网络结合BayarConv和离散小波变换(DWT)运算得到的特征捕获混合噪声域模式。该提取器通过双粒度对比学习策略得到进一步增强,该策略对图像内部(在操纵区域和真实区域之间)和图像之间(在不同的操作类型之间)的分布差异进行建模。其次,通过跨域注意机制和自顶向下的特征金字塔,构建了自适应集成噪声域和rgb域语义特征的多尺度融合模块。然后,一个轻量级的解码器生成高精度的最终定位地图。NC-Net实现了噪声提取和RGB分支的端到端联合优化,在具有竞争力的计算开销下实现了最先进的性能。大量的实验证明了该方法优于现有方法。源代码可从https://github.com/HIT-liar/NC-Net获得。
{"title":"Noise-aware cross attention for image manipulation localization","authors":"Hongshi Zhang ,&nbsp;Tonghua Su ,&nbsp;Zhou Liu ,&nbsp;Fuxiang Yang ,&nbsp;Donglin Di ,&nbsp;Yang Song ,&nbsp;Lei Fan","doi":"10.1016/j.patcog.2026.113164","DOIUrl":"10.1016/j.patcog.2026.113164","url":null,"abstract":"<div><div>Modern image manipulation techniques have achieved visual realism that often deceives the human eye and semantic-based detectors. However, manipulation operations typically disturb the intrinsic statistical properties of images. Unlike high-level semantic content, which remains visually consistent, such disturbances manifest as anomalies in noise characteristics, including inconsistencies in sensor pattern noise, distinct high-frequency residuals, and unnatural frequency-domain artifacts introduced by resampling or synthesis. These subtle forensic cues provide more reliable evidence for manipulation localization but are often suppressed by standard RGB-domain feature extractors. Existing IML methods often rely on a single noise feature extraction strategy or treat all tampering techniques uniformly, leading to two major limitations, <em>incomplete noise characterization</em> and <em>insufficient tampering-type awareness</em>. We propose a Noise-aware Contrastive localization Network (NC-Net), which introduces two key modules. Firstly, a Gated Noise Extractor that captures mixed noise-domain patterns using a gated network combining features derived from BayarConv and Discrete Wavelet Transform (DWT) operations. This extractor is further enhanced by a dual-granularity contrastive learning strategy, which models distributional discrepancies both within images (between manipulated and authentic regions) and across images (among different manipulation types). Secondly, a Multi-Scale Fusion Module that adaptively integrates noise-domain and RGB-domain semantic features via a cross-domain attention mechanism and a top-down feature pyramid. A lightweight decoder then produces the final localization map with high precision. NC-Net enables end-to-end joint optimization of the noise extraction and RGB branches, achieving state-of-the-art performance with competitive computational overhead. Extensive experiments demonstrate its superiority over existing methods. Source code is available at <span><span>https://github.com/HIT-liar/NC-Net</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113164"},"PeriodicalIF":7.6,"publicationDate":"2026-02-05","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174890","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Improving episodic few-shot visual question answering via spatial and frequency domain dual-calibration
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-04 DOI: 10.1016/j.patcog.2026.113165
Jing Zhang, Yifan Wei, Yunzuo Hu, Zhe Wang
Considering that the frequency domain information in the image can make up for the deficiency of the spatial domain information in the global structure representation, we proposed a novel Dual-domain Feature and Distribution dual-calibration Network (DFDN) for episodic few shot visual question answering to achieve a deep and comprehensive understanding of the image content and cross-modal reasoning. In DFDN, spatial and frequency-domain information are mutually calibrated to achieve complementary information advantages, and more effective cross-modal reasoning is achieved through dual calibration of both features and distributions. A dual-domain feature calibration module is proposed, which employs mutual mapping and dynamic masking techniques to extract task-relevant features, and calibrate dual-domain information at the feature level. Meanwhile, a new dual-domain mutual distillation distribution calibration module is proposed to achieve mutual calibration of data distributions across spatial and frequency domains, further improving the cross-modal reasoning ability of DFDN. Experimental results across four public benchmark datasets demonstrated that DFDN achieved excellent performance and outperformed current state-of-the-art methods on episodic few shot visual question answering. Code is available at anonymous account: https://github.com/Harold1810/DFDN.
考虑到图像中的频域信息可以弥补全局结构表示中空间域信息的不足,我们提出了一种新的双域特征和分布双校准网络(DFDN),用于情景式少镜头视觉问答,以实现对图像内容的深入全面理解和跨模态推理。在DFDN中,空间和频域信息相互校准,实现信息优势互补,通过特征和分布的双重校准,实现更有效的跨模态推理。提出了一种双域特征校准模块,该模块采用互映射和动态掩蔽技术提取任务相关特征,并在特征级对双域信息进行校准。同时,提出了一种新的双域互精馏分布标定模块,实现了跨空间域和频域数据分布的互标定,进一步提高了DFDN的跨模态推理能力。在四个公共基准数据集上的实验结果表明,DFDN在情景性少镜头视觉问答方面取得了优异的性能,优于当前最先进的方法。代码可在匿名帐户:https://github.com/Harold1810/DFDN。
{"title":"Improving episodic few-shot visual question answering via spatial and frequency domain dual-calibration","authors":"Jing Zhang,&nbsp;Yifan Wei,&nbsp;Yunzuo Hu,&nbsp;Zhe Wang","doi":"10.1016/j.patcog.2026.113165","DOIUrl":"10.1016/j.patcog.2026.113165","url":null,"abstract":"<div><div>Considering that the frequency domain information in the image can make up for the deficiency of the spatial domain information in the global structure representation, we proposed a novel Dual-domain Feature and Distribution dual-calibration Network (DFDN) for episodic few shot visual question answering to achieve a deep and comprehensive understanding of the image content and cross-modal reasoning. In DFDN, spatial and frequency-domain information are mutually calibrated to achieve complementary information advantages, and more effective cross-modal reasoning is achieved through dual calibration of both features and distributions. A dual-domain feature calibration module is proposed, which employs mutual mapping and dynamic masking techniques to extract task-relevant features, and calibrate dual-domain information at the feature level. Meanwhile, a new dual-domain mutual distillation distribution calibration module is proposed to achieve mutual calibration of data distributions across spatial and frequency domains, further improving the cross-modal reasoning ability of DFDN. Experimental results across four public benchmark datasets demonstrated that DFDN achieved excellent performance and outperformed current state-of-the-art methods on episodic few shot visual question answering. Code is available at anonymous account: <span><span>https://github.com/Harold1810/DFDN</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113165"},"PeriodicalIF":7.6,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174371","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
ETV-Attack: Efficient text-driven visual-variable adversarial attacks on visual question answering with pre-trained language models
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-04 DOI: 10.1016/j.patcog.2026.113202
Quanxing Xu , Ling Zhou , Xian Zhong , Feifei Zhang , Jinyu Tian , Xiaohan Yu , Rubing Huang
An adversarial attack aims to induce a model to produce incorrect outputs, thereby assessing its robustness. With the rapid development of pre-trained language models (PLMs), these models have become increasingly prevalent in various vision-language (VL) tasks. In particular, visual question answering (VQA), which answers questions based on a given image, is a fundamental VL task; thus, evaluating the robustness of PLM-based VQA models under adversarial attacks is of great importance to the VL community. Existing adversarial attacks on VQA mainly generate adversarial examples by directly perturbing images or questions at the image or patch level, but they suffer from limited diversity of alterations and fail to realize object-level perturbations. To address these issues and further advance the robustness evaluation of PLM-based VQA models, we propose a multi-modal Efficient Text-driven Visual-variable Attack (ETV-Attack), which generates more effective visual perturbations in a text-guided manner. Specifically, leveraging the fact that pre-trained text encoders inherently capture visual priors, we introduce an embedding-level operation into the image augmentation pipeline instead of directly manipulating the image. This reduces the complexity of perturbation generation while enabling more flexible augmentations. Experimental results reveal that PLM-based VQA models are vulnerable to such multi-modal perturbations. Motivated by this observation, we further propose ETV Augmentation to improve VQA performance in both conventional architectures and LLM-based approaches.
对抗性攻击旨在诱导模型产生不正确的输出,从而评估其鲁棒性。随着预训练语言模型(PLMs)的快速发展,这些模型在各种视觉语言(VL)任务中越来越普遍。特别是,视觉问答(VQA),它根据给定的图像回答问题,是一个基本的VL任务;因此,评估基于plm的VQA模型在对抗性攻击下的鲁棒性对VL社区非常重要。现有的针对VQA的对抗性攻击主要是通过在图像或补丁级别直接扰动图像或问题来生成对抗性样例,但其变化的多样性有限,无法实现对象级别的扰动。为了解决这些问题并进一步推进基于plm的VQA模型的鲁棒性评估,我们提出了一种多模态高效文本驱动视觉变量攻击(ETV-Attack),该攻击以文本引导的方式产生更有效的视觉扰动。具体来说,利用预先训练的文本编码器固有地捕获视觉先验的事实,我们在图像增强管道中引入了嵌入级操作,而不是直接操作图像。这降低了摄动产生的复杂性,同时实现了更灵活的增强。实验结果表明,基于plm的VQA模型容易受到这种多模态扰动的影响。受此观察结果的启发,我们进一步提出了ETV增强,以提高传统架构和基于llm的方法中的VQA性能。
{"title":"ETV-Attack: Efficient text-driven visual-variable adversarial attacks on visual question answering with pre-trained language models","authors":"Quanxing Xu ,&nbsp;Ling Zhou ,&nbsp;Xian Zhong ,&nbsp;Feifei Zhang ,&nbsp;Jinyu Tian ,&nbsp;Xiaohan Yu ,&nbsp;Rubing Huang","doi":"10.1016/j.patcog.2026.113202","DOIUrl":"10.1016/j.patcog.2026.113202","url":null,"abstract":"<div><div>An adversarial attack aims to induce a model to produce incorrect outputs, thereby assessing its robustness. With the rapid development of pre-trained language models (PLMs), these models have become increasingly prevalent in various vision-language (VL) tasks. In particular, visual question answering (VQA), which answers questions based on a given image, is a fundamental VL task; thus, evaluating the robustness of PLM-based VQA models under adversarial attacks is of great importance to the VL community. Existing adversarial attacks on VQA mainly generate adversarial examples by directly perturbing images or questions at the image or patch level, but they suffer from limited diversity of alterations and fail to realize object-level perturbations. To address these issues and further advance the robustness evaluation of PLM-based VQA models, we propose a multi-modal <u><strong>E</strong></u>fficient <u><strong>T</strong></u>ext-driven <u><strong>V</strong></u>isual-variable <u><strong>Attack</strong></u> (ETV-Attack), which generates more effective visual perturbations in a text-guided manner. Specifically, leveraging the fact that pre-trained text encoders inherently capture visual priors, we introduce an embedding-level operation into the image augmentation pipeline instead of directly manipulating the image. This reduces the complexity of perturbation generation while enabling more flexible augmentations. Experimental results reveal that PLM-based VQA models are vulnerable to such multi-modal perturbations. Motivated by this observation, we further propose ETV Augmentation to improve VQA performance in both conventional architectures and LLM-based approaches.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113202"},"PeriodicalIF":7.6,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174867","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
Align-then-generate: An effective cross-modal generation paradigm for multi-label zero-shot learning
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-04 DOI: 10.1016/j.patcog.2026.113229
Peirong Ma , Wu Ran , Yanhui Gu , Huaqiu Chen , Zhiquan He , Hong Lu
Multi-label zero-shot learning (MLZSL) aims to recognize multiple unseen class labels that may appear in an image, posing a significant challenge in the field of computer vision. While generative methods have achieved remarkable progress by synthesizing visual features of unseen categories, they often suffer from poor visual-semantic consistency and limited generative quality. To address these issues, this paper proposes a novel “Align-then-Generate” paradigm and introduces a unified framework named VLA-CMG, which integrates vision-language alignment with cross-modal feature generation. Specifically, a Language-aware multi-label image encoder (LMIE) is designed to extract both global and local visual features from images, which are aligned with multi-label semantic embeddings generated by the text encoder of the Vision-language pre-training (VLP) Model, thereby enhancing the consistency between semantic and visual representations. This alignment provides high-quality input for the training of the Dual-stream feature generation network (DSFGN), which synthesizes discriminative visual features for unseen classes. Finally, a robust multi-label zero-shot classifier is built upon the generated features. Extensive experiments on two large-scale benchmark datasets (i.e., NUS-WIDE and Open Images) demonstrate that VLA-CMG consistently outperforms existing state-of-the-art methods on both ZSL and GZSL tasks, validating its effectiveness and superiority.
多标签零射击学习(MLZSL)旨在识别图像中可能出现的多个未见过的类标签,这是计算机视觉领域的一个重大挑战。虽然生成方法在合成未见类别的视觉特征方面取得了显著进展,但它们往往存在视觉语义一致性差和生成质量有限的问题。为了解决这些问题,本文提出了一种新的“对齐-生成”模式,并引入了一个名为VLA-CMG的统一框架,该框架将视觉语言对齐与跨模态特征生成相结合。具体而言,设计了一种语言感知的多标签图像编码器(LMIE),从图像中提取全局和局部视觉特征,并将其与视觉语言预训练(VLP)模型的文本编码器生成的多标签语义嵌入对齐,从而增强语义表示和视觉表示的一致性。这种对齐为双流特征生成网络(DSFGN)的训练提供了高质量的输入,该网络综合了未见类的判别视觉特征。最后,基于生成的特征构建了鲁棒的多标签零射分类器。在两个大型基准数据集(即NUS-WIDE和Open Images)上进行的大量实验表明,VLA-CMG在ZSL和GZSL任务上始终优于现有的最先进方法,验证了其有效性和优越性。
{"title":"Align-then-generate: An effective cross-modal generation paradigm for multi-label zero-shot learning","authors":"Peirong Ma ,&nbsp;Wu Ran ,&nbsp;Yanhui Gu ,&nbsp;Huaqiu Chen ,&nbsp;Zhiquan He ,&nbsp;Hong Lu","doi":"10.1016/j.patcog.2026.113229","DOIUrl":"10.1016/j.patcog.2026.113229","url":null,"abstract":"<div><div>Multi-label zero-shot learning (MLZSL) aims to recognize multiple unseen class labels that may appear in an image, posing a significant challenge in the field of computer vision. While generative methods have achieved remarkable progress by synthesizing visual features of unseen categories, they often suffer from poor visual-semantic consistency and limited generative quality. To address these issues, this paper proposes a novel “Align-then-Generate” paradigm and introduces a unified framework named VLA-CMG, which integrates vision-language alignment with cross-modal feature generation. Specifically, a Language-aware multi-label image encoder (LMIE) is designed to extract both global and local visual features from images, which are aligned with multi-label semantic embeddings generated by the text encoder of the Vision-language pre-training (VLP) Model, thereby enhancing the consistency between semantic and visual representations. This alignment provides high-quality input for the training of the Dual-stream feature generation network (DSFGN), which synthesizes discriminative visual features for unseen classes. Finally, a robust multi-label zero-shot classifier is built upon the generated features. Extensive experiments on two large-scale benchmark datasets (<em>i.e.</em>, NUS-WIDE and Open Images) demonstrate that VLA-CMG consistently outperforms existing state-of-the-art methods on both ZSL and GZSL tasks, validating its effectiveness and superiority.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113229"},"PeriodicalIF":7.6,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174447","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
MCMTSYN: Predicting anticancer drug synergy via cross-modal feature fusion and multi-task learning
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-04 DOI: 10.1016/j.patcog.2026.113222
Wei Wang , Gaolin Yuan , Bin Sun , Dong Liu , Hongjun Zhang , Guangsheng Wu , Yun Zhou , Xianfang Wang
Current computational models for predicting cancer drug synergy are mostly based on biological prior knowledge. However, the lack of real-world features and insufficient representation of new drug characteristics present significant challenges to prediction tasks, requiring models to have a stronger ability to extract rich and comprehensive features. To address this issue, we propose a new method, MCMTSYN, based on cross-modal feature fusion and multi-task learning for predicting cancer drug synergy. MCMTSYN begins by utilizing a self-modal feature extraction module to reduce high-dimensional data, and generate dense embeddings for both the multi-omics features of cell lines and the properties of drugs. Next, a cross-modal feature fusion module is employed to dynamically capture potential mutual information between drugs and cell lines across various modalities. Finally, multi-task learning techniques are applied to optimize four distinct tasks: regression and classification tasks for predicting anticancer drug synergy, as well as regression and classification tasks for predicting drug sensitivity. By sharing parameters and balancing the loss across different tasks, MCMTSYN maximizes the utilization of original information, thereby enhancing feature representation and bolstering the model’s generalization capabilities. Regression and classification tasks carried out on benchmark datasets evidently show that MCMTSYN outperforms seven comparative methods. The experimental results show that MCMTSYN achieved an MSE of 218.20±38.53 and a PCC of 0.76±0.02 in 5-fold cross-validation of the regression task, and the AUC of 0.91±0.02 and the ACC of 0.94±0.01 in 5-fold cross-validation of the classification task.
目前预测癌症药物协同作用的计算模型大多基于生物学先验知识。然而,现实世界特征的缺乏和新药特征的不充分表征对预测任务提出了重大挑战,要求模型具有更强的提取丰富和全面特征的能力。为了解决这一问题,我们提出了一种基于跨模态特征融合和多任务学习的预测癌症药物协同作用的新方法——MCMTSYN。MCMTSYN首先利用自模态特征提取模块减少高维数据,并生成细胞系多组学特征和药物特性的密集嵌入。接下来,采用跨模态特征融合模块动态捕获药物和细胞系之间跨各种模态的潜在相互信息。最后,应用多任务学习技术优化四个不同的任务:预测抗癌药物协同作用的回归和分类任务,以及预测药物敏感性的回归和分类任务。通过共享参数和平衡不同任务之间的损失,MCMTSYN最大限度地利用了原始信息,从而增强了特征表示,增强了模型的泛化能力。在基准数据集上进行的回归和分类任务明显表明,MCMTSYN优于7种比较方法。实验结果表明,在回归任务的5倍交叉验证中,MCMTSYN的MSE为218.20±38.53,PCC为0.76±0.02;在分类任务的5倍交叉验证中,AUC为0.91±0.02,ACC为0.94±0.01。
{"title":"MCMTSYN: Predicting anticancer drug synergy via cross-modal feature fusion and multi-task learning","authors":"Wei Wang ,&nbsp;Gaolin Yuan ,&nbsp;Bin Sun ,&nbsp;Dong Liu ,&nbsp;Hongjun Zhang ,&nbsp;Guangsheng Wu ,&nbsp;Yun Zhou ,&nbsp;Xianfang Wang","doi":"10.1016/j.patcog.2026.113222","DOIUrl":"10.1016/j.patcog.2026.113222","url":null,"abstract":"<div><div>Current computational models for predicting cancer drug synergy are mostly based on biological prior knowledge. However, the lack of real-world features and insufficient representation of new drug characteristics present significant challenges to prediction tasks, requiring models to have a stronger ability to extract rich and comprehensive features. To address this issue, we propose a new method, MCMTSYN, based on cross-modal feature fusion and multi-task learning for predicting cancer drug synergy. MCMTSYN begins by utilizing a self-modal feature extraction module to reduce high-dimensional data, and generate dense embeddings for both the multi-omics features of cell lines and the properties of drugs. Next, a cross-modal feature fusion module is employed to dynamically capture potential mutual information between drugs and cell lines across various modalities. Finally, multi-task learning techniques are applied to optimize four distinct tasks: regression and classification tasks for predicting anticancer drug synergy, as well as regression and classification tasks for predicting drug sensitivity. By sharing parameters and balancing the loss across different tasks, MCMTSYN maximizes the utilization of original information, thereby enhancing feature representation and bolstering the model’s generalization capabilities. Regression and classification tasks carried out on benchmark datasets evidently show that MCMTSYN outperforms seven comparative methods. The experimental results show that MCMTSYN achieved an MSE of 218.20±38.53 and a PCC of 0.76±0.02 in 5-fold cross-validation of the regression task, and the AUC of 0.91±0.02 and the ACC of 0.94±0.01 in 5-fold cross-validation of the classification task.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113222"},"PeriodicalIF":7.6,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174601","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
CUDiff: Consistency and uncertainty guided conditional diffusion for infrared and visible image fusion
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-04 DOI: 10.1016/j.patcog.2026.113174
Yueying Luo, Kangjian He, Dan Xu
Infrared and visible image fusion aims to integrate complementary information from both modalities to produce more informative and visually coherent images. Although many existing methods focus on incorporating enhancement modules to improve model efficiency, few effectively address the challenges of learning in complex or ambiguous regions. In this paper, we propose CUDiff, a novel framework that leverages the powerful generative capabilities of diffusion models to reformulate the fusion process as a conditional generation task. Specifically, we design a conditional diffusion model that extracts and integrates relevant features from infrared and visible modalities. A content-consistency constraint is introduced to preserve the structural integrity of the source images, ensuring that essential information is retained in the fused output. Moreover, an uncertainty-driven mechanism adaptively refines and enhances uncertain regions, improving the overall quality and expressiveness of the fused images. Extensive experiments demonstrate that CUDiff surpasses 12 state-of-the-art methods in both visual quality and quantitative evaluation. Furthermore, CUDiff achieves superior performance in object detection tasks. The source code is available at: https://github.com/VCMHE/CUDiff
红外和可见光图像融合旨在整合两种模式的互补信息,以产生更多信息和视觉连贯的图像。尽管许多现有的方法侧重于结合增强模块来提高模型效率,但很少有方法有效地解决复杂或模糊区域学习的挑战。在本文中,我们提出了一个新的框架CUDiff,它利用扩散模型强大的生成能力将融合过程重新表述为条件生成任务。具体而言,我们设计了一个条件扩散模型,提取并整合了红外和可见光模式的相关特征。引入内容一致性约束以保持源图像的结构完整性,确保融合输出中保留基本信息。此外,不确定性驱动机制自适应地细化和增强不确定性区域,提高融合图像的整体质量和表现力。大量的实验表明,CUDiff在视觉质量和定量评价方面都超过了12种最先进的方法。此外,CUDiff在目标检测任务中取得了优异的性能。源代码可从https://github.com/VCMHE/CUDiff获得
{"title":"CUDiff: Consistency and uncertainty guided conditional diffusion for infrared and visible image fusion","authors":"Yueying Luo,&nbsp;Kangjian He,&nbsp;Dan Xu","doi":"10.1016/j.patcog.2026.113174","DOIUrl":"10.1016/j.patcog.2026.113174","url":null,"abstract":"<div><div>Infrared and visible image fusion aims to integrate complementary information from both modalities to produce more informative and visually coherent images. Although many existing methods focus on incorporating enhancement modules to improve model efficiency, few effectively address the challenges of learning in complex or ambiguous regions. In this paper, we propose CUDiff, a novel framework that leverages the powerful generative capabilities of diffusion models to reformulate the fusion process as a conditional generation task. Specifically, we design a conditional diffusion model that extracts and integrates relevant features from infrared and visible modalities. A content-consistency constraint is introduced to preserve the structural integrity of the source images, ensuring that essential information is retained in the fused output. Moreover, an uncertainty-driven mechanism adaptively refines and enhances uncertain regions, improving the overall quality and expressiveness of the fused images. Extensive experiments demonstrate that CUDiff surpasses 12 state-of-the-art methods in both visual quality and quantitative evaluation. Furthermore, CUDiff achieves superior performance in object detection tasks. The source code is available at: <span><span>https://github.com/VCMHE/CUDiff</span><svg><path></path></svg></span></div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113174"},"PeriodicalIF":7.6,"publicationDate":"2026-02-04","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174368","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
PGRM: Positive-unlabeled enhanced recommendation model based on generative adversarial network
IF 7.6 1区 计算机科学 Q1 COMPUTER SCIENCE, ARTIFICIAL INTELLIGENCE Pub Date : 2026-02-03 DOI: 10.1016/j.patcog.2026.113209
Jiangzhou Deng , Huilin Jin , Zhiqiang Zhang , Jianmei Ye , Yong Wang , Leo Yu Zhang , Kobiljon Kh. Khushvakhtzoda
As information overload escalates, recommendation systems play a crucial role in personalized information filtering. Traditional recommendation systems face substantial challenges from implicit feedback, particularly due to the imbalance between positive and negative samples as well as noisy data. To tackle these challenges, we propose a novel Positive-Unlabeled Enhanced Recommendation Model based on Generative Adversarial Network (GAN), named PGRM. This model combines the Positive-Unlabeled (PU) learning and adversarial learning. By analyzing the distributions of positive and unlabeled samples, it can more accurately identify potential reliable negative samples, thus improving the GAN’s training process. Furthermore, we introduce a spy mechanism in the PU learning technique and a hybrid negative sampling strategy to further improve the recommendation accuracy. The spy mechanism enhances the model’s ability to learn from negative samples through a PU-assisted loss function, while the hybrid negative sampling strategy effectively captures user preferences by combining hard and weak negative samples, thereby reducing interference from noisy data. Extensive experiments on four public datasets show that the proposed model PGRM outperforms state-of-the-art comparison models, with average gains of 4.8% in HR@5 and 4.0% in NDCG@5, especially under sparse and noisy conditions, demonstrating its effectiveness and generalization.
随着信息过载的加剧,推荐系统在个性化信息过滤中发挥着至关重要的作用。传统的推荐系统面临着来自隐式反馈的巨大挑战,特别是由于正、负样本之间的不平衡以及噪声数据。为了解决这些挑战,我们提出了一种新的基于生成对抗网络(GAN)的正无标记增强推荐模型,称为PGRM。该模型结合了Positive-Unlabeled (PU)学习和对抗学习。通过分析阳性和未标记样本的分布,可以更准确地识别潜在的可靠负样本,从而改进GAN的训练过程。此外,我们在PU学习技术中引入了间谍机制和混合负采样策略,以进一步提高推荐精度。间谍机制通过pu辅助损失函数增强了模型从负样本中学习的能力,而混合负抽样策略通过结合硬样本和弱样本有效捕获用户偏好,从而减少了噪声数据的干扰。在四个公开数据集上的大量实验表明,所提出的模型PGRM优于最先进的比较模型,在HR@5和NDCG@5的平均增益分别为4.8%和4.0%,特别是在稀疏和噪声条件下,证明了其有效性和泛化性。
{"title":"PGRM: Positive-unlabeled enhanced recommendation model based on generative adversarial network","authors":"Jiangzhou Deng ,&nbsp;Huilin Jin ,&nbsp;Zhiqiang Zhang ,&nbsp;Jianmei Ye ,&nbsp;Yong Wang ,&nbsp;Leo Yu Zhang ,&nbsp;Kobiljon Kh. Khushvakhtzoda","doi":"10.1016/j.patcog.2026.113209","DOIUrl":"10.1016/j.patcog.2026.113209","url":null,"abstract":"<div><div>As information overload escalates, recommendation systems play a crucial role in personalized information filtering. Traditional recommendation systems face substantial challenges from implicit feedback, particularly due to the imbalance between positive and negative samples as well as noisy data. To tackle these challenges, we propose a novel Positive-Unlabeled Enhanced Recommendation Model based on Generative Adversarial Network (GAN), named PGRM. This model combines the Positive-Unlabeled (PU) learning and adversarial learning. By analyzing the distributions of positive and unlabeled samples, it can more accurately identify potential reliable negative samples, thus improving the GAN’s training process. Furthermore, we introduce a spy mechanism in the PU learning technique and a hybrid negative sampling strategy to further improve the recommendation accuracy. The spy mechanism enhances the model’s ability to learn from negative samples through a PU-assisted loss function, while the hybrid negative sampling strategy effectively captures user preferences by combining hard and weak negative samples, thereby reducing interference from noisy data. Extensive experiments on four public datasets show that the proposed model PGRM outperforms state-of-the-art comparison models, with average gains of 4.8% in HR@5 and 4.0% in NDCG@5, especially under sparse and noisy conditions, demonstrating its effectiveness and generalization.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113209"},"PeriodicalIF":7.6,"publicationDate":"2026-02-03","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174684","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
引用次数: 0
期刊
Pattern Recognition
全部 Acc. Chem. Res. ACS Applied Bio Materials ACS Appl. Electron. Mater. ACS Appl. Energy Mater. ACS Appl. Mater. Interfaces ACS Appl. Nano Mater. ACS Appl. Polym. Mater. ACS BIOMATER-SCI ENG ACS Catal. ACS Cent. Sci. ACS Chem. Biol. ACS Chemical Health & Safety ACS Chem. Neurosci. ACS Comb. Sci. ACS Earth Space Chem. ACS Energy Lett. ACS Infect. Dis. ACS Macro Lett. ACS Mater. Lett. ACS Med. Chem. Lett. ACS Nano ACS Omega ACS Photonics ACS Sens. ACS Sustainable Chem. Eng. ACS Synth. Biol. Anal. Chem. BIOCHEMISTRY-US Bioconjugate Chem. BIOMACROMOLECULES Chem. Res. Toxicol. Chem. Rev. Chem. Mater. CRYST GROWTH DES ENERG FUEL Environ. Sci. Technol. Environ. Sci. Technol. Lett. Eur. J. Inorg. Chem. IND ENG CHEM RES Inorg. Chem. J. Agric. Food. Chem. J. Chem. Eng. Data J. Chem. Educ. J. Chem. Inf. Model. J. Chem. Theory Comput. J. Med. Chem. J. Nat. Prod. J PROTEOME RES J. Am. Chem. Soc. LANGMUIR MACROMOLECULES Mol. Pharmaceutics Nano Lett. Org. Lett. ORG PROCESS RES DEV ORGANOMETALLICS J. Org. Chem. J. Phys. Chem. J. Phys. Chem. A J. Phys. Chem. B J. Phys. Chem. C J. Phys. Chem. Lett. Analyst Anal. Methods Biomater. Sci. Catal. Sci. Technol. Chem. Commun. Chem. Soc. Rev. CHEM EDUC RES PRACT CRYSTENGCOMM Dalton Trans. Energy Environ. Sci. ENVIRON SCI-NANO ENVIRON SCI-PROC IMP ENVIRON SCI-WAT RES Faraday Discuss. Food Funct. Green Chem. Inorg. Chem. Front. Integr. Biol. J. Anal. At. Spectrom. J. Mater. Chem. A J. Mater. Chem. B J. Mater. Chem. C Lab Chip Mater. Chem. Front. Mater. Horiz. MEDCHEMCOMM Metallomics Mol. Biosyst. Mol. Syst. Des. Eng. Nanoscale Nanoscale Horiz. Nat. Prod. Rep. New J. Chem. Org. Biomol. Chem. Org. Chem. Front. PHOTOCH PHOTOBIO SCI PCCP Polym. Chem.
×
引用
GB/T 7714-2015
复制
MLA
复制
APA
复制
导出至
BibTeX EndNote RefMan NoteFirst NoteExpress
×
0
微信
客服QQ
Book学术公众号 扫码关注我们
反馈
×
意见反馈
请填写您的意见或建议
请填写您的手机或邮箱
×
提示
您的信息不完整,为了账户安全,请先补充。
现在去补充
×
提示
您因"违规操作"
具体请查看互助需知
我知道了
×
提示
现在去查看 取消
×
提示
确定
Book学术官方微信
Book学术文献互助
Book学术文献互助群
群 号:604180095
Book学术
文献互助 智能选刊 最新文献 互助须知 联系我们:info@booksci.cn
Book学术提供免费学术资源搜索服务,方便国内外学者检索中英文文献。致力于提供最便捷和优质的服务体验。
Copyright © 2023 Book学术 All rights reserved.
ghs 京公网安备 11010802042870号 京ICP备2023020795号-1