Pub Date : 2026-08-01Epub Date: 2026-01-28DOI: 10.1016/j.patcog.2026.113187
Zining Chen , Zhicheng Zhao , Fei Su , Shijian Lu
Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. However, this approach tends to impede network generalization due to modality discrepancy and distribution shift between training and inference. To this end, we propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic Sample Pool (SSP) module. The TS module exploits compositional textual semantics during training, enhancing the pseudo-word token with more linguistic semantics and thus mitigating the modality discrepancy effectively. The SSP module exploits the zero-shot capability of pretrained Vision-Language Models (VLMs), alleviating the distribution shift and mitigating the overfitting issue from the redundancy of the large-scale image-text data. Extensive experiments over four ZS-CIR benchmarks show that DeG outperforms the state-of-the-art (SOTA) methods with much less training data, and saves substantial training and inference time for practical usage.
{"title":"Data-efficient generalization for zero-shot composed image retrieval","authors":"Zining Chen , Zhicheng Zhao , Fei Su , Shijian Lu","doi":"10.1016/j.patcog.2026.113187","DOIUrl":"10.1016/j.patcog.2026.113187","url":null,"abstract":"<div><div>Zero-shot Composed Image Retrieval (ZS-CIR) aims to retrieve the target image based on a reference image and a text description without requiring in-distribution triplets for training. One prevalent approach follows the vision-language pretraining paradigm that employs a mapping network to transfer the image embedding to a pseudo-word token in the text embedding space. However, this approach tends to impede network generalization due to modality discrepancy and distribution shift between training and inference. To this end, we propose a Data-efficient Generalization (DeG) framework, including two novel designs, namely, Textual Supplement (TS) module and Semantic Sample Pool (SSP) module. The TS module exploits compositional textual semantics during training, enhancing the pseudo-word token with more linguistic semantics and thus mitigating the modality discrepancy effectively. The SSP module exploits the zero-shot capability of pretrained Vision-Language Models (VLMs), alleviating the distribution shift and mitigating the overfitting issue from the redundancy of the large-scale image-text data. Extensive experiments over four ZS-CIR benchmarks show that DeG outperforms the state-of-the-art (SOTA) methods with much less training data, and saves substantial training and inference time for practical usage.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113187"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081397","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-01-24DOI: 10.1016/j.patcog.2026.113151
Yuqi Zhang , Jiakai Wang , Baoyu Liang , Yuancheng Yang , Siyang Wu , Chao Tong
Multiple instance learning (MIL) has emerged as a reliable paradigm that has propelled the integration of computational pathology (CPath) into clinical histopathology. However, despite significant advancements, current MIL approaches continue to face challenges due to inadequate spatial information representation resulting from the disorder of the original whole slide images (WSIs). To address this limitation, we first demonstrate the importance of prioritized scanning within the structured state space models (SSM). We introduce a MIL framework that incorporates spatial information, termed Prioritized Scanning MIL (PSMIL). PSMIL primarily comprises two branches and a fusion block. The first branch, known as the spatial branch, incorporates potential spatial information into the patch sequence using the original 2D positions and employs SSM to model the spatial features of the WSI. The second branch, referred to as the cross-spatial branch, utilizes a significance scoring block along with SSM to harness feature relationships among similar instances across spatial locations. Finally, a lightweight feature fusion block integrates the outputs of both branches, facilitating more comprehensive feature utilization. Extensive experiments on 5 popular datasets and 3 downstream tasks strongly demonstrate that PSMIL surpasses the state-of-the-art MIL methods significantly, up to 5.26% ACC improvements for cancer sub-typing. Our code is available at https://github.com/YuqiZhang-Buaa/PSMIL.
{"title":"Prioritized scanning: Combining spatial information multiple instance learning for computational pathology","authors":"Yuqi Zhang , Jiakai Wang , Baoyu Liang , Yuancheng Yang , Siyang Wu , Chao Tong","doi":"10.1016/j.patcog.2026.113151","DOIUrl":"10.1016/j.patcog.2026.113151","url":null,"abstract":"<div><div>Multiple instance learning (MIL) has emerged as a reliable paradigm that has propelled the integration of computational pathology (CPath) into clinical histopathology. However, despite significant advancements, current MIL approaches continue to face challenges due to inadequate spatial information representation resulting from the disorder of the original whole slide images (WSIs). To address this limitation, we first demonstrate the importance of prioritized scanning within the structured state space models (SSM). We introduce a MIL framework that incorporates spatial information, termed <strong>Prioritized Scanning MIL (PSMIL)</strong>. PSMIL primarily comprises two branches and a fusion block. The first branch, known as the spatial branch, incorporates potential spatial information into the patch sequence using the original 2D positions and employs SSM to model the spatial features of the WSI. The second branch, referred to as the cross-spatial branch, utilizes a significance scoring block along with SSM to harness feature relationships among similar instances across spatial locations. Finally, a lightweight feature fusion block integrates the outputs of both branches, facilitating more comprehensive feature utilization. Extensive experiments on 5 popular datasets and 3 downstream tasks strongly demonstrate that PSMIL surpasses the state-of-the-art MIL methods significantly, up to 5.26% ACC improvements for cancer sub-typing. Our code is available at <span><span>https://github.com/YuqiZhang-Buaa/PSMIL</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113151"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081393","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-01-29DOI: 10.1016/j.patcog.2026.113137
Qianshan Zhan , Xiao-Jun Zeng , Qian Wang
In transfer learning, one fundamental problem is transferability estimation, where a metric measures transfer performance without training. Existing metrics face two issues: 1) requiring target domain labels, and 2) only focusing on task speciality but ignoring equally important domain commonality. To overcome these limitations, we propose TranSAC, a Transferability metric based on task Speciality And domain Commonality, capturing the separation between classes and the similarity between domains. Its main advantages are: 1) unsupervised, 2) fine-tuning free, and 3) applicable to source-dependent and source-free transfer scenarios. To achieve this, we investigate the upper and lower bounds of transfer performance based on fixed representations extracted from the pre-trained model. Theoretical results reveal that unsupervised transfer performance is characterized by entropy-based quantities, naturally reflecting task specificity and domain commonality. These insights motivate the design of TranSAC, which integrates both factors to enhance transferability. Extensive experiments are performed across 12 target datasets with 36 pre-trained models, including supervised CNNs, self-supervised CNNs, and ViTs. Results demonstrate the importance of domain commonality and task speciality, allowing TranSAC as superior to state-of-the-art metrics for pre-trained model ranking, target domain ranking, and source domain ranking.
{"title":"TranSAC: An unsupervised transferability metric based on task speciality and domain commonality","authors":"Qianshan Zhan , Xiao-Jun Zeng , Qian Wang","doi":"10.1016/j.patcog.2026.113137","DOIUrl":"10.1016/j.patcog.2026.113137","url":null,"abstract":"<div><div>In transfer learning, one fundamental problem is transferability estimation, where a metric measures transfer performance without training. Existing metrics face two issues: 1) requiring target domain labels, and 2) only focusing on task speciality but ignoring equally important domain commonality. To overcome these limitations, we propose TranSAC, a <strong>Tran</strong>sferability metric based on task <strong>S</strong>peciality <strong>A</strong>nd domain <strong>C</strong>ommonality, capturing the separation between classes and the similarity between domains. Its main advantages are: 1) unsupervised, 2) fine-tuning free, and 3) applicable to source-dependent and source-free transfer scenarios. To achieve this, we investigate the upper and lower bounds of transfer performance based on fixed representations extracted from the pre-trained model. Theoretical results reveal that unsupervised transfer performance is characterized by entropy-based quantities, naturally reflecting task specificity and domain commonality. These insights motivate the design of TranSAC, which integrates both factors to enhance transferability. Extensive experiments are performed across 12 target datasets with 36 pre-trained models, including supervised CNNs, self-supervised CNNs, and ViTs. Results demonstrate the importance of domain commonality and task speciality, allowing TranSAC as superior to state-of-the-art metrics for pre-trained model ranking, target domain ranking, and source domain ranking.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113137"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081493","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-01-24DOI: 10.1016/j.patcog.2026.113156
Shuzhan Hu , Mingyu Li , Yang Liu , Weiwei Jiang , Bingrui Geng , Wei Zhong , Long Ye
In human-centered communication systems, establishing human perception-aligned audio-visual quality assessment methods is crucial for enhancing multimedia system performance and service quality. However, conventional subjective evaluation methods based on user ratings are susceptible to biases induced by high-level cognitive processes. To address this limitation, we propose an electroencephalography (EEG) feature fusion approach to establish correlations between audio-visual distortions and perceptual experiences. Specifically, we construct an audio-visual degradation-EEG dataset by recording neural responses from subjects exposed to progressively degraded stimuli. Leveraging this dataset, we extract event-related potential (ERP) features to quantify variations in subjects’ perception of audio-visual quality, demonstrating the feasibility of EEG-based perceptual experience assessment. Capitalizing on EEG’s sensitivity to dynamic multimodal perceptual changes, we develop a multi-perspective feature fusion framework, incorporating a spatio-temporal feature fusion architecture and a diffusion-driven EEG augmentation strategy. This framework enables the extraction of experience-related features from single-trial EEG signals, establishing an EEG-based classifier to detect whether distortions induce perceptual experience alterations. Experimental results validate that EEG signals effectively reflect perception changes induced by quality degradation, while the proposed model achieves efficient and dynamic detection of perception alterations from single-trial EEG data.
{"title":"Audio-visual perceptual quality measurement via multi-perspective spatio-temporal EEG analysis","authors":"Shuzhan Hu , Mingyu Li , Yang Liu , Weiwei Jiang , Bingrui Geng , Wei Zhong , Long Ye","doi":"10.1016/j.patcog.2026.113156","DOIUrl":"10.1016/j.patcog.2026.113156","url":null,"abstract":"<div><div>In human-centered communication systems, establishing human perception-aligned audio-visual quality assessment methods is crucial for enhancing multimedia system performance and service quality. However, conventional subjective evaluation methods based on user ratings are susceptible to biases induced by high-level cognitive processes. To address this limitation, we propose an electroencephalography (EEG) feature fusion approach to establish correlations between audio-visual distortions and perceptual experiences. Specifically, we construct an audio-visual degradation-EEG dataset by recording neural responses from subjects exposed to progressively degraded stimuli. Leveraging this dataset, we extract event-related potential (ERP) features to quantify variations in subjects’ perception of audio-visual quality, demonstrating the feasibility of EEG-based perceptual experience assessment. Capitalizing on EEG’s sensitivity to dynamic multimodal perceptual changes, we develop a multi-perspective feature fusion framework, incorporating a spatio-temporal feature fusion architecture and a diffusion-driven EEG augmentation strategy. This framework enables the extraction of experience-related features from single-trial EEG signals, establishing an EEG-based classifier to detect whether distortions induce perceptual experience alterations. Experimental results validate that EEG signals effectively reflect perception changes induced by quality degradation, while the proposed model achieves efficient and dynamic detection of perception alterations from single-trial EEG data.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113156"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146081492","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-01-29DOI: 10.1016/j.patcog.2026.113175
Wenzhe Liu , Li Jiang , Huibing Wang , Yong Zhang
In recent years, tensor-based methods have seen considerable success in multi-view clustering. However, the current approach has several limitations: 1) Insufficient exploration of underlying similarity information (i.e. latent representation); 2) Insufficient exploration of higher-order structure information of both inter-view and intra-view; 3) Treating clustering learning independently from tensor learning and the overall learning framework. To address these issues, we propose a unified framework called Bottom-up Structural Exploration for One-step Multi-view Graph Clustering (BSE_OMGC). Specifically, we first employ an anchor strategy to build similarity graphs, reducing the complexity of graph learning. To deeply represent the underlying similar information of the data and mitigate the influence of noise on similar structures in the original space, BSE_OMGC adaptively separates the noise matrix from the similarity graphs to learn high-quality enhanced graphs. Subsequently, from the bottom up, the enhanced graphs serve as the foundation for constructing high-order tensors. We rotate the constructed tensors and apply the t-TNN to preserve the low-rank properties and to better capture higher-order structure information of both inter-view and intra-view. Finally, we introduce a symmetric non-negative matrix factorization-based graph partitioning technique, which learns non-negative embeddings during dynamic optimization to reveal clustering results. This approach unifies clustering learning within the entire learning framework. Extensive experiments on multiple real-world multi-view datasets, along with comparisons to state-of-the-art methods, demonstrate the effectiveness and robustness of the proposed approach.
{"title":"One-step multi-view graph clustering via bottom-up structural learning","authors":"Wenzhe Liu , Li Jiang , Huibing Wang , Yong Zhang","doi":"10.1016/j.patcog.2026.113175","DOIUrl":"10.1016/j.patcog.2026.113175","url":null,"abstract":"<div><div>In recent years, tensor-based methods have seen considerable success in multi-view clustering. However, the current approach has several limitations: 1) Insufficient exploration of underlying similarity information (i.e. latent representation); 2) Insufficient exploration of higher-order structure information of both inter-view and intra-view; 3) Treating clustering learning independently from tensor learning and the overall learning framework. To address these issues, we propose a unified framework called Bottom-up Structural Exploration for One-step Multi-view Graph Clustering (BSE_OMGC). Specifically, we first employ an anchor strategy to build similarity graphs, reducing the complexity of graph learning. To deeply represent the underlying similar information of the data and mitigate the influence of noise on similar structures in the original space, BSE_OMGC adaptively separates the noise matrix from the similarity graphs to learn high-quality enhanced graphs. Subsequently, from the bottom up, the enhanced graphs serve as the foundation for constructing high-order tensors. We rotate the constructed tensors and apply the t-TNN to preserve the low-rank properties and to better capture higher-order structure information of both inter-view and intra-view. Finally, we introduce a symmetric non-negative matrix factorization-based graph partitioning technique, which learns non-negative embeddings during dynamic optimization to reveal clustering results. This approach unifies clustering learning within the entire learning framework. Extensive experiments on multiple real-world multi-view datasets, along with comparisons to state-of-the-art methods, demonstrate the effectiveness and robustness of the proposed approach.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113175"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174258","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-01-29DOI: 10.1016/j.patcog.2026.113162
Shanshan Huang , Lei Wang , Haoxuan Chen , Yuxuan Liang , Li Liu
Representation learning has been widely employed to learn low-dimensional representations that consist of multiple independent and interpretable generative factors like visual attributes in images, enabling controllable image editing by manipulating specific attributes in the learned representation space. However, in real-world scenarios, generative factors with semantic meanings are often causally related rather than independent. Previous methods with independence assumption are failed to capture such causal relationships, even in the supervised settings. To this end, we propose a diffusion model-based causal representation learning framework, named CausalDiffuser, which models causal prior distributions by the structural causal models (SCMs) to explicitly characterize the causal relations among the underlying generative factors. Such modelling scheme encourages the framework to learn the latent representations of causality for generative factors. Furthermore, a composite loss function is introduced to ensure causal disentanglement of latent representations by combining supervision information from the ground truth factors (i.e., image labels). Empirical evaluations on one synthetic dataset and two real-world benchmark datasets suggest our approach significantly outperforms the state-of-the-art methods. CausalDiffuser effectively edits image attributes by restoring causal relationships among generative factors and generates counterfactual images through intervention operation.
{"title":"Learning generalizable visual representations with causal diffusion model for controllable editing","authors":"Shanshan Huang , Lei Wang , Haoxuan Chen , Yuxuan Liang , Li Liu","doi":"10.1016/j.patcog.2026.113162","DOIUrl":"10.1016/j.patcog.2026.113162","url":null,"abstract":"<div><div>Representation learning has been widely employed to learn low-dimensional representations that consist of multiple independent and interpretable generative factors like visual attributes in images, enabling controllable image editing by manipulating specific attributes in the learned representation space. However, in real-world scenarios, generative factors with semantic meanings are often causally related rather than independent. Previous methods with independence assumption are failed to capture such causal relationships, even in the supervised settings. To this end, we propose a diffusion model-based causal representation learning framework, named CausalDiffuser, which models causal prior distributions by the structural causal models (SCMs) to explicitly characterize the causal relations among the underlying generative factors. Such modelling scheme encourages the framework to learn the latent representations of causality for generative factors. Furthermore, a composite loss function is introduced to ensure causal disentanglement of latent representations by combining supervision information from the ground truth factors (i.e., image labels). Empirical evaluations on one synthetic dataset and two real-world benchmark datasets suggest our approach significantly outperforms the state-of-the-art methods. CausalDiffuser effectively edits image attributes by restoring causal relationships among generative factors and generates counterfactual images through intervention operation.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113162"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174375","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-01-28DOI: 10.1016/j.patcog.2026.113183
Su-Ji Jang, Ue-Hwan Kim
Neural rendering has shown significant potential in generating high-quality 3D scenes from sparse inputs. However, existing methods struggle to simultaneously capture both low-frequency global structures and high-frequency fine details, leading to suboptimal scene representations. To overcome this limitation, we propose a frequency-aligned supervision framework that explicitly separates the learning process into low-frequency and full-spectrum components. By introducing two sub-networks and aligning supervision signals at appropriate layers, our method enhances the formation of global structures while preserving fine details. Specifically, the low-frequency network (LFN) is supervised with low-pass targets (Gaussian-filtered images) to form global structures, while the full-spectrum network (FSN) is supervised with the original images to refine high-frequency details. The proposed approach is broadly applicable to MLP-based NeRF architectures without requiring major architectural modifications. Extensive experiments demonstrate that our method consistently improves PSNR, SSIM, and LPIPS across multiple NeRF variants and datasets, confirming its robustness in sparse input scenarios.
{"title":"Frequency-aligned supervision for few-shot neural rendering","authors":"Su-Ji Jang, Ue-Hwan Kim","doi":"10.1016/j.patcog.2026.113183","DOIUrl":"10.1016/j.patcog.2026.113183","url":null,"abstract":"<div><div>Neural rendering has shown significant potential in generating high-quality 3D scenes from sparse inputs. However, existing methods struggle to simultaneously capture both low-frequency global structures and high-frequency fine details, leading to suboptimal scene representations. To overcome this limitation, we propose a frequency-aligned supervision framework that explicitly separates the learning process into low-frequency and full-spectrum components. By introducing two sub-networks and aligning supervision signals at appropriate layers, our method enhances the formation of global structures while preserving fine details. Specifically, the low-frequency network (LFN) is supervised with low-pass targets (Gaussian-filtered images) to form global structures, while the full-spectrum network (FSN) is supervised with the original images to refine high-frequency details. The proposed approach is broadly applicable to MLP-based NeRF architectures without requiring major architectural modifications. Extensive experiments demonstrate that our method consistently improves PSNR, SSIM, and LPIPS across multiple NeRF variants and datasets, confirming its robustness in sparse input scenarios.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113183"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174534","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-01-28DOI: 10.1016/j.patcog.2026.113180
Jiaxing Li , Lin Jiang , Zuopeng Yang , Xiaozhao Fang , Shengli Xie , Yong Xu
Cross-modal hashing is one of the promising practical applications in information retrieval for multimedia data. However, there exist some technical hurdles, e.g., how to further reduce the heterogeneous gaps for cross-modal data semantically, how to extract cross-modal knowledge by jointly training data from different modality and how to better leverage the label information to generate more discriminative hash codes, etc. To overcome the above-mentioned challenges, this paper proposes a joint asymmetric discrete hashing (JADH for short) for cross-modal retrieval. By leveraging kernel mapping operation, JADH extracts the non-linear features of cross-modal data to better preserve the semantic information in the latent common space learning. Then, a joint asymmetric hash codes learning term is customized to learn hash codes for data from different modalities jointly. As such, more cross-modal information can be preserved, which can effectively reduce the heterogeneous semantic gaps. Finally, a log-likelihood similarity preserving term is proposed to boost hash codes learning from the similarity matrix, while a classifier learning term is proposed to further improve the quality of the learned hash codes. In addition, an alternative algorithm is derived to solve the optimization problem in JADH efficiently. Experimental results on four widely used datasets show that, JADH outperforms some state-of-the-art baseline methods in hashing-based cross-modal retrieval, on accuracy and efficiency.
{"title":"Joint asymmetric discrete hashing for cross-modal retrieval","authors":"Jiaxing Li , Lin Jiang , Zuopeng Yang , Xiaozhao Fang , Shengli Xie , Yong Xu","doi":"10.1016/j.patcog.2026.113180","DOIUrl":"10.1016/j.patcog.2026.113180","url":null,"abstract":"<div><div>Cross-modal hashing is one of the promising practical applications in information retrieval for multimedia data. However, there exist some technical hurdles, e.g., how to further reduce the heterogeneous gaps for cross-modal data semantically, how to extract cross-modal knowledge by jointly training data from different modality and how to better leverage the label information to generate more discriminative hash codes, etc. To overcome the above-mentioned challenges, this paper proposes a joint asymmetric discrete hashing (JADH for short) for cross-modal retrieval. By leveraging kernel mapping operation, JADH extracts the non-linear features of cross-modal data to better preserve the semantic information in the latent common space learning. Then, a joint asymmetric hash codes learning term is customized to learn hash codes for data from different modalities jointly. As such, more cross-modal information can be preserved, which can effectively reduce the heterogeneous semantic gaps. Finally, a log-likelihood similarity preserving term is proposed to boost hash codes learning from the similarity matrix, while a classifier learning term is proposed to further improve the quality of the learned hash codes. In addition, an alternative algorithm is derived to solve the optimization problem in JADH efficiently. Experimental results on four widely used datasets show that, JADH outperforms some state-of-the-art baseline methods in hashing-based cross-modal retrieval, on accuracy and efficiency.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113180"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174537","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-01-25DOI: 10.1016/j.patcog.2026.113158
Gang Li , Enze Xie , Chongjian Ge , Xiang Li , Lingyu Si , Changwen Zheng , Zhenguo Li
Denoising Diffusion Probabilistic Models (DDPMs) have made significant progress in image generation. Recent works in semantic-to-image (S2I) synthesis have also shifted from the previously de facto GAN-based methods to DDPMs, yielding better results. However, these works mostly employ a U-Net structure and vanilla training-from-scratch scheme for S2I, unconsciously neglecting the potential benefits offered by task-related pre-training. In this work, we introduce a Transformer-based architecture, namely S2I-DiT, and reconsider the merits of a pre-trained large diffusion model for cross-task adaptation (i.e., from the class-conditional generation to S2I). In S2I-DiT, we propose the integration of semantic embedders within Diffusion Transformers (DiTs) to maximize the utilization of semantic information. The semantic embedder densely encodes semantic layouts to guide the adaptive normalization process. We configure semantic embedders in a layer-wise manner to learn pixel-level correspondence, enabling finer-grained semantic-to-image control. Besides, to fully unleash the cross-task transferability of DDPMs, we introduce a two-stage fine-tuning strategy, which involves initially adapting the semantic embedders in the pixel-level space, followed by fine-tuning the partial/entire model for cross-task adaptation. Notably, S2I-DiT pioneers the application of Large Diffusion Transformers to cross-task fine-tuning. Extensive experiments on four benchmark datasets demonstrate S2I-DiT’s effectiveness, as it achieves state-of-the-art performance in terms of quality (FID) and diversity (LPIPS), while consuming fewer training iterations. This work establishes a new state-of-the-art for semantic-to-image generation and provides valuable insights into cross-task transferability of large generative models.
{"title":"S2I-DiT: Unlocking the semantic-to-image transferability by fine-tuning large diffusion transformer models","authors":"Gang Li , Enze Xie , Chongjian Ge , Xiang Li , Lingyu Si , Changwen Zheng , Zhenguo Li","doi":"10.1016/j.patcog.2026.113158","DOIUrl":"10.1016/j.patcog.2026.113158","url":null,"abstract":"<div><div>Denoising Diffusion Probabilistic Models (DDPMs) have made significant progress in image generation. Recent works in semantic-to-image (S2I) synthesis have also shifted from the previously <em>de facto</em> GAN-based methods to DDPMs, yielding better results. However, these works mostly employ a U-Net structure and vanilla training-from-scratch scheme for S2I, unconsciously neglecting the potential benefits offered by task-related pre-training. In this work, we introduce a Transformer-based architecture, namely S2I-DiT, and reconsider the merits of a pre-trained large diffusion model for cross-task adaptation (i.e., from the class-conditional generation to S2I). In S2I-DiT, we propose the integration of semantic embedders within Diffusion Transformers (DiTs) to maximize the utilization of semantic information. The semantic embedder densely encodes semantic layouts to guide the adaptive normalization process. We configure semantic embedders in a layer-wise manner to learn pixel-level correspondence, enabling finer-grained semantic-to-image control. Besides, to fully unleash the cross-task transferability of DDPMs, we introduce a two-stage fine-tuning strategy, which involves initially adapting the semantic embedders in the pixel-level space, followed by fine-tuning the partial/entire model for cross-task adaptation. Notably, S2I-DiT pioneers the application of Large Diffusion Transformers to cross-task fine-tuning. Extensive experiments on four benchmark datasets demonstrate S2I-DiT’s effectiveness, as it achieves state-of-the-art performance in terms of quality (FID) and diversity (LPIPS), while consuming fewer training iterations. This work establishes a new state-of-the-art for semantic-to-image generation and provides valuable insights into cross-task transferability of large generative models.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113158"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174576","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}
Pub Date : 2026-08-01Epub Date: 2026-02-10DOI: 10.1016/j.patcog.2026.113265
Kai Zhou , Guanglu Sun , Linsen Yu , Jun Wang
Existing face forgery detection methods mainly focus on capturing specific artifacts. While achieving high accuracy on in-distribution data, they often generalize poorly to unseen manipulation techniques due to strong correlation between the learned feature representation and training set. To mitigate this strong correlation and move towards a more generalizable feature representation, we propose a novel face forgery detection framework based on Single-step Reconstruction Difference (SRD). Our approach explores more generalizable features by mining differences between the original and single-step reconstructed features of both real and fake faces. More specifically, we design a feature enhancement module that processes and refines the single-step reconstruction difference, which progressively integrates forgery-related clues into the features of neural networks through attention mechanism. In addition, we design a Frequency-Constrained Contrastive Loss (FCC Loss) to learn discriminative and robust features by contrasting real and fake faces using frequency-domain information. Experimental results demonstrate that the proposed method not only exhibits excellent generalization performance on different datasets but also shows strong robustness of the detection method against various image attacks. Our code is released at: https://github.com/zhouk369/SRD.
{"title":"Generalizable face forgery detection via mining single-step reconstruction difference","authors":"Kai Zhou , Guanglu Sun , Linsen Yu , Jun Wang","doi":"10.1016/j.patcog.2026.113265","DOIUrl":"10.1016/j.patcog.2026.113265","url":null,"abstract":"<div><div>Existing face forgery detection methods mainly focus on capturing specific artifacts. While achieving high accuracy on in-distribution data, they often generalize poorly to unseen manipulation techniques due to strong correlation between the learned feature representation and training set. To mitigate this strong correlation and move towards a more generalizable feature representation, we propose a novel face forgery detection framework based on Single-step Reconstruction Difference (SRD). Our approach explores more generalizable features by mining differences between the original and single-step reconstructed features of both real and fake faces. More specifically, we design a feature enhancement module that processes and refines the single-step reconstruction difference, which progressively integrates forgery-related clues into the features of neural networks through attention mechanism. In addition, we design a Frequency-Constrained Contrastive Loss (FCC Loss) to learn discriminative and robust features by contrasting real and fake faces using frequency-domain information. Experimental results demonstrate that the proposed method not only exhibits excellent generalization performance on different datasets but also shows strong robustness of the detection method against various image attacks. Our code is released at: <span><span>https://github.com/zhouk369/SRD</span><svg><path></path></svg></span>.</div></div>","PeriodicalId":49713,"journal":{"name":"Pattern Recognition","volume":"176 ","pages":"Article 113265"},"PeriodicalIF":7.6,"publicationDate":"2026-08-01","publicationTypes":"Journal Article","fieldsOfStudy":null,"isOpenAccess":false,"openAccessPdf":"","citationCount":null,"resultStr":null,"platform":"Semanticscholar","paperid":"146174600","PeriodicalName":null,"FirstCategoryId":null,"ListUrlMain":null,"RegionNum":1,"RegionCategory":"计算机科学","ArticlePicture":[],"TitleCN":null,"AbstractTextCN":null,"PMCID":"","EPubDate":null,"PubModel":null,"JCR":null,"JCRName":null,"Score":null,"Total":0}